Compare commits

...

387 Commits

Author SHA1 Message Date
leejet
11ab095230
fix: resolve embedding loading issue when calling generate_image multiple times (#1078) 2025-12-12 23:08:12 +08:00
Wagner Bruna
a3a88fc9b2
fix: avoid crash loading LoRAs with bf16 weights (#1077) 2025-12-12 22:36:54 +08:00
leejet
8823dc48bc
feat: align the spatial size to the corresponding multiple (#1073) 2025-12-10 23:15:08 +08:00
Pedrito
1ac5a616de
feat: support custom upscale tile size (#896) 2025-12-10 22:25:19 +08:00
leejet
d939f6e86a
refactor: optimize the handling of LoRA models (#1070) 2025-12-10 00:26:07 +08:00
Wagner Bruna
e72aea796e
feat: embed version string and git commit hash (#1008) 2025-12-09 22:38:54 +08:00
wuhei
a908436729
docs: update download link for Stable Diffusion v1.5 (#1063) 2025-12-09 22:06:16 +08:00
stduhpf
583a02e29e
feat: add Flux.2 VAE proj matrix for previews (#1017) 2025-12-09 22:00:45 +08:00
leejet
96c3e64057
refactor: optimize the handling of embedding (#1068)
* optimize the handling of embedding

* support case-insensitive embedding names
2025-12-08 23:59:04 +08:00
Weiqi Gao
0392273e10
chore: add compute kernels to Windows CUDA build (#1062)
* Fix syntax for CUDA architecture definitions

* Extend CUDA support to GTX 10 Series to RTX 50 Series

* update cuda installer step version to install cuda 12.8.1

* Remove unsupported compute capability
2025-12-07 22:12:50 +08:00
leejet
bf1a388b44 docs: update logo 2025-12-07 15:09:32 +08:00
leejet
c9005337a8 docs: update logo 2025-12-07 14:56:21 +08:00
leejet
2f0bd31a84
feat: add ovis image support (#1057) 2025-12-07 12:32:56 +08:00
leejet
bfbb929790
feat: do not convert bf16 to f32 (#1055) 2025-12-06 23:55:51 +08:00
leejet
689e44c9a8
fix: correct ggml_ext_silu_act (#1056) 2025-12-06 23:55:28 +08:00
leejet
985aedda32
refactor: optimize the handling of pred type (#1048) 2025-12-04 23:31:55 +08:00
leejet
3f3610b5cd
chore: optimize lora log (#1047) 2025-12-04 22:44:58 +08:00
Wagner Bruna
118683de8a
fix: correct preview method selection (#1038) 2025-12-04 22:43:16 +08:00
stduhpf
bcc9c0d0b3
feat: handle ggml compute failures without crashing the program (#1003)
* Feat: handle compute failures more gracefully

* fix Unreachable code after return

Co-authored-by: idostyle <idostyl3@googlemail.com>

* adjust z_image.hpp

---------

Co-authored-by: idostyle <idostyl3@googlemail.com>
Co-authored-by: leejet <leejet714@gmail.com>
2025-12-04 22:04:27 +08:00
leejet
5865b5e703
refactor: split SDParams to SDCliParams/SDContextParams/SDGenerationParams (#1032) 2025-12-03 22:31:46 +08:00
stduhpf
edf2cb3846
fix: fix CosXL not being detected (#989) 2025-12-03 22:25:02 +08:00
Wagner Bruna
99e17232a4
fix: prevent NaN issues with Z-Image on certain ROCm setups (#1034) 2025-12-03 22:19:34 +08:00
leejet
710169df5c docs: update news 2025-12-01 22:46:15 +08:00
Wagner Bruna
e4c50f1de5
chore: add sd_ prefix to a few functions (#967) 2025-12-01 22:43:52 +08:00
rmatif
0743a1b3b5
fix: fix vae tiling for flux2 (#1025) 2025-12-01 22:41:56 +08:00
leejet
34a6fd4e60
feat: add z-image support (#1020)
* add z-image support

* use flux_latent_rgb_proj for z-image

* fix qwen3 rope type

* add support for qwen3 4b gguf

* add support for diffusers format lora

* fix nan issue that occurs when using CUDA with k-quants weights

* add z-image docs
2025-12-01 22:39:43 +08:00
leejet
3c1187ce83 docs: correct the time of adding flux2 support 2025-11-30 12:40:56 +08:00
leejet
20eb674100
fix: avoid crash when the lora file is not found using immediately mode (#1022) 2025-11-30 12:19:37 +08:00
leejet
bc80225336
fix: make the immediate LoRA apply mode work better when using Vulkan (#1021) 2025-11-30 12:08:25 +08:00
leejet
ab7e8d285e docs: update news 2025-11-30 11:51:23 +08:00
Wagner Bruna
673dbdda17
fix: add missing line cleanup for s/it progress display (#891) 2025-11-30 11:45:30 +08:00
Wagner Bruna
0249509a30
refactor: add user data pointer to the image preview callback (#1001) 2025-11-30 11:34:17 +08:00
leejet
52b67c538b
feat: add flux2 support (#1016)
* add flux2 support

* rename qwenvl to llm

* add Flux2FlowDenoiser

* update docs
2025-11-30 11:32:56 +08:00
leejet
20345888a3
refactor: optimize the handling of sample method (#999) 2025-11-22 14:00:25 +08:00
akleine
490c51d963
feat: report success/failure when saving PNG/JPG output (#912) 2025-11-22 13:57:44 +08:00
Wagner Bruna
45c46779af
feat: add LCM scheduler (#983) 2025-11-22 13:53:31 +08:00
leejet
869d023416
refactor: optimize the handling of scheduler (#998) 2025-11-22 12:48:53 +08:00
akleine
e9bc3b6c06
fix: check the PhotoMaker id_embeds tensor ONLY in PhotoMaker V2 mode (#987) 2025-11-22 12:47:40 +08:00
Wagner Bruna
b542894fb9
fix: avoid crash on default video preview path (#997)
Co-authored-by: masamaru-san
2025-11-22 12:46:27 +08:00
leejet
5498cc0d67
feat: add Wan2.1-I2V-1.3B(SkyReels) support (#988) 2025-11-19 23:56:46 +08:00
stduhpf
aa2b8e0ca5
fix: patch 1x1 conv weights at runtime (#986) 2025-11-19 23:27:23 +08:00
rmatif
a14e2b321d
feat: add easycache support (#940) 2025-11-19 23:19:32 +08:00
leejet
28ffb6c13d
fix: resolve issue with concat multiple LoRA output diffs at runtime (#985) 2025-11-17 22:56:07 +08:00
leejet
b88cc32346
fix: avoid using same type but diff instances for rng and sampler_rng (#982) 2025-11-16 23:37:14 +08:00
leejet
f532972d60
fix: avoid precision issues on vulkan backend (#980) 2025-11-16 20:57:08 +08:00
leejet
d5b05f70c6
feat: support independent sampler rng (#978) 2025-11-16 17:11:02 +08:00
akleine
6d6dc1b8ed
fix: make PhotoMakerV2 more robust by image count check (#970) 2025-11-16 17:10:48 +08:00
Wagner Bruna
199e675cc7
feat: support for --tensor-type-rules on generation modes (#932) 2025-11-16 17:07:32 +08:00
leejet
742a7333c3
feat: add cpu rng (#977) 2025-11-16 14:48:15 +08:00
Wagner Bruna
e8eb3791c8
fix: typo in --lora-apply-mode help (#972) 2025-11-16 14:48:00 +08:00
Wagner Bruna
aa44e06890
fix: avoid crash with LoRAs and type override (#974) 2025-11-16 14:47:36 +08:00
Daniele
6448430dbb
feat: add break pseudo token support (#422)
---------

Co-authored-by: Urs Ganse <urs.ganse@helsinki.fi>
2025-11-16 14:45:20 +08:00
leejet
347710f68f
feat: support applying LoRA at runtime (#969) 2025-11-13 21:48:44 +08:00
lcy
59ebdf0bb5
chrore: enable Windows ROCm(HIP) build release (#956)
* build: fix missing commit sha in macOS and Ubuntu build zip name

The build workflows for macOS and Ubuntu incorrectly check for the
"main" branch instead of "master" when retrieving the commit hash for
naming the build artifacts.

* build: correct Vulkan SDK installation condition in build workflow

* build: Enable Windows ROCm(HIP) build release

Refer to the build workflow of llama.cpp to add a Windows ROCm (HIP)
build release to the workflow.
Since there are many differences between the HIP build and other
builds, this commit add a separate "windows-latest-cmake-hip" job,
instead of enabling the ROCm matrix entry in the existing Windows
build job.

Main differences include:

- Install ROCm SDK from AMD official installer.
- Add a cache step for ROCm installation and a ccache step for build
  processing, since the HIP build takes much longer time than other
  builds.
- Include the ROCm/HIP artifact in the release assets.
2025-11-12 00:28:55 +08:00
Flavio Bizzarri
4ffcbcaed7
fix: specify enum modifier in sd_set_preview_callback signature (#959) 2025-11-12 00:27:23 +08:00
leejet
694f0d9235
refactor: optimize the logic for name conversion and the processing of the LoRA model (#955) 2025-11-10 00:12:20 +08:00
stduhpf
8ecdf053ac
feat: add image preview support (#522) 2025-11-10 00:12:02 +08:00
leejet
ee89afc878
fix: resolve issue with pmid (#957) 2025-11-09 22:47:53 +08:00
akleine
d2d3944f50
feat: add support for SD2.x with TINY U-Nets (#939) 2025-11-09 22:47:37 +08:00
akleine
0fa3e1a383
fix: prevent core dump in PM V2 in case of incomplete cmd line (#950) 2025-11-09 22:36:43 +08:00
leejet
c2d8ffc22c
fix: compatibility for models with modified tensor shapes (#951) 2025-11-07 23:04:41 +08:00
stduhpf
fb748bb8a4
fix: TAE encoding (#935) 2025-11-07 22:58:59 +08:00
leejet
8f6c5c217b
refactor: simplify the model loading logic (#933)
* remove String2GGMLType

* remove preprocess_tensor

* fix clip init

* simplify the logic for reading weights
2025-11-03 21:21:34 +08:00
leejet
6103d86e2c
refactor: introduce GGMLRunnerContext (#928)
* introduce GGMLRunnerContext

* add Flash Attention enable control through GGMLRunnerContext

* add conv2d_direct enable control through GGMLRunnerContext
2025-11-02 02:11:04 +08:00
stduhpf
c42826b77c
fix: resolve multiple inpainting issues (#926)
* Fix inpainting masked image being broken by side effect

* Fix unet inpainting concat not being set

* Fix Flex.2 inpaint mode crash (+ use scale factor)
2025-11-02 02:10:32 +08:00
Wagner Bruna
945d9a9ee3
docs: add Koboldcpp as an available UI (#930) 2025-11-02 02:03:01 +08:00
Wagner Bruna
353e708844
docs: update ggml and llama.cpp URLs (#931) 2025-11-02 02:02:44 +08:00
leejet
dd75fc081c
refactor: unify the naming style of ggml extension functions (#921) 2025-10-28 23:26:48 +08:00
stduhpf
77eb95f8e4
docs: fix taesd direct download link (#917) 2025-10-28 23:26:23 +08:00
Wagner Bruna
8a45d0ff7f
chore: clean up stb includes (#919) 2025-10-28 23:25:45 +08:00
leejet
9e28be6479
feat: add chroma radiance support (#910)
* add chroma radiance support

* fix ci

* simply generate_init_latent

* workaround: avoid ggml cuda error

* format code

* add chroma radiance doc
2025-10-25 23:56:14 +08:00
akleine
062490aa7c
feat: add SSD1B and tiny-sd support (#897)
* feat: add code and doc for running SSD1B models

* Added some more lines to support SD1.x with TINY U-Nets too.

* support SSD-1B.safetensors

* fix sdv1.5 diffusers format loader

---------

Co-authored-by: leejet <leejet714@gmail.com>
2025-10-25 23:35:54 +08:00
stduhpf
faabc5ad3c
feat: allow models to run without all text encoder(s) (#645) 2025-10-25 22:00:56 +08:00
leejet
69b9511ce9 sync: update ggml 2025-10-24 00:32:45 +08:00
stduhpf
917f7bfe99
fix: support --flow-shift for flux models with default pred (#913) 2025-10-23 21:35:18 +08:00
leejet
48e0a28ddf
feat: add shift factor support (#903) 2025-10-23 01:20:29 +08:00
leejet
d05e46ca5e
chore: add .clang-tidy configuration and apply modernize checks (#902) 2025-10-18 23:23:40 +08:00
Wagner Bruna
64a7698347
chore: report number of Qwen layers as info (#901) 2025-10-18 23:22:01 +08:00
leejet
0723ee51c9
refactor: optimize option printing (#900) 2025-10-18 17:50:30 +08:00
leejet
90ef5f8246
feat: add auto-resize support for reference images (was Qwen-Image-Edit only) (#898) 2025-10-18 16:37:09 +08:00
leejet
db6f4791b4
feat: add wtype stat (#899) 2025-10-17 23:40:32 +08:00
leejet
b25785bc10 sync: update ggml 2025-10-17 21:46:39 +08:00
leejet
0585e2609d docs: split README sections (build, performance, etc.) into separate docs 2025-10-16 23:22:06 +08:00
leejet
683d6d08a8 chore: add github issue template 2025-10-16 21:04:41 +08:00
leejet
40a6a8710e
fix: resolve precision issues in SDXL VAE under fp16 (#888)
* fix: resolve precision issues in SDXL VAE under fp16

* add --force-sdxl-vae-conv-scale option

* update docs
2025-10-15 23:01:00 +08:00
Daniele
e3702585cb
feat: added prediction argument (#334) 2025-10-15 23:00:10 +08:00
cmdr2
a7d6d296c7
chore: allow building ggml as a separate shared lib (#468) 2025-10-15 22:10:26 +08:00
leejet
2e9242e37f
feat: add Qwen Image Edit support (#877)
* add ref latent support for qwen image

* optimize clip_preprocess and fix get_first_stage_encoding

* add qwen2vl vit support

* add qwen image edit support

* fix qwen image edit pipeline

* add mmproj file support

* support dynamic number of Qwen image transformer blocks

* set prompt_template_encode_start_idx every time

* to_add_out precision fix

* to_out.0 precision fix

* update docs
2025-10-13 23:17:18 +08:00
Wagner Bruna
c64994dc1d
fix: better progress display for second-order samplers (#834) 2025-10-13 22:12:48 +08:00
Wagner Bruna
5436f6b814
fix: correct canny preprocessor (#861) 2025-10-13 22:02:35 +08:00
leejet
1c32fa03bc
fix: avoid generating black images when running T5 on the GPU (#882) 2025-10-13 00:01:06 +08:00
Wagner Bruna
9727c6bb98
fix: resolve VAE tiling problem in Qwen Image (#873) 2025-10-12 23:45:53 +08:00
leejet
beb99a2de2
feat: add Qwen Image support (#851)
* add qwen tokenizer

* add qwen2.5 vl support

* mv qwen.hpp -> qwenvl.hpp

* add qwen image model

* add qwen image t2i pipeline

* fix qwen image flash attn

* add qwen image i2i pipline

* change encoding of vocab_qwen.hpp to utf8

* fix get_first_stage_encoding

* apply jeffbolz f32 patch

https://github.com/leejet/stable-diffusion.cpp/pull/851#issuecomment-3335515302

* fix the issue that occurs when using CUDA with k-quants weights

* optimize the handling of the FeedForward precision fix

* to_add_out precision fix

* update docs
2025-10-12 23:23:19 +08:00
Wagner Bruna
aa68b875b9
refactor: deal with default img-cfg-scale at the library level (#869) 2025-10-12 23:17:52 +08:00
Wagner Bruna
5b261b9cee
feat: add a stand-alone upscale mode (#865)
* feat: add a stand-alone upscale mode

* fix prompt option check

* format code

* update README.md

---------

Co-authored-by: leejet <leejet714@gmail.com>
2025-10-12 23:10:02 +08:00
Pedrito
e70d0205ca
feat: add support for more esrgan models & x2 & x1 models (#855) 2025-10-12 22:53:31 +08:00
leejet
02af48a97f
chore: fix vulkan ci (#878) 2025-10-11 00:40:57 +08:00
leejet
e12d5e0aaf
fix: ensure directory iteration results are sorted by filename (#858) 2025-10-11 00:18:39 +08:00
Serkan Sahin
940a2018e1
chore: fix dockerfile libgomp1 dependency + improvements (#852) 2025-10-11 00:17:45 +08:00
Sharuzzaman Ahmat Raslan
b451728b2f
docs: update README.md (#866) 2025-10-11 00:11:10 +08:00
stduhpf
11f436c483
feat: add support for Flux Controls and Flex.2 (#692) 2025-10-11 00:06:57 +08:00
leejet
35843c77ea
fix: optimize the handling of embedding weight (#859) 2025-09-25 23:09:59 +08:00
leejet
6ad46bb700 sync: update ggml 2025-09-25 21:57:43 +08:00
leejet
1ba30ce005 sync: update ggml 2025-09-25 00:38:38 +08:00
leejet
2abe9451c4
fix: optimize the handling of CLIP embedding weight (#840) 2025-09-25 00:28:20 +08:00
Wagner Bruna
f3140eadbb
fix: tensor loading thread count (#854) 2025-09-25 00:26:38 +08:00
Stefan-Olt
98ba155fc6
docs: HipBLAS / ROCm build instruction fix (#843) 2025-09-25 00:03:05 +08:00
Wagner Bruna
513f36d495
docs: include Vulkan compatibility for LoRA quants (#845) 2025-09-25 00:01:10 +08:00
rmatif
1e0d2821bb
fix: correct tensor deduplication logic (#844) 2025-09-24 23:22:40 +08:00
leejet
fd693ac6a2
refactor: remove unused --normalize-input parameter (#835) 2025-09-18 00:12:53 +08:00
Wagner Bruna
171b2222a5
fix: avoid segfault for pix2pix models without reference images (#766)
* fix: avoid segfault for pix2pix models with no reference images

* fix: default to empty reference on pix2pix models to avoid segfault

* use resize instead of reserve

* format code

---------

Co-authored-by: leejet <leejet714@gmail.com>
2025-09-18 00:11:38 +08:00
leejet
567f9f14f0 fix: avoid multithreading issues in the model loader 2025-09-18 00:00:15 +08:00
leejet
1e5f207006
chore: fix workflow (#836) 2025-09-17 22:11:55 +08:00
leejet
79426d578e chore: set release tag by commit count 2025-09-16 23:24:36 +08:00
vmobilis
97ad3e7ff9
refactor: simplify DPM++ (2S) Ancestral (#667) 2025-09-16 23:05:25 +08:00
Erik Scholz
8909523e92
refactor: move tiling cacl and debug print into the tiling code branch (#833) 2025-09-16 22:46:56 +08:00
rmatif
8376dfba2a
feat: add sgm_uniform scheduler, simple scheduler, and support for NitroFusion (#675)
* feat: Add timestep shift and two new schedulers

* update readme

* fix spaces

* format code

* simplify SGMUniformSchedule

* simplify shifted_timestep logic

* avoid conflict

---------

Co-authored-by: leejet <leejet714@gmail.com>
2025-09-16 22:42:09 +08:00
leejet
0ebe6fe118
refactor: simplify the logic of pm id image loading (#827) 2025-09-14 22:50:21 +08:00
rmatif
55c2e05d98
feat: optimize tensor loading time (#790)
* opt tensor loading

* fix build failure

* revert the changes

* allow the use of n_threads

* fix lora loading

* optimize lora loading

* add mutex

* use atomic

* fix build

* fix potential duplicate issue

* avoid duplicate lookup of lora tensor

* fix progeress bar

* remove unused remove_duplicates

---------

Co-authored-by: leejet <leejet714@gmail.com>
2025-09-14 22:48:35 +08:00
leejet
52a97b3ac1
feat: add vace support (#819)
* add wan vace t2v support

* add --vace-strength option

* add vace i2v support

* fix the processing of vace_context

* add vace v2v support

* update docs
2025-09-14 16:57:33 +08:00
stduhpf
2c9b1e2594
feat: add VAE encoding tiling support and adaptive overlap (#484)
* implement  tiling vae encode support

* Tiling (vae/upscale): adaptative overlap

* Tiling: fix edge case

* Tiling: fix crash when less than 2 tiles per dim

* remove extra dot

* Tiling: fix edge cases for adaptative overlap

* tiling: fix edge case

* set vae tile size via env var

* vae tiling: refactor again, base on smaller buffer for alignment

* Use bigger tiles for encode (to match compute buffer size)

* Fix edge case when tile is bigger than latent

* non-square VAE tiling (#3)

* refactor tile number calculation

* support non-square tiles

* add env var to change tile overlap

* add safeguards and better error messages for SD_TILE_OVERLAP

* add safeguards and include overlapping factor for SD_TILE_SIZE

* avoid rounding issues when specifying SD_TILE_SIZE as a factor

* lower SD_TILE_OVERLAP limit

* zero-init empty output buffer

* Fix decode latent size

* fix encode

* tile size params instead of env

* Tiled vae parameter validation (#6)

* avoid crash with invalid tile sizes, use 0 for default

* refactor default tile size, limit overlap factor

* remove explicit parameter for relative tile size

* limit encoding tile to latent size

* unify code style and format code

* update docs

* fix get_tile_sizes in decode_first_stage

---------

Co-authored-by: Wagner Bruna <wbruna@users.noreply.github.com>
Co-authored-by: leejet <leejet714@gmail.com>
2025-09-14 16:00:29 +08:00
leejet
288e2d63c0 docs: update docs 2025-09-14 14:24:24 +08:00
leejet
dc46993b55
feat: increase work_ctx memory buffer size (#814) 2025-09-14 13:19:20 +08:00
Richard Palethorpe
a6a8569ea0
feat: Add SYCL Dockerfile (#651) 2025-09-14 13:02:59 +08:00
Erik Scholz
9e7befa320
fix: harden for large files (#643) 2025-09-14 12:44:19 +08:00
Wagner Bruna
c607fc3ed4
feat: use Euler sampling by default for SD3 and Flux (#753)
Thank you for your contribution.
2025-09-14 12:34:41 +08:00
Wagner Bruna
b54bec3f18
fix: do not force VAE type to f32 on SDXL (#716)
This seems to be a leftover from the initial SDXL support: it's
not enough to avoid NaN issues, and it's not not needed for the
fixed sdxl-vae-fp16-fix .
2025-09-14 12:19:59 +08:00
Wagner Bruna
5869987fe4
fix: make weight override more robust against ggml changes (#760) 2025-09-14 12:15:53 +08:00
Wagner Bruna
48956ffb87
feat: reduce CLIP memory usage with no embeddings (#768) 2025-09-14 12:08:00 +08:00
Wagner Bruna
ddc4a18b92
fix: make tiled VAE reuse the compute buffer (#821) 2025-09-14 11:41:50 +08:00
leejet
fce6afcc6a
feat: add sd3 flash attn support (#815) 2025-09-11 23:24:29 +08:00
Erik Scholz
49d6570c43
feat: add SmoothStep Scheduler (#813) 2025-09-11 23:17:46 +08:00
clibdev
6bbaf161ad
chore: add install() support in CMakeLists.txt (#540) 2025-09-11 22:24:16 +08:00
clibdev
87cdbd5978
feat: use log_printf to print ggml logs (#545) 2025-09-11 22:16:05 +08:00
leejet
b017918106
chore: remove sd3 flash attention warn (#812) 2025-09-10 22:21:02 +08:00
Wagner Bruna
ac5a215998
fix: use {} for params init instead of memset (#781) 2025-09-10 21:49:29 +08:00
Wagner Bruna
abb36d66b5
chore: update flash attention warnings (#805) 2025-09-10 21:38:21 +08:00
Wagner Bruna
ff4fdbb88d
fix: accept NULL in sd_img_gen_params_t::input_id_images_path (#809) 2025-09-10 21:22:55 +08:00
Markus Hartung
abb115cd02
fix: clarify lora quant support and small fixes (#792) 2025-09-08 22:39:25 +08:00
leejet
c648001030
feat: add detailed tensor loading time stat (#793) 2025-09-07 22:51:44 +08:00
stduhpf
c587a43c99
feat: support incrementing ref image index (omni-kontext) (#755)
* kontext: support  ref images indices

* lora: support x_embedder

* update help message

* Support for negative indices

* support for OmniControl (offsets at index 0)

* c++11 compat

* add --increase-ref-index option

* simplify the logic and fix some issues

* update README.md

* remove unused variable

---------

Co-authored-by: leejet <leejet714@gmail.com>
2025-09-07 22:35:16 +08:00
leejet
f8fe4e7db9
fix: add flash attn support check (#803) 2025-09-07 21:29:06 +08:00
leejet
1c07fb6fb1 docs: update docs/wan.md 2025-09-07 12:07:20 +08:00
leejet
675208dcb6 chore: update to c++17 2025-09-07 12:04:17 +08:00
leejet
d7f430cd69 docs: update docs and help message 2025-09-07 02:26:44 +08:00
stduhpf
141a4b4113
feat: add flow shift parameter (for SD3 and Wan) (#780)
* Add flow shift parameter (for SD3 and Wan)

* unify code style and fix some issues

---------

Co-authored-by: leejet <leejet714@gmail.com>
2025-09-07 02:16:59 +08:00
stduhpf
21ce9fe2cf
feat: add support for timestep boundary based automatic expert routing in Wan MoE (#779)
* Wan MoE: Automatic expert routing based on timestep boundary

* unify code style and fix some issues

---------

Co-authored-by: leejet <leejet714@gmail.com>
2025-09-07 01:44:10 +08:00
leejet
cb1d975e96
feat: add wan2.1/2.2 support (#778)
* add wan vae suppport

* add wan model support

* add umt5 support

* add wan2.1 t2i support

* make flash attn work with wan

* make wan a little faster

* add wan2.1 t2v support

* add wan gguf support

* add offload params to cpu support

* add wan2.1 i2v support

* crop image before resize

* set default fps to 16

* add diff lora support

* fix wan2.1 i2v

* introduce sd_sample_params_t

* add wan2.2 t2v support

* add wan2.2 14B i2v support

* add wan2.2 ti2v support

* add high noise lora support

* sync: update ggml submodule url

* avoid build failure on linux

* avoid build failure

* update ggml

* update ggml

* fix sd_version_is_wan

* update ggml, fix cpu im2col_3d

* fix ggml_nn_attention_ext mask

* add cache support to ggml runner

* fix the issue of illegal memory access

* unify image loading processing

* add wan2.1/2.2 FLF2V support

* fix end_image mask

* update to latest ggml

* add GGUFReader

* update docs
2025-09-06 18:08:03 +08:00
Wagner Bruna
2eb3845df5
fix: typo in the verbose long flag (#783) 2025-09-04 00:49:01 +08:00
stduhpf
4c6475f917
feat: show usage on unknown arg (#767) 2025-09-01 21:38:34 +08:00
SmallAndSoft
f0fa7ddc40
docs: add compile option needed by Ninja (#770) 2025-09-01 21:35:25 +08:00
SmallAndSoft
a7c7905c6d
docs: add missing dash to docs/chroma.md (#771) 2025-09-01 21:34:34 +08:00
Wagner Bruna
eea77cbad9
feat: throttle model loading progress updates (#782)
Some terminals have slow display latency, so frequent output
during model loading can actually slow down the process.

Also, since tensor loading times can vary a lot, the progress
display now shows the average across past iterations instead
of just the last one.
2025-09-01 21:32:01 +08:00
NekopenDev
0e86d90ee4
chore: add Nvidia 30 series (cuda arch 86) to build 2025-09-01 21:21:34 +08:00
leejet
5900ef6605 sync: update ggml, make cuda im2col a little faster 2025-08-03 01:29:40 +08:00
Daniele
5b8996f74a
Conv2D direct support (#744)
* Conv2DDirect for VAE stage

* Enable only for Vulkan, reduced duplicated code

* Cmake option to use conv2d direct

* conv2d direct always on for opencl

* conv direct as a flag

* fix merge typo

* Align conv2d behavior to flash attention's

* fix readme

* add conv2d direct for controlnet

* add conv2d direct for esrgan

* clean code, use enable_conv2d_direct/get_all_blocks

* format code

---------

Co-authored-by: leejet <leejet714@gmail.com>
2025-08-03 01:25:17 +08:00
Wagner Bruna
f7f05fb185
chore: avoid setting GGML_MAX_NAME when building against external ggml (#751)
An external ggml will most likely have been built with the default
GGML_MAX_NAME value (64), which would be inconsistent with the value
set by our build (128). That would be an ODR violation, and it could
easily cause memory corruption issues due to the different
sizeof(struct ggml_tensor) values.

For now, when linking against an external ggml, we demand it has been
patched with a bigger GGML_MAX_NAME, since we can't check against a
value defined only at build time.
2025-08-03 01:24:40 +08:00
Seas0
6167e2927a
feat: support build against system installed GGML library (#749) 2025-08-02 11:03:18 +08:00
leejet
f6b9aa1a43 refector: optimize the usage of tensor_types 2025-07-28 23:18:29 +08:00
Wagner Bruna
7eb30d00e5
feat: add missing models and parameters to image metadata (#743)
* feat: add new scheduler types, clip skip and vae to image embedded params

- If a non default scheduler is set, include it in the 'Sampler' tag in the data
embedded into the final image.
- If a custom VAE path is set, include the vae name (without path and extension)
in embedded image params under a `VAE:` tag.
- If a custom Clip skip is set, include that Clip skip value in embedded image
params under a `Clip skip:` tag.

* feat: add separate diffusion and text models to metadata

---------

Co-authored-by: one-lithe-rune <skapusniak@lithe-runes.com>
2025-07-28 22:00:27 +08:00
stduhpf
59080d3ce1
feat: change image dimensions requirement for DiT models (#742) 2025-07-28 21:58:17 +08:00
R0CKSTAR
8c3c788f31
feat: upgrade musa sdk to rc4.2.0 (#732) 2025-07-28 21:51:11 +08:00
leejet
f54524f620 sync: update ggml 2025-07-28 21:50:12 +08:00
leejet
eed97a5e1d sync: update ggml 2025-07-24 23:04:08 +08:00
Ettore Di Giacinto
fb86bf4cb0
docs: add LocalAI to README's UIs (#741) 2025-07-24 22:39:26 +08:00
leejet
bd1eaef93e fix: convert f64 to f32 and i64 to i32 when loading weights 2025-07-24 00:59:38 +08:00
Erik Scholz
ab835f7d39
fix: correct head dim check and L_k padding of flash attention (#736) 2025-07-24 00:57:45 +08:00
Daniele
26f3f61d37
docs: add sd.cpp-webui as an available frontend (#738) 2025-07-23 23:51:57 +08:00
Oleg Skutte
1896b28ef2
fix: make --taesd work (#731) 2025-07-15 00:45:22 +08:00
leejet
0739361bfe fix: avoid macOS build failed 2025-07-13 20:18:10 +08:00
leejet
ca0bd9396e
refactor: update c api (#728) 2025-07-13 18:48:42 +08:00
stduhpf
a772dca27a
feat: add Instruct-Pix2pix/CosXL-Edit support (#679)
* Instruct-p2p support

* support 2 conditionings cfg

* Do not re-encode the exact same image twice

* fixes for 2-cfg

* Fix pix2pix latent inputs + improve inpainting a bit + fix naming

* prepare for other pix2pix-like models

* Support sdxl ip2p

* fix reference image embeddings

* Support 2-cond cfg properly in cli

* fix typo in help

* Support masks for ip2p models

* unify code style

* delete unused code

* use edit mode

* add img_cond

* format code

---------

Co-authored-by: leejet <leejet714@gmail.com>
2025-07-12 15:36:45 +08:00
Wagner Bruna
6d84a30c66
feat: overriding quant types for specific tensors on model conversion (#724) 2025-07-08 00:11:38 +08:00
stduhpf
dafc32d0dd
feat: add support for f64/i64 and clip_g diffusers model (#681) 2025-07-06 23:24:55 +08:00
idostyle
225162f270
fix: mark encoder.embed_tokens.weight as unused tensor (#721) 2025-07-06 23:10:10 +08:00
leejet
b9e4718fac fix: correct --chroma-enable-t5-mask argument 2025-07-06 11:11:47 +08:00
leejet
1ce1c1adca feat: make lora graph size variable 2025-07-05 22:44:22 +08:00
stduhpf
19fbfd8639
feat: override text encoders for unet models (#682) 2025-07-04 22:19:47 +08:00
Wagner Bruna
76c72628b1
fix: fix a few typos on cli help and error messages (#714) 2025-07-04 22:15:41 +08:00
vmobilis
3bae667f3d
fix: break the line after skipping tensors in VAE (#591) 2025-07-03 22:50:42 +08:00
stduhpf
8d0819c548
fix: actually use embeddings with SDXL (#657) 2025-07-03 22:39:57 +08:00
Binozo
7a8ff2e819
docs: add golang cgo bindings to README (#635) 2025-07-02 23:19:49 +08:00
rmatif
0927e8e322
docs: add Android app to README (#647) 2025-07-02 23:18:16 +08:00
stduhpf
83ef4e44ce
feat: add T5 with llama.cpp naming convention support (#654) 2025-07-02 23:13:00 +08:00
leejet
7dac89ad75 refector: reuse some code 2025-07-01 23:33:50 +08:00
stduhpf
9251756086
feat: add CosXL support (#683) 2025-07-01 23:13:04 +08:00
leejet
ecf5db97ae chore: fix windows build and release 2025-07-01 23:05:48 +08:00
stduhpf
ea46fd6948
fix: force zero-initialize output of tiling (#703) 2025-07-01 23:01:29 +08:00
leejet
23de7fc44a chore: avoid warnings when building on linux 2025-06-30 23:49:52 +08:00
rmatif
d42fd59464
feat: add OpenCL backend support (#680) 2025-06-30 23:32:23 +08:00
Wagner Bruna
0d8b39f0ba
fix: avoid crash on sdxl loras (#658)
Some SDXL LoRAs (eg. PCM) can exceed 12k nodes.
2025-06-30 23:29:32 +08:00
R0CKSTAR
539b5b9374
fix: fix musa docker build (#662)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-06-30 23:27:40 +08:00
Wagner Bruna
b1fc16b504
fix: allow resetting clip_skip to its default value (#697) 2025-06-30 23:23:21 +08:00
leejet
d6c87dce5c docs: add chroma doc 2025-06-29 23:58:15 +08:00
leejet
a28d04dd81 fix: fix the issue in parsing --chroma-disable-dit-mask 2025-06-29 23:52:36 +08:00
leejet
45d0ebb30c style: format code 2025-06-29 23:40:55 +08:00
stduhpf
b1cc40c35c
feat: add Chroma support (#696)
---------

Co-authored-by: Green Sky <Green-Sky@users.noreply.github.com>
Co-authored-by: leejet <leejet714@gmail.com>
2025-06-29 23:36:42 +08:00
leejet
884e23eeeb docs: add kontext doc 2025-06-29 10:35:31 +08:00
stduhpf
c9b5735116
feat: add FLUX.1 Kontext dev support (#707)
* Kontext support
* add edit mode

---------

Co-authored-by: leejet <leejet714@gmail.com>
2025-06-29 10:08:53 +08:00
vmobilis
10c6501bd0
fix missing argument in prototype of stbi_write_jpg (#613) 2025-03-09 12:30:10 +08:00
vmobilis
10feacf031
fix: correct img2img time (#616) 2025-03-09 12:29:08 +08:00
vmobilis
655f8a5169
fix: clang complains about needless braces (#618) 2025-03-09 12:26:41 +08:00
idostyle
d7c7a34712
fix: ModelLoader::load_tensors duplicated check (#623)
Introduced in 2b6ec97fe244d03c40aa8d70131d40bb086099b0
2025-03-09 12:23:23 +08:00
vmobilis
81556f3136
chore: silence some warnings about precision loss (#620) 2025-03-09 12:22:39 +08:00
stduhpf
3fb275a67b
fix: suport sdxl embedddings (#621) 2025-03-09 12:21:23 +08:00
leejet
30b3ac8e62 fix: avoid potential dangling pointer problem 2025-03-01 16:58:26 +08:00
leejet
195d170136 sync: update ggml 2025-03-01 12:09:55 +08:00
stduhpf
f50a7f66aa
fix: fix race condition causing inconsistent value for decoder_only (#609) 2025-03-01 11:49:06 +08:00
stduhpf
85e9a12988
fix: preprocess tensor names in tensor types map (#607)
Thank you for your contribution
2025-03-01 11:48:04 +08:00
stduhpf
fbd42b6fc1
fix: fix embeddings with quantized models (#601) 2025-03-01 11:45:39 +08:00
yslai
19d876ee30
feat: implement DDIM with the "trailing" timestep spacing and TCD (#568) 2025-02-22 21:34:22 +08:00
lalala
f27f2b2aa2
docs: add missing --mask and --guidance options to print_usage (#572) 2025-02-22 21:32:37 +08:00
piallai
99609761dc
docs: fix typo in readme (#574) 2025-02-22 21:30:28 +08:00
stduhpf
69c73789fe
fix: force binary mask for inpaint models (#589)
Co-authored-by: leejet <leejet714@gmail.com>
2025-02-22 21:29:57 +08:00
Meng, Hengyu
838beb9b5e
chore: add global SYCL compile flags (#597) 2025-02-22 21:23:58 +08:00
stduhpf
f23b803a6b
fix:: unapply current loras properly (#590) 2025-02-22 21:22:22 +08:00
stduhpf
1be2491dcf
feat: partial LyCORIS support (tucker decomposition for LoCon + LoHa + LoKr) (#577) 2025-02-22 21:19:26 +08:00
Matti Pulkkinen
3753223982
fix: make get_files_from_dir works with absolute path (#598)
Co-authored-by: Matti Pulkkinen <pulkkinen@ultimatium.com>
2025-02-22 21:16:50 +08:00
R0CKSTAR
59ca2b0f16
chore: bump MUSA SDK version to rc3.1.1 (#599)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-02-22 21:14:26 +08:00
vmobilis
d46ed5e184
feat: support JPEG compression (#583) 2025-02-05 16:18:02 +08:00
ag2s20150909
2535ad5a43
chore: fix cuda on github action (#580) 2025-02-05 16:15:41 +08:00
stduhpf
e500d95abd
fix: fix rank 1 loras (#575) 2025-02-05 16:13:17 +08:00
R0CKSTAR
a3cbdf6dcb
chore: SD_USE_CUBLAS => SD_USE_CUDA for MUSA backend (#578)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-02-05 16:11:26 +08:00
piallai
5eb15ef4d0
docs: add CLI-GUI to list (#546) 2025-01-18 13:16:54 +08:00
stduhpf
d9b5942d98
feat: add sdxl v-pred suppport (#536) 2025-01-18 13:15:54 +08:00
stduhpf
587a37b2e2
fix: avoid sd2((non inpaint) crash on v-pred check (#537) 2025-01-18 13:13:34 +08:00
ag2s20150909
4fe83d52cf
chore: fix CUDA on GitHub Action (#567) 2025-01-18 13:12:26 +08:00
null-define
b70aaa672a
chore: fix amd rocm build (#571) 2025-01-18 13:11:39 +08:00
idostyle
27edb765a5
chore: fix CI windows release artifacts (#532) 2025-01-18 13:09:22 +08:00
leejet
dcf91f9e0f chore: change SD_CUBLAS/SD_USE_CUBLAS to SD_CUDA/SD_USE_CUDA 2024-12-28 13:27:51 +08:00
stduhpf
348a54e34a
feat: use pretty-progress for tensor loading (#516) 2024-12-28 13:14:52 +08:00
stduhpf
d50473dc49
feat: support 16 channel tae (taesd/taef1) (#527) 2024-12-28 13:13:48 +08:00
piallai
b5cc1422da
fix: fix typo for skip layers parameters (#492) 2024-12-28 13:12:08 +08:00
R0CKSTAR
5cc74d1f09
feat: support Moore Threads GPU (#529)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2024-12-28 13:08:36 +08:00
stduhpf
0d9d6659a7
fix: fix metal build (#513) 2024-12-28 13:06:17 +08:00
stduhpf
8f4ab9add3
feat: support Inpaint models (#511) 2024-12-28 13:04:49 +08:00
stduhpf
cc92a6a1b3
feat: support more LoRA models (#520) 2024-12-28 12:56:44 +08:00
leejet
9578fdcc46 chore: remove rocm5.5 build temporarily 2024-11-30 14:26:29 +08:00
stduhpf
9148b980be
feat: remove type restrictions (#489) 2024-11-30 14:22:15 +08:00
stduhpf
7ce63e740c
feat: flexible model architecture for dit models (Flux & SD3) (#490)
* Refactor: wtype per tensor

* Fix default args

* refactor: fix flux

* Refactor photmaker v2 support

* unet: refactor the refactoring

* Refactor: fix controlnet and tae

* refactor: upscaler

* Refactor: fix runtime type override

* upscaler: use fp16 again

* Refactor: Flexible sd3 arch

* Refactor: Flexible Flux arch

* format code

---------

Co-authored-by: leejet <leejet714@gmail.com>
2024-11-30 14:18:53 +08:00
leejet
4570715727 fix: use ggml_nn_attention in vae 2024-11-24 18:21:31 +08:00
stduhpf
53b415f787
fix: remove default variables in c headers (#478) 2024-11-24 18:10:25 +08:00
leejet
c3eeb669cd sync: update ggml 2024-11-23 13:29:32 +08:00
leejet
b5f4932696 refactor: add some sd vesion helper functions 2024-11-23 13:02:44 +08:00
Erik Scholz
1c168d98a5
fix: repair flash attention support (#386)
* repair flash attention in _ext
this does not fix the currently broken fa behind the define, which is only used by VAE

Co-authored-by: FSSRepo <FSSRepo@users.noreply.github.com>

* make flash attention in the diffusion model a runtime flag
no support for sd3 or video

* remove old flash attention option and switch vae over to attn_ext

* update docs

* format code

---------

Co-authored-by: FSSRepo <FSSRepo@users.noreply.github.com>
Co-authored-by: leejet <leejet714@gmail.com>
2024-11-23 12:39:08 +08:00
William Murray
ea9b647080
docs: update readme, add python bindings (#423) 2024-11-23 11:52:33 +08:00
bssrdf
2b1bc06477
feat: add PhotoMaker Version 2 support (#358)
* first attempt at updating to photomaker v2

* continue adding photomaker v2 modules

* finishing the last few pieces for photomaker v2; id_embeds need to be done by a manual step and pass as an input file

* added a name converter for Photomaker V2; build ok

* more debugging underway

* failing at cuda mat_mul

* updated chunk_half to be more efficient; redo feedforward

* fixed a bug: carefully using ggml_view_4d to get chunks of a tensor; strides need to be recalculated or set properly; still failing at soft_max cuda op

* redo weight calculation and weight*v

* fixed a bug now Photomaker V2 kinds of working

* add python script for face detection (Photomaker V2 needs)

* updated readme for photomaker

* fixed a bug causing PMV1 crashing; both V1 and V2 work

* fixed clean_input_ids for PMV2

* fixed a double counting bug in tokenize_with_trigger_token

* updated photomaker readme

* removed some commented code

* improved reconstructing class word free prompt

* changed reading id_embed to raw binary using existing load tensor function; this is more efficient than using model load and also makes it easier to work with sd server

* minor clean up

---------

Co-authored-by: bssrdf <bssrdf@gmail.com>
2024-11-23 11:50:14 +08:00
Flavio Bizzarri
b99cbfe4dc
docs: update README.md (#452) 2024-11-23 11:46:50 +08:00
Plamen Minev
8c7719fe9a
fix: typo in clip-g encoder arg (#472) 2024-11-23 11:46:00 +08:00
LostRuins Concedo
8f94efafa3
feat: add support for loading F8_E5M2 weights (#460) 2024-11-23 11:45:11 +08:00
fszontagh
07585448ad
docs: update readme (#462) 2024-11-23 11:42:12 +08:00
stduhpf
6ea812256e
feat: add flux 1 lite 8B (freepik) support (#474)
* Flux Lite (Freepik) support

* format code

---------

Co-authored-by: leejet <leejet714@gmail.com>
2024-11-23 11:41:30 +08:00
stduhpf
9b1d90bc23
fix: improve clip text_projection support (#397) 2024-11-23 11:19:27 +08:00
stduhpf
65fa646684
feat: add sd3.5 medium and skip layer guidance support (#451)
* mmdit-x

* add support for sd3.5 medium

* add skip layer guidance support (mmdit only)

* ignore slg if slg_scale is zero (optimization)

* init out_skip once

* slg support for flux (expermiental)

* warn if version doesn't support slg

* refactor slg cli args

* set default slg_scale to 0 (oops)

* format code

---------

Co-authored-by: leejet <leejet714@gmail.com>
2024-11-23 11:15:31 +08:00
leejet
ac54e00760
feat: add sd3.5 support (#445) 2024-10-24 21:58:03 +08:00
stduhpf
14206fd488
fix: fix clip tokenizer (#383) 2024-09-02 22:31:46 +08:00
zhentaoyu
e410aeb534
sync: update ggml to fix large image generation with SYCL backend (#380)
* turn off fast-math on host in SYCL backend

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* update ggml for sync some sycl ops

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* update sycl readme and ggml

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

---------

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>
2024-09-02 22:29:35 +08:00
leejet
58d54738e2 docs: add star history 2024-08-28 00:27:54 +08:00
leejet
4f87b232c2 docs: add Vulkan build command 2024-08-28 00:25:31 +08:00
Erik Scholz
e71ddcedad
fix: improve VAE tiling (#372)
* fix and improve: VAE tiling
- properly handle the upper left corner interpolating both x and y
- refactor out lerp
- use smootherstep to preserve more detail and spend less area blending

* actually fix vae tile merging

Co-authored-by: stduhpf <stephduh@live.fr>

* remove the now unused lerp function

---------

Co-authored-by: stduhpf <stephduh@live.fr>
2024-08-28 00:21:12 +08:00
stduhpf
f4c937cb94
fix: add some missing cli args to usage (#363) 2024-08-28 00:17:46 +08:00
Daniele
0362cc4874
fix: fix some typos (#361) 2024-08-28 00:15:37 +08:00
Yu Xing
6c88ad3fd6
fix: resolve naming conflict while llama.cpp and sd.cpp both build (#351) 2024-08-28 00:14:41 +08:00
Daniele
dc0882cdc9
feat: add exponential scheduler (#346)
* feat: added exponential scheduler

* updated README

* improved exponential formatting

---------

Co-authored-by: leejet <leejet714@gmail.com>
2024-08-28 00:13:35 +08:00
Daniele
d00c94844d
feat: add ipndm and ipndm_v samplers (#344) 2024-08-28 00:03:41 +08:00
Daniele
2d4a2f7982
feat: add GITS scheduler (#343) 2024-08-28 00:02:17 +08:00
Tim Miller
353ee93e2d
fix: add enum type to sd_type_t (#293) 2024-08-27 23:57:24 +08:00
soham
2027b16fda
feat: add vulkan backend support (#291)
* Fix includes and init vulkan the same as llama.cpp

* Add Windows Vulkan CI

* Updated ggml submodule

* support epsilon as a parameter for ggml_group_norm

---------

Co-authored-by: Cloudwalk <cloudwalk@icculus.org>
Co-authored-by: Oleg Skutte <00.00.oleg.00.00@gmail.com>
Co-authored-by: leejet <leejet714@gmail.com>
2024-08-27 23:56:09 +08:00
leejet
8847114abf fix: fix issue when applying lora 2024-08-25 22:39:39 +08:00
leejet
5c561eab31 feat: do not convert more flux tensors 2024-08-25 16:01:36 +08:00
leejet
f5997a1951 fix: do not force using f32 for some flux layers
This sometimes leads to worse result
2024-08-25 14:07:22 +08:00
leejet
1bdc767aaf feat: force using f32 for some layers 2024-08-25 13:53:16 +08:00
leejet
79c9fe9556 feat: do not convert some tensors 2024-08-25 13:37:37 +08:00
leejet
28a614769a docs: update docs/flux.md 2024-08-25 13:11:34 +08:00
leejet
c837c5d9cc style: format code 2024-08-25 00:19:37 +08:00
leejet
d08d7fa632 docs: update README.md 2024-08-24 14:38:44 +08:00
leejet
64d231f384
feat: add flux support (#356)
* add flux support

* avoid build failures in non-CUDA environments

* fix schnell support

* add k quants support

* add support for applying lora to quantized tensors

* add inplace conversion support for f8_e4m3 (#359)

in the same way it is done for bf16
like how bf16 converts losslessly to fp32,
f8_e4m3 converts losslessly to fp16

* add xlabs flux comfy converted lora support

* update docs

---------

Co-authored-by: Erik Scholz <Green-Sky@users.noreply.github.com>
2024-08-24 14:29:52 +08:00
zhentaoyu
697d000f49
feat: add SYCL Backend Support for Intel GPUs (#330)
* update ggml and add SYCL CMake option

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* hacky CMakeLists.txt for updating ggml in cpu backend

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* rebase and clean code

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* add sycl in README

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* rebase ggml commit

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* refine README

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* update ggml for supporting sycl tsembd op

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

---------

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>
2024-08-10 13:42:50 +08:00
leejet
5b8d16aa68 docs: reorganize README.md 2024-08-03 12:06:34 +08:00
leejet
3d854f7917 sync: update ggml submodule url 2024-08-03 11:42:12 +08:00
leejet
4a6e36edc5 sync: update ggml 2024-07-28 18:30:35 +08:00
leejet
73c2176648
feat: add sd3 support (#298) 2024-07-28 15:44:08 +08:00
Phu Tran
9c51d8787f
chore: fix cuda CI (#286) 2024-06-12 23:13:24 +08:00
leejet
f9f0d4685b fix: sample_k_diffusion should be static 2024-06-10 23:04:02 +08:00
leejet
8d2050a5cf sync: update ggml 2024-06-10 22:59:36 +08:00
leejet
08f5b41956 refector: make the sampling module more independent 2024-06-10 22:42:15 +08:00
Eugene
b6daf5c55b
fix: use PRI64 instead of %i for some log (#269) 2024-06-01 14:01:58 +08:00
leejet
be6cd1a4bf sync: update ggml 2024-06-01 13:44:09 +08:00
Justine Tunney
e1384defca
perf: make crc32 100x faster on x86-64 (#278)
This change makes checkpoints load significantly faster by optimizing
pkzip's cyclic redundancy check. This code was developed by Intel and
Google and Mozilla. See Chromium's zlib codebase for further details.
2024-06-01 12:58:30 +08:00
Phu Tran
814280343c
chore: update artifact actions (#267) 2024-06-01 12:33:13 +08:00
leejet
1d2af5ca3f fix: set n_dims of tensor storage to 1 when it's 0 2024-05-14 23:06:52 +08:00
Grauho
ce1bcc74a6
feat: add AYS(Align Your Steps) scheduler (#241)
Added NVIDEA's new "Align Your Steps" style scheduler in accordance with their
quick start guide. Currently has handling for SD1.5, SDXL, and SVD, using the
noise levels from their paper to generate the sigma values. Can be selected
using the --schedule ays command line switch. Updates the main.cpp help
message and README to reflect this option, also they now inform the user
of the --color switch as well.

---------

Co-authored-by: leejet <leejet714@gmail.com>
2024-04-29 23:21:32 +08:00
Eugene
760cfaa618
fix: ignore tensors with the particular dim while loading (#233) 2024-04-29 23:04:27 +08:00
Eugene
6d16f6853e
fix: correct upscale progressbar (#232) 2024-04-29 22:59:46 +08:00
leejet
036ba9e6d8 feat: enable controlnet and photo maker for img2img mode 2024-04-14 16:36:08 +08:00
leejet
ec82d5279a refector: remove some useless code 2024-04-14 14:04:52 +08:00
bssrdf
afea457eda
fix: support more SDXL LoRA names (#216)
* apply pmid lora only once for multiple txt2img calls

* add better support for SDXL LoRA

* fix for some sdxl lora, like lcm-lora-xl

---------

Co-authored-by: bssrdf <bssrdf@gmail.com>
Co-authored-by: leejet <leejet714@gmail.com>
2024-04-06 17:12:03 +08:00
null-define
646e77638e
fix: fix tiles_ctx not freed in sd_tiling (#219) 2024-04-06 16:51:48 +08:00
leejet
3ac48ea1a7 fix: use static implementation of stb_image_resize 2024-04-06 16:37:08 +08:00
Phu Tran
607e39489f
docs: add Jellybox as UI using sd.cpp (#214) 2024-04-02 12:31:54 +08:00
delldu
ccae95aec9
feat: support RGBA image input of flexible size (#212)
* Support png image and resize image with 64 pixels in img2img mode

* update the error information

---------

Co-authored-by: leejet <leejet714@gmail.com>
2024-04-02 12:29:18 +08:00
bssrdf
90e9178d18
fix: apply pmid lora only once for multiple txt2img calls (#208)
Co-authored-by: bssrdf <bssrdf@gmail.com>
2024-04-02 11:58:29 +08:00
Grauho
48bcce493f
fix: avoid double free and fix sdxl lora naming conversion
* Fixed a double free issue when running multiple backends on the CPU, eg: CLIP
and the primary backend, as this would result in the *_backend pointers both
pointing to the same thing resulting in a segfault when calling the
StableDiffusionGGML destructor.

* Improve logging to allow for a color switch on the command line interface.
Changed the base log_printf function to not bake the log level directly into
the log buffer as that information is already passed the logging function via
the level parameter and it's easier to add in there than strip it out.

* Added a fix for certain SDXL LoRAs that don't seem to follow the expected
naming convention, converts over the tensor name during the LoRA model
loading. Added some logging of useful LoRA loading information. Had to
increase the base size of the GGML graph as the existing size results in an
insufficient graph memory error when using SDXL LoRAs.

* small fixes

---------

Co-authored-by: leejet <leejet714@gmail.com>
2024-03-20 22:00:22 +08:00
bssrdf
a469688e30
feat: add TencentARC PhotoMaker support (#179)
* first efforts at implementing photomaker; lots more to do

* added PhotoMakerIDEncoder model in SD

* fixed soem bugs; now photomaker model weights can be loaded into their tensor buffers

* added input id image loading

* added preprocessing inpit id images

* finished get_num_tensors

* fixed a bug in remove_duplicates

* add a get_learned_condition_with_trigger function to do photomaker stuff

* add a convert_token_to_id function for photomaker to extract trigger word's token id

* making progress; need to implement tokenizer decoder

* making more progress; finishing vision model forward

* debugging vision_model outputs

* corrected clip vision model output

* continue making progress in id fusion process

* finished stacked id embedding; to be tested

* remove garbage file

* debuging graph compute

* more progress; now alloc buffer failed

* fixed wtype issue; input images can only be 1 because issue with transformer when batch size > 1 (to be investigated)

* added delayed subject conditioning; now photomaker runs and generates images

* fixed stat_merge_step

* added photomaker lora model (to be tested)

* reworked pmid lora

* finished applying pmid lora; to be tested

* finalized pmid lora

* add a few print tensor; tweak in sample again

* small tweak; still not getting ID faces

* fixed a bug in FuseBlock forward; also remove diag_mask op in for vision transformer; getting better results

* disable pmid lora apply for now; 1 input image seems working; > 1 not working

* turn pmid lora apply back on

* fixed a decode bug

* fixed a bug in ggml's conv_2d, and now > 1 input images working

* add style_ratio as a cli param; reworked encode with trigger for attention weights

* merge commit fixing lora free param buffer error

* change default style ratio to 10%

* added an option to offload vae decoder to CPU for mem-limited gpus

* removing image normalization step seems making ID fidelity much higher

* revert default style ratio back ro 20%

* added an option for normalizing input ID images; cleaned up debugging code

* more clean up

* fixed bugs; now failed with cuda error; likely out-of-mem on GPU

* free pmid model params when required

* photomaker working properly now after merging and adapting to GGMLBlock API

* remove tensor renaming;  fixing names in the photomaker model file

* updated README.md to include instructions and notes for running PhotoMaker

* a bit clean up

* remove -DGGML_CUDA_FORCE_MMQ; more clean up and README update

* add input image requirement in README

* bring back freeing pmid lora params buffer; simply pooled output of CLIPvision

* remove MultiheadAttention2; customized MultiheadAttention

* added a WIN32 get_files_from_dir; turn off Photomakder if receiving no input images

* update docs

* fix ci error

* make stable-diffusion.h a pure c header file

This reverts commit 27887b630db6a92f269f0aef8de9bc9832ab50a9.

* fix ci error

* format code

* reuse get_learned_condition

* reuse pad_tokens

* reuse CLIPVisionModel

* reuse LoraModel

* add --clip-on-cpu

* fix lora name conversion for SDXL

---------

Co-authored-by: bssrdf <bssrdf@gmail.com>
Co-authored-by: leejet <leejet714@gmail.com>
2024-03-12 23:15:17 +08:00
leejet
61980171a1 sync: update ggml 2024-03-10 17:23:11 +08:00
Cyberhan123
583cc5bba2
docs: add binding (#189) 2024-03-03 13:27:07 +08:00
Phu Tran
1ce9470f27
fix: fix building shared library (#188) 2024-03-03 13:24:59 +08:00
leejet
a65c410463 sync: update ggml 2024-03-02 19:49:41 +08:00
leejet
a17ae7b7d2 sync: update ggml 2024-03-02 19:23:11 +08:00
leejet
e1b37b4ef6 fix: update ggml submodule url 2024-03-02 17:34:08 +08:00
fszontagh
7be65faa7c
feat: add progress callback (#170) 2024-03-02 17:28:41 +08:00
Phu Tran
d164236b2a
fix: fix metal build issues (#183) 2024-03-02 17:17:57 +08:00
leejet
ef5c3f7401 feat: add support for prompt longer than 77 2024-03-02 17:13:18 +08:00
Cyberhan123
b7870a0f89
chore: improve ci (#150)
---------

Co-authored-by: leejet <leejet714@gmail.com>
2024-02-26 22:01:34 +08:00
leejet
4a8190405a fix: fix the issue with dynamic linking 2024-02-25 21:39:01 +08:00
leejet
730585d515
sync: update ggml (#180) 2024-02-25 21:11:01 +08:00
Sean Bailey
193fb620b1
feat: add capability to repeatedly run the upscaler in a row (#174)
* Add in upscale repeater logic

---------

Co-authored-by: leejet <leejet714@gmail.com>
2024-02-24 21:31:01 +08:00
leejet
b6368868d9
feat: introduce GGMLBlock and implement SVD(Broken) (#159)
* introduce GGMLBlock and implement SVD(Broken)

* add sdxl vae warning
2024-02-24 20:06:39 +08:00
leejet
349439f239 style: format code 2024-01-29 23:05:18 +08:00
Steward Garcia
36ec16ac99
feat: Control Net support + Textual Inversion (embeddings) (#131)
* add controlnet to pipeline

* add cli params

* control strength cli param

* cli param keep controlnet in cpu

* add Textual Inversion

* add canny preprocessor

* refactor: change ggml_type_sizef to ggml_row_size

* process hint once time

* ignore the embedding name case

---------

Co-authored-by: leejet <leejet714@gmail.com>
2024-01-29 22:38:51 +08:00
旺旺碎冰冰
c6071fa82f
feat: add hipBlas support (#94) 2024-01-14 11:53:42 +08:00
leejet
5c614e4bc2
feat: add convert api (#142) 2024-01-14 11:43:24 +08:00
leejet
2b6ec97fe2
sync: update ggml (#134) 2024-01-05 23:18:41 +08:00
leejet
db382348cc fix: change GGML_MAX_NAME to 128 2024-01-03 22:42:42 +08:00
leejet
7cb41b190f fix: avoid encountering 'std::set undefined' in some environments 2024-01-02 22:37:01 +08:00
leejet
7fb8a51318 chore: make SD_BUILD_DLL visible only to SD_LIB 2024-01-02 22:31:40 +08:00
leejet
2c5f3fc53a chore: add support for building shared library 2024-01-02 21:05:44 +08:00
Erik Scholz
f2e4d9793b
fix: avoid some memory leaks (#136)
---------

Co-authored-by: leejet <leejet714@gmail.com>
2024-01-01 23:27:29 +08:00
Erik Scholz
4a5e7b58e2
fix: never use a log message as a format string (#135) 2024-01-01 20:43:47 +08:00
leejet
2e79a82f85
refactor: reorganize code and use c api (#133) 2024-01-01 16:22:18 +08:00
leejet
b139434b57 docs: update README.md 2023-12-31 11:48:41 +08:00
leejet
14da17a923 fix: initialize some pointers to NULL 2023-12-30 14:24:45 +08:00
leejet
78ad76f3f4
feat: add SDXL support (#117)
* add SDXL support

* fix the issue with generating large images
2023-12-29 00:16:10 +08:00
Steward Garcia
004dfbef27
feat: implement ESRGAN upscaler + Metal Backend (#104)
* add esrgan upscaler

* add sd_tiling

* support metal backend

* add clip_skip

---------

Co-authored-by: leejet <leejet714@gmail.com>
2023-12-28 23:46:48 +08:00
旺旺碎冰冰
0e64238e4c
feat: implement the complete bpe function (#119)
* implement the complete bpe function
---------

Co-authored-by: leejet <leejet714@gmail.com>
2023-12-23 12:11:07 +08:00
leejet
8f6b4a39d6
fix: enhance the tokenizer's handing of Unicode (#120) 2023-12-21 00:22:03 +08:00
Kreijstal
9842a3f819
fix: add support for int32_t on other compilers (#114) 2023-12-11 23:32:39 +08:00
leejet
ac8f5a044c feat: add SD-Turbo support 2023-12-10 13:15:09 +08:00
Sam Jones
ca33304318
fix: remove dangling pointer to work_output in CLIPTextModel (#111) 2023-12-10 10:05:02 +08:00
leejet
69efe3ce2b chore: make code cleaner 2023-12-09 17:35:10 +08:00
leejet
2eac844bbd fix: generate image correctly in img2img mode 2023-12-09 14:39:43 +08:00
leejet
968226abb2 docs: update v2-1_768-nonema-pruned.safetensors url 2023-12-05 22:52:19 +08:00
Steward Garcia
134883aec4
feat: add TAESD implementation - faster autoencoder (#88)
* add taesd implementation

* taesd gpu offloading

* show seed when generating image with -s -1

* less restrictive with larger images

* cuda: im2col speedup x2

* cuda: group norm speedup x90

* quantized models now works in cuda :)

* fix cal mem size

---------

Co-authored-by: leejet <leejet714@gmail.com>
2023-12-05 22:40:03 +08:00
leejet
f99bcd1f76 fix: detect model format base on file content 2023-12-03 20:30:31 +08:00
leejet
8a87b273ad fix: allow model and vae using different format 2023-12-03 17:12:04 +08:00
leejet
d7af2c2ba9
feat: load weights from safetensors and ckpt (#101) 2023-12-03 15:47:20 +08:00
旺旺碎冰冰
47dd704198
fix: avoid build fail on msvc (#93) 2023-11-28 20:49:11 +08:00
Erik Scholz
f469b835a3
fix: reading memory of stack allocated object past its scope (#91) 2023-11-27 21:37:12 +08:00
Steward Garcia
8124588cf1
feat: ggml-alloc integration and gpu acceleration (#75)
* set ggml url to FSSRepo/ggml

* ggml-alloc integration

* offload all functions to gpu

* gguf format + native converter

* merge custom vae to a model

* full offload to gpu

* improve pretty progress

---------

Co-authored-by: leejet <leejet714@gmail.com>
2023-11-26 19:02:36 +08:00
Erik Scholz
c874063408
fix: support bf16 lora weights (#82) 2023-11-20 22:34:17 +08:00
Urs Ganse
ae1d5dcebb
feat: allow LoRAs with negative multiplier (#83)
* Allow Loras with negative weight, too.

There are a couple of loras, which serve to adjust certain concepts in
both positive and negative directions (like exposure, detail level etc).

The current code rejects them if loaded with a negative weight, but I
suggest that this check can simply be dropped.

* ignore lora in the case of multiplier == 0.f

---------

Co-authored-by: Urs Ganse <urs@nerd2nerd.org>
Co-authored-by: leejet <leejet714@gmail.com>
2023-11-20 22:23:52 +08:00
leejet
51b53d4cb1 chore: typo remote => remove 2023-11-19 23:21:49 +08:00
leejet
0d9b801aaa fix: fix multi loras prompt parse 2023-11-19 23:19:37 +08:00
leejet
176a00b606 chore: add .clang-format 2023-11-19 19:35:33 +08:00
leejet
64f6002457 docs: add contributors info to README.md 2023-11-19 18:35:19 +08:00
leejet
9a9f3daf8e feat: add LoRA support 2023-11-19 17:43:49 +08:00
leejet
536f3af672 feat: add lcm sampler support
This referenced an issue discussion of the stable-diffusion-webui at
https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/13952, which
may not be too perfect.
2023-11-17 22:53:46 +08:00
leejet
3bf1665885 chore: clear the msvc compilation warning 2023-10-28 20:55:24 +08:00
leejet
3001c23f7d perf: change ggml graph eval order to RIGHT_TO_LEFT to optimize memory usage 2023-10-28 20:19:15 +08:00
leejet
ed374983f3 fix: set eps of ggml_norm(LayerNorm) to 1e-5 2023-10-27 00:50:23 +08:00
leejet
4c96185fcc fix: update ggml to avoid insufficient memory error on macOS 2023-10-24 22:04:00 +08:00
leejet
fbd18e1059 fix: avoid stack overflow on MSVC 2023-10-23 21:10:46 +08:00
leejet
09cab2a2ae chore: set default BUILD_SHARED_LIBS to OFF 2023-10-22 14:59:03 +08:00
leejet
69e54ace14 sync: update ggml 2023-10-22 14:11:06 +08:00
Robert Bledsaw
29a56f2e98
docs: update README.md (#71)
fixed type in curl command.
2023-10-22 13:03:32 +08:00
Urs Ganse
afec5051cf
feat: write generation parameter exif data into output png (#57)
* Write generation parameter exif data into output pngs.

This adds prompt, negative prompt (if nonempty) and other generation
parameters to the output file as a tEXt PNG block, in the same format as
AUTOMATIC1111 webui does.

In order to keep everything free of external library dependencies, I
have somewhat dirtily hacked this into the stb_image_write
implementation.

* Mention png text data in README.md, include "karras" in sampler text

* add Steps/Model/RNG to parameter string

---------

Co-authored-by: leejet <leejet714@gmail.com>
2023-09-18 21:09:15 +08:00
leejet
bd62138751 feat: adapt to more weight formats 2023-09-13 00:07:16 +08:00
Urs Ganse
3a25179d52
feat: add DPM2 and DPM++(2s) a samplers (#56)
* Add DPM2 sampler.

* Add DPM++ (2s) a sampler.

* Update README.md with added samplers

---------

Co-authored-by: leejet <leejet714@gmail.com>
2023-09-12 23:02:09 +08:00
Urs Ganse
968fbf02aa
feat: add option to switch the sigma schedule (#51)
Concretely, this allows switching to the "Karras" schedule from the
Karras et al 2022 paper, equivalent to the samplers marked as "Karras"
in the AUTOMATIC1111 WebUI. This choice is in principle orthogonal to
the sampler choice and can be given independently.
2023-09-09 00:02:07 +08:00
Urs Ganse
b6899e8fc2
feat: add Euler, Heun and DPM++ (2M) samplers (#50)
* Add Euler sampler

* Add Heun sampler

* Add DPM++ (2M) sampler

* Add modified DPM++ (2M) "v2" sampler.

This was proposed in a issue discussion of the stable diffusion webui,
at https://github.com/AUTOMATIC1111/stable-diffusion-webui/discussions/8457
and apparently works around overstepping of the DPM++ (2M) method with
small step counts.

The parameter is called dpmpp2mv2 here.

* match code style

---------

Co-authored-by: Urs Ganse <urs@nerd2nerd.org>
Co-authored-by: leejet <leejet714@gmail.com>
2023-09-08 23:47:28 +08:00
leejet
b85b236b13 feat: set default rng to cuda 2023-09-04 21:46:54 +08:00
leejet
34a118d407 fix: avoid coredump when steps == 1 2023-09-04 21:44:38 +08:00
leejet
f6ff06fcb7 fix: avoid coredump when generating large image 2023-09-04 21:37:46 +08:00
leejet
cf38e238d4 fix: width and height should be a multiple of 64 2023-09-04 20:49:52 +08:00
leejet
b247581782 fix: insufficient memory error on macOS 2023-09-04 03:50:42 +08:00
leejet
bb3f19cb40 fix: increase ctx_size 2023-09-04 03:45:43 +08:00
leejet
7620b920c8 use new graph api to avoid stack overflow on msvc 2023-09-03 22:56:33 +08:00
leejet
3ffffa6929 fix: do not check weights of open clip last layer 2023-09-03 21:10:08 +08:00
leejet
45842865ff fix: seed should be 64 bit 2023-09-03 20:08:22 +08:00
leejet
e5a7aec252 feat: add CUDA RNG 2023-09-03 19:24:07 +08:00
leejet
31e77e1573
feat: add SD2.x support (#40) 2023-09-03 16:00:33 +08:00
leejet
c542a77a3f fix: correct the handling of weight loading 2023-08-30 21:44:06 +08:00
Derek Anderson
1b5a868296
fix: flushes after printf (#38) 2023-08-30 20:47:25 +08:00
leejet
c8f85a4e30 sync: update ggml 2023-08-27 14:35:26 +08:00
leejet
d765b95ed1 perf: make ggml_conv_2d faster 2023-08-26 17:08:59 +08:00
leejet
008d80a0b1 docs: update README.md 2023-08-25 20:59:18 +08:00
leejet
467bc5baeb perf: make ggml_conv_2d a little faster 2023-08-25 00:09:34 +08:00
173 changed files with 4418095 additions and 4465 deletions

12
.clang-format Normal file
View File

@ -0,0 +1,12 @@
BasedOnStyle: Chromium
UseTab: Never
IndentWidth: 4
TabWidth: 4
AllowShortIfStatementsOnASingleLine: false
ColumnLimit: 0
AccessModifierOffset: -4
NamespaceIndentation: All
FixNamespaceComments: false
AlignAfterOpenBracket: true
AlignConsecutiveAssignments: true
IndentCaseLabels: true

10
.clang-tidy Normal file
View File

@ -0,0 +1,10 @@
Checks: >
modernize-make-shared,
modernize-use-nullptr,
modernize-use-override,
modernize-pass-by-value,
modernize-return-braced-init-list,
modernize-deprecated-headers,
HeaderFilterRegex: '^$'
WarningsAsErrors: ''
FormatStyle: none

73
.github/ISSUE_TEMPLATE/bug_report.yml vendored Normal file
View File

@ -0,0 +1,73 @@
name: 🐞 Bug Report
description: Report a bug or unexpected behavior
title: "[Bug] "
labels: ["bug"]
body:
- type: markdown
attributes:
value: |
Please use this template and include as many details as possible to help us reproduce and fix the issue.
- type: textarea
id: commit
attributes:
label: Git commit
description: Which commit are you trying to compile?
placeholder: |
$git rev-parse HEAD
40a6a8710ec15b1b5db6b5a098409f6bc8f654a4
validations:
required: true
- type: input
id: os
attributes:
label: Operating System & Version
placeholder: e.g. “Ubuntu 22.04”, “Windows 11 23H2”, “macOS 14.3”
validations:
required: true
- type: dropdown
id: backends
attributes:
label: GGML backends
description: Which GGML backends do you know to be affected?
options: [CPU, CUDA, HIP, Metal, Musa, SYCL, Vulkan, OpenCL]
multiple: true
validations:
required: true
- type: input
id: cmd_arguments
attributes:
label: Command-line arguments used
placeholder: The full command line you ran (with all flags)
validations:
required: true
- type: textarea
id: steps_to_reproduce
attributes:
label: Steps to reproduce
placeholder: A step-by-step list of what you did
validations:
required: true
- type: textarea
id: expected_behavior
attributes:
label: What you expected to happen
placeholder: Describe the expected behavior or result
validations:
required: true
- type: textarea
id: actual_behavior
attributes:
label: What actually happened
placeholder: Describe what you saw instead (errors, logs, crash, etc.)
validations:
required: true
- type: textarea
id: logs_and_errors
attributes:
label: Logs / error messages / stack trace
placeholder: Paste complete logs or error output
- type: textarea
id: additional_info
attributes:
label: Additional context / environment details
placeholder: e.g. CPU model, GPU, RAM, model file versions, quantization type, etc.

View File

@ -0,0 +1,33 @@
name: 💡 Feature Request
description: Suggest a new feature or improvement
title: "[Feature] "
labels: ["enhancement"]
body:
- type: markdown
attributes:
value: |
Thank you for suggesting an improvement! Please fill in the fields below.
- type: input
id: summary
attributes:
label: Feature Summary
placeholder: A one-line summary of the feature youd like
validations:
required: true
- type: textarea
id: description
attributes:
label: Detailed Description
placeholder: What problem does this solve? How do you expect it to work?
validations:
required: true
- type: textarea
id: alternatives
attributes:
label: Alternatives you considered
placeholder: Any alternative designs or workarounds you tried
- type: textarea
id: additional_context
attributes:
label: Additional context
placeholder: Any extra information (use cases, related functionalities, constraints)

View File

@ -4,17 +4,36 @@ on:
workflow_dispatch: # allows manual triggering
inputs:
create_release:
description: 'Create new release'
description: "Create new release"
required: true
type: boolean
push:
branches:
- master
- ci
paths: ['.github/workflows/**', '**/CMakeLists.txt', '**/Makefile', '**/*.h', '**/*.hpp', '**/*.c', '**/*.cpp', '**/*.cu']
paths:
[
".github/workflows/**",
"**/CMakeLists.txt",
"**/Makefile",
"**/*.h",
"**/*.hpp",
"**/*.c",
"**/*.cpp",
"**/*.cu",
]
pull_request:
types: [opened, synchronize, reopened]
paths: ['**/CMakeLists.txt', '**/Makefile', '**/*.h', '**/*.hpp', '**/*.c', '**/*.cpp', '**/*.cu']
paths:
[
"**/CMakeLists.txt",
"**/Makefile",
"**/*.h",
"**/*.hpp",
"**/*.c",
"**/*.cpp",
"**/*.cu",
]
env:
BRANCH_NAME: ${{ github.head_ref || github.ref_name }}
@ -30,7 +49,6 @@ jobs:
with:
submodules: recursive
- name: Dependencies
id: depends
run: |
@ -42,14 +60,37 @@ jobs:
run: |
mkdir build
cd build
cmake ..
cmake .. -DGGML_AVX2=ON -DSD_BUILD_SHARED_LIBS=ON
cmake --build . --config Release
#- name: Test
#id: cmake_test
#run: |
#cd build
#ctest --verbose --timeout 900
- name: Get commit hash
id: commit
if: ${{ ( github.event_name == 'push' && github.ref == 'refs/heads/master' ) || github.event.inputs.create_release == 'true' }}
uses: pr-mpt/actions-commit-hash@v2
- name: Fetch system info
id: system-info
run: |
echo "CPU_ARCH=`uname -m`" >> "$GITHUB_OUTPUT"
echo "OS_NAME=`lsb_release -s -i`" >> "$GITHUB_OUTPUT"
echo "OS_VERSION=`lsb_release -s -r`" >> "$GITHUB_OUTPUT"
echo "OS_TYPE=`uname -s`" >> "$GITHUB_OUTPUT"
- name: Pack artifacts
id: pack_artifacts
if: ${{ ( github.event_name == 'push' && github.ref == 'refs/heads/master' ) || github.event.inputs.create_release == 'true' }}
run: |
cp ggml/LICENSE ./build/bin/ggml.txt
cp LICENSE ./build/bin/stable-diffusion.cpp.txt
zip -j sd-${{ env.BRANCH_NAME }}-${{ steps.commit.outputs.short }}-bin-${{ steps.system-info.outputs.OS_TYPE }}-${{ steps.system-info.outputs.OS_NAME }}-${{ steps.system-info.outputs.OS_VERSION }}-${{ steps.system-info.outputs.CPU_ARCH }}.zip ./build/bin/*
- name: Upload artifacts
if: ${{ ( github.event_name == 'push' && github.ref == 'refs/heads/master' ) || github.event.inputs.create_release == 'true' }}
uses: actions/upload-artifact@v4
with:
name: sd-${{ env.BRANCH_NAME }}-${{ steps.commit.outputs.short }}-bin-${{ steps.system-info.outputs.OS_TYPE }}-${{ steps.system-info.outputs.OS_NAME }}-${{ steps.system-info.outputs.OS_VERSION }}-${{ steps.system-info.outputs.CPU_ARCH }}.zip
path: |
sd-${{ env.BRANCH_NAME }}-${{ steps.commit.outputs.short }}-bin-${{ steps.system-info.outputs.OS_TYPE }}-${{ steps.system-info.outputs.OS_NAME }}-${{ steps.system-info.outputs.OS_VERSION }}-${{ steps.system-info.outputs.CPU_ARCH }}.zip
macOS-latest-cmake:
runs-on: macos-latest
@ -63,9 +104,8 @@ jobs:
- name: Dependencies
id: depends
continue-on-error: true
run: |
brew update
brew install zip
- name: Build
id: cmake_build
@ -73,30 +113,59 @@ jobs:
sysctl -a
mkdir build
cd build
cmake ..
cmake .. -DGGML_AVX2=ON -DCMAKE_OSX_ARCHITECTURES="arm64;x86_64" -DSD_BUILD_SHARED_LIBS=ON
cmake --build . --config Release
#- name: Test
#id: cmake_test
#run: |
#cd build
#ctest --verbose --timeout 900
- name: Get commit hash
id: commit
if: ${{ ( github.event_name == 'push' && github.ref == 'refs/heads/master' ) || github.event.inputs.create_release == 'true' }}
uses: pr-mpt/actions-commit-hash@v2
- name: Fetch system info
id: system-info
run: |
echo "CPU_ARCH=`uname -m`" >> "$GITHUB_OUTPUT"
echo "OS_NAME=`sw_vers -productName`" >> "$GITHUB_OUTPUT"
echo "OS_VERSION=`sw_vers -productVersion`" >> "$GITHUB_OUTPUT"
echo "OS_TYPE=`uname -s`" >> "$GITHUB_OUTPUT"
- name: Pack artifacts
id: pack_artifacts
if: ${{ ( github.event_name == 'push' && github.ref == 'refs/heads/master' ) || github.event.inputs.create_release == 'true' }}
run: |
cp ggml/LICENSE ./build/bin/ggml.txt
cp LICENSE ./build/bin/stable-diffusion.cpp.txt
zip -j sd-${{ env.BRANCH_NAME }}-${{ steps.commit.outputs.short }}-bin-${{ steps.system-info.outputs.OS_TYPE }}-${{ steps.system-info.outputs.OS_NAME }}-${{ steps.system-info.outputs.OS_VERSION }}-${{ steps.system-info.outputs.CPU_ARCH }}.zip ./build/bin/*
- name: Upload artifacts
if: ${{ ( github.event_name == 'push' && github.ref == 'refs/heads/master' ) || github.event.inputs.create_release == 'true' }}
uses: actions/upload-artifact@v4
with:
name: sd-${{ env.BRANCH_NAME }}-${{ steps.commit.outputs.short }}-bin-${{ steps.system-info.outputs.OS_TYPE }}-${{ steps.system-info.outputs.OS_NAME }}-${{ steps.system-info.outputs.OS_VERSION }}-${{ steps.system-info.outputs.CPU_ARCH }}.zip
path: |
sd-${{ env.BRANCH_NAME }}-${{ steps.commit.outputs.short }}-bin-${{ steps.system-info.outputs.OS_TYPE }}-${{ steps.system-info.outputs.OS_NAME }}-${{ steps.system-info.outputs.OS_VERSION }}-${{ steps.system-info.outputs.CPU_ARCH }}.zip
windows-latest-cmake:
runs-on: windows-latest
runs-on: windows-2025
env:
VULKAN_VERSION: 1.4.328.1
strategy:
matrix:
include:
- build: 'noavx'
defines: '-DGGML_AVX=OFF -DGGML_AVX2=OFF -DGGML_FMA=OFF'
- build: 'avx2'
defines: '-DGGML_AVX2=ON'
- build: 'avx'
defines: '-DGGML_AVX2=OFF'
- build: 'avx512'
defines: '-DGGML_AVX512=ON'
- build: "noavx"
defines: "-DGGML_NATIVE=OFF -DGGML_AVX=OFF -DGGML_AVX2=OFF -DGGML_FMA=OFF -DSD_BUILD_SHARED_LIBS=ON"
- build: "avx2"
defines: "-DGGML_NATIVE=OFF -DGGML_AVX2=ON -DSD_BUILD_SHARED_LIBS=ON"
- build: "avx"
defines: "-DGGML_NATIVE=OFF -DGGML_AVX=ON -DGGML_AVX2=OFF -DSD_BUILD_SHARED_LIBS=ON"
- build: "avx512"
defines: "-DGGML_NATIVE=OFF -DGGML_AVX512=ON -DGGML_AVX=ON -DGGML_AVX2=ON -DSD_BUILD_SHARED_LIBS=ON"
- build: "cuda12"
defines: "-DSD_CUDA=ON -DSD_BUILD_SHARED_LIBS=ON -DCMAKE_CUDA_ARCHITECTURES='61;70;75;80;86;89;90;100;120'"
- build: 'vulkan'
defines: "-DSD_VULKAN=ON -DSD_BUILD_SHARED_LIBS=ON"
steps:
- name: Clone
id: checkout
@ -104,6 +173,24 @@ jobs:
with:
submodules: recursive
- name: Install cuda-toolkit
id: cuda-toolkit
if: ${{ matrix.build == 'cuda12' }}
uses: Jimver/cuda-toolkit@v0.2.22
with:
cuda: "12.8.1"
method: "network"
sub-packages: '["nvcc", "cudart", "cublas", "cublas_dev", "thrust", "visual_studio_integration"]'
- name: Install Vulkan SDK
id: get_vulkan
if: ${{ matrix.build == 'vulkan' }}
run: |
curl.exe -o $env:RUNNER_TEMP/VulkanSDK-Installer.exe -L "https://sdk.lunarg.com/sdk/download/${env:VULKAN_VERSION}/windows/vulkansdk-windows-X64-${env:VULKAN_VERSION}.exe"
& "$env:RUNNER_TEMP\VulkanSDK-Installer.exe" --accept-licenses --default-answer --confirm-command install
Add-Content $env:GITHUB_ENV "VULKAN_SDK=C:\VulkanSDK\${env:VULKAN_VERSION}"
Add-Content $env:GITHUB_PATH "C:\VulkanSDK\${env:VULKAN_VERSION}\bin"
- name: Build
id: cmake_build
run: |
@ -125,12 +212,6 @@ jobs:
& $cl /O2 /GS- /kernel avx512f.c /link /nodefaultlib /entry:main
.\avx512f.exe && echo "AVX512F: YES" && ( echo HAS_AVX512F=1 >> $env:GITHUB_ENV ) || echo "AVX512F: NO"
#- name: Test
#id: cmake_test
#run: |
#cd build
#ctest -C Release --verbose --timeout 900
- name: Get commit hash
id: commit
if: ${{ ( github.event_name == 'push' && github.ref == 'refs/heads/master' ) || github.event.inputs.create_release == 'true' }}
@ -140,17 +221,145 @@ jobs:
id: pack_artifacts
if: ${{ ( github.event_name == 'push' && github.ref == 'refs/heads/master' ) || github.event.inputs.create_release == 'true' }}
run: |
Copy-Item ggml/LICENSE .\build\bin\Release\ggml.txt
Copy-Item LICENSE .\build\bin\Release\stable-diffusion.cpp.txt
7z a sd-${{ env.BRANCH_NAME }}-${{ steps.commit.outputs.short }}-bin-win-${{ matrix.build }}-x64.zip .\build\bin\Release\*
$filePath = ".\build\bin\Release\*"
if (Test-Path $filePath) {
echo "Exists at path $filePath"
Copy-Item ggml/LICENSE .\build\bin\Release\ggml.txt
Copy-Item LICENSE .\build\bin\Release\stable-diffusion.cpp.txt
} elseif (Test-Path ".\build\bin\stable-diffusion.dll") {
$filePath = ".\build\bin\*"
echo "Exists at path $filePath"
Copy-Item ggml/LICENSE .\build\bin\ggml.txt
Copy-Item LICENSE .\build\bin\stable-diffusion.cpp.txt
} else {
ls .\build\bin
throw "Can't find stable-diffusion.dll"
}
7z a sd-${{ env.BRANCH_NAME }}-${{ steps.commit.outputs.short }}-bin-win-${{ matrix.build }}-x64.zip $filePath
- name: Copy and pack Cuda runtime
id: pack_cuda_runtime
if: ${{ matrix.build == 'cuda12' && (github.event_name == 'push' && github.ref == 'refs/heads/master' || github.event.inputs.create_release == 'true') }}
run: |
echo "Cuda install location: ${{steps.cuda-toolkit.outputs.CUDA_PATH}}"
$dst='.\build\bin\cudart\'
robocopy "${{steps.cuda-toolkit.outputs.CUDA_PATH}}\bin" $dst cudart64_*.dll cublas64_*.dll cublasLt64_*.dll
7z a cudart-sd-bin-win-cu12-x64.zip $dst\*
- name: Upload Cuda runtime
if: ${{ matrix.build == 'cuda12' && (github.event_name == 'push' && github.ref == 'refs/heads/master' || github.event.inputs.create_release == 'true') }}
uses: actions/upload-artifact@v4
with:
name: sd-cudart-sd-bin-win-cu12-x64.zip
path: |
cudart-sd-bin-win-cu12-x64.zip
- name: Upload artifacts
if: ${{ ( github.event_name == 'push' && github.ref == 'refs/heads/master' ) || github.event.inputs.create_release == 'true' }}
uses: actions/upload-artifact@v3
uses: actions/upload-artifact@v4
with:
name: sd-${{ env.BRANCH_NAME }}-${{ steps.commit.outputs.short }}-bin-win-${{ matrix.build }}-x64.zip
path: |
sd-${{ env.BRANCH_NAME }}-${{ steps.commit.outputs.short }}-bin-win-${{ matrix.build }}-x64.zip
windows-latest-cmake-hip:
runs-on: windows-2022
env:
HIPSDK_INSTALLER_VERSION: "25.Q3"
GPU_TARGETS: "gfx1151;gfx1200;gfx1201;gfx1100;gfx1101;gfx1102;gfx1030;gfx1031;gfx1032"
steps:
- uses: actions/checkout@v3
with:
submodules: recursive
- name: Cache ROCm Installation
id: cache-rocm
uses: actions/cache@v4
with:
path: C:\Program Files\AMD\ROCm
key: rocm-${{ env.HIPSDK_INSTALLER_VERSION }}-${{ runner.os }}
- name: ccache
uses: ggml-org/ccache-action@v1.2.16
with:
key: windows-latest-cmake-hip-${{ env.HIPSDK_INSTALLER_VERSION }}-x64
evict-old-files: 1d
- name: Install ROCm
if: steps.cache-rocm.outputs.cache-hit != 'true'
run: |
$ErrorActionPreference = "Stop"
write-host "Downloading AMD HIP SDK Installer"
Invoke-WebRequest -Uri "https://download.amd.com/developer/eula/rocm-hub/AMD-Software-PRO-Edition-${{ env.HIPSDK_INSTALLER_VERSION }}-WinSvr2022-For-HIP.exe" -OutFile "${env:RUNNER_TEMP}\rocm-install.exe"
write-host "Installing AMD HIP SDK"
$proc = Start-Process "${env:RUNNER_TEMP}\rocm-install.exe" -ArgumentList '-install' -NoNewWindow -PassThru
$completed = $proc.WaitForExit(600000)
if (-not $completed) {
Write-Error "ROCm installation timed out after 10 minutes. Killing the process"
$proc.Kill()
exit 1
}
if ($proc.ExitCode -ne 0) {
Write-Error "ROCm installation failed with exit code $($proc.ExitCode)"
exit 1
}
write-host "Completed AMD HIP SDK installation"
- name: Verify ROCm
run: |
# Find and test ROCm installation
$clangPath = Get-ChildItem 'C:\Program Files\AMD\ROCm\*\bin\clang.exe' | Select-Object -First 1
if (-not $clangPath) {
Write-Error "ROCm installation not found"
exit 1
}
& $clangPath.FullName --version
# Set HIP_PATH environment variable for later steps
echo "HIP_PATH=$(Resolve-Path 'C:\Program Files\AMD\ROCm\*\bin\clang.exe' | split-path | split-path)" >> $env:GITHUB_ENV
- name: Build
run: |
mkdir build
cd build
$env:CMAKE_PREFIX_PATH="${env:HIP_PATH}"
cmake .. `
-G "Unix Makefiles" `
-DSD_HIPBLAS=ON `
-DSD_BUILD_SHARED_LIBS=ON `
-DGGML_NATIVE=OFF `
-DCMAKE_C_COMPILER=clang `
-DCMAKE_CXX_COMPILER=clang++ `
-DCMAKE_BUILD_TYPE=Release `
-DGPU_TARGETS="${{ env.GPU_TARGETS }}"
cmake --build . --config Release --parallel ${env:NUMBER_OF_PROCESSORS}
- name: Get commit hash
id: commit
if: ${{ ( github.event_name == 'push' && github.ref == 'refs/heads/master' ) || github.event.inputs.create_release == 'true' }}
uses: pr-mpt/actions-commit-hash@v2
- name: Pack artifacts
if: ${{ ( github.event_name == 'push' && github.ref == 'refs/heads/master' ) || github.event.inputs.create_release == 'true' }}
run: |
md "build\bin\rocblas\library\"
md "build\bin\hipblaslt\library"
cp "${env:HIP_PATH}\bin\hipblas.dll" "build\bin\"
cp "${env:HIP_PATH}\bin\hipblaslt.dll" "build\bin\"
cp "${env:HIP_PATH}\bin\rocblas.dll" "build\bin\"
cp "${env:HIP_PATH}\bin\rocblas\library\*" "build\bin\rocblas\library\"
cp "${env:HIP_PATH}\bin\hipblaslt\library\*" "build\bin\hipblaslt\library\"
7z a sd-${{ env.BRANCH_NAME }}-${{ steps.commit.outputs.short }}-bin-win-rocm-x64.zip .\build\bin\*
- name: Upload artifacts
if: ${{ ( github.event_name == 'push' && github.ref == 'refs/heads/master' ) || github.event.inputs.create_release == 'true' }}
uses: actions/upload-artifact@v4
with:
name: sd-${{ env.BRANCH_NAME }}-${{ steps.commit.outputs.short }}-bin-win-rocm-x64.zip
path: |
sd-${{ env.BRANCH_NAME }}-${{ steps.commit.outputs.short }}-bin-win-rocm-x64.zip
release:
if: ${{ ( github.event_name == 'push' && github.ref == 'refs/heads/master' ) || github.event.inputs.create_release == 'true' }}
@ -160,11 +369,26 @@ jobs:
- ubuntu-latest-cmake
- macOS-latest-cmake
- windows-latest-cmake
- windows-latest-cmake-hip
steps:
- name: Clone
uses: actions/checkout@v3
with:
fetch-depth: 0
- name: Download artifacts
id: download-artifact
uses: actions/download-artifact@v3
uses: actions/download-artifact@v4
with:
path: ./artifact
pattern: sd-*
merge-multiple: true
- name: Get commit count
id: commit_count
run: |
echo "count=$(git rev-list --count HEAD)" >> $GITHUB_OUTPUT
- name: Get commit hash
id: commit
@ -172,14 +396,16 @@ jobs:
- name: Create release
id: create_release
if: ${{ github.event_name == 'workflow_dispatch' || github.ref_name == 'master' }}
uses: anzz1/action-create-release@v1
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
with:
tag_name: ${{ env.BRANCH_NAME }}-${{ steps.commit.outputs.short }}
tag_name: ${{ format('{0}-{1}-{2}', env.BRANCH_NAME, steps.commit_count.outputs.count, steps.commit.outputs.short) }}
- name: Upload release
id: upload_release
if: ${{ github.event_name == 'workflow_dispatch' || github.ref_name == 'master' }}
uses: actions/github-script@v3
with:
github-token: ${{secrets.GITHUB_TOKEN}}

12
.gitignore vendored
View File

@ -1,5 +1,15 @@
build*/
cmake-build-*/
test/
.vscode/
.idea/
.cache/
*.swp
*.bat
*.bin
*.exe
*.gguf
output*.png
models*
*.log
preview.png

4
.gitmodules vendored
View File

@ -1,3 +1,3 @@
[submodule "ggml"]
path = ggml
url = https://github.com/leejet/ggml.git
path = ggml
url = https://github.com/ggml-org/ggml.git

View File

@ -24,21 +24,168 @@ endif()
# general
#option(SD_BUILD_TESTS "sd: build tests" ${SD_STANDALONE})
option(SD_BUILD_EXAMPLES "sd: build examples" ${SD_STANDALONE})
option(SD_CUDA "sd: cuda backend" OFF)
option(SD_HIPBLAS "sd: rocm backend" OFF)
option(SD_METAL "sd: metal backend" OFF)
option(SD_VULKAN "sd: vulkan backend" OFF)
option(SD_OPENCL "sd: opencl backend" OFF)
option(SD_SYCL "sd: sycl backend" OFF)
option(SD_MUSA "sd: musa backend" OFF)
option(SD_FAST_SOFTMAX "sd: x1.5 faster softmax, indeterministic (sometimes, same seed don't generate same image), cuda only" OFF)
option(SD_BUILD_SHARED_LIBS "sd: build shared libs" OFF)
option(SD_BUILD_SHARED_GGML_LIB "sd: build ggml as a separate shared lib" OFF)
option(SD_USE_SYSTEM_GGML "sd: use system-installed GGML library" OFF)
#option(SD_BUILD_SERVER "sd: build server example" ON)
if(SD_CUDA)
message("-- Use CUDA as backend stable-diffusion")
set(GGML_CUDA ON)
add_definitions(-DSD_USE_CUDA)
endif()
# deps
add_subdirectory(ggml)
if(SD_METAL)
message("-- Use Metal as backend stable-diffusion")
set(GGML_METAL ON)
add_definitions(-DSD_USE_METAL)
endif()
if (SD_VULKAN)
message("-- Use Vulkan as backend stable-diffusion")
set(GGML_VULKAN ON)
add_definitions(-DSD_USE_VULKAN)
endif ()
if (SD_OPENCL)
message("-- Use OpenCL as backend stable-diffusion")
set(GGML_OPENCL ON)
add_definitions(-DSD_USE_OPENCL)
endif ()
if (SD_HIPBLAS)
message("-- Use HIPBLAS as backend stable-diffusion")
set(GGML_HIP ON)
add_definitions(-DSD_USE_CUDA)
if(SD_FAST_SOFTMAX)
set(GGML_CUDA_FAST_SOFTMAX ON)
endif()
endif ()
if(SD_MUSA)
message("-- Use MUSA as backend stable-diffusion")
set(GGML_MUSA ON)
add_definitions(-DSD_USE_CUDA)
if(SD_FAST_SOFTMAX)
set(GGML_CUDA_FAST_SOFTMAX ON)
endif()
endif()
set(SD_LIB stable-diffusion)
add_library(${SD_LIB} stable-diffusion.h stable-diffusion.cpp)
target_link_libraries(${SD_LIB} PUBLIC ggml)
target_include_directories(${SD_LIB} PUBLIC .)
target_compile_features(${SD_LIB} PUBLIC cxx_std_11)
file(GLOB SD_LIB_SOURCES
"*.h"
"*.cpp"
"*.hpp"
)
find_program(GIT_EXE NAMES git git.exe NO_CMAKE_FIND_ROOT_PATH)
if(GIT_EXE)
execute_process(COMMAND ${GIT_EXE} describe --tags --abbrev=7 --dirty=+
WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
OUTPUT_VARIABLE SDCPP_BUILD_VERSION
OUTPUT_STRIP_TRAILING_WHITESPACE
ERROR_QUIET
)
execute_process(COMMAND ${GIT_EXE} rev-parse --short HEAD
WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
OUTPUT_VARIABLE SDCPP_BUILD_COMMIT
OUTPUT_STRIP_TRAILING_WHITESPACE
ERROR_QUIET
)
endif()
if(NOT SDCPP_BUILD_VERSION)
set(SDCPP_BUILD_VERSION unknown)
endif()
message(STATUS "stable-diffusion.cpp version ${SDCPP_BUILD_VERSION}")
if(NOT SDCPP_BUILD_COMMIT)
set(SDCPP_BUILD_COMMIT unknown)
endif()
message(STATUS "stable-diffusion.cpp commit ${SDCPP_BUILD_COMMIT}")
set_property(
SOURCE ${CMAKE_CURRENT_SOURCE_DIR}/version.cpp
APPEND PROPERTY COMPILE_DEFINITIONS
SDCPP_BUILD_COMMIT=${SDCPP_BUILD_COMMIT} SDCPP_BUILD_VERSION=${SDCPP_BUILD_VERSION}
)
if(SD_BUILD_SHARED_LIBS)
message("-- Build shared library")
message(${SD_LIB_SOURCES})
if(NOT SD_BUILD_SHARED_GGML_LIB)
set(BUILD_SHARED_LIBS OFF)
endif()
add_library(${SD_LIB} SHARED ${SD_LIB_SOURCES})
add_definitions(-DSD_BUILD_SHARED_LIB)
target_compile_definitions(${SD_LIB} PRIVATE -DSD_BUILD_DLL)
set(CMAKE_POSITION_INDEPENDENT_CODE ON)
else()
message("-- Build static library")
if(NOT SD_BUILD_SHARED_GGML_LIB)
set(BUILD_SHARED_LIBS OFF)
endif()
add_library(${SD_LIB} STATIC ${SD_LIB_SOURCES})
endif()
if(SD_SYCL)
message("-- Use SYCL as backend stable-diffusion")
set(GGML_SYCL ON)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wno-narrowing -fsycl")
add_definitions(-DSD_USE_SYCL)
# disable fast-math on host, see:
# https://www.intel.com/content/www/us/en/docs/cpp-compiler/developer-guide-reference/2021-10/fp-model-fp.html
if (WIN32)
set(SYCL_COMPILE_OPTIONS /fp:precise)
else()
set(SYCL_COMPILE_OPTIONS -fp-model=precise)
endif()
message("-- Turn off fast-math for host in SYCL backend")
target_compile_options(${SD_LIB} PRIVATE ${SYCL_COMPILE_OPTIONS})
endif()
set(CMAKE_POLICY_DEFAULT_CMP0077 NEW)
if (NOT SD_USE_SYSTEM_GGML)
# see https://github.com/ggerganov/ggml/pull/682
add_definitions(-DGGML_MAX_NAME=128)
endif()
# deps
# Only add ggml if it hasn't been added yet
if (NOT TARGET ggml)
if (SD_USE_SYSTEM_GGML)
find_package(ggml REQUIRED)
if (NOT ggml_FOUND)
message(FATAL_ERROR "System-installed GGML library not found.")
endif()
add_library(ggml ALIAS ggml::ggml)
else()
add_subdirectory(ggml)
endif()
endif()
add_subdirectory(thirdparty)
target_link_libraries(${SD_LIB} PUBLIC ggml zip)
target_include_directories(${SD_LIB} PUBLIC . thirdparty)
target_compile_features(${SD_LIB} PUBLIC c_std_11 cxx_std_17)
if (SD_BUILD_EXAMPLES)
add_subdirectory(examples)
endif()
set(SD_PUBLIC_HEADERS stable-diffusion.h)
set_target_properties(${SD_LIB} PROPERTIES PUBLIC_HEADER "${SD_PUBLIC_HEADERS}")
install(TARGETS ${SD_LIB} LIBRARY PUBLIC_HEADER)

View File

@ -1,16 +1,21 @@
ARG UBUNTU_VERSION=22.04
FROM ubuntu:$UBUNTU_VERSION as build
FROM ubuntu:$UBUNTU_VERSION AS build
RUN apt-get update && apt-get install -y build-essential git cmake
RUN apt-get update && apt-get install -y --no-install-recommends build-essential git cmake
WORKDIR /sd.cpp
COPY . .
RUN mkdir build && cd build && cmake .. && cmake --build . --config Release
RUN cmake . -B ./build
RUN cmake --build ./build --config Release --parallel
FROM ubuntu:$UBUNTU_VERSION as runtime
FROM ubuntu:$UBUNTU_VERSION AS runtime
RUN apt-get update && \
apt-get install --yes --no-install-recommends libgomp1 && \
apt-get clean
COPY --from=build /sd.cpp/build/bin/sd /sd

23
Dockerfile.musa Normal file
View File

@ -0,0 +1,23 @@
ARG MUSA_VERSION=rc4.2.0
ARG UBUNTU_VERSION=22.04
FROM mthreads/musa:${MUSA_VERSION}-devel-ubuntu${UBUNTU_VERSION}-amd64 as build
RUN apt-get update && apt-get install -y ccache cmake git
WORKDIR /sd.cpp
COPY . .
RUN mkdir build && cd build && \
cmake .. -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ \
-DCMAKE_C_FLAGS="${CMAKE_C_FLAGS} -fopenmp -I/usr/lib/llvm-14/lib/clang/14.0.0/include -L/usr/lib/llvm-14/lib" \
-DCMAKE_CXX_FLAGS="${CMAKE_CXX_FLAGS} -fopenmp -I/usr/lib/llvm-14/lib/clang/14.0.0/include -L/usr/lib/llvm-14/lib" \
-DSD_MUSA=ON -DCMAKE_BUILD_TYPE=Release && \
cmake --build . --config Release
FROM mthreads/musa:${MUSA_VERSION}-runtime-ubuntu${UBUNTU_VERSION}-amd64 as runtime
COPY --from=build /sd.cpp/build/bin/sd /sd
ENTRYPOINT [ "/sd" ]

19
Dockerfile.sycl Normal file
View File

@ -0,0 +1,19 @@
ARG SYCL_VERSION=2025.1.0-0
FROM intel/oneapi-basekit:${SYCL_VERSION}-devel-ubuntu24.04 AS build
RUN apt-get update && apt-get install -y cmake
WORKDIR /sd.cpp
COPY . .
RUN mkdir build && cd build && \
cmake .. -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DSD_SYCL=ON -DCMAKE_BUILD_TYPE=Release && \
cmake --build . --config Release -j$(nproc)
FROM intel/oneapi-basekit:${SYCL_VERSION}-devel-ubuntu24.04 AS runtime
COPY --from=build /sd.cpp/build/bin/sd /sd
ENTRYPOINT [ "/sd" ]

292
README.md
View File

@ -1,185 +1,195 @@
<p align="center">
<img src="./assets/a%20lovely%20cat.png" width="256x">
<img src="./assets/logo.png" width="360x">
</p>
# stable-diffusion.cpp
Inference of [Stable Diffusion](https://github.com/CompVis/stable-diffusion) in pure C/C++
<div align="center">
<a href="https://trendshift.io/repositories/9714" target="_blank"><img src="https://trendshift.io/api/badge/repositories/9714" alt="leejet%2Fstable-diffusion.cpp | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
</div>
Diffusion model(SD,Flux,Wan,...) inference in pure C/C++
***Note that this project is under active development. \
API and command-line option may change frequently.***
## 🔥Important News
* **2025/12/01** 🚀 stable-diffusion.cpp now supports **Z-Image**
👉 Details: [PR #1020](https://github.com/leejet/stable-diffusion.cpp/pull/1020)
* **2025/11/30** 🚀 stable-diffusion.cpp now supports **FLUX.2-dev**
👉 Details: [PR #1016](https://github.com/leejet/stable-diffusion.cpp/pull/1016)
* **2025/10/13** 🚀 stable-diffusion.cpp now supports **Qwen-Image-Edit / Qwen-Image-Edit 2509**
👉 Details: [PR #877](https://github.com/leejet/stable-diffusion.cpp/pull/877)
* **2025/10/12** 🚀 stable-diffusion.cpp now supports **Qwen-Image**
👉 Details: [PR #851](https://github.com/leejet/stable-diffusion.cpp/pull/851)
* **2025/09/14** 🚀 stable-diffusion.cpp now supports **Wan2.1 Vace**
👉 Details: [PR #819](https://github.com/leejet/stable-diffusion.cpp/pull/819)
* **2025/09/06** 🚀 stable-diffusion.cpp now supports **Wan2.1 / Wan2.2**
👉 Details: [PR #778](https://github.com/leejet/stable-diffusion.cpp/pull/778)
## Features
- Plain C/C++ implementation based on [ggml](https://github.com/ggerganov/ggml), working in the same way as [llama.cpp](https://github.com/ggerganov/llama.cpp)
- 16-bit, 32-bit float support
- 4-bit, 5-bit and 8-bit integer quantization support
- Accelerated memory-efficient CPU inference
- Only requires ~2.3GB when using txt2img with fp16 precision to generate a 512x512 image
- AVX, AVX2 and AVX512 support for x86 architectures
- Original `txt2img` and `img2img` mode
- Negative prompt
- [stable-diffusion-webui](https://github.com/AUTOMATIC1111/stable-diffusion-webui) style tokenizer (not all the features, only token weighting for now)
- Sampling method
- `Euler A`
- Plain C/C++ implementation based on [ggml](https://github.com/ggml-org/ggml), working in the same way as [llama.cpp](https://github.com/ggml-org/llama.cpp)
- Super lightweight and without external dependencies
- Supported models
- Image Models
- SD1.x, SD2.x, [SD-Turbo](https://huggingface.co/stabilityai/sd-turbo)
- SDXL, [SDXL-Turbo](https://huggingface.co/stabilityai/sdxl-turbo)
- [Some SD1.x and SDXL distilled models](./docs/distilled_sd.md)
- [SD3/SD3.5](./docs/sd3.md)
- [FlUX.1-dev/FlUX.1-schnell](./docs/flux.md)
- [FLUX.2-dev](./docs/flux2.md)
- [Chroma](./docs/chroma.md)
- [Chroma1-Radiance](./docs/chroma_radiance.md)
- [Qwen Image](./docs/qwen_image.md)
- [Z-Image](./docs/z_image.md)
- [Ovis-Image](./docs/ovis_image.md)
- Image Edit Models
- [FLUX.1-Kontext-dev](./docs/kontext.md)
- [Qwen Image Edit/Qwen Image Edit 2509](./docs/qwen_image_edit.md)
- Video Models
- [Wan2.1/Wan2.2](./docs/wan.md)
- [PhotoMaker](https://github.com/TencentARC/PhotoMaker) support.
- Control Net support with SD 1.5
- LoRA support, same as [stable-diffusion-webui](https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Features#lora)
- Latent Consistency Models support (LCM/LCM-LoRA)
- Faster and memory efficient latent decoding with [TAESD](https://github.com/madebyollin/taesd)
- Upscale images generated with [ESRGAN](https://github.com/xinntao/Real-ESRGAN)
- Supported backends
- CPU (AVX, AVX2 and AVX512 support for x86 architectures)
- CUDA
- Vulkan
- Metal
- OpenCL
- SYCL
- Supported weight formats
- Pytorch checkpoint (`.ckpt` or `.pth`)
- Safetensors (`./safetensors`)
- GGUF (`.gguf`)
- Supported platforms
- Linux
- Mac OS
- Windows
- Android (via Termux, [Local Diffusion](https://github.com/rmatif/Local-Diffusion))
- Flash Attention for memory usage optimization
- Negative prompt
- [stable-diffusion-webui](https://github.com/AUTOMATIC1111/stable-diffusion-webui) style tokenizer (not all the features, only token weighting for now)
- VAE tiling processing for reduce memory usage
- Sampling method
- `Euler A`
- `Euler`
- `Heun`
- `DPM2`
- `DPM++ 2M`
- [`DPM++ 2M v2`](https://github.com/AUTOMATIC1111/stable-diffusion-webui/discussions/8457)
- `DPM++ 2S a`
- [`LCM`](https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/13952)
- Cross-platform reproducibility
- `--rng cuda`, default, consistent with the `stable-diffusion-webui GPU RNG`
- `--rng cpu`, consistent with the `comfyui RNG`
- Embedds generation parameters into png output as webui-compatible text string
### TODO
## Quick Start
- [ ] More sampling methods
- [ ] GPU support
- [ ] Make inference faster
- The current implementation of ggml_conv_2d is slow and has high memory usage
- [ ] Continuing to reduce memory usage (quantizing the weights of ggml_conv_2d)
- [ ] LoRA support
- [ ] k-quants support
- [ ] Cross-platform reproducibility (perhaps ensuring consistency with the original SD)
- [ ] Adapting to more weight formats
### Get the sd executable
## Usage
- Download pre-built binaries from the [releases page](https://github.com/leejet/stable-diffusion.cpp/releases)
- Or build from source by following the [build guide](./docs/build.md)
### Get the Code
### Download model weights
```
git clone --recursive https://github.com/leejet/stable-diffusion.cpp
cd stable-diffusion.cpp
```
- download weights(.ckpt or .safetensors or .gguf). For example
- Stable Diffusion v1.5 from https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5
- If you have already cloned the repository, you can use the following command to update the repository to the latest code.
```
cd stable-diffusion.cpp
git pull origin master
git submodule init
git submodule update
```
### Convert weights
- download original weights(.ckpt or .safetensors). For example
- Stable Diffusion v1.4 from https://huggingface.co/CompVis/stable-diffusion-v-1-4-original
- Stable Diffusion v1.5 from https://huggingface.co/runwayml/stable-diffusion-v1-5
```shell
curl -L -O https://huggingface.co/CompVis/stable-diffusion-v-1-4-original/resolve/main/sd-v1-4.ckpt
# curl -L -O https://huggingface.co/runwayml/stable-diffusion-v1-5/resolve/main/v1-5-pruned-emaonly.safetensors
```sh
curl -L -O https://huggingface.co/runwayml/stable-diffusion-v1-5/resolve/main/v1-5-pruned-emaonly.safetensors
```
- convert weights to ggml model format
### Generate an image with just one command
```shell
cd models
pip install -r requirements.txt
python convert.py [path to weights] --out_type [output precision]
# For example, python convert.py sd-v1-4.ckpt --out_type f16
```
### Quantization
You can specify the output model format using the --out_type parameter
- `f16` for 16-bit floating-point
- `f32` for 32-bit floating-point
- `q8_0` for 8-bit integer quantization
- `q5_0` or `q5_1` for 5-bit integer quantization
- `q4_0` or `q4_1` for 4-bit integer quantization
### Build
#### Build from scratch
```shell
mkdir build
cd build
cmake ..
cmake --build . --config Release
```sh
./bin/sd -m ../models/v1-5-pruned-emaonly.safetensors -p "a lovely cat"
```
##### Using OpenBLAS
***For detailed command-line arguments, check out [cli doc](./examples/cli/README.md).***
```
cmake .. -DGGML_OPENBLAS=ON
cmake --build . --config Release
```
## Performance
### Run
If you want to improve performance or reduce VRAM/RAM usage, please refer to [performance guide](./docs/performance.md).
```
usage: ./bin/sd [arguments]
## More Guides
arguments:
-h, --help show this help message and exit
-M, --mode [txt2img or img2img] generation mode (default: txt2img)
-t, --threads N number of threads to use during computation (default: -1).
If threads <= 0, then threads will be set to the number of CPU physical cores
-m, --model [MODEL] path to model
-i, --init-img [IMAGE] path to the input image, required by img2img
-o, --output OUTPUT path to write result image to (default: .\output.png)
-p, --prompt [PROMPT] the prompt to render
-n, --negative-prompt PROMPT the negative prompt (default: "")
--cfg-scale SCALE unconditional guidance scale: (default: 7.0)
--strength STRENGTH strength for noising/unnoising (default: 0.75)
1.0 corresponds to full destruction of information in init image
-H, --height H image height, in pixel space (default: 512)
-W, --width W image width, in pixel space (default: 512)
--sample-method SAMPLE_METHOD sample method (default: "eular a")
--steps STEPS number of sample steps (default: 20)
-s SEED, --seed SEED RNG seed (default: 42, use random seed for < 0)
-v, --verbose print extra info
```
- [SD1.x/SD2.x/SDXL](./docs/sd.md)
- [SD3/SD3.5](./docs/sd3.md)
- [FlUX.1-dev/FlUX.1-schnell](./docs/flux.md)
- [FLUX.2-dev](./docs/flux2.md)
- [FLUX.1-Kontext-dev](./docs/kontext.md)
- [Chroma](./docs/chroma.md)
- [🔥Qwen Image](./docs/qwen_image.md)
- [🔥Qwen Image Edit/Qwen Image Edit 2509](./docs/qwen_image_edit.md)
- [🔥Wan2.1/Wan2.2](./docs/wan.md)
- [🔥Z-Image](./docs/z_image.md)
- [Ovis-Image](./docs/ovis_image.md)
- [LoRA](./docs/lora.md)
- [LCM/LCM-LoRA](./docs/lcm.md)
- [Using PhotoMaker to personalize image generation](./docs/photo_maker.md)
- [Using ESRGAN to upscale results](./docs/esrgan.md)
- [Using TAESD to faster decoding](./docs/taesd.md)
- [Docker](./docs/docker.md)
- [Quantization and GGUF](./docs/quantization_and_gguf.md)
#### txt2img example
## Bindings
```
./bin/sd -m ../models/sd-v1-4-ggml-model-f16.bin -p "a lovely cat"
```
These projects wrap `stable-diffusion.cpp` for easier use in other languages/frameworks.
Using formats of different precisions will yield results of varying quality.
* Golang (non-cgo): [seasonjs/stable-diffusion](https://github.com/seasonjs/stable-diffusion)
* Golang (cgo): [Binozo/GoStableDiffusion](https://github.com/Binozo/GoStableDiffusion)
* C#: [DarthAffe/StableDiffusion.NET](https://github.com/DarthAffe/StableDiffusion.NET)
* Python: [william-murray1204/stable-diffusion-cpp-python](https://github.com/william-murray1204/stable-diffusion-cpp-python)
* Rust: [newfla/diffusion-rs](https://github.com/newfla/diffusion-rs)
* Flutter/Dart: [rmatif/Local-Diffusion](https://github.com/rmatif/Local-Diffusion)
| f32 | f16 |q8_0 |q5_0 |q5_1 |q4_0 |q4_1 |
| ---- |---- |---- |---- |---- |---- |---- |
| ![](./assets/f32.png) |![](./assets/f16.png) |![](./assets/q8_0.png) |![](./assets/q5_0.png) |![](./assets/q5_1.png) |![](./assets/q4_0.png) |![](./assets/q4_1.png) |
## UIs
#### img2img example
These projects use `stable-diffusion.cpp` as a backend for their image generation.
- `./output.png` is the image generated from the above txt2img pipeline
- [Jellybox](https://jellybox.com)
- [Stable Diffusion GUI](https://github.com/fszontagh/sd.cpp.gui.wx)
- [Stable Diffusion CLI-GUI](https://github.com/piallai/stable-diffusion.cpp)
- [Local Diffusion](https://github.com/rmatif/Local-Diffusion)
- [sd.cpp-webui](https://github.com/daniandtheweb/sd.cpp-webui)
- [LocalAI](https://github.com/mudler/LocalAI)
- [Neural-Pixel](https://github.com/Luiz-Alcantara/Neural-Pixel)
- [KoboldCpp](https://github.com/LostRuins/koboldcpp)
## Contributors
```
./bin/sd --mode img2img -m ../models/sd-v1-4-ggml-model-f16.bin -p "cat with blue eyes" -i ./output.png -o ./img2img_output.png --strength 0.4
```
Thank you to all the people who have already contributed to stable-diffusion.cpp!
<p align="center">
<img src="./assets/img2img_output.png" width="256x">
</p>
[![Contributors](https://contrib.rocks/image?repo=leejet/stable-diffusion.cpp)](https://github.com/leejet/stable-diffusion.cpp/graphs/contributors)
### Docker
#### Building using Docker
```shell
docker build -t sd .
```
#### Run
```shell
docker run -v /path/to/models:/models -v /path/to/output/:/output sd [args...]
# For example
# docker run -v ./models:/models -v ./build:/output sd -m /models/sd-v1-4-ggml-model-f16.bin -p "a lovely cat" -v -o /output/output.png
```
## Memory/Disk Requirements
| precision | f32 | f16 |q8_0 |q5_0 |q5_1 |q4_0 |q4_1 |
| ---- | ---- |---- |---- |---- |---- |---- |---- |
| **Disk** | 2.7G | 2.0G | 1.7G | 1.6G | 1.6G | 1.5G | 1.5G |
| **Memory**(txt2img - 512 x 512) | ~2.8G | ~2.3G | ~2.1G | ~2.0G | ~2.0G | ~2.0G | ~2.0G |
## Star History
[![Star History Chart](https://api.star-history.com/svg?repos=leejet/stable-diffusion.cpp&type=Date)](https://star-history.com/#leejet/stable-diffusion.cpp&Date)
## References
- [ggml](https://github.com/ggerganov/ggml)
- [ggml](https://github.com/ggml-org/ggml)
- [diffusers](https://github.com/huggingface/diffusers)
- [stable-diffusion](https://github.com/CompVis/stable-diffusion)
- [sd3-ref](https://github.com/Stability-AI/sd3-ref)
- [stable-diffusion-stability-ai](https://github.com/Stability-AI/stablediffusion)
- [stable-diffusion-webui](https://github.com/AUTOMATIC1111/stable-diffusion-webui)
- [ComfyUI](https://github.com/comfyanonymous/ComfyUI)
- [k-diffusion](https://github.com/crowsonkb/k-diffusion)
- [latent-consistency-model](https://github.com/luosiallen/latent-consistency-model)
- [generative-models](https://github.com/Stability-AI/generative-models/)
- [PhotoMaker](https://github.com/TencentARC/PhotoMaker)
- [Wan2.1](https://github.com/Wan-Video/Wan2.1)
- [Wan2.2](https://github.com/Wan-Video/Wan2.2)

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.4 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.2 MiB

BIN
assets/control.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 4.3 KiB

BIN
assets/control_2.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 6.1 KiB

BIN
assets/control_3.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 18 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 477 KiB

BIN
assets/flux/chroma_v40.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 539 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 416 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 490 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 464 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 468 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 566 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 475 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 481 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 496 KiB

BIN
assets/flux2/example.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 556 KiB

BIN
assets/logo.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.0 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 401 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 39 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 311 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 53 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.4 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 26 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 88 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 26 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 72 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 107 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 54 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 68 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 54 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 48 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 28 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 30 KiB

BIN
assets/qwen/example.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.4 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 457 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 415 KiB

BIN
assets/sd3.5_large.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.8 MiB

BIN
assets/sycl_sd3_output.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.7 MiB

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

After

Width:  |  Height:  |  Size: 594 KiB

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

BIN
assets/with_lcm.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 596 KiB

BIN
assets/without_lcm.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 533 KiB

BIN
assets/z_image/bf16.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.0 MiB

BIN
assets/z_image/q2_K.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.1 MiB

BIN
assets/z_image/q3_K.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.1 MiB

BIN
assets/z_image/q4_0.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.0 MiB

BIN
assets/z_image/q4_K.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.0 MiB

BIN
assets/z_image/q5_0.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.0 MiB

BIN
assets/z_image/q6_K.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.0 MiB

BIN
assets/z_image/q8_0.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.0 MiB

1003
clip.hpp Normal file

File diff suppressed because it is too large Load Diff

591
common.hpp Normal file
View File

@ -0,0 +1,591 @@
#ifndef __COMMON_HPP__
#define __COMMON_HPP__
#include "ggml_extend.hpp"
class DownSampleBlock : public GGMLBlock {
protected:
int channels;
int out_channels;
bool vae_downsample;
public:
DownSampleBlock(int channels,
int out_channels,
bool vae_downsample = false)
: channels(channels),
out_channels(out_channels),
vae_downsample(vae_downsample) {
if (vae_downsample) {
blocks["conv"] = std::shared_ptr<GGMLBlock>(new Conv2d(channels, out_channels, {3, 3}, {2, 2}, {0, 0}));
} else {
blocks["op"] = std::shared_ptr<GGMLBlock>(new Conv2d(channels, out_channels, {3, 3}, {2, 2}, {1, 1}));
}
}
struct ggml_tensor* forward(GGMLRunnerContext* ctx, struct ggml_tensor* x) {
// x: [N, channels, h, w]
if (vae_downsample) {
auto conv = std::dynamic_pointer_cast<Conv2d>(blocks["conv"]);
x = ggml_pad(ctx->ggml_ctx, x, 1, 1, 0, 0);
x = conv->forward(ctx, x);
} else {
auto conv = std::dynamic_pointer_cast<Conv2d>(blocks["op"]);
x = conv->forward(ctx, x);
}
return x; // [N, out_channels, h/2, w/2]
}
};
class UpSampleBlock : public GGMLBlock {
protected:
int channels;
int out_channels;
public:
UpSampleBlock(int channels,
int out_channels)
: channels(channels),
out_channels(out_channels) {
blocks["conv"] = std::shared_ptr<GGMLBlock>(new Conv2d(channels, out_channels, {3, 3}, {1, 1}, {1, 1}));
}
struct ggml_tensor* forward(GGMLRunnerContext* ctx, struct ggml_tensor* x) {
// x: [N, channels, h, w]
auto conv = std::dynamic_pointer_cast<Conv2d>(blocks["conv"]);
x = ggml_upscale(ctx->ggml_ctx, x, 2, GGML_SCALE_MODE_NEAREST); // [N, channels, h*2, w*2]
x = conv->forward(ctx, x); // [N, out_channels, h*2, w*2]
return x;
}
};
class ResBlock : public GGMLBlock {
protected:
// network hparams
int64_t channels; // model_channels * (1, 1, 1, 2, 2, 4, 4, 4)
int64_t emb_channels; // time_embed_dim
int64_t out_channels; // mult * model_channels
std::pair<int, int> kernel_size;
int dims;
bool skip_t_emb;
bool exchange_temb_dims;
std::shared_ptr<GGMLBlock> conv_nd(int dims,
int64_t in_channels,
int64_t out_channels,
std::pair<int, int> kernel_size,
std::pair<int, int> padding) {
GGML_ASSERT(dims == 2 || dims == 3);
if (dims == 3) {
return std::shared_ptr<GGMLBlock>(new Conv3dnx1x1(in_channels, out_channels, kernel_size.first, 1, padding.first));
} else {
return std::shared_ptr<GGMLBlock>(new Conv2d(in_channels, out_channels, kernel_size, {1, 1}, padding));
}
}
public:
ResBlock(int64_t channels,
int64_t emb_channels,
int64_t out_channels,
std::pair<int, int> kernel_size = {3, 3},
int dims = 2,
bool exchange_temb_dims = false,
bool skip_t_emb = false)
: channels(channels),
emb_channels(emb_channels),
out_channels(out_channels),
kernel_size(kernel_size),
dims(dims),
skip_t_emb(skip_t_emb),
exchange_temb_dims(exchange_temb_dims) {
std::pair<int, int> padding = {kernel_size.first / 2, kernel_size.second / 2};
blocks["in_layers.0"] = std::shared_ptr<GGMLBlock>(new GroupNorm32(channels));
// in_layer_1 is nn.SILU()
blocks["in_layers.2"] = conv_nd(dims, channels, out_channels, kernel_size, padding);
if (!skip_t_emb) {
// emb_layer_0 is nn.SILU()
blocks["emb_layers.1"] = std::shared_ptr<GGMLBlock>(new Linear(emb_channels, out_channels));
}
blocks["out_layers.0"] = std::shared_ptr<GGMLBlock>(new GroupNorm32(out_channels));
// out_layer_1 is nn.SILU()
// out_layer_2 is nn.Dropout(), skip for inference
blocks["out_layers.3"] = conv_nd(dims, out_channels, out_channels, kernel_size, padding);
if (out_channels != channels) {
blocks["skip_connection"] = conv_nd(dims, channels, out_channels, {1, 1}, {0, 0});
}
}
virtual struct ggml_tensor* forward(GGMLRunnerContext* ctx, struct ggml_tensor* x, struct ggml_tensor* emb = nullptr) {
// For dims==3, we reduce dimension from 5d to 4d by merging h and w, in order not to change ggml
// [N, c, t, h, w] => [N, c, t, h * w]
// x: [N, channels, h, w] if dims == 2 else [N, channels, t, h, w]
// emb: [N, emb_channels] if dims == 2 else [N, t, emb_channels]
auto in_layers_0 = std::dynamic_pointer_cast<GroupNorm32>(blocks["in_layers.0"]);
auto in_layers_2 = std::dynamic_pointer_cast<UnaryBlock>(blocks["in_layers.2"]);
auto out_layers_0 = std::dynamic_pointer_cast<GroupNorm32>(blocks["out_layers.0"]);
auto out_layers_3 = std::dynamic_pointer_cast<UnaryBlock>(blocks["out_layers.3"]);
if (emb == nullptr) {
GGML_ASSERT(skip_t_emb);
}
// in_layers
auto h = in_layers_0->forward(ctx, x);
h = ggml_silu_inplace(ctx->ggml_ctx, h);
h = in_layers_2->forward(ctx, h); // [N, out_channels, h, w] if dims == 2 else [N, out_channels, t, h, w]
// emb_layers
if (!skip_t_emb) {
auto emb_layer_1 = std::dynamic_pointer_cast<Linear>(blocks["emb_layers.1"]);
auto emb_out = ggml_silu(ctx->ggml_ctx, emb);
emb_out = emb_layer_1->forward(ctx, emb_out); // [N, out_channels] if dims == 2 else [N, t, out_channels]
if (dims == 2) {
emb_out = ggml_reshape_4d(ctx->ggml_ctx, emb_out, 1, 1, emb_out->ne[0], emb_out->ne[1]); // [N, out_channels, 1, 1]
} else {
emb_out = ggml_reshape_4d(ctx->ggml_ctx, emb_out, 1, emb_out->ne[0], emb_out->ne[1], emb_out->ne[2]); // [N, t, out_channels, 1]
if (exchange_temb_dims) {
// emb_out = rearrange(emb_out, "b t c ... -> b c t ...")
emb_out = ggml_cont(ctx->ggml_ctx, ggml_permute(ctx->ggml_ctx, emb_out, 0, 2, 1, 3)); // [N, out_channels, t, 1]
}
}
h = ggml_add(ctx->ggml_ctx, h, emb_out); // [N, out_channels, h, w] if dims == 2 else [N, out_channels, t, h, w]
}
// out_layers
h = out_layers_0->forward(ctx, h);
h = ggml_silu_inplace(ctx->ggml_ctx, h);
// dropout, skip for inference
h = out_layers_3->forward(ctx, h);
// skip connection
if (out_channels != channels) {
auto skip_connection = std::dynamic_pointer_cast<UnaryBlock>(blocks["skip_connection"]);
x = skip_connection->forward(ctx, x); // [N, out_channels, h, w] if dims == 2 else [N, out_channels, t, h, w]
}
h = ggml_add(ctx->ggml_ctx, h, x);
return h; // [N, out_channels, h, w] if dims == 2 else [N, out_channels, t, h, w]
}
};
class GEGLU : public UnaryBlock {
protected:
int64_t dim_in;
int64_t dim_out;
public:
GEGLU(int64_t dim_in, int64_t dim_out)
: dim_in(dim_in), dim_out(dim_out) {
blocks["proj"] = std::shared_ptr<GGMLBlock>(new Linear(dim_in, dim_out * 2));
}
struct ggml_tensor* forward(GGMLRunnerContext* ctx, struct ggml_tensor* x) override {
// x: [ne3, ne2, ne1, dim_in]
// return: [ne3, ne2, ne1, dim_out]
auto proj = std::dynamic_pointer_cast<Linear>(blocks["proj"]);
x = proj->forward(ctx, x); // [ne3, ne2, ne1, dim_out*2]
auto x_vec = ggml_ext_chunk(ctx->ggml_ctx, x, 2, 0);
x = x_vec[0]; // [ne3, ne2, ne1, dim_out]
auto gate = x_vec[1]; // [ne3, ne2, ne1, dim_out]
gate = ggml_gelu_inplace(ctx->ggml_ctx, gate);
x = ggml_mul(ctx->ggml_ctx, x, gate); // [ne3, ne2, ne1, dim_out]
return x;
}
};
class GELU : public UnaryBlock {
public:
GELU(int64_t dim_in, int64_t dim_out, bool bias = true) {
blocks["proj"] = std::shared_ptr<GGMLBlock>(new Linear(dim_in, dim_out, bias));
}
struct ggml_tensor* forward(GGMLRunnerContext* ctx, struct ggml_tensor* x) override {
// x: [ne3, ne2, ne1, dim_in]
// return: [ne3, ne2, ne1, dim_out]
auto proj = std::dynamic_pointer_cast<Linear>(blocks["proj"]);
x = proj->forward(ctx, x);
x = ggml_gelu_inplace(ctx->ggml_ctx, x);
return x;
}
};
class FeedForward : public GGMLBlock {
public:
enum class Activation {
GEGLU,
GELU
};
FeedForward(int64_t dim,
int64_t dim_out,
int64_t mult = 4,
Activation activation = Activation::GEGLU,
bool precision_fix = false) {
int64_t inner_dim = dim * mult;
if (activation == Activation::GELU) {
blocks["net.0"] = std::shared_ptr<GGMLBlock>(new GELU(dim, inner_dim));
} else {
blocks["net.0"] = std::shared_ptr<GGMLBlock>(new GEGLU(dim, inner_dim));
}
// net_1 is nn.Dropout(), skip for inference
bool force_prec_f32 = false;
float scale = 1.f;
if (precision_fix) {
scale = 1.f / 128.f;
#ifdef SD_USE_VULKAN
force_prec_f32 = true;
#endif
}
// The purpose of the scale here is to prevent NaN issues in certain situations.
// For example, when using Vulkan without enabling force_prec_f32,
// or when using CUDA but the weights are k-quants.
blocks["net.2"] = std::shared_ptr<GGMLBlock>(new Linear(inner_dim, dim_out, true, false, force_prec_f32, scale));
}
struct ggml_tensor* forward(GGMLRunnerContext* ctx, struct ggml_tensor* x) {
// x: [ne3, ne2, ne1, dim]
// return: [ne3, ne2, ne1, dim_out]
auto net_0 = std::dynamic_pointer_cast<UnaryBlock>(blocks["net.0"]);
auto net_2 = std::dynamic_pointer_cast<Linear>(blocks["net.2"]);
x = net_0->forward(ctx, x); // [ne3, ne2, ne1, inner_dim]
x = net_2->forward(ctx, x); // [ne3, ne2, ne1, dim_out]
return x;
}
};
class CrossAttention : public GGMLBlock {
protected:
int64_t query_dim;
int64_t context_dim;
int64_t n_head;
int64_t d_head;
public:
CrossAttention(int64_t query_dim,
int64_t context_dim,
int64_t n_head,
int64_t d_head)
: n_head(n_head),
d_head(d_head),
query_dim(query_dim),
context_dim(context_dim) {
int64_t inner_dim = d_head * n_head;
blocks["to_q"] = std::shared_ptr<GGMLBlock>(new Linear(query_dim, inner_dim, false));
blocks["to_k"] = std::shared_ptr<GGMLBlock>(new Linear(context_dim, inner_dim, false));
blocks["to_v"] = std::shared_ptr<GGMLBlock>(new Linear(context_dim, inner_dim, false));
blocks["to_out.0"] = std::shared_ptr<GGMLBlock>(new Linear(inner_dim, query_dim));
// to_out_1 is nn.Dropout(), skip for inference
}
struct ggml_tensor* forward(GGMLRunnerContext* ctx,
struct ggml_tensor* x,
struct ggml_tensor* context) {
// x: [N, n_token, query_dim]
// context: [N, n_context, context_dim]
// return: [N, n_token, query_dim]
auto to_q = std::dynamic_pointer_cast<Linear>(blocks["to_q"]);
auto to_k = std::dynamic_pointer_cast<Linear>(blocks["to_k"]);
auto to_v = std::dynamic_pointer_cast<Linear>(blocks["to_v"]);
auto to_out_0 = std::dynamic_pointer_cast<Linear>(blocks["to_out.0"]);
int64_t n = x->ne[2];
int64_t n_token = x->ne[1];
int64_t n_context = context->ne[1];
int64_t inner_dim = d_head * n_head;
auto q = to_q->forward(ctx, x); // [N, n_token, inner_dim]
auto k = to_k->forward(ctx, context); // [N, n_context, inner_dim]
auto v = to_v->forward(ctx, context); // [N, n_context, inner_dim]
x = ggml_ext_attention_ext(ctx->ggml_ctx, ctx->backend, q, k, v, n_head, nullptr, false, false, ctx->flash_attn_enabled); // [N, n_token, inner_dim]
x = to_out_0->forward(ctx, x); // [N, n_token, query_dim]
return x;
}
};
class BasicTransformerBlock : public GGMLBlock {
protected:
int64_t n_head;
int64_t d_head;
bool ff_in;
public:
BasicTransformerBlock(int64_t dim,
int64_t n_head,
int64_t d_head,
int64_t context_dim,
bool ff_in = false)
: n_head(n_head), d_head(d_head), ff_in(ff_in) {
// disable_self_attn is always False
// disable_temporal_crossattention is always False
// switch_temporal_ca_to_sa is always False
// inner_dim is always None or equal to dim
// gated_ff is always True
blocks["attn1"] = std::shared_ptr<GGMLBlock>(new CrossAttention(dim, dim, n_head, d_head));
blocks["attn2"] = std::shared_ptr<GGMLBlock>(new CrossAttention(dim, context_dim, n_head, d_head));
blocks["ff"] = std::shared_ptr<GGMLBlock>(new FeedForward(dim, dim));
blocks["norm1"] = std::shared_ptr<GGMLBlock>(new LayerNorm(dim));
blocks["norm2"] = std::shared_ptr<GGMLBlock>(new LayerNorm(dim));
blocks["norm3"] = std::shared_ptr<GGMLBlock>(new LayerNorm(dim));
if (ff_in) {
blocks["norm_in"] = std::shared_ptr<GGMLBlock>(new LayerNorm(dim));
blocks["ff_in"] = std::shared_ptr<GGMLBlock>(new FeedForward(dim, dim));
}
}
struct ggml_tensor* forward(GGMLRunnerContext* ctx,
struct ggml_tensor* x,
struct ggml_tensor* context) {
// x: [N, n_token, query_dim]
// context: [N, n_context, context_dim]
// return: [N, n_token, query_dim]
auto attn1 = std::dynamic_pointer_cast<CrossAttention>(blocks["attn1"]);
auto attn2 = std::dynamic_pointer_cast<CrossAttention>(blocks["attn2"]);
auto ff = std::dynamic_pointer_cast<FeedForward>(blocks["ff"]);
auto norm1 = std::dynamic_pointer_cast<LayerNorm>(blocks["norm1"]);
auto norm2 = std::dynamic_pointer_cast<LayerNorm>(blocks["norm2"]);
auto norm3 = std::dynamic_pointer_cast<LayerNorm>(blocks["norm3"]);
if (ff_in) {
auto norm_in = std::dynamic_pointer_cast<LayerNorm>(blocks["norm_in"]);
auto ff_in = std::dynamic_pointer_cast<FeedForward>(blocks["ff_in"]);
auto x_skip = x;
x = norm_in->forward(ctx, x);
x = ff_in->forward(ctx, x);
// self.is_res is always True
x = ggml_add(ctx->ggml_ctx, x, x_skip);
}
auto r = x;
x = norm1->forward(ctx, x);
x = attn1->forward(ctx, x, x); // self-attention
x = ggml_add(ctx->ggml_ctx, x, r);
r = x;
x = norm2->forward(ctx, x);
x = attn2->forward(ctx, x, context); // cross-attention
x = ggml_add(ctx->ggml_ctx, x, r);
r = x;
x = norm3->forward(ctx, x);
x = ff->forward(ctx, x);
x = ggml_add(ctx->ggml_ctx, x, r);
return x;
}
};
class SpatialTransformer : public GGMLBlock {
protected:
int64_t in_channels; // mult * model_channels
int64_t n_head;
int64_t d_head;
int64_t depth = 1; // 1
int64_t context_dim = 768; // hidden_size, 1024 for VERSION_SD2
bool use_linear = false;
void init_params(struct ggml_context* ctx, const String2TensorStorage& tensor_storage_map = {}, const std::string prefix = "") {
auto iter = tensor_storage_map.find(prefix + "proj_out.weight");
if (iter != tensor_storage_map.end()) {
int64_t inner_dim = n_head * d_head;
if (iter->second.n_dims == 4 && use_linear) {
use_linear = false;
blocks["proj_in"] = std::make_shared<Conv2d>(in_channels, inner_dim, std::pair{1, 1});
blocks["proj_out"] = std::make_shared<Conv2d>(inner_dim, in_channels, std::pair{1, 1});
} else if (iter->second.n_dims == 2 && !use_linear) {
use_linear = true;
blocks["proj_in"] = std::make_shared<Linear>(in_channels, inner_dim);
blocks["proj_out"] = std::make_shared<Linear>(inner_dim, in_channels);
}
}
}
public:
SpatialTransformer(int64_t in_channels,
int64_t n_head,
int64_t d_head,
int64_t depth,
int64_t context_dim,
bool use_linear)
: in_channels(in_channels),
n_head(n_head),
d_head(d_head),
depth(depth),
context_dim(context_dim),
use_linear(use_linear) {
// disable_self_attn is always False
int64_t inner_dim = n_head * d_head; // in_channels
blocks["norm"] = std::shared_ptr<GGMLBlock>(new GroupNorm32(in_channels));
if (use_linear) {
blocks["proj_in"] = std::shared_ptr<GGMLBlock>(new Linear(in_channels, inner_dim));
} else {
blocks["proj_in"] = std::shared_ptr<GGMLBlock>(new Conv2d(in_channels, inner_dim, {1, 1}));
}
for (int i = 0; i < depth; i++) {
std::string name = "transformer_blocks." + std::to_string(i);
blocks[name] = std::shared_ptr<GGMLBlock>(new BasicTransformerBlock(inner_dim, n_head, d_head, context_dim, false));
}
if (use_linear) {
blocks["proj_out"] = std::shared_ptr<GGMLBlock>(new Linear(inner_dim, in_channels));
} else {
blocks["proj_out"] = std::shared_ptr<GGMLBlock>(new Conv2d(inner_dim, in_channels, {1, 1}));
}
}
virtual struct ggml_tensor* forward(GGMLRunnerContext* ctx,
struct ggml_tensor* x,
struct ggml_tensor* context) {
// x: [N, in_channels, h, w]
// context: [N, max_position(aka n_token), hidden_size(aka context_dim)]
auto norm = std::dynamic_pointer_cast<GroupNorm32>(blocks["norm"]);
auto proj_in = std::dynamic_pointer_cast<UnaryBlock>(blocks["proj_in"]);
auto proj_out = std::dynamic_pointer_cast<UnaryBlock>(blocks["proj_out"]);
auto x_in = x;
int64_t n = x->ne[3];
int64_t h = x->ne[1];
int64_t w = x->ne[0];
int64_t inner_dim = n_head * d_head;
x = norm->forward(ctx, x);
if (use_linear) {
x = ggml_cont(ctx->ggml_ctx, ggml_permute(ctx->ggml_ctx, x, 1, 2, 0, 3)); // [N, h, w, inner_dim]
x = ggml_reshape_3d(ctx->ggml_ctx, x, inner_dim, w * h, n); // [N, h * w, inner_dim]
x = proj_in->forward(ctx, x); // [N, inner_dim, h, w]
} else {
x = proj_in->forward(ctx, x); // [N, inner_dim, h, w]
x = ggml_cont(ctx->ggml_ctx, ggml_permute(ctx->ggml_ctx, x, 1, 2, 0, 3)); // [N, h, w, inner_dim]
x = ggml_reshape_3d(ctx->ggml_ctx, x, inner_dim, w * h, n); // [N, h * w, inner_dim]
}
for (int i = 0; i < depth; i++) {
std::string name = "transformer_blocks." + std::to_string(i);
auto transformer_block = std::dynamic_pointer_cast<BasicTransformerBlock>(blocks[name]);
x = transformer_block->forward(ctx, x, context);
}
if (use_linear) {
// proj_out
x = proj_out->forward(ctx, x); // [N, in_channels, h, w]
x = ggml_cont(ctx->ggml_ctx, ggml_permute(ctx->ggml_ctx, x, 1, 0, 2, 3)); // [N, inner_dim, h * w]
x = ggml_reshape_4d(ctx->ggml_ctx, x, w, h, inner_dim, n); // [N, inner_dim, h, w]
} else {
x = ggml_cont(ctx->ggml_ctx, ggml_permute(ctx->ggml_ctx, x, 1, 0, 2, 3)); // [N, inner_dim, h * w]
x = ggml_reshape_4d(ctx->ggml_ctx, x, w, h, inner_dim, n); // [N, inner_dim, h, w]
// proj_out
x = proj_out->forward(ctx, x); // [N, in_channels, h, w]
}
x = ggml_add(ctx->ggml_ctx, x, x_in);
return x;
}
};
class AlphaBlender : public GGMLBlock {
protected:
void init_params(struct ggml_context* ctx, const String2TensorStorage& tensor_storage_map = {}, std::string prefix = "") override {
// Get the type of the "mix_factor" tensor from the input tensors map with the specified prefix
enum ggml_type wtype = GGML_TYPE_F32;
params["mix_factor"] = ggml_new_tensor_1d(ctx, wtype, 1);
}
float get_alpha() {
// image_only_indicator is always tensor([0.]) and since mix_factor.shape is [1,]
// so learned_with_images is same as learned
float alpha = ggml_ext_backend_tensor_get_f32(params["mix_factor"]);
return sigmoid(alpha);
}
public:
AlphaBlender() {
// merge_strategy is always learned_with_images
// for inference, we don't need to set alpha
// since mix_factor.shape is [1,], we don't need rearrange using rearrange_pattern
}
struct ggml_tensor* forward(GGMLRunnerContext* ctx,
struct ggml_tensor* x_spatial,
struct ggml_tensor* x_temporal) {
// image_only_indicator is always tensor([0.])
float alpha = get_alpha();
auto x = ggml_add(ctx->ggml_ctx,
ggml_scale(ctx->ggml_ctx, x_spatial, alpha),
ggml_scale(ctx->ggml_ctx, x_temporal, 1.0f - alpha));
return x;
}
};
class VideoResBlock : public ResBlock {
public:
VideoResBlock(int channels,
int emb_channels,
int out_channels,
std::pair<int, int> kernel_size = {3, 3},
int64_t video_kernel_size = 3,
int dims = 2) // always 2
: ResBlock(channels, emb_channels, out_channels, kernel_size, dims) {
blocks["time_stack"] = std::shared_ptr<GGMLBlock>(new ResBlock(out_channels, emb_channels, out_channels, kernel_size, 3, true));
blocks["time_mixer"] = std::shared_ptr<GGMLBlock>(new AlphaBlender());
}
struct ggml_tensor* forward(GGMLRunnerContext* ctx,
struct ggml_tensor* x,
struct ggml_tensor* emb,
int num_video_frames) {
// x: [N, channels, h, w] aka [b*t, channels, h, w]
// emb: [N, emb_channels] aka [b*t, emb_channels]
// image_only_indicator is always tensor([0.])
auto time_stack = std::dynamic_pointer_cast<ResBlock>(blocks["time_stack"]);
auto time_mixer = std::dynamic_pointer_cast<AlphaBlender>(blocks["time_mixer"]);
x = ResBlock::forward(ctx, x, emb);
int64_t T = num_video_frames;
int64_t B = x->ne[3] / T;
int64_t C = x->ne[2];
int64_t H = x->ne[1];
int64_t W = x->ne[0];
x = ggml_reshape_4d(ctx->ggml_ctx, x, W * H, C, T, B); // (b t) c h w -> b t c (h w)
x = ggml_cont(ctx->ggml_ctx, ggml_permute(ctx->ggml_ctx, x, 0, 2, 1, 3)); // b t c (h w) -> b c t (h w)
auto x_mix = x;
emb = ggml_reshape_4d(ctx->ggml_ctx, emb, emb->ne[0], T, B, emb->ne[3]); // (b t) ... -> b t ...
x = time_stack->forward(ctx, x, emb); // b t c (h w)
x = time_mixer->forward(ctx, x_mix, x); // b t c (h w)
x = ggml_cont(ctx->ggml_ctx, ggml_permute(ctx->ggml_ctx, x, 0, 2, 1, 3)); // b c t (h w) -> b t c (h w)
x = ggml_reshape_4d(ctx->ggml_ctx, x, W, H, C, T * B); // b t c (h w) -> (b t) c h w
return x;
}
};
#endif // __COMMON_HPP__

1897
conditioner.hpp Normal file

File diff suppressed because it is too large Load Diff

466
control.hpp Normal file
View File

@ -0,0 +1,466 @@
#ifndef __CONTROL_HPP__
#define __CONTROL_HPP__
#include "common.hpp"
#include "ggml_extend.hpp"
#include "model.h"
#define CONTROL_NET_GRAPH_SIZE 1536
/*
=================================== ControlNet ===================================
Reference: https://github.com/comfyanonymous/ComfyUI/blob/master/comfy/cldm/cldm.py
*/
class ControlNetBlock : public GGMLBlock {
protected:
SDVersion version = VERSION_SD1;
// network hparams
int in_channels = 4;
int out_channels = 4;
int hint_channels = 3;
int num_res_blocks = 2;
std::vector<int> attention_resolutions = {4, 2, 1};
std::vector<int> channel_mult = {1, 2, 4, 4};
std::vector<int> transformer_depth = {1, 1, 1, 1};
int time_embed_dim = 1280; // model_channels*4
int num_heads = 8;
int num_head_channels = -1; // channels // num_heads
int context_dim = 768; // 1024 for VERSION_SD2, 2048 for VERSION_SDXL
bool use_linear_projection = false;
public:
int model_channels = 320;
int adm_in_channels = 2816; // only for VERSION_SDXL
ControlNetBlock(SDVersion version = VERSION_SD1)
: version(version) {
if (sd_version_is_sd2(version)) {
context_dim = 1024;
num_head_channels = 64;
num_heads = -1;
} else if (sd_version_is_sdxl(version)) {
context_dim = 2048;
attention_resolutions = {4, 2};
channel_mult = {1, 2, 4};
transformer_depth = {1, 2, 10};
num_head_channels = 64;
num_heads = -1;
} else if (version == VERSION_SVD) {
in_channels = 8;
out_channels = 4;
context_dim = 1024;
adm_in_channels = 768;
num_head_channels = 64;
num_heads = -1;
}
blocks["time_embed.0"] = std::shared_ptr<GGMLBlock>(new Linear(model_channels, time_embed_dim));
// time_embed_1 is nn.SiLU()
blocks["time_embed.2"] = std::shared_ptr<GGMLBlock>(new Linear(time_embed_dim, time_embed_dim));
if (sd_version_is_sdxl(version) || version == VERSION_SVD) {
blocks["label_emb.0.0"] = std::shared_ptr<GGMLBlock>(new Linear(adm_in_channels, time_embed_dim));
// label_emb_1 is nn.SiLU()
blocks["label_emb.0.2"] = std::shared_ptr<GGMLBlock>(new Linear(time_embed_dim, time_embed_dim));
}
// input_blocks
blocks["input_blocks.0.0"] = std::shared_ptr<GGMLBlock>(new Conv2d(in_channels, model_channels, {3, 3}, {1, 1}, {1, 1}));
std::vector<int> input_block_chans;
input_block_chans.push_back(model_channels);
int ch = model_channels;
int input_block_idx = 0;
int ds = 1;
auto get_resblock = [&](int64_t channels, int64_t emb_channels, int64_t out_channels) -> ResBlock* {
return new ResBlock(channels, emb_channels, out_channels);
};
auto get_attention_layer = [&](int64_t in_channels,
int64_t n_head,
int64_t d_head,
int64_t depth,
int64_t context_dim) -> SpatialTransformer* {
return new SpatialTransformer(in_channels, n_head, d_head, depth, context_dim, use_linear_projection);
};
auto make_zero_conv = [&](int64_t channels) {
return new Conv2d(channels, channels, {1, 1});
};
blocks["zero_convs.0.0"] = std::shared_ptr<GGMLBlock>(make_zero_conv(model_channels));
blocks["input_hint_block.0"] = std::shared_ptr<GGMLBlock>(new Conv2d(hint_channels, 16, {3, 3}, {1, 1}, {1, 1}));
// nn.SiLU()
blocks["input_hint_block.2"] = std::shared_ptr<GGMLBlock>(new Conv2d(16, 16, {3, 3}, {1, 1}, {1, 1}));
// nn.SiLU()
blocks["input_hint_block.4"] = std::shared_ptr<GGMLBlock>(new Conv2d(16, 32, {3, 3}, {2, 2}, {1, 1}));
// nn.SiLU()
blocks["input_hint_block.6"] = std::shared_ptr<GGMLBlock>(new Conv2d(32, 32, {3, 3}, {1, 1}, {1, 1}));
// nn.SiLU()
blocks["input_hint_block.8"] = std::shared_ptr<GGMLBlock>(new Conv2d(32, 96, {3, 3}, {2, 2}, {1, 1}));
// nn.SiLU()
blocks["input_hint_block.10"] = std::shared_ptr<GGMLBlock>(new Conv2d(96, 96, {3, 3}, {1, 1}, {1, 1}));
// nn.SiLU()
blocks["input_hint_block.12"] = std::shared_ptr<GGMLBlock>(new Conv2d(96, 256, {3, 3}, {2, 2}, {1, 1}));
// nn.SiLU()
blocks["input_hint_block.14"] = std::shared_ptr<GGMLBlock>(new Conv2d(256, model_channels, {3, 3}, {1, 1}, {1, 1}));
size_t len_mults = channel_mult.size();
for (int i = 0; i < len_mults; i++) {
int mult = channel_mult[i];
for (int j = 0; j < num_res_blocks; j++) {
input_block_idx += 1;
std::string name = "input_blocks." + std::to_string(input_block_idx) + ".0";
blocks[name] = std::shared_ptr<GGMLBlock>(get_resblock(ch, time_embed_dim, mult * model_channels));
ch = mult * model_channels;
if (std::find(attention_resolutions.begin(), attention_resolutions.end(), ds) != attention_resolutions.end()) {
int n_head = num_heads;
int d_head = ch / num_heads;
if (num_head_channels != -1) {
d_head = num_head_channels;
n_head = ch / d_head;
}
std::string name = "input_blocks." + std::to_string(input_block_idx) + ".1";
blocks[name] = std::shared_ptr<GGMLBlock>(get_attention_layer(ch,
n_head,
d_head,
transformer_depth[i],
context_dim));
}
blocks["zero_convs." + std::to_string(input_block_idx) + ".0"] = std::shared_ptr<GGMLBlock>(make_zero_conv(ch));
input_block_chans.push_back(ch);
}
if (i != len_mults - 1) {
input_block_idx += 1;
std::string name = "input_blocks." + std::to_string(input_block_idx) + ".0";
blocks[name] = std::shared_ptr<GGMLBlock>(new DownSampleBlock(ch, ch));
blocks["zero_convs." + std::to_string(input_block_idx) + ".0"] = std::shared_ptr<GGMLBlock>(make_zero_conv(ch));
input_block_chans.push_back(ch);
ds *= 2;
}
}
// middle blocks
int n_head = num_heads;
int d_head = ch / num_heads;
if (num_head_channels != -1) {
d_head = num_head_channels;
n_head = ch / d_head;
}
blocks["middle_block.0"] = std::shared_ptr<GGMLBlock>(get_resblock(ch, time_embed_dim, ch));
blocks["middle_block.1"] = std::shared_ptr<GGMLBlock>(get_attention_layer(ch,
n_head,
d_head,
transformer_depth[transformer_depth.size() - 1],
context_dim));
blocks["middle_block.2"] = std::shared_ptr<GGMLBlock>(get_resblock(ch, time_embed_dim, ch));
// middle_block_out
blocks["middle_block_out.0"] = std::shared_ptr<GGMLBlock>(make_zero_conv(ch));
}
struct ggml_tensor* resblock_forward(std::string name,
GGMLRunnerContext* ctx,
struct ggml_tensor* x,
struct ggml_tensor* emb) {
auto block = std::dynamic_pointer_cast<ResBlock>(blocks[name]);
return block->forward(ctx, x, emb);
}
struct ggml_tensor* attention_layer_forward(std::string name,
GGMLRunnerContext* ctx,
struct ggml_tensor* x,
struct ggml_tensor* context) {
auto block = std::dynamic_pointer_cast<SpatialTransformer>(blocks[name]);
return block->forward(ctx, x, context);
}
struct ggml_tensor* input_hint_block_forward(GGMLRunnerContext* ctx,
struct ggml_tensor* hint,
struct ggml_tensor* emb,
struct ggml_tensor* context) {
int num_input_blocks = 15;
auto h = hint;
for (int i = 0; i < num_input_blocks; i++) {
if (i % 2 == 0) {
auto block = std::dynamic_pointer_cast<Conv2d>(blocks["input_hint_block." + std::to_string(i)]);
h = block->forward(ctx, h);
} else {
h = ggml_silu_inplace(ctx->ggml_ctx, h);
}
}
return h;
}
std::vector<struct ggml_tensor*> forward(GGMLRunnerContext* ctx,
struct ggml_tensor* x,
struct ggml_tensor* hint,
struct ggml_tensor* guided_hint,
struct ggml_tensor* timesteps,
struct ggml_tensor* context,
struct ggml_tensor* y = nullptr) {
// x: [N, in_channels, h, w] or [N, in_channels/2, h, w]
// timesteps: [N,]
// context: [N, max_position, hidden_size] or [1, max_position, hidden_size]. for example, [N, 77, 768]
// y: [N, adm_in_channels] or [1, adm_in_channels]
if (context != nullptr) {
if (context->ne[2] != x->ne[3]) {
context = ggml_repeat(ctx->ggml_ctx, context, ggml_new_tensor_3d(ctx->ggml_ctx, GGML_TYPE_F32, context->ne[0], context->ne[1], x->ne[3]));
}
}
if (y != nullptr) {
if (y->ne[1] != x->ne[3]) {
y = ggml_repeat(ctx->ggml_ctx, y, ggml_new_tensor_2d(ctx->ggml_ctx, GGML_TYPE_F32, y->ne[0], x->ne[3]));
}
}
auto time_embed_0 = std::dynamic_pointer_cast<Linear>(blocks["time_embed.0"]);
auto time_embed_2 = std::dynamic_pointer_cast<Linear>(blocks["time_embed.2"]);
auto input_blocks_0_0 = std::dynamic_pointer_cast<Conv2d>(blocks["input_blocks.0.0"]);
auto zero_convs_0 = std::dynamic_pointer_cast<Conv2d>(blocks["zero_convs.0.0"]);
auto middle_block_out = std::dynamic_pointer_cast<Conv2d>(blocks["middle_block_out.0"]);
auto t_emb = ggml_ext_timestep_embedding(ctx->ggml_ctx, timesteps, model_channels); // [N, model_channels]
auto emb = time_embed_0->forward(ctx, t_emb);
emb = ggml_silu_inplace(ctx->ggml_ctx, emb);
emb = time_embed_2->forward(ctx, emb); // [N, time_embed_dim]
// SDXL/SVD
if (y != nullptr) {
auto label_embed_0 = std::dynamic_pointer_cast<Linear>(blocks["label_emb.0.0"]);
auto label_embed_2 = std::dynamic_pointer_cast<Linear>(blocks["label_emb.0.2"]);
auto label_emb = label_embed_0->forward(ctx, y);
label_emb = ggml_silu_inplace(ctx->ggml_ctx, label_emb);
label_emb = label_embed_2->forward(ctx, label_emb); // [N, time_embed_dim]
emb = ggml_add(ctx->ggml_ctx, emb, label_emb); // [N, time_embed_dim]
}
std::vector<struct ggml_tensor*> outs;
if (guided_hint == nullptr) {
guided_hint = input_hint_block_forward(ctx, hint, emb, context);
}
outs.push_back(guided_hint);
// input_blocks
// input block 0
auto h = input_blocks_0_0->forward(ctx, x);
h = ggml_add(ctx->ggml_ctx, h, guided_hint);
outs.push_back(zero_convs_0->forward(ctx, h));
// input block 1-11
size_t len_mults = channel_mult.size();
int input_block_idx = 0;
int ds = 1;
for (int i = 0; i < len_mults; i++) {
int mult = channel_mult[i];
for (int j = 0; j < num_res_blocks; j++) {
input_block_idx += 1;
std::string name = "input_blocks." + std::to_string(input_block_idx) + ".0";
h = resblock_forward(name, ctx, h, emb); // [N, mult*model_channels, h, w]
if (std::find(attention_resolutions.begin(), attention_resolutions.end(), ds) != attention_resolutions.end()) {
std::string name = "input_blocks." + std::to_string(input_block_idx) + ".1";
h = attention_layer_forward(name, ctx, h, context); // [N, mult*model_channels, h, w]
}
auto zero_conv = std::dynamic_pointer_cast<Conv2d>(blocks["zero_convs." + std::to_string(input_block_idx) + ".0"]);
outs.push_back(zero_conv->forward(ctx, h));
}
if (i != len_mults - 1) {
ds *= 2;
input_block_idx += 1;
std::string name = "input_blocks." + std::to_string(input_block_idx) + ".0";
auto block = std::dynamic_pointer_cast<DownSampleBlock>(blocks[name]);
h = block->forward(ctx, h); // [N, mult*model_channels, h/(2^(i+1)), w/(2^(i+1))]
auto zero_conv = std::dynamic_pointer_cast<Conv2d>(blocks["zero_convs." + std::to_string(input_block_idx) + ".0"]);
outs.push_back(zero_conv->forward(ctx, h));
}
}
// [N, 4*model_channels, h/8, w/8]
// middle_block
h = resblock_forward("middle_block.0", ctx, h, emb); // [N, 4*model_channels, h/8, w/8]
h = attention_layer_forward("middle_block.1", ctx, h, context); // [N, 4*model_channels, h/8, w/8]
h = resblock_forward("middle_block.2", ctx, h, emb); // [N, 4*model_channels, h/8, w/8]
// out
outs.push_back(middle_block_out->forward(ctx, h));
return outs;
}
};
struct ControlNet : public GGMLRunner {
SDVersion version = VERSION_SD1;
ControlNetBlock control_net;
ggml_backend_buffer_t control_buffer = nullptr; // keep control output tensors in backend memory
ggml_context* control_ctx = nullptr;
std::vector<struct ggml_tensor*> controls; // (12 input block outputs, 1 middle block output) SD 1.5
struct ggml_tensor* guided_hint = nullptr; // guided_hint cache, for faster inference
bool guided_hint_cached = false;
ControlNet(ggml_backend_t backend,
bool offload_params_to_cpu,
const String2TensorStorage& tensor_storage_map = {},
SDVersion version = VERSION_SD1)
: GGMLRunner(backend, offload_params_to_cpu), control_net(version) {
control_net.init(params_ctx, tensor_storage_map, "");
}
~ControlNet() override {
free_control_ctx();
}
void alloc_control_ctx(std::vector<struct ggml_tensor*> outs) {
struct ggml_init_params params;
params.mem_size = static_cast<size_t>(outs.size() * ggml_tensor_overhead()) + 1024 * 1024;
params.mem_buffer = nullptr;
params.no_alloc = true;
control_ctx = ggml_init(params);
controls.resize(outs.size() - 1);
size_t control_buffer_size = 0;
guided_hint = ggml_dup_tensor(control_ctx, outs[0]);
control_buffer_size += ggml_nbytes(guided_hint);
for (int i = 0; i < outs.size() - 1; i++) {
controls[i] = ggml_dup_tensor(control_ctx, outs[i + 1]);
control_buffer_size += ggml_nbytes(controls[i]);
}
control_buffer = ggml_backend_alloc_ctx_tensors(control_ctx, runtime_backend);
LOG_DEBUG("control buffer size %.2fMB", control_buffer_size * 1.f / 1024.f / 1024.f);
}
void free_control_ctx() {
if (control_buffer != nullptr) {
ggml_backend_buffer_free(control_buffer);
control_buffer = nullptr;
}
if (control_ctx != nullptr) {
ggml_free(control_ctx);
control_ctx = nullptr;
}
guided_hint = nullptr;
guided_hint_cached = false;
controls.clear();
}
std::string get_desc() override {
return "control_net";
}
void get_param_tensors(std::map<std::string, struct ggml_tensor*>& tensors, const std::string prefix) {
control_net.get_param_tensors(tensors, prefix);
}
struct ggml_cgraph* build_graph(struct ggml_tensor* x,
struct ggml_tensor* hint,
struct ggml_tensor* timesteps,
struct ggml_tensor* context,
struct ggml_tensor* y = nullptr) {
struct ggml_cgraph* gf = new_graph_custom(CONTROL_NET_GRAPH_SIZE);
x = to_backend(x);
if (guided_hint_cached) {
hint = nullptr;
} else {
hint = to_backend(hint);
}
context = to_backend(context);
y = to_backend(y);
timesteps = to_backend(timesteps);
auto runner_ctx = get_context();
auto outs = control_net.forward(&runner_ctx,
x,
hint,
guided_hint_cached ? guided_hint : nullptr,
timesteps,
context,
y);
if (control_ctx == nullptr) {
alloc_control_ctx(outs);
}
ggml_build_forward_expand(gf, ggml_cpy(compute_ctx, outs[0], guided_hint));
for (int i = 0; i < outs.size() - 1; i++) {
ggml_build_forward_expand(gf, ggml_cpy(compute_ctx, outs[i + 1], controls[i]));
}
return gf;
}
bool compute(int n_threads,
struct ggml_tensor* x,
struct ggml_tensor* hint,
struct ggml_tensor* timesteps,
struct ggml_tensor* context,
struct ggml_tensor* y,
struct ggml_tensor** output = nullptr,
struct ggml_context* output_ctx = nullptr) {
// x: [N, in_channels, h, w]
// timesteps: [N, ]
// context: [N, max_position, hidden_size]([N, 77, 768]) or [1, max_position, hidden_size]
// y: [N, adm_in_channels] or [1, adm_in_channels]
auto get_graph = [&]() -> struct ggml_cgraph* {
return build_graph(x, hint, timesteps, context, y);
};
bool res = GGMLRunner::compute(get_graph, n_threads, false, output, output_ctx);
if (res) {
// cache guided_hint
guided_hint_cached = true;
}
return res;
}
bool load_from_file(const std::string& file_path, int n_threads) {
LOG_INFO("loading control net from '%s'", file_path.c_str());
alloc_params_buffer();
std::map<std::string, ggml_tensor*> tensors;
control_net.get_param_tensors(tensors);
std::set<std::string> ignore_tensors;
ModelLoader model_loader;
if (!model_loader.init_from_file_and_convert_name(file_path)) {
LOG_ERROR("init control net model loader from file failed: '%s'", file_path.c_str());
return false;
}
bool success = model_loader.load_tensors(tensors, ignore_tensors, n_threads);
if (!success) {
LOG_ERROR("load control net tensors from model loader failed");
return false;
}
LOG_INFO("control net model loaded");
return success;
}
};
#endif // __CONTROL_HPP__

1605
denoiser.hpp Normal file

File diff suppressed because it is too large Load Diff

424
diffusion_model.hpp Normal file
View File

@ -0,0 +1,424 @@
#ifndef __DIFFUSION_MODEL_H__
#define __DIFFUSION_MODEL_H__
#include "flux.hpp"
#include "mmdit.hpp"
#include "qwen_image.hpp"
#include "unet.hpp"
#include "wan.hpp"
#include "z_image.hpp"
struct DiffusionParams {
struct ggml_tensor* x = nullptr;
struct ggml_tensor* timesteps = nullptr;
struct ggml_tensor* context = nullptr;
struct ggml_tensor* c_concat = nullptr;
struct ggml_tensor* y = nullptr;
struct ggml_tensor* guidance = nullptr;
std::vector<ggml_tensor*> ref_latents = {};
bool increase_ref_index = false;
int num_video_frames = -1;
std::vector<struct ggml_tensor*> controls = {};
float control_strength = 0.f;
struct ggml_tensor* vace_context = nullptr;
float vace_strength = 1.f;
std::vector<int> skip_layers = {};
};
struct DiffusionModel {
virtual std::string get_desc() = 0;
virtual bool compute(int n_threads,
DiffusionParams diffusion_params,
struct ggml_tensor** output = nullptr,
struct ggml_context* output_ctx = nullptr) = 0;
virtual void alloc_params_buffer() = 0;
virtual void free_params_buffer() = 0;
virtual void free_compute_buffer() = 0;
virtual void get_param_tensors(std::map<std::string, struct ggml_tensor*>& tensors) = 0;
virtual size_t get_params_buffer_size() = 0;
virtual void set_weight_adapter(const std::shared_ptr<WeightAdapter>& adapter){};
virtual int64_t get_adm_in_channels() = 0;
virtual void set_flash_attn_enabled(bool enabled) = 0;
};
struct UNetModel : public DiffusionModel {
UNetModelRunner unet;
UNetModel(ggml_backend_t backend,
bool offload_params_to_cpu,
const String2TensorStorage& tensor_storage_map = {},
SDVersion version = VERSION_SD1)
: unet(backend, offload_params_to_cpu, tensor_storage_map, "model.diffusion_model", version) {
}
std::string get_desc() override {
return unet.get_desc();
}
void alloc_params_buffer() override {
unet.alloc_params_buffer();
}
void free_params_buffer() override {
unet.free_params_buffer();
}
void free_compute_buffer() override {
unet.free_compute_buffer();
}
void get_param_tensors(std::map<std::string, struct ggml_tensor*>& tensors) override {
unet.get_param_tensors(tensors, "model.diffusion_model");
}
size_t get_params_buffer_size() override {
return unet.get_params_buffer_size();
}
void set_weight_adapter(const std::shared_ptr<WeightAdapter>& adapter) override {
unet.set_weight_adapter(adapter);
}
int64_t get_adm_in_channels() override {
return unet.unet.adm_in_channels;
}
void set_flash_attn_enabled(bool enabled) {
unet.set_flash_attention_enabled(enabled);
}
bool compute(int n_threads,
DiffusionParams diffusion_params,
struct ggml_tensor** output = nullptr,
struct ggml_context* output_ctx = nullptr) override {
return unet.compute(n_threads,
diffusion_params.x,
diffusion_params.timesteps,
diffusion_params.context,
diffusion_params.c_concat,
diffusion_params.y,
diffusion_params.num_video_frames,
diffusion_params.controls,
diffusion_params.control_strength, output, output_ctx);
}
};
struct MMDiTModel : public DiffusionModel {
MMDiTRunner mmdit;
MMDiTModel(ggml_backend_t backend,
bool offload_params_to_cpu,
const String2TensorStorage& tensor_storage_map = {})
: mmdit(backend, offload_params_to_cpu, tensor_storage_map, "model.diffusion_model") {
}
std::string get_desc() override {
return mmdit.get_desc();
}
void alloc_params_buffer() override {
mmdit.alloc_params_buffer();
}
void free_params_buffer() override {
mmdit.free_params_buffer();
}
void free_compute_buffer() override {
mmdit.free_compute_buffer();
}
void get_param_tensors(std::map<std::string, struct ggml_tensor*>& tensors) override {
mmdit.get_param_tensors(tensors, "model.diffusion_model");
}
size_t get_params_buffer_size() override {
return mmdit.get_params_buffer_size();
}
void set_weight_adapter(const std::shared_ptr<WeightAdapter>& adapter) override {
mmdit.set_weight_adapter(adapter);
}
int64_t get_adm_in_channels() override {
return 768 + 1280;
}
void set_flash_attn_enabled(bool enabled) {
mmdit.set_flash_attention_enabled(enabled);
}
bool compute(int n_threads,
DiffusionParams diffusion_params,
struct ggml_tensor** output = nullptr,
struct ggml_context* output_ctx = nullptr) override {
return mmdit.compute(n_threads,
diffusion_params.x,
diffusion_params.timesteps,
diffusion_params.context,
diffusion_params.y,
output,
output_ctx,
diffusion_params.skip_layers);
}
};
struct FluxModel : public DiffusionModel {
Flux::FluxRunner flux;
FluxModel(ggml_backend_t backend,
bool offload_params_to_cpu,
const String2TensorStorage& tensor_storage_map = {},
SDVersion version = VERSION_FLUX,
bool use_mask = false)
: flux(backend, offload_params_to_cpu, tensor_storage_map, "model.diffusion_model", version, use_mask) {
}
std::string get_desc() override {
return flux.get_desc();
}
void alloc_params_buffer() override {
flux.alloc_params_buffer();
}
void free_params_buffer() override {
flux.free_params_buffer();
}
void free_compute_buffer() override {
flux.free_compute_buffer();
}
void get_param_tensors(std::map<std::string, struct ggml_tensor*>& tensors) override {
flux.get_param_tensors(tensors, "model.diffusion_model");
}
size_t get_params_buffer_size() override {
return flux.get_params_buffer_size();
}
void set_weight_adapter(const std::shared_ptr<WeightAdapter>& adapter) override {
flux.set_weight_adapter(adapter);
}
int64_t get_adm_in_channels() override {
return 768;
}
void set_flash_attn_enabled(bool enabled) {
flux.set_flash_attention_enabled(enabled);
}
bool compute(int n_threads,
DiffusionParams diffusion_params,
struct ggml_tensor** output = nullptr,
struct ggml_context* output_ctx = nullptr) override {
return flux.compute(n_threads,
diffusion_params.x,
diffusion_params.timesteps,
diffusion_params.context,
diffusion_params.c_concat,
diffusion_params.y,
diffusion_params.guidance,
diffusion_params.ref_latents,
diffusion_params.increase_ref_index,
output,
output_ctx,
diffusion_params.skip_layers);
}
};
struct WanModel : public DiffusionModel {
std::string prefix;
WAN::WanRunner wan;
WanModel(ggml_backend_t backend,
bool offload_params_to_cpu,
const String2TensorStorage& tensor_storage_map = {},
const std::string prefix = "model.diffusion_model",
SDVersion version = VERSION_WAN2)
: prefix(prefix), wan(backend, offload_params_to_cpu, tensor_storage_map, prefix, version) {
}
std::string get_desc() override {
return wan.get_desc();
}
void alloc_params_buffer() override {
wan.alloc_params_buffer();
}
void free_params_buffer() override {
wan.free_params_buffer();
}
void free_compute_buffer() override {
wan.free_compute_buffer();
}
void get_param_tensors(std::map<std::string, struct ggml_tensor*>& tensors) override {
wan.get_param_tensors(tensors, prefix);
}
size_t get_params_buffer_size() override {
return wan.get_params_buffer_size();
}
void set_weight_adapter(const std::shared_ptr<WeightAdapter>& adapter) override {
wan.set_weight_adapter(adapter);
}
int64_t get_adm_in_channels() override {
return 768;
}
void set_flash_attn_enabled(bool enabled) {
wan.set_flash_attention_enabled(enabled);
}
bool compute(int n_threads,
DiffusionParams diffusion_params,
struct ggml_tensor** output = nullptr,
struct ggml_context* output_ctx = nullptr) override {
return wan.compute(n_threads,
diffusion_params.x,
diffusion_params.timesteps,
diffusion_params.context,
diffusion_params.y,
diffusion_params.c_concat,
nullptr,
diffusion_params.vace_context,
diffusion_params.vace_strength,
output,
output_ctx);
}
};
struct QwenImageModel : public DiffusionModel {
std::string prefix;
Qwen::QwenImageRunner qwen_image;
QwenImageModel(ggml_backend_t backend,
bool offload_params_to_cpu,
const String2TensorStorage& tensor_storage_map = {},
const std::string prefix = "model.diffusion_model",
SDVersion version = VERSION_QWEN_IMAGE)
: prefix(prefix), qwen_image(backend, offload_params_to_cpu, tensor_storage_map, prefix, version) {
}
std::string get_desc() override {
return qwen_image.get_desc();
}
void alloc_params_buffer() override {
qwen_image.alloc_params_buffer();
}
void free_params_buffer() override {
qwen_image.free_params_buffer();
}
void free_compute_buffer() override {
qwen_image.free_compute_buffer();
}
void get_param_tensors(std::map<std::string, struct ggml_tensor*>& tensors) override {
qwen_image.get_param_tensors(tensors, prefix);
}
size_t get_params_buffer_size() override {
return qwen_image.get_params_buffer_size();
}
void set_weight_adapter(const std::shared_ptr<WeightAdapter>& adapter) override {
qwen_image.set_weight_adapter(adapter);
}
int64_t get_adm_in_channels() override {
return 768;
}
void set_flash_attn_enabled(bool enabled) {
qwen_image.set_flash_attention_enabled(enabled);
}
bool compute(int n_threads,
DiffusionParams diffusion_params,
struct ggml_tensor** output = nullptr,
struct ggml_context* output_ctx = nullptr) override {
return qwen_image.compute(n_threads,
diffusion_params.x,
diffusion_params.timesteps,
diffusion_params.context,
diffusion_params.ref_latents,
true, // increase_ref_index
output,
output_ctx);
}
};
struct ZImageModel : public DiffusionModel {
std::string prefix;
ZImage::ZImageRunner z_image;
ZImageModel(ggml_backend_t backend,
bool offload_params_to_cpu,
const String2TensorStorage& tensor_storage_map = {},
const std::string prefix = "model.diffusion_model",
SDVersion version = VERSION_Z_IMAGE)
: prefix(prefix), z_image(backend, offload_params_to_cpu, tensor_storage_map, prefix, version) {
}
std::string get_desc() override {
return z_image.get_desc();
}
void alloc_params_buffer() override {
z_image.alloc_params_buffer();
}
void free_params_buffer() override {
z_image.free_params_buffer();
}
void free_compute_buffer() override {
z_image.free_compute_buffer();
}
void get_param_tensors(std::map<std::string, struct ggml_tensor*>& tensors) override {
z_image.get_param_tensors(tensors, prefix);
}
size_t get_params_buffer_size() override {
return z_image.get_params_buffer_size();
}
void set_weight_adapter(const std::shared_ptr<WeightAdapter>& adapter) override {
z_image.set_weight_adapter(adapter);
}
int64_t get_adm_in_channels() override {
return 768;
}
void set_flash_attn_enabled(bool enabled) {
z_image.set_flash_attention_enabled(enabled);
}
bool compute(int n_threads,
DiffusionParams diffusion_params,
struct ggml_tensor** output = nullptr,
struct ggml_context* output_ctx = nullptr) override {
return z_image.compute(n_threads,
diffusion_params.x,
diffusion_params.timesteps,
diffusion_params.context,
diffusion_params.ref_latents,
true, // increase_ref_index
output,
output_ctx);
}
};
#endif

173
docs/build.md Normal file
View File

@ -0,0 +1,173 @@
# Build from scratch
## Get the Code
```
git clone --recursive https://github.com/leejet/stable-diffusion.cpp
cd stable-diffusion.cpp
```
- If you have already cloned the repository, you can use the following command to update the repository to the latest code.
```
cd stable-diffusion.cpp
git pull origin master
git submodule init
git submodule update
```
## Build (CPU only)
If you don't have a GPU or CUDA installed, you can build a CPU-only version.
```shell
mkdir build && cd build
cmake ..
cmake --build . --config Release
```
## Build with OpenBLAS
```shell
mkdir build && cd build
cmake .. -DGGML_OPENBLAS=ON
cmake --build . --config Release
```
## Build with CUDA
This provides GPU acceleration using NVIDIA GPU. Make sure to have the CUDA toolkit installed. You can download it from your Linux distro's package manager (e.g. `apt install nvidia-cuda-toolkit`) or from here: [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads). Recommended to have at least 4 GB of VRAM.
```shell
mkdir build && cd build
cmake .. -DSD_CUDA=ON
cmake --build . --config Release
```
## Build with HipBLAS
This provides GPU acceleration using AMD GPU. Make sure to have the ROCm toolkit installed.
To build for another GPU architecture than installed in your system, set `$GFX_NAME` manually to the desired architecture (replace first command). This is also necessary if your GPU is not officially supported by ROCm, for example you have to set `$GFX_NAME` manually to `gfx1030` for consumer RDNA2 cards.
Windows User Refer to [docs/hipBLAS_on_Windows.md](docs%2FhipBLAS_on_Windows.md) for a comprehensive guide.
```shell
mkdir build && cd build
if command -v rocminfo; then export GFX_NAME=$(rocminfo | awk '/ *Name: +gfx[1-9]/ {print $2; exit}'); else echo "rocminfo missing!"; fi
if [ -z "${GFX_NAME}" ]; then echo "Error: Couldn't detect GPU!"; else echo "Building for GPU: ${GFX_NAME}"; fi
cmake .. -G "Ninja" -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DSD_HIPBLAS=ON -DCMAKE_BUILD_TYPE=Release -DGPU_TARGETS=$GFX_NAME -DAMDGPU_TARGETS=$GFX_NAME -DCMAKE_BUILD_WITH_INSTALL_RPATH=ON -DCMAKE_POSITION_INDEPENDENT_CODE=ON
cmake --build . --config Release
```
## Build with MUSA
This provides GPU acceleration using Moore Threads GPU. Make sure to have the MUSA toolkit installed.
```shell
mkdir build && cd build
cmake .. -DCMAKE_C_COMPILER=/usr/local/musa/bin/clang -DCMAKE_CXX_COMPILER=/usr/local/musa/bin/clang++ -DSD_MUSA=ON -DCMAKE_BUILD_TYPE=Release
cmake --build . --config Release
```
## Build with Metal
Using Metal makes the computation run on the GPU. Currently, there are some issues with Metal when performing operations on very large matrices, making it highly inefficient at the moment. Performance improvements are expected in the near future.
```shell
mkdir build && cd build
cmake .. -DSD_METAL=ON
cmake --build . --config Release
```
## Build with Vulkan
Install Vulkan SDK from https://www.lunarg.com/vulkan-sdk/.
```shell
mkdir build && cd build
cmake .. -DSD_VULKAN=ON
cmake --build . --config Release
```
## Build with OpenCL (for Adreno GPU)
Currently, it supports only Adreno GPUs and is primarily optimized for Q4_0 type
To build for Windows ARM please refers to [Windows 11 Arm64](https://github.com/ggml-org/llama.cpp/blob/master/docs/backend/OPENCL.md#windows-11-arm64)
Building for Android:
Android NDK:
Download and install the Android NDK from the [official Android developer site](https://developer.android.com/ndk/downloads).
Setup OpenCL Dependencies for NDK:
You need to provide OpenCL headers and the ICD loader library to your NDK sysroot.
* OpenCL Headers:
```bash
# In a temporary working directory
git clone https://github.com/KhronosGroup/OpenCL-Headers
cd OpenCL-Headers
# Replace <YOUR_NDK_PATH> with your actual NDK installation path
# e.g., cp -r CL /path/to/android-ndk-r26c/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/include
sudo cp -r CL <YOUR_NDK_PATH>/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/include
cd ..
```
* OpenCL ICD Loader:
```shell
# In the same temporary working directory
git clone https://github.com/KhronosGroup/OpenCL-ICD-Loader
cd OpenCL-ICD-Loader
mkdir build_ndk && cd build_ndk
# Replace <YOUR_NDK_PATH> in the CMAKE_TOOLCHAIN_FILE and OPENCL_ICD_LOADER_HEADERS_DIR
cmake .. -G Ninja -DCMAKE_BUILD_TYPE=Release \
-DCMAKE_TOOLCHAIN_FILE=<YOUR_NDK_PATH>/build/cmake/android.toolchain.cmake \
-DOPENCL_ICD_LOADER_HEADERS_DIR=<YOUR_NDK_PATH>/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/include \
-DANDROID_ABI=arm64-v8a \
-DANDROID_PLATFORM=24 \
-DANDROID_STL=c++_shared
ninja
# Replace <YOUR_NDK_PATH>
# e.g., cp libOpenCL.so /path/to/android-ndk-r26c/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/lib/aarch64-linux-android
sudo cp libOpenCL.so <YOUR_NDK_PATH>/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/lib/aarch64-linux-android
cd ../..
```
Build `stable-diffusion.cpp` for Android with OpenCL:
```shell
mkdir build-android && cd build-android
# Replace <YOUR_NDK_PATH> with your actual NDK installation path
# e.g., -DCMAKE_TOOLCHAIN_FILE=/path/to/android-ndk-r26c/build/cmake/android.toolchain.cmake
cmake .. -G Ninja \
-DCMAKE_TOOLCHAIN_FILE=<YOUR_NDK_PATH>/build/cmake/android.toolchain.cmake \
-DANDROID_ABI=arm64-v8a \
-DANDROID_PLATFORM=android-28 \
-DGGML_OPENMP=OFF \
-DSD_OPENCL=ON
ninja
```
*(Note: Don't forget to include `LD_LIBRARY_PATH=/vendor/lib64` in your command line before running the binary)*
## Build with SYCL
Using SYCL makes the computation run on the Intel GPU. Please make sure you have installed the related driver and [Intel® oneAPI Base toolkit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html) before start. More details and steps can refer to [llama.cpp SYCL backend](https://github.com/ggml-org/llama.cpp/blob/master/docs/backend/SYCL.md#linux).
```shell
# Export relevant ENV variables
source /opt/intel/oneapi/setvars.sh
# Option 1: Use FP32 (recommended for better performance in most cases)
cmake .. -DSD_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx
# Option 2: Use FP16
cmake .. -DSD_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DGGML_SYCL_F16=ON
cmake --build . --config Release
```

33
docs/chroma.md Normal file
View File

@ -0,0 +1,33 @@
# How to Use
You can run Chroma using stable-diffusion.cpp with a GPU that has 6GB or even 4GB of VRAM, without needing to offload to RAM.
## Download weights
- Download Chroma
- If you don't want to do the conversion yourself, download the preconverted gguf model from [silveroxides/Chroma-GGUF](https://huggingface.co/silveroxides/Chroma-GGUF)
- Otherwise, download chroma's safetensors from [lodestones/Chroma](https://huggingface.co/lodestones/Chroma)
- Download vae from https://huggingface.co/black-forest-labs/FLUX.1-dev/blob/main/ae.safetensors
- Download t5xxl from https://huggingface.co/comfyanonymous/flux_text_encoders/blob/main/t5xxl_fp16.safetensors
## Convert Chroma weights
You can download the preconverted gguf weights from [silveroxides/Chroma-GGUF](https://huggingface.co/silveroxides/Chroma-GGUF), this way you don't have to do the conversion yourself.
```
.\bin\Release\sd.exe -M convert -m ..\..\ComfyUI\models\unet\chroma-unlocked-v40.safetensors -o ..\models\chroma-unlocked-v40-q8_0.gguf -v --type q8_0
```
## Run
### Example
For example:
```
.\bin\Release\sd.exe --diffusion-model ..\models\chroma-unlocked-v40-q8_0.gguf --vae ..\models\ae.sft --t5xxl ..\models\t5xxl_fp16.safetensors -p "a lovely cat holding a sign says 'chroma.cpp'" --cfg-scale 4.0 --sampling-method euler -v --chroma-disable-dit-mask --clip-on-cpu
```
![](../assets/flux/chroma_v40.png)

21
docs/chroma_radiance.md Normal file
View File

@ -0,0 +1,21 @@
# How to Use
## Download weights
- Download Chroma1-Radiance
- safetensors: https://huggingface.co/lodestones/Chroma1-Radiance/tree/main
- gguf: https://huggingface.co/silveroxides/Chroma1-Radiance-GGUF/tree/main
- Download t5xxl
- safetensors: https://huggingface.co/comfyanonymous/flux_text_encoders/blob/main/t5xxl_fp16.safetensors
## Examples
```
.\bin\Release\sd.exe --diffusion-model ..\..\ComfyUI\models\diffusion_models\Chroma1-Radiance-v0.4-Q8_0.gguf --t5xxl ..\..\ComfyUI\models\clip\t5xxl_fp16.safetensors -p "a lovely cat holding a sign says 'chroma radiance cpp'" --cfg-scale 4.0 --sampling-method euler -v
```
<img alt="Chroma1-Radiance" src="../assets/flux/chroma1-radiance.png" />

99
docs/distilled_sd.md Normal file
View File

@ -0,0 +1,99 @@
# Running distilled models: SSD1B and SDx.x with tiny U-Nets
## Preface
These models feature a reduced U-Net architecture. Unlike standard SDXL models, the SSD-1B U-Net contains only one middle block and fewer attention layers in its up- and down-blocks, resulting in significantly smaller file sizes. Using these models can reduce inference time by more than 33%. For more details, refer to Segmind's paper: https://arxiv.org/abs/2401.02677v1.
Similarly, SD1.x- and SD2.x-style models with a tiny U-Net consist of only 6 U-Net blocks, leading to very small files and time savings of up to 50%. For more information, see the paper: https://arxiv.org/pdf/2305.15798.pdf.
## SSD1B
Note that not all of these models follow the standard parameter naming conventions. However, several useful SSD-1B models are available online, such as:
* https://huggingface.co/segmind/SSD-1B/resolve/main/SSD-1B-A1111.safetensors
* https://huggingface.co/hassenhamdi/SSD-1B-fp8_e4m3fn/resolve/main/SSD-1B_fp8_e4m3fn.safetensors
Useful LoRAs are also available:
* https://huggingface.co/seungminh/lora-swarovski-SSD-1B/resolve/main/pytorch_lora_weights.safetensors
* https://huggingface.co/kylielee505/mylcmlorassd/resolve/main/pytorch_lora_weights.safetensors
These files can be used out-of-the-box, unlike the models described in the next section.
## SD1.x, SD2.x with tiny U-Nets
These models require conversion before use. You will need a Python script provided by the diffusers team, available on GitHub:
* https://raw.githubusercontent.com/huggingface/diffusers/refs/heads/main/scripts/convert_diffusers_to_original_stable_diffusion.py
### SD2.x
NotaAI provides the following model online:
* https://huggingface.co/nota-ai/bk-sdm-v2-tiny
Creating a .safetensors file involves two steps. First, run this short Python script to download the model from Hugging Face:
```python
from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained("nota-ai/bk-sdm-v2-tiny",cache_dir="./")
```
Second, create the .safetensors file by running:
```bash
python convert_diffusers_to_original_stable_diffusion.py \
--model_path models--nota-ai--bk-sdm-v2-tiny/snapshots/68277af553777858cd47e133f92e4db47321bc74 \
--checkpoint_path bk-sdm-v2-tiny.safetensors --half --use_safetensors
```
This will generate the **file bk-sdm-v2-tiny.safetensors**, which is now ready for use with sd.cpp.
### SD1.x
Several Tiny SD 1.x models are available online, such as:
* https://huggingface.co/segmind/tiny-sd
* https://huggingface.co/segmind/portrait-finetuned
* https://huggingface.co/nota-ai/bk-sdm-tiny
These models also require conversion, partly because some tensors are stored in a non-contiguous manner. To create a usable checkpoint file, follow these simple steps:
Download and prepare the model using Python:
##### Download the model using Python on your computer, for example this way:
```python
import torch
from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained("segmind/tiny-sd")
unet=pipe.unet
for param in unet.parameters():
param.data = param.data.contiguous() # <- important here
pipe.save_pretrained("segmindtiny-sd", safe_serialization=True)
```
##### Run the conversion script:
```bash
python convert_diffusers_to_original_stable_diffusion.py \
--model_path ./segmindtiny-sd \
--checkpoint_path ./segmind_tiny-sd.ckpt --half
```
The file segmind_tiny-sd.ckpt will be generated and is now ready for use with sd.cpp. You can follow a similar process for the other models mentioned above.
### Another available .ckpt file:
* https://huggingface.co/ClashSAN/small-sd/resolve/main/tinySDdistilled.ckpt
To use this file, you must first adjust its non-contiguous tensors:
```python
import torch
ckpt = torch.load("tinySDdistilled.ckpt", map_location=torch.device('cpu'))
for key, value in ckpt['state_dict'].items():
if isinstance(value, torch.Tensor):
ckpt['state_dict'][key] = value.contiguous()
torch.save(ckpt, "tinySDdistilled_fixed.ckpt")
```

15
docs/docker.md Normal file
View File

@ -0,0 +1,15 @@
## Docker
### Building using Docker
```shell
docker build -t sd .
```
### Run
```shell
docker run -v /path/to/models:/models -v /path/to/output/:/output sd [args...]
# For example
# docker run -v ./models:/models -v ./build:/output sd -m /models/sd-v1-4.ckpt -p "a lovely cat" -v -o /output/output.png
```

9
docs/esrgan.md Normal file
View File

@ -0,0 +1,9 @@
## Using ESRGAN to upscale results
You can use ESRGAN to upscale the generated images. At the moment, only the [RealESRGAN_x4plus_anime_6B.pth](https://github.com/xinntao/Real-ESRGAN/releases/download/v0.2.2.4/RealESRGAN_x4plus_anime_6B.pth) model is supported. Support for more models of this architecture will be added soon.
- Specify the model path using the `--upscale-model PATH` parameter. example:
```bash
sd -m ../models/v1-5-pruned-emaonly.safetensors -p "a lovely cat" --upscale-model ../models/RealESRGAN_x4plus_anime_6B.pth
```

66
docs/flux.md Normal file
View File

@ -0,0 +1,66 @@
# How to Use
You can run Flux using stable-diffusion.cpp with a GPU that has 6GB or even 4GB of VRAM, without needing to offload to RAM.
## Download weights
- Download flux
- If you don't want to do the conversion yourself, download the preconverted gguf model from [FLUX.1-dev-gguf](https://huggingface.co/leejet/FLUX.1-dev-gguf) or [FLUX.1-schnell](https://huggingface.co/leejet/FLUX.1-schnell-gguf)
- Otherwise, download flux-dev from https://huggingface.co/black-forest-labs/FLUX.1-dev/blob/main/flux1-dev.safetensors or flux-schnell from https://huggingface.co/black-forest-labs/FLUX.1-schnell/blob/main/flux1-schnell.safetensors
- Download vae from https://huggingface.co/black-forest-labs/FLUX.1-dev/blob/main/ae.safetensors
- Download clip_l from https://huggingface.co/comfyanonymous/flux_text_encoders/blob/main/clip_l.safetensors
- Download t5xxl from https://huggingface.co/comfyanonymous/flux_text_encoders/blob/main/t5xxl_fp16.safetensors
## Convert flux weights
You can download the preconverted gguf weights from [FLUX.1-dev-gguf](https://huggingface.co/leejet/FLUX.1-dev-gguf) or [FLUX.1-schnell](https://huggingface.co/leejet/FLUX.1-schnell-gguf), this way you don't have to do the conversion yourself.
For example:
```
.\bin\Release\sd.exe -M convert -m ..\..\ComfyUI\models\unet\flux1-dev.sft -o ..\models\flux1-dev-q8_0.gguf -v --type q8_0
```
## Run
- `--cfg-scale` is recommended to be set to 1.
### Flux-dev
For example:
```
.\bin\Release\sd.exe --diffusion-model ..\models\flux1-dev-q8_0.gguf --vae ..\models\ae.sft --clip_l ..\models\clip_l.safetensors --t5xxl ..\models\t5xxl_fp16.safetensors -p "a lovely cat holding a sign says 'flux.cpp'" --cfg-scale 1.0 --sampling-method euler -v --clip-on-cpu
```
Using formats of different precisions will yield results of varying quality.
| Type | q8_0 | q4_0 | q4_k | q3_k | q2_k |
|---- | ---- |---- |---- |---- |---- |
| **Memory** | 12068.09 MB | 6394.53 MB | 6395.17 MB | 4888.16 MB | 3735.73 MB |
| **Result** | ![](../assets/flux/flux1-dev-q8_0.png) |![](../assets/flux/flux1-dev-q4_0.png) |![](../assets/flux/flux1-dev-q4_k.png) |![](../assets/flux/flux1-dev-q3_k.png) |![](../assets/flux/flux1-dev-q2_k.png)|
### Flux-schnell
```
.\bin\Release\sd.exe --diffusion-model ..\models\flux1-schnell-q8_0.gguf --vae ..\models\ae.sft --clip_l ..\models\clip_l.safetensors --t5xxl ..\models\t5xxl_fp16.safetensors -p "a lovely cat holding a sign says 'flux.cpp'" --cfg-scale 1.0 --sampling-method euler -v --steps 4 --clip-on-cpu
```
| q8_0 |
| ---- |
|![](../assets/flux/flux1-schnell-q8_0.png) |
## Run with LoRA
Since many flux LoRA training libraries have used various LoRA naming formats, it is possible that not all flux LoRA naming formats are supported. It is recommended to use LoRA with naming formats compatible with ComfyUI.
### Flux-dev q8_0 with LoRA
- LoRA model from https://huggingface.co/XLabs-AI/flux-lora-collection/tree/main (using comfy converted version!!!)
```
.\bin\Release\sd.exe --diffusion-model ..\models\flux1-dev-q8_0.gguf --vae ...\models\ae.sft --clip_l ..\models\clip_l.safetensors --t5xxl ..\models\t5xxl_fp16.safetensors -p "a lovely cat holding a sign says 'flux.cpp'<lora:realism_lora_comfy_converted:1>" --cfg-scale 1.0 --sampling-method euler -v --lora-model-dir ../models --clip-on-cpu
```
![output](../assets/flux/flux1-dev-q8_0%20with%20lora.png)

21
docs/flux2.md Normal file
View File

@ -0,0 +1,21 @@
# How to Use
## Download weights
- Download FLUX.2-dev
- gguf: https://huggingface.co/city96/FLUX.2-dev-gguf/tree/main
- Download vae
- safetensors: https://huggingface.co/black-forest-labs/FLUX.2-dev/tree/main
- Download Mistral-Small-3.2-24B-Instruct-2506-GGUF
- gguf: https://huggingface.co/unsloth/Mistral-Small-3.2-24B-Instruct-2506-GGUF/tree/main
## Examples
```
.\bin\Release\sd.exe --diffusion-model ..\..\ComfyUI\models\diffusion_models\flux2-dev-Q4_K_S.gguf --vae ..\..\ComfyUI\models\vae\flux2_ae.safetensors --llm ..\..\ComfyUI\models\text_encoders\Mistral-Small-3.2-24B-Instruct-2506-Q4_K_M.gguf -r .\kontext_input.png -p "change 'flux.cpp' to 'flux2-dev.cpp'" --cfg-scale 1.0 --sampling-method euler -v --diffusion-fa --offload-to-cpu
```
<img alt="flux2 example" src="../assets/flux2/example.png" />

View File

@ -0,0 +1,85 @@
# Using hipBLAS on Windows
To get hipBLAS in `stable-diffusion.cpp` working on Windows, go through this guide section by section.
## Build Tools for Visual Studio 2022
Skip this step if you already have Build Tools installed.
To install Build Tools, go to [Visual Studio Downloads](https://visualstudio.microsoft.com/vs/), download `Visual Studio 2022 and other Products` and run the installer.
## CMake
Skip this step if you already have CMake installed: running `cmake --version` should output `cmake version x.y.z`.
Download latest `Windows x64 Installer` from [Download | CMake](https://cmake.org/download/) and run it.
## ROCm
Skip this step if you already have Build Tools installed.
The [validation tools](https://rocm.docs.amd.com/en/latest/reference/validation_tools.html) not support on Windows. So you should confirm the Version of `ROCM` by yourself.
Fortunately, `AMD` provides complete help documentation, you can use the help documentation to install [ROCM](https://rocm.docs.amd.com/en/latest/deploy/windows/quick_start.html)
>**If you encounter an error, if it is [AMD ROCm Windows Installation Error 215](https://github.com/RadeonOpenCompute/ROCm/issues/2363), don't worry about this error. ROCM has been installed correctly, but the vs studio plugin installation failed, we can ignore it.**
Then we must set `ROCM` as environment variables before running cmake.
Usually if you install according to the official tutorial and do not modify the ROCM path, then there is a high probability that it is here `C:\Program Files\AMD\ROCm\5.5\bin`
This is what I use to set the clang:
```Commandline
set CC=C:\Program Files\AMD\ROCm\5.5\bin\clang.exe
set CXX=C:\Program Files\AMD\ROCm\5.5\bin\clang++.exe
```
## Ninja
Skip this step if you already have Ninja installed: running `ninja --version` should output `1.11.1`.
Download latest `ninja-win.zip` from [GitHub Releases Page](https://github.com/ninja-build/ninja/releases/tag/v1.11.1) and unzip. Then set as environment variables. I unzipped it in `C:\Program Files\ninja`, so I set it like this:
```Commandline
set ninja=C:\Program Files\ninja\ninja.exe
```
## Building stable-diffusion.cpp
The thing different from the regular CPU build is `-DSD_HIPBLAS=ON` ,
`-G "Ninja"`, `-DCMAKE_C_COMPILER=clang`, `-DCMAKE_CXX_COMPILER=clang++`, `-DAMDGPU_TARGETS=gfx1100`
>**Notice**: check the `clang` and `clang++` information:
```Commandline
clang --version
clang++ --version
```
If you see like this, we can continue:
```
clang version 17.0.0 (git@github.amd.com:Compute-Mirrors/llvm-project e3201662d21c48894f2156d302276eb1cf47c7be)
Target: x86_64-pc-windows-msvc
Thread model: posix
InstalledDir: C:\Program Files\AMD\ROCm\5.5\bin
```
```
clang version 17.0.0 (git@github.amd.com:Compute-Mirrors/llvm-project e3201662d21c48894f2156d302276eb1cf47c7be)
Target: x86_64-pc-windows-msvc
Thread model: posix
InstalledDir: C:\Program Files\AMD\ROCm\5.5\bin
```
>**Notice** that the `gfx1100` is the GPU architecture of my GPU, you can change it to your GPU architecture. Click here to see your architecture [LLVM Target](https://rocm.docs.amd.com/en/latest/release/windows_support.html#windows-supported-gpus)
My GPU is AMD Radeon™ RX 7900 XTX Graphics, so I set it to `gfx1100`.
option:
```commandline
mkdir build
cd build
cmake .. -G "Ninja" -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DSD_HIPBLAS=ON -DCMAKE_BUILD_TYPE=Release -DAMDGPU_TARGETS=gfx1100
cmake --build . --config Release
```
If everything went OK, `build\bin\sd.exe` file should appear.

39
docs/kontext.md Normal file
View File

@ -0,0 +1,39 @@
# How to Use
You can run Kontext using stable-diffusion.cpp with a GPU that has 6GB or even 4GB of VRAM, without needing to offload to RAM.
## Download weights
- Download Kontext
- If you don't want to do the conversion yourself, download the preconverted gguf model from [FLUX.1-Kontext-dev-GGUF](https://huggingface.co/QuantStack/FLUX.1-Kontext-dev-GGUF)
- Otherwise, download FLUX.1-Kontext-dev from https://huggingface.co/black-forest-labs/FLUX.1-Kontext-dev/blob/main/flux1-kontext-dev.safetensors
- Download vae from https://huggingface.co/black-forest-labs/FLUX.1-dev/blob/main/ae.safetensors
- Download clip_l from https://huggingface.co/comfyanonymous/flux_text_encoders/blob/main/clip_l.safetensors
- Download t5xxl from https://huggingface.co/comfyanonymous/flux_text_encoders/blob/main/t5xxl_fp16.safetensors
## Convert Kontext weights
You can download the preconverted gguf weights from [FLUX.1-Kontext-dev-GGUF](https://huggingface.co/QuantStack/FLUX.1-Kontext-dev-GGUF), this way you don't have to do the conversion yourself.
```
.\bin\Release\sd.exe -M convert -m ..\..\ComfyUI\models\unet\flux1-kontext-dev.safetensors -o ..\models\flux1-kontext-dev-q8_0.gguf -v --type q8_0
```
## Run
- `--cfg-scale` is recommended to be set to 1.
### Example
For example:
```
.\bin\Release\sd.exe -r .\flux1-dev-q8_0.png --diffusion-model ..\models\flux1-kontext-dev-q8_0.gguf --vae ..\models\ae.sft --clip_l ..\models\clip_l.safetensors --t5xxl ..\models\t5xxl_fp16.safetensors -p "change 'flux.cpp' to 'kontext.cpp'" --cfg-scale 1.0 --sampling-method euler -v --clip-on-cpu
```
| ref_image | prompt | output |
| ---- | ---- |---- |
| ![](../assets/flux/flux1-dev-q8_0.png) | change 'flux.cpp' to 'kontext.cpp' |![](../assets/flux/kontext1_dev_output.png) |

15
docs/lcm.md Normal file
View File

@ -0,0 +1,15 @@
## LCM/LCM-LoRA
- Download LCM-LoRA form https://huggingface.co/latent-consistency/lcm-lora-sdv1-5
- Specify LCM-LoRA by adding `<lora:lcm-lora-sdv1-5:1>` to prompt
- It's advisable to set `--cfg-scale` to `1.0` instead of the default `7.0`. For `--steps`, a range of `2-8` steps is recommended. For `--sampling-method`, `lcm`/`euler_a` is recommended.
Here's a simple example:
```
./bin/sd -m ../models/v1-5-pruned-emaonly.safetensors -p "a lovely cat<lora:lcm-lora-sdv1-5:1>" --steps 4 --lora-model-dir ../models -v --cfg-scale 1
```
| without LCM-LoRA (--cfg-scale 7) | with LCM-LoRA (--cfg-scale 1) |
| ---- |---- |
| ![](../assets/without_lcm.png) |![](../assets/with_lcm.png) |

26
docs/lora.md Normal file
View File

@ -0,0 +1,26 @@
## LoRA
- You can specify the directory where the lora weights are stored via `--lora-model-dir`. If not specified, the default is the current working directory.
- LoRA is specified via prompt, just like [stable-diffusion-webui](https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Features#lora).
Here's a simple example:
```
./bin/sd -m ../models/v1-5-pruned-emaonly.safetensors -p "a lovely cat<lora:marblesh:1>" --lora-model-dir ../models
```
`../models/marblesh.safetensors` or `../models/marblesh.ckpt` will be applied to the model
# Lora Apply Mode
There are two ways to apply LoRA: **immediately** and **at_runtime**. You can specify it using the `--lora-apply-mode` parameter.
By default, the mode is selected automatically:
* If the model weights contain any quantized parameters, the **at_runtime** mode is used;
* Otherwise, the **immediately** mode is used.
The **immediately** mode may have precision and compatibility issues with quantized parameters, but it usually offers faster inference speed and, in some cases, lower memory usage.
In contrast, the **at_runtime** mode provides better compatibility and higher precision, but inference may be slower and memory usage may be higher in some cases.

19
docs/ovis_image.md Normal file
View File

@ -0,0 +1,19 @@
# How to Use
## Download weights
- Download Ovis-Image-7B
- safetensors: https://huggingface.co/Comfy-Org/Ovis-Image/tree/main/split_files/diffusion_models
- gguf: https://huggingface.co/leejet/Ovis-Image-7B-GGUF
- Download vae
- safetensors: https://huggingface.co/black-forest-labs/FLUX.1-schnell/tree/main
- Download Ovis 2.5
- safetensors: https://huggingface.co/Comfy-Org/Ovis-Image/tree/main/split_files/text_encoders
## Examples
```
.\bin\Release\sd.exe --diffusion-model ovis_image-Q4_0.gguf --vae ..\..\ComfyUI\models\vae\ae.sft --llm ..\..\ComfyUI\models\text_encoders\ovis_2.5.safetensors -p "a lovely cat" --cfg-scale 5.0 -v --offload-to-cpu --diffusion-fa
```
<img alt="ovis image example" src="../assets/ovis_image/example.png" />

26
docs/performance.md Normal file
View File

@ -0,0 +1,26 @@
## Use Flash Attention to save memory and improve speed.
Enabling flash attention for the diffusion model reduces memory usage by varying amounts of MB.
eg.:
- flux 768x768 ~600mb
- SD2 768x768 ~1400mb
For most backends, it slows things down, but for cuda it generally speeds it up too.
At the moment, it is only supported for some models and some backends (like cpu, cuda/rocm, metal).
Run by adding `--diffusion-fa` to the arguments and watch for:
```
[INFO ] stable-diffusion.cpp:312 - Using flash attention in the diffusion model
```
and the compute buffer shrink in the debug log:
```
[DEBUG] ggml_extend.hpp:1004 - flux compute buffer size: 650.00 MB(VRAM)
```
## Offload weights to the CPU to save VRAM without reducing generation speed.
Using `--offload-to-cpu` allows you to offload weights to the CPU, saving VRAM without reducing generation speed.
## Use quantization to reduce memory usage.
[quantization](./quantization_and_gguf.md)

53
docs/photo_maker.md Normal file
View File

@ -0,0 +1,53 @@
## Using PhotoMaker to personalize image generation
You can use [PhotoMaker](https://github.com/TencentARC/PhotoMaker) to personalize generated images with your own ID.
**NOTE**, currently PhotoMaker **ONLY** works with **SDXL** (any SDXL model files will work).
Download PhotoMaker model file (in safetensor format) [here](https://huggingface.co/bssrdf/PhotoMaker). The official release of the model file (in .bin format) does not work with ```stablediffusion.cpp```.
- Specify the PhotoMaker model path using the `--photo-maker PATH` parameter.
- Specify the input images path using the `--pm-id-images-dir PATH` parameter.
In prompt, make sure you have a class word followed by the trigger word ```"img"``` (hard-coded for now). The class word could be one of ```"man, woman, girl, boy"```. If input ID images contain asian faces, add ```Asian``` before the class
word.
Another PhotoMaker specific parameter:
- ```--pm-style-strength (0-100)%```: default is 20 and 10-20 typically gets good results. Lower ratio means more faithfully following input ID (not necessarily better quality).
Other parameters recommended for running Photomaker:
- ```--cfg-scale 5.0```
- ```-H 1024```
- ```-W 1024```
If on low memory GPUs (<= 8GB), recommend running with ```--vae-on-cpu``` option to get artifact free images.
Example:
```bash
bin/sd -m ../models/sdxlUnstableDiffusers_v11.safetensors --vae ../models/sdxl_vae.safetensors --photo-maker ../models/photomaker-v1.safetensors --pm-id-images-dir ../assets/photomaker_examples/scarletthead_woman -p "a girl img, retro futurism, retro game art style but extremely beautiful, intricate details, masterpiece, best quality, space-themed, cosmic, celestial, stars, galaxies, nebulas, planets, science fiction, highly detailed" -n "realistic, photo-realistic, worst quality, greyscale, bad anatomy, bad hands, error, text" --cfg-scale 5.0 --sampling-method euler -H 1024 -W 1024 --pm-style-strength 10 --vae-on-cpu --steps 50
```
## PhotoMaker Version 2
[PhotoMaker Version 2 (PMV2)](https://github.com/TencentARC/PhotoMaker/blob/main/README_pmv2.md) has some key improvements. Unfortunately it has a very heavy dependency which makes running it a bit involved in ```SD.cpp```.
Running PMV2 is now a two-step process:
- Run a python script ```face_detect.py``` to obtain **id_embeds** for the given input images
```
python face_detect.py input_image_dir
```
An ```id_embeds.bin``` file will be generated in ```input_images_dir```
**Note: this step is only needed to run once; the same ```id_embeds``` can be reused**
- Run the same command as in version 1 but replacing ```photomaker-v1.safetensors``` with ```photomaker-v2.safetensors```.
You can download ```photomaker-v2.safetensors``` from [here](https://huggingface.co/bssrdf/PhotoMakerV2)
- All the command line parameters from Version 1 remain the same for Version 2 plus one extra pointing to a valid ```id_embeds``` file: --pm-id-embed-path [path_to__id_embeds.bin]

View File

@ -0,0 +1,27 @@
## Quantization
You can specify the model weight type using the `--type` parameter. The weights are automatically converted when loading the model.
- `f16` for 16-bit floating-point
- `f32` for 32-bit floating-point
- `q8_0` for 8-bit integer quantization
- `q5_0` or `q5_1` for 5-bit integer quantization
- `q4_0` or `q4_1` for 4-bit integer quantization
### Memory Requirements of Stable Diffusion 1.x
| precision | f32 | f16 |q8_0 |q5_0 |q5_1 |q4_0 |q4_1 |
| ---- | ---- |---- |---- |---- |---- |---- |---- |
| **Memory** (txt2img - 512 x 512) | ~2.8G | ~2.3G | ~2.1G | ~2.0G | ~2.0G | ~2.0G | ~2.0G |
| **Memory** (txt2img - 512 x 512) *with Flash Attention* | ~2.4G | ~1.9G | ~1.6G | ~1.5G | ~1.5G | ~1.5G | ~1.5G |
## Convert to GGUF
You can also convert weights in the formats `ckpt/safetensors/diffusers` to gguf and perform quantization in advance, avoiding the need for quantization every time you load them.
For example:
```sh
./bin/sd -M convert -m ../models/v1-5-pruned-emaonly.safetensors -o ../models/v1-5-pruned-emaonly.q8_0.gguf -v --type q8_0
```

23
docs/qwen_image.md Normal file
View File

@ -0,0 +1,23 @@
# How to Use
## Download weights
- Download Qwen Image
- safetensors: https://huggingface.co/Comfy-Org/Qwen-Image_ComfyUI/tree/main/split_files/diffusion_models
- gguf: https://huggingface.co/QuantStack/Qwen-Image-GGUF/tree/main
- Download vae
- safetensors: https://huggingface.co/Comfy-Org/Qwen-Image_ComfyUI/tree/main/split_files/vae
- Download qwen_2.5_vl 7b
- safetensors: https://huggingface.co/Comfy-Org/Qwen-Image_ComfyUI/tree/main/split_files/text_encoders
- gguf: https://huggingface.co/mradermacher/Qwen2.5-VL-7B-Instruct-GGUF/tree/main
## Examples
```
.\bin\Release\sd.exe --diffusion-model ..\..\ComfyUI\models\diffusion_models\qwen-image-Q8_0.gguf --vae ..\..\ComfyUI\models\vae\qwen_image_vae.safetensors --llm ..\..\ComfyUI\models\text_encoders\Qwen2.5-VL-7B-Instruct-Q8_0.gguf -p '一个穿着"QWEN"标志的T恤的中国美女正拿着黑色的马克笔面相镜头微笑。她身后的玻璃板上手写体写着 “一、Qwen-Image的技术路线 探索视觉生成基础模型的极限开创理解与生成一体化的未来。二、Qwen-Image的模型特色1、复杂文字渲染。支持中英渲染、自动布局 2、精准图像编辑。支持文字编辑、物体增减、风格变换。三、Qwen-Image的未来愿景赋能专业内容创作、助力生成式AI发展。”' --cfg-scale 2.5 --sampling-method euler -v --offload-to-cpu -H 1024 -W 1024 --diffusion-fa --flow-shift 3
```
<img alt="qwen example" src="../assets/qwen/example.png" />

Some files were not shown because too many files have changed in this diff Show More