61 Commits

Author SHA1 Message Date
leejet
6103d86e2c
refactor: introduce GGMLRunnerContext (#928)
* introduce GGMLRunnerContext

* add Flash Attention enable control through GGMLRunnerContext

* add conv2d_direct enable control through GGMLRunnerContext
2025-11-02 02:11:04 +08:00
leejet
dd75fc081c
refactor: unify the naming style of ggml extension functions (#921) 2025-10-28 23:26:48 +08:00
leejet
9e28be6479
feat: add chroma radiance support (#910)
* add chroma radiance support

* fix ci

* simply generate_init_latent

* workaround: avoid ggml cuda error

* format code

* add chroma radiance doc
2025-10-25 23:56:14 +08:00
leejet
d05e46ca5e
chore: add .clang-tidy configuration and apply modernize checks (#902) 2025-10-18 23:23:40 +08:00
leejet
40a6a8710e
fix: resolve precision issues in SDXL VAE under fp16 (#888)
* fix: resolve precision issues in SDXL VAE under fp16

* add --force-sdxl-vae-conv-scale option

* update docs
2025-10-15 23:01:00 +08:00
leejet
2e9242e37f
feat: add Qwen Image Edit support (#877)
* add ref latent support for qwen image

* optimize clip_preprocess and fix get_first_stage_encoding

* add qwen2vl vit support

* add qwen image edit support

* fix qwen image edit pipeline

* add mmproj file support

* support dynamic number of Qwen image transformer blocks

* set prompt_template_encode_start_idx every time

* to_add_out precision fix

* to_out.0 precision fix

* update docs
2025-10-13 23:17:18 +08:00
Wagner Bruna
5436f6b814
fix: correct canny preprocessor (#861) 2025-10-13 22:02:35 +08:00
Wagner Bruna
9727c6bb98
fix: resolve VAE tiling problem in Qwen Image (#873) 2025-10-12 23:45:53 +08:00
leejet
beb99a2de2
feat: add Qwen Image support (#851)
* add qwen tokenizer

* add qwen2.5 vl support

* mv qwen.hpp -> qwenvl.hpp

* add qwen image model

* add qwen image t2i pipeline

* fix qwen image flash attn

* add qwen image i2i pipline

* change encoding of vocab_qwen.hpp to utf8

* fix get_first_stage_encoding

* apply jeffbolz f32 patch

https://github.com/leejet/stable-diffusion.cpp/pull/851#issuecomment-3335515302

* fix the issue that occurs when using CUDA with k-quants weights

* optimize the handling of the FeedForward precision fix

* to_add_out precision fix

* update docs
2025-10-12 23:23:19 +08:00
stduhpf
11f436c483
feat: add support for Flux Controls and Flex.2 (#692) 2025-10-11 00:06:57 +08:00
leejet
35843c77ea
fix: optimize the handling of embedding weight (#859) 2025-09-25 23:09:59 +08:00
leejet
0ebe6fe118
refactor: simplify the logic of pm id image loading (#827) 2025-09-14 22:50:21 +08:00
leejet
52a97b3ac1
feat: add vace support (#819)
* add wan vace t2v support

* add --vace-strength option

* add vace i2v support

* fix the processing of vace_context

* add vace v2v support

* update docs
2025-09-14 16:57:33 +08:00
stduhpf
2c9b1e2594
feat: add VAE encoding tiling support and adaptive overlap (#484)
* implement  tiling vae encode support

* Tiling (vae/upscale): adaptative overlap

* Tiling: fix edge case

* Tiling: fix crash when less than 2 tiles per dim

* remove extra dot

* Tiling: fix edge cases for adaptative overlap

* tiling: fix edge case

* set vae tile size via env var

* vae tiling: refactor again, base on smaller buffer for alignment

* Use bigger tiles for encode (to match compute buffer size)

* Fix edge case when tile is bigger than latent

* non-square VAE tiling (#3)

* refactor tile number calculation

* support non-square tiles

* add env var to change tile overlap

* add safeguards and better error messages for SD_TILE_OVERLAP

* add safeguards and include overlapping factor for SD_TILE_SIZE

* avoid rounding issues when specifying SD_TILE_SIZE as a factor

* lower SD_TILE_OVERLAP limit

* zero-init empty output buffer

* Fix decode latent size

* fix encode

* tile size params instead of env

* Tiled vae parameter validation (#6)

* avoid crash with invalid tile sizes, use 0 for default

* refactor default tile size, limit overlap factor

* remove explicit parameter for relative tile size

* limit encoding tile to latent size

* unify code style and format code

* update docs

* fix get_tile_sizes in decode_first_stage

---------

Co-authored-by: Wagner Bruna <wbruna@users.noreply.github.com>
Co-authored-by: leejet <leejet714@gmail.com>
2025-09-14 16:00:29 +08:00
clibdev
87cdbd5978
feat: use log_printf to print ggml logs (#545) 2025-09-11 22:16:05 +08:00
leejet
f8fe4e7db9
fix: add flash attn support check (#803) 2025-09-07 21:29:06 +08:00
leejet
cb1d975e96
feat: add wan2.1/2.2 support (#778)
* add wan vae suppport

* add wan model support

* add umt5 support

* add wan2.1 t2i support

* make flash attn work with wan

* make wan a little faster

* add wan2.1 t2v support

* add wan gguf support

* add offload params to cpu support

* add wan2.1 i2v support

* crop image before resize

* set default fps to 16

* add diff lora support

* fix wan2.1 i2v

* introduce sd_sample_params_t

* add wan2.2 t2v support

* add wan2.2 14B i2v support

* add wan2.2 ti2v support

* add high noise lora support

* sync: update ggml submodule url

* avoid build failure on linux

* avoid build failure

* update ggml

* update ggml

* fix sd_version_is_wan

* update ggml, fix cpu im2col_3d

* fix ggml_nn_attention_ext mask

* add cache support to ggml runner

* fix the issue of illegal memory access

* unify image loading processing

* add wan2.1/2.2 FLF2V support

* fix end_image mask

* update to latest ggml

* add GGUFReader

* update docs
2025-09-06 18:08:03 +08:00
Daniele
5b8996f74a
Conv2D direct support (#744)
* Conv2DDirect for VAE stage

* Enable only for Vulkan, reduced duplicated code

* Cmake option to use conv2d direct

* conv2d direct always on for opencl

* conv direct as a flag

* fix merge typo

* Align conv2d behavior to flash attention's

* fix readme

* add conv2d direct for controlnet

* add conv2d direct for esrgan

* clean code, use enable_conv2d_direct/get_all_blocks

* format code

---------

Co-authored-by: leejet <leejet714@gmail.com>
2025-08-03 01:25:17 +08:00
Wagner Bruna
f7f05fb185
chore: avoid setting GGML_MAX_NAME when building against external ggml (#751)
An external ggml will most likely have been built with the default
GGML_MAX_NAME value (64), which would be inconsistent with the value
set by our build (128). That would be an ODR violation, and it could
easily cause memory corruption issues due to the different
sizeof(struct ggml_tensor) values.

For now, when linking against an external ggml, we demand it has been
patched with a bigger GGML_MAX_NAME, since we can't check against a
value defined only at build time.
2025-08-03 01:24:40 +08:00
leejet
f6b9aa1a43 refector: optimize the usage of tensor_types 2025-07-28 23:18:29 +08:00
leejet
eed97a5e1d sync: update ggml 2025-07-24 23:04:08 +08:00
Erik Scholz
ab835f7d39
fix: correct head dim check and L_k padding of flash attention (#736) 2025-07-24 00:57:45 +08:00
leejet
7dac89ad75 refector: reuse some code 2025-07-01 23:33:50 +08:00
stduhpf
ea46fd6948
fix: force zero-initialize output of tiling (#703) 2025-07-01 23:01:29 +08:00
rmatif
d42fd59464
feat: add OpenCL backend support (#680) 2025-06-30 23:32:23 +08:00
stduhpf
b1cc40c35c
feat: add Chroma support (#696)
---------

Co-authored-by: Green Sky <Green-Sky@users.noreply.github.com>
Co-authored-by: leejet <leejet714@gmail.com>
2025-06-29 23:36:42 +08:00
stduhpf
69c73789fe
fix: force binary mask for inpaint models (#589)
Co-authored-by: leejet <leejet714@gmail.com>
2025-02-22 21:29:57 +08:00
stduhpf
1be2491dcf
feat: partial LyCORIS support (tucker decomposition for LoCon + LoHa + LoKr) (#577) 2025-02-22 21:19:26 +08:00
leejet
dcf91f9e0f chore: change SD_CUBLAS/SD_USE_CUBLAS to SD_CUDA/SD_USE_CUDA 2024-12-28 13:27:51 +08:00
stduhpf
0d9d6659a7
fix: fix metal build (#513) 2024-12-28 13:06:17 +08:00
stduhpf
8f4ab9add3
feat: support Inpaint models (#511) 2024-12-28 13:04:49 +08:00
stduhpf
7ce63e740c
feat: flexible model architecture for dit models (Flux & SD3) (#490)
* Refactor: wtype per tensor

* Fix default args

* refactor: fix flux

* Refactor photmaker v2 support

* unet: refactor the refactoring

* Refactor: fix controlnet and tae

* refactor: upscaler

* Refactor: fix runtime type override

* upscaler: use fp16 again

* Refactor: Flexible sd3 arch

* Refactor: Flexible Flux arch

* format code

---------

Co-authored-by: leejet <leejet714@gmail.com>
2024-11-30 14:18:53 +08:00
leejet
4570715727 fix: use ggml_nn_attention in vae 2024-11-24 18:21:31 +08:00
leejet
c3eeb669cd sync: update ggml 2024-11-23 13:29:32 +08:00
Erik Scholz
1c168d98a5
fix: repair flash attention support (#386)
* repair flash attention in _ext
this does not fix the currently broken fa behind the define, which is only used by VAE

Co-authored-by: FSSRepo <FSSRepo@users.noreply.github.com>

* make flash attention in the diffusion model a runtime flag
no support for sd3 or video

* remove old flash attention option and switch vae over to attn_ext

* update docs

* format code

---------

Co-authored-by: FSSRepo <FSSRepo@users.noreply.github.com>
Co-authored-by: leejet <leejet714@gmail.com>
2024-11-23 12:39:08 +08:00
bssrdf
2b1bc06477
feat: add PhotoMaker Version 2 support (#358)
* first attempt at updating to photomaker v2

* continue adding photomaker v2 modules

* finishing the last few pieces for photomaker v2; id_embeds need to be done by a manual step and pass as an input file

* added a name converter for Photomaker V2; build ok

* more debugging underway

* failing at cuda mat_mul

* updated chunk_half to be more efficient; redo feedforward

* fixed a bug: carefully using ggml_view_4d to get chunks of a tensor; strides need to be recalculated or set properly; still failing at soft_max cuda op

* redo weight calculation and weight*v

* fixed a bug now Photomaker V2 kinds of working

* add python script for face detection (Photomaker V2 needs)

* updated readme for photomaker

* fixed a bug causing PMV1 crashing; both V1 and V2 work

* fixed clean_input_ids for PMV2

* fixed a double counting bug in tokenize_with_trigger_token

* updated photomaker readme

* removed some commented code

* improved reconstructing class word free prompt

* changed reading id_embed to raw binary using existing load tensor function; this is more efficient than using model load and also makes it easier to work with sd server

* minor clean up

---------

Co-authored-by: bssrdf <bssrdf@gmail.com>
2024-11-23 11:50:14 +08:00
leejet
ac54e00760
feat: add sd3.5 support (#445) 2024-10-24 21:58:03 +08:00
zhentaoyu
e410aeb534
sync: update ggml to fix large image generation with SYCL backend (#380)
* turn off fast-math on host in SYCL backend

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* update ggml for sync some sycl ops

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* update sycl readme and ggml

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

---------

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>
2024-09-02 22:29:35 +08:00
Erik Scholz
e71ddcedad
fix: improve VAE tiling (#372)
* fix and improve: VAE tiling
- properly handle the upper left corner interpolating both x and y
- refactor out lerp
- use smootherstep to preserve more detail and spend less area blending

* actually fix vae tile merging

Co-authored-by: stduhpf <stephduh@live.fr>

* remove the now unused lerp function

---------

Co-authored-by: stduhpf <stephduh@live.fr>
2024-08-28 00:21:12 +08:00
soham
2027b16fda
feat: add vulkan backend support (#291)
* Fix includes and init vulkan the same as llama.cpp

* Add Windows Vulkan CI

* Updated ggml submodule

* support epsilon as a parameter for ggml_group_norm

---------

Co-authored-by: Cloudwalk <cloudwalk@icculus.org>
Co-authored-by: Oleg Skutte <00.00.oleg.00.00@gmail.com>
Co-authored-by: leejet <leejet714@gmail.com>
2024-08-27 23:56:09 +08:00
leejet
1bdc767aaf feat: force using f32 for some layers 2024-08-25 13:53:16 +08:00
leejet
c837c5d9cc style: format code 2024-08-25 00:19:37 +08:00
leejet
64d231f384
feat: add flux support (#356)
* add flux support

* avoid build failures in non-CUDA environments

* fix schnell support

* add k quants support

* add support for applying lora to quantized tensors

* add inplace conversion support for f8_e4m3 (#359)

in the same way it is done for bf16
like how bf16 converts losslessly to fp32,
f8_e4m3 converts losslessly to fp16

* add xlabs flux comfy converted lora support

* update docs

---------

Co-authored-by: Erik Scholz <Green-Sky@users.noreply.github.com>
2024-08-24 14:29:52 +08:00
zhentaoyu
697d000f49
feat: add SYCL Backend Support for Intel GPUs (#330)
* update ggml and add SYCL CMake option

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* hacky CMakeLists.txt for updating ggml in cpu backend

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* rebase and clean code

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* add sycl in README

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* rebase ggml commit

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* refine README

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* update ggml for supporting sycl tsembd op

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

---------

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>
2024-08-10 13:42:50 +08:00
leejet
4a6e36edc5 sync: update ggml 2024-07-28 18:30:35 +08:00
leejet
73c2176648
feat: add sd3 support (#298) 2024-07-28 15:44:08 +08:00
leejet
be6cd1a4bf sync: update ggml 2024-06-01 13:44:09 +08:00
Eugene
6d16f6853e
fix: correct upscale progressbar (#232) 2024-04-29 22:59:46 +08:00
leejet
036ba9e6d8 feat: enable controlnet and photo maker for img2img mode 2024-04-14 16:36:08 +08:00
leejet
ec82d5279a refector: remove some useless code 2024-04-14 14:04:52 +08:00