* introduce GGMLRunnerContext
* add Flash Attention enable control through GGMLRunnerContext
* add conv2d_direct enable control through GGMLRunnerContext
* add ref latent support for qwen image
* optimize clip_preprocess and fix get_first_stage_encoding
* add qwen2vl vit support
* add qwen image edit support
* fix qwen image edit pipeline
* add mmproj file support
* support dynamic number of Qwen image transformer blocks
* set prompt_template_encode_start_idx every time
* to_add_out precision fix
* to_out.0 precision fix
* update docs
* add wan vace t2v support
* add --vace-strength option
* add vace i2v support
* fix the processing of vace_context
* add vace v2v support
* update docs
* add wan vae suppport
* add wan model support
* add umt5 support
* add wan2.1 t2i support
* make flash attn work with wan
* make wan a little faster
* add wan2.1 t2v support
* add wan gguf support
* add offload params to cpu support
* add wan2.1 i2v support
* crop image before resize
* set default fps to 16
* add diff lora support
* fix wan2.1 i2v
* introduce sd_sample_params_t
* add wan2.2 t2v support
* add wan2.2 14B i2v support
* add wan2.2 ti2v support
* add high noise lora support
* sync: update ggml submodule url
* avoid build failure on linux
* avoid build failure
* update ggml
* update ggml
* fix sd_version_is_wan
* update ggml, fix cpu im2col_3d
* fix ggml_nn_attention_ext mask
* add cache support to ggml runner
* fix the issue of illegal memory access
* unify image loading processing
* add wan2.1/2.2 FLF2V support
* fix end_image mask
* update to latest ggml
* add GGUFReader
* update docs
* Conv2DDirect for VAE stage
* Enable only for Vulkan, reduced duplicated code
* Cmake option to use conv2d direct
* conv2d direct always on for opencl
* conv direct as a flag
* fix merge typo
* Align conv2d behavior to flash attention's
* fix readme
* add conv2d direct for controlnet
* add conv2d direct for esrgan
* clean code, use enable_conv2d_direct/get_all_blocks
* format code
---------
Co-authored-by: leejet <leejet714@gmail.com>
An external ggml will most likely have been built with the default
GGML_MAX_NAME value (64), which would be inconsistent with the value
set by our build (128). That would be an ODR violation, and it could
easily cause memory corruption issues due to the different
sizeof(struct ggml_tensor) values.
For now, when linking against an external ggml, we demand it has been
patched with a bigger GGML_MAX_NAME, since we can't check against a
value defined only at build time.
* repair flash attention in _ext
this does not fix the currently broken fa behind the define, which is only used by VAE
Co-authored-by: FSSRepo <FSSRepo@users.noreply.github.com>
* make flash attention in the diffusion model a runtime flag
no support for sd3 or video
* remove old flash attention option and switch vae over to attn_ext
* update docs
* format code
---------
Co-authored-by: FSSRepo <FSSRepo@users.noreply.github.com>
Co-authored-by: leejet <leejet714@gmail.com>
* first attempt at updating to photomaker v2
* continue adding photomaker v2 modules
* finishing the last few pieces for photomaker v2; id_embeds need to be done by a manual step and pass as an input file
* added a name converter for Photomaker V2; build ok
* more debugging underway
* failing at cuda mat_mul
* updated chunk_half to be more efficient; redo feedforward
* fixed a bug: carefully using ggml_view_4d to get chunks of a tensor; strides need to be recalculated or set properly; still failing at soft_max cuda op
* redo weight calculation and weight*v
* fixed a bug now Photomaker V2 kinds of working
* add python script for face detection (Photomaker V2 needs)
* updated readme for photomaker
* fixed a bug causing PMV1 crashing; both V1 and V2 work
* fixed clean_input_ids for PMV2
* fixed a double counting bug in tokenize_with_trigger_token
* updated photomaker readme
* removed some commented code
* improved reconstructing class word free prompt
* changed reading id_embed to raw binary using existing load tensor function; this is more efficient than using model load and also makes it easier to work with sd server
* minor clean up
---------
Co-authored-by: bssrdf <bssrdf@gmail.com>
* fix and improve: VAE tiling
- properly handle the upper left corner interpolating both x and y
- refactor out lerp
- use smootherstep to preserve more detail and spend less area blending
* actually fix vae tile merging
Co-authored-by: stduhpf <stephduh@live.fr>
* remove the now unused lerp function
---------
Co-authored-by: stduhpf <stephduh@live.fr>
* Fix includes and init vulkan the same as llama.cpp
* Add Windows Vulkan CI
* Updated ggml submodule
* support epsilon as a parameter for ggml_group_norm
---------
Co-authored-by: Cloudwalk <cloudwalk@icculus.org>
Co-authored-by: Oleg Skutte <00.00.oleg.00.00@gmail.com>
Co-authored-by: leejet <leejet714@gmail.com>
* add flux support
* avoid build failures in non-CUDA environments
* fix schnell support
* add k quants support
* add support for applying lora to quantized tensors
* add inplace conversion support for f8_e4m3 (#359)
in the same way it is done for bf16
like how bf16 converts losslessly to fp32,
f8_e4m3 converts losslessly to fp16
* add xlabs flux comfy converted lora support
* update docs
---------
Co-authored-by: Erik Scholz <Green-Sky@users.noreply.github.com>