* feat: add code and doc for running SSD1B models
* Added some more lines to support SD1.x with TINY U-Nets too.
* support SSD-1B.safetensors
* fix sdv1.5 diffusers format loader
---------
Co-authored-by: leejet <leejet714@gmail.com>
* add ref latent support for qwen image
* optimize clip_preprocess and fix get_first_stage_encoding
* add qwen2vl vit support
* add qwen image edit support
* fix qwen image edit pipeline
* add mmproj file support
* support dynamic number of Qwen image transformer blocks
* set prompt_template_encode_start_idx every time
* to_add_out precision fix
* to_out.0 precision fix
* update docs
* add wan vae suppport
* add wan model support
* add umt5 support
* add wan2.1 t2i support
* make flash attn work with wan
* make wan a little faster
* add wan2.1 t2v support
* add wan gguf support
* add offload params to cpu support
* add wan2.1 i2v support
* crop image before resize
* set default fps to 16
* add diff lora support
* fix wan2.1 i2v
* introduce sd_sample_params_t
* add wan2.2 t2v support
* add wan2.2 14B i2v support
* add wan2.2 ti2v support
* add high noise lora support
* sync: update ggml submodule url
* avoid build failure on linux
* avoid build failure
* update ggml
* update ggml
* fix sd_version_is_wan
* update ggml, fix cpu im2col_3d
* fix ggml_nn_attention_ext mask
* add cache support to ggml runner
* fix the issue of illegal memory access
* unify image loading processing
* add wan2.1/2.2 FLF2V support
* fix end_image mask
* update to latest ggml
* add GGUFReader
* update docs
Some terminals have slow display latency, so frequent output
during model loading can actually slow down the process.
Also, since tensor loading times can vary a lot, the progress
display now shows the average across past iterations instead
of just the last one.
* Instruct-p2p support
* support 2 conditionings cfg
* Do not re-encode the exact same image twice
* fixes for 2-cfg
* Fix pix2pix latent inputs + improve inpainting a bit + fix naming
* prepare for other pix2pix-like models
* Support sdxl ip2p
* fix reference image embeddings
* Support 2-cond cfg properly in cli
* fix typo in help
* Support masks for ip2p models
* unify code style
* delete unused code
* use edit mode
* add img_cond
* format code
---------
Co-authored-by: leejet <leejet714@gmail.com>
* repair flash attention in _ext
this does not fix the currently broken fa behind the define, which is only used by VAE
Co-authored-by: FSSRepo <FSSRepo@users.noreply.github.com>
* make flash attention in the diffusion model a runtime flag
no support for sd3 or video
* remove old flash attention option and switch vae over to attn_ext
* update docs
* format code
---------
Co-authored-by: FSSRepo <FSSRepo@users.noreply.github.com>
Co-authored-by: leejet <leejet714@gmail.com>
* first attempt at updating to photomaker v2
* continue adding photomaker v2 modules
* finishing the last few pieces for photomaker v2; id_embeds need to be done by a manual step and pass as an input file
* added a name converter for Photomaker V2; build ok
* more debugging underway
* failing at cuda mat_mul
* updated chunk_half to be more efficient; redo feedforward
* fixed a bug: carefully using ggml_view_4d to get chunks of a tensor; strides need to be recalculated or set properly; still failing at soft_max cuda op
* redo weight calculation and weight*v
* fixed a bug now Photomaker V2 kinds of working
* add python script for face detection (Photomaker V2 needs)
* updated readme for photomaker
* fixed a bug causing PMV1 crashing; both V1 and V2 work
* fixed clean_input_ids for PMV2
* fixed a double counting bug in tokenize_with_trigger_token
* updated photomaker readme
* removed some commented code
* improved reconstructing class word free prompt
* changed reading id_embed to raw binary using existing load tensor function; this is more efficient than using model load and also makes it easier to work with sd server
* minor clean up
---------
Co-authored-by: bssrdf <bssrdf@gmail.com>
* mmdit-x
* add support for sd3.5 medium
* add skip layer guidance support (mmdit only)
* ignore slg if slg_scale is zero (optimization)
* init out_skip once
* slg support for flux (expermiental)
* warn if version doesn't support slg
* refactor slg cli args
* set default slg_scale to 0 (oops)
* format code
---------
Co-authored-by: leejet <leejet714@gmail.com>
* Fix includes and init vulkan the same as llama.cpp
* Add Windows Vulkan CI
* Updated ggml submodule
* support epsilon as a parameter for ggml_group_norm
---------
Co-authored-by: Cloudwalk <cloudwalk@icculus.org>
Co-authored-by: Oleg Skutte <00.00.oleg.00.00@gmail.com>
Co-authored-by: leejet <leejet714@gmail.com>
* add flux support
* avoid build failures in non-CUDA environments
* fix schnell support
* add k quants support
* add support for applying lora to quantized tensors
* add inplace conversion support for f8_e4m3 (#359)
in the same way it is done for bf16
like how bf16 converts losslessly to fp32,
f8_e4m3 converts losslessly to fp16
* add xlabs flux comfy converted lora support
* update docs
---------
Co-authored-by: Erik Scholz <Green-Sky@users.noreply.github.com>