fix: avoid writable mmap for read-only weights (#1698 )

feat: support guidance_schedule (#1684 )
refactor: add Flux VAE version helper (#1696 )
2026-06-23 22:56:42 +00:00 · 2026-06-23 00:39:31 +08:00 · 2026-06-23 00:05:55 +08:00 · 2026-06-22 22:39:42 +08:00 · 2026-06-22 22:16:54 +08:00 · 2026-06-22 22:10:09 +08:00
44 changed files with 2468 additions and 166 deletions
--- a/README.md
+++ b/README.md
@ -34,8 +34,8 @@ API and command-line option may change frequently.***
 - Super lightweight and without external dependencies
 - Supported models
  - Image Models
-    - SD1.x, SD2.x, [SD-Turbo](https://huggingface.co/stabilityai/sd-turbo)
-    - SDXL, [SDXL-Turbo](https://huggingface.co/stabilityai/sdxl-turbo)
+    - [SD1.x, SD2.x, SD-Turbo](./docs/sd.md)
+    - [SDXL, SDXL-Turbo](./docs/sd.md)
    - [Some SD1.x and SDXL distilled models](./docs/distilled_sd.md)
    - [SD3/SD3.5](./docs/sd3.md)
    - [FLUX.1-dev/FLUX.1-schnell](./docs/flux.md)
@ -50,21 +50,23 @@ API and command-line option may change frequently.***
    - [Ovis-Image](./docs/ovis_image.md)
    - [Anima](./docs/anima.md)
    - [ERNIE-Image](./docs/ernie_image.md)
+    - [Boogu Image](./docs/boogu_image.md)
    - [HiDream-O1-Image](./docs/hidream_o1_image.md)
    - [Ideogram4](./docs/ideogram4.md)
  - Image Edit Models
    - [FLUX.1-Kontext-dev](./docs/kontext.md)
    - [Qwen Image Edit series](./docs/qwen_image_edit.md)
    - [LongCat Image Edit](./docs/longcat_image.md)
+    - [Boogu Image Edit](./docs/boogu_image.md)
  - Video Models
    - [Wan2.1/Wan2.2](./docs/wan.md)
    - [LTX-2.3](./docs/ltx2.md)
-  - [PhotoMaker](https://github.com/TencentARC/PhotoMaker) support.
+  - [PhotoMaker](./docs/photo_maker.md) support.
  - Control Net support with SD 1.5
  - LoRA support, same as [stable-diffusion-webui](https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Features#lora)
  - Latent Consistency Models support (LCM/LCM-LoRA)
-  - Faster and memory efficient latent decoding with [TAESD](https://github.com/madebyollin/taesd)
-  - Upscale images generated with [ESRGAN](https://github.com/xinntao/Real-ESRGAN)
+  - Faster and memory efficient latent decoding with [TAESD](./docs/taesd.md)
+  - Upscale images generated with [ESRGAN](./docs/esrgan.md)
 - Supported backends
  - CPU (AVX, AVX2 and AVX512 support for x86 architectures)
  - CUDA
@ -133,28 +135,9 @@ For runtime and parameter backend placement, see the [backend selection guide](.
 ## More Guides

 - [Backend selection](./docs/backend.md)
- [SD1.x/SD2.x/SDXL](./docs/sd.md)
- [SD3/SD3.5](./docs/sd3.md)
- [FLUX.1-dev/FLUX.1-schnell](./docs/flux.md)
- [FLUX.2-dev/FLUX.2-klein](./docs/flux2.md)
- [FLUX.1-Kontext-dev](./docs/kontext.md)
- [Chroma](./docs/chroma.md)
- [🔥Qwen Image](./docs/qwen_image.md)
- [🔥Qwen Image Edit series](./docs/qwen_image_edit.md)
- [🔥Wan2.1/Wan2.2](./docs/wan.md)
- [🔥LTX-2.3](./docs/ltx2.md)
- [🔥Z-Image](./docs/z_image.md)
- [Ovis-Image](./docs/ovis_image.md)
- [Anima](./docs/anima.md)
- [ERNIE-Image](./docs/ernie_image.md)
- [HiDream-O1-Image](./docs/hidream_o1_image.md)
- [Lens](./docs/lens.md)
- [LongCat Image / LongCat Image Edit](./docs/longcat_image.md)
+- [RPC](./docs/rpc.md)
 - [LoRA](./docs/lora.md)
 - [LCM/LCM-LoRA](./docs/lcm.md)
- [Using PhotoMaker to personalize image generation](./docs/photo_maker.md)
- [Using ESRGAN to upscale results](./docs/esrgan.md)
- [Using TAESD to faster decoding](./docs/taesd.md)
 - [Docker](./docs/docker.md)
 - [Quantization and GGUF](./docs/quantization_and_gguf.md)
 - [Inference acceleration via caching](./docs/caching.md)
--- a/assets/boogu/edit_example.png
+++ b/assets/boogu/edit_example.png
--- a/assets/boogu/example.png
+++ b/assets/boogu/example.png
--- a/docs/backend.md
+++ b/docs/backend.md
@ -35,6 +35,14 @@ sd-cli -m model.safetensors -p "a cat" --backend cuda0 --params-backend te=cpu,v
 sd-cli -m model.safetensors -p "a cat" --backend cuda0 --params-backend disk
 ```

+`--max-vram` can target resolved backend/device names:
+
+```shell
+sd-cli -m model.safetensors -p "a cat" --backend diffusion=cuda0,vae=vulkan0 --max-vram cuda0=6,vulkan0=2
+```
+
+The budget applies to every module running on that backend.
+
 Module names are case-insensitive. Hyphens and underscores in module names are ignored, so `clip_vision`, `clip-vision`, and `clipvision` are equivalent.

 `all=`, `default=`, and `*=` can be used to set the default backend inside a mixed assignment:
--- a/docs/boogu_image.md
+++ b/docs/boogu_image.md
@ -0,0 +1,31 @@
+# How to Use
+
+Boogu Image uses a Boogu diffusion transformer, the FLUX VAE, and Qwen3-VL as the LLM text and vision encoder.
+
+## Download weights
+
+- Download Boogu Image
+    - safetensors: https://huggingface.co/Comfy-Org/Boogu-Image/tree/main/diffusion_models
+- Download vae
+    - safetensors: https://huggingface.co/black-forest-labs/FLUX.1-dev/blob/main/ae.safetensors
+- Download Qwen3-VL 8B
+    - gguf: https://huggingface.co/unsloth/Qwen3-VL-8B-Instruct-GGUF/tree/main
+        - For image editing with GGUF text encoders, also download the matching mmproj file and pass it with `--llm_vision`.
+
+## Examples
+
+### Boogu Image Base
+
+```
+.\bin\Release\sd-cli.exe --diffusion-model ..\..\ComfyUI\models\diffusion_models\boogu_image_base_bf16.safetensors --llm ..\..\llm\Qwen3VL-8B-Instruct-Q4_K_M.gguf --vae ..\..\ComfyUI\models\vae\ae.sft -p "a lovely cat" --diffusion-fa -v --offload-to-cpu
+```
+
+<img width="256" alt="Boogu Image Base example" src="../assets/boogu/example.png" />
+
+### Boogu Image Edit
+
+```
+.\bin\Release\sd-cli.exe --diffusion-model ..\..\ComfyUI\models\diffusion_models\boogu_image_edit_bf16.safetensors --llm ..\..\llm\Qwen3VL-8B-Instruct-Q4_K_M.gguf --llm_vision ..\..\llm\mmproj-Qwen3VL-8B-Instruct-F16.gguf --vae ..\..\ComfyUI\models\vae\ae.sft --diffusion-fa -v --offload-to-cpu -r ..\assets\flux\flux1-dev-q8_0.png -p "change 'flux.cpp' to 'boogu.cpp'"
+```
+
+<img width="256" alt="Boogu Image Edit example" src="../assets/boogu/edit_example.png" />
--- a/docs/pulid.md
+++ b/docs/pulid.md
@ -0,0 +1,196 @@
+# PuLID-Flux face-identity preservation
+
+stable-diffusion.cpp supports the [PuLID-Flux](https://github.com/ToTheBeginning/PuLID)
+identity-injection technique on top of Flux.1 (schnell or dev) models.
+Given a single source portrait, PuLID-Flux produces new generations that
+preserve the source person's face across arbitrary scenes, poses, and
+prompts.
+
+Unlike PhotoMaker (which extracts the identity inside the inference
+process from a directory of images), PuLID-Flux's identity extractor is
+a heavy stack (insightface ArcFace + EVA-CLIP-L + IDFormer encoder) that
+is impractical to port to C++/ggml. To keep this implementation small and
+cross-vendor, **stable-diffusion.cpp consumes a precomputed identity
+embedding** produced by an external Python tool that runs once per source
+portrait. Everything downstream of that one-shot extraction is C++ and
+runs on any backend (Vulkan, CUDA, Metal, ROCm, CPU).
+
+## Architecture summary
+
+The PuLID-Flux contribution to the Flux denoise loop is a stack of 20
+small cross-attention modules (`PerceiverAttentionCA`) inserted between
+the Flux transformer blocks:
+
+- After every 2nd of the 19 double-stream blocks (10 hook points)
+- After every 4th of the 38 single-stream blocks (10 hook points)
+
+Each cross-attention layer takes the current image tokens as query, the
+32-token / 2048-dim identity embedding as key+value, and adds its output
+(scaled by `id_weight`, typically 1.0) back to the image tokens.
+
+## Required weights
+
+Three files in addition to the standard Flux weight set:
+
+1. **Flux base** (transformer + VAE + clip_l + t5xxl) -- exactly as
+   [docs/flux.md](flux.md) describes.
+2. **PuLID weights** -- download from
+   [guozinan/PuLID](https://huggingface.co/guozinan/PuLID):
+   - `pulid_flux_v0.9.0.safetensors` or `pulid_flux_v0.9.1.safetensors`
+     (recommended; this implementation is verified against v0.9.1)
+   - **v1.1 (`pulid_v1.1.safetensors`) is NOT yet supported** -- it uses
+     renamed keys (`id_adapter_attn_layers.*` instead of `pulid_ca.*`)
+     and possibly different module structure. Future PR.
+3. **Identity embedding (.pulidembd)** -- produced by the precompute
+   tool below.
+
+## Precompute the identity embedding
+
+The precompute tool runs the PyTorch identity-extraction stack on a
+single portrait image and writes the resulting `(32, 2048)` embedding
+to a `.pulidembd` binary file (about 131 KB). Run it once per source
+person; the same file is reused for any number of generations.
+
+A reference Python script is provided alongside this docs file at
+[`script/pulid_extract_id.py`](../script/pulid_extract_id.py). It
+requires:
+- A working CUDA / CPU PyTorch stack
+- `insightface`, `facexlib`, `eva-clip`, `torchvision`, `opencv-python`,
+  `huggingface_hub`, `gguf`
+- The PuLID weights file (same one stable-diffusion.cpp will load below)
+- The ToTheBeginning/PuLID repo's `pulid/` package (including
+  `pulid/pipeline_flux.py`) and `eva_clip/` package on `PYTHONPATH`; `flux/`
+  is not needed for embedding extraction
+
+Run it as:
+
+```
+python pulid_extract_id.py \
+  --portrait /path/to/source-photo.jpg \
+  --pulid-weights /path/to/pulid_flux_v0.9.1.safetensors \
+  --out /path/to/source.pulidembd
+```
+
+## Format (gguf)
+
+The embedding is a standard **gguf** container holding a single tensor:
+
+```
+tensor name : "pulid_id"
+shape       : [token_dim, num_tokens]   (ggml order; typically [2048, 32])
+type        : F16 (also accepts F32 / BF16)
+metadata    : general.architecture = "pulid", pulid.version = 1
+```
+
+stable-diffusion.cpp loads it with the normal gguf reader
+(`gguf_init_from_file`) and converts to fp32 at load time -- no bespoke
+parser. Total file size for the typical (32, 2048, fp16) case is ~131 KB.
+
+## Command-line usage
+
+```
+.\bin\Release\sd-cli.exe \
+  --diffusion-model     models\flux1-schnell-Q4_K_S.gguf \
+  --vae                 models\ae.safetensors \
+  --clip_l              models\clip_l.safetensors \
+  --t5xxl               models\t5xxl_fp16.safetensors \
+  --pulid-weights       models\pulid_flux_v0.9.1.safetensors \
+  --pulid-id-embedding  source.pulidembd \
+  --pulid-id-weight     1.0 \
+  -p "candid photograph of a young woman on a beach at sunset" \
+  --cfg-scale 1.0 --sampling-method euler --steps 4 -W 512 -H 512 \
+  --seed 42 --clip-on-cpu \
+  -o out.png
+```
+
+For Flux Dev (instead of Schnell), add `--guidance 3.5` and `--steps 20`.
+
+## Flags
+
+| Flag                       | Purpose                                                           |
+|----------------------------|-------------------------------------------------------------------|
+| `--pulid-weights <path>`   | Path to `pulid_flux_v0.9.x.safetensors`. Loaded with the model.   |
+| `--pulid-id-embedding <p>` | Path to a `.pulidembd` binary produced by the precompute tool.    |
+| `--pulid-id-weight <f>`    | Identity-injection strength. Typical 0.7-1.2; default 1.0.        |
+
+All three flags must be set together to activate PuLID. Setting only
+`--pulid-weights` (no embedding) loads the weights but disables injection
+at runtime. Setting `--pulid-id-weight 0` zeros out the contribution
+(useful for falsification testing: outputs should be byte-identical to
+a no-PuLID run with the same seed).
+
+## Memory budget
+
+At 512x512, 4 steps (Schnell), the 20 cross-attention layers add roughly
+10% to denoise time and almost nothing to peak VRAM. Tested on a 12 GB
+consumer card alongside Flux Schnell Q4 GGUF + CPU-offloaded clip_l and
+t5xxl + GPU-resident VAE.
+
+At 1024x1024 with Flux Dev Q4 + 20 steps + PuLID, the VAE decode compute
+buffer doesn't fit on a 12 GB card even with `--vae-on-cpu`. Workaround:
+explicitly route VAE to the CPU backend instead of the offload flag:
+
+```
+--backend "diffusion=vulkan0,vae=cpu"
+```
+
+The `--vae-on-cpu` flag offloads VAE weights but leaves the compute graph
+on the default backend; this is existing stable-diffusion.cpp behavior,
+not a PuLID-specific issue. Documented here because anyone running PuLID
+at 1024 will hit it.
+
+## Backend selection
+
+The standard `--backend` flag works as documented. Common patterns:
+
+```
+# AMD Vulkan
+--backend "diffusion=vulkan0,vae=cpu"
+
+# NVIDIA Vulkan
+--backend "diffusion=vulkan1,vae=cpu"
+
+# CUDA
+--backend "diffusion=cuda0,vae=cpu"
+```
+
+The PuLID cross-attention layers run on the same backend as the main
+diffusion model. They have not yet been independently profiled on every
+backend; only Vulkan and CPU have been tested by the original contributor.
+
+## Verification
+
+A three-way SHA-256 check is the recommended sanity test when bringing up
+a new combination of model + backend + hardware:
+
+| Run                                          | Expected hash relation             |
+|----------------------------------------------|------------------------------------|
+| A: no `--pulid-*` flags                      | baseline                           |
+| B: PuLID flags, `--pulid-id-weight 0.0`      | **byte-identical to A**            |
+| C: PuLID flags, `--pulid-id-weight 1.0`      | **different from A,B**, preserves source identity |
+
+If A and C differ but A and B differ too, the injection is allocating
+or computing something even at zero weight -- likely a bug.
+
+## Limitations / not yet supported
+
+- **`--skip-layers` (skip-layer-guidance / SLG) combined with PuLID** is not
+  supported. The `pulid_ca` index advances per non-skipped block, so a
+  skipped block silently misaligns the cross-attention weight assignment
+  vs. the trained intervals. The reference PyTorch implementation does
+  not have SLG either, so there is no well-defined behavior to emulate.
+  Use either feature alone.
+- **PuLID v1.1 weights** (`pulid_v1.1.safetensors`, renamed key layout).
+- **Multiple ID images.** The reference PyTorch implementation can fuse
+  several portraits into one embedding for stronger identity. This
+  implementation accepts a single embedding produced from one or more
+  images by the external precompute tool.
+- **Negative-prompt branch of CFG.** PuLID only injects on the positive
+  conditioning path in the published reference, and the implementation
+  here follows that. Flux's distilled guidance doesn't run a separate
+  uncond branch in normal use, so this matters only for `--true-cfg`
+  workflows that aren't standard for Flux.
+- **Backends other than Vulkan and CPU** are untested by the original
+  contributor. The implementation is pure-ggml and should work on CUDA,
+  ROCm, and Metal, but verification by users on those backends is
+  welcomed.
--- a/examples/cli/main.cpp
+++ b/examples/cli/main.cpp
@ -62,18 +62,22 @@ struct SDCliParams {
            {"-o",
             "--output",
             "path to write result image to. you can use printf-style %d format specifiers for image sequences (default: ./output.png) (eg. output_%03d.png). Single-file video outputs support .avi, .webm, and animated .webp",
+             0,
             &output_path},
            {"",
             "--image",
             "path to the image to inspect (for metadata mode)",
+             0,
             &image_path},
            {"",
             "--metadata-format",
             "metadata output format, one of [text, json] (default: text)",
+             0,
             &metadata_format},
            {"",
             "--preview-path",
             "path to write preview image to (default: ./preview.png). Multi-frame previews support .avi, .webm, and animated .webp",
+             0,
             &preview_path},
        };

--- a/examples/common/common.cpp
+++ b/examples/common/common.cpp
@ -6,6 +6,7 @@
 #include <cstdlib>
 #include <ctime>
 #include <filesystem>
+#include <fstream>
 #include <iomanip>
 #include <iostream>
 #include <regex>
@ -260,8 +261,15 @@ bool parse_options(int argc, const char** argv, const std::vector<ArgOptions>& o
                        invalid_arg = true;
                        return;
                    }
-                    *option.target = argv_to_utf8(i, argv);
-                    found_arg      = true;
+                    if (option.concat && !option.target->empty()) {
+                        if (option.concat > 0 && option.concat <= 0xff) {
+                            *option.target += static_cast<char>(option.concat);
+                        }
+                        *option.target += argv_to_utf8(i, argv);
+                    } else {
+                        *option.target = argv_to_utf8(i, argv);
+                    }
+                    found_arg = true;
                }))
                break;

@ -324,113 +332,152 @@ ArgOptions SDContextParams::get_options() {
        {"-m",
         "--model",
         "path to full model",
+         0,
         &model_path},
        {"",
         "--clip_l",
-         "path to the clip-l text encoder", &clip_l_path},
+         "path to the clip-l text encoder",
+         0,
+         &clip_l_path},
        {"", "--clip_g",
         "path to the clip-g text encoder",
+         0,
         &clip_g_path},
        {"",
         "--clip_vision",
         "path to the clip-vision encoder",
+         0,
         &clip_vision_path},
        {"",
         "--t5xxl",
         "path to the t5xxl text encoder",
+         0,
         &t5xxl_path},
        {"",
         "--llm",
         "path to the llm text encoder. For example: (qwenvl2.5 for qwen-image, mistral-small3.2 for flux2, ...)",
+         0,
         &llm_path},
        {"",
         "--llm_vision",
         "path to the llm vit",
+         0,
         &llm_vision_path},
        {"",
         "--qwen2vl",
         "alias of --llm. Deprecated.",
+         0,
         &llm_path},
        {"",
         "--qwen2vl_vision",
         "alias of --llm_vision. Deprecated.",
+         0,
         &llm_vision_path},
        {"",
         "--diffusion-model",
         "path to the standalone diffusion model",
+         0,
         &diffusion_model_path},
        {"",
         "--high-noise-diffusion-model",
         "path to the standalone high noise diffusion model",
+         0,
         &high_noise_diffusion_model_path},
        {"",
         "--uncond-diffusion-model",
         "path to the standalone unconditional diffusion model, currently used by Ideogram4 CFG",
+         0,
         &uncond_diffusion_model_path},
        {"",
         "--embeddings-connectors",
         "path to LTXAV embeddings connectors",
+         0,
         &embeddings_connectors_path},
        {"",
         "--vae",
         "path to standalone vae model",
+         0,
         &vae_path},
        {"",
         "--vae-format",
         "VAE latent format override: auto, flux, sd3, or flux2 (default: auto)",
+         0,
         &vae_format},
        {"",
         "--audio-vae",
         "path to standalone LTX audio vae model",
+         0,
         &audio_vae_path},
        {"",
         "--taesd",
         "path to taesd. Using Tiny AutoEncoder for fast decoding (low quality)",
+         0,
         &taesd_path},
        {"",
         "--tae",
         "alias of --taesd",
+         0,
         &taesd_path},
        {"",
         "--control-net",
         "path to control net model",
+         0,
         &control_net_path},
        {"",
         "--embd-dir",
         "embeddings directory",
+         0,
         &embedding_dir},
        {"",
         "--lora-model-dir",
         "lora model directory",
+         0,
         &lora_model_dir},
        {"",
         "--hires-upscalers-dir",
         "highres fix upscaler model directory",
+         0,
         &hires_upscalers_dir},
        {"",
         "--tensor-type-rules",
         "weight type per tensor pattern (example: \"^vae\\.=f16,model\\.=q8_0\")",
+         (int)',',
         &tensor_type_rules},
        {"",
         "--photo-maker",
         "path to PHOTOMAKER model",
+         0,
         &photo_maker_path},
+        {"",
+         "--pulid-weights",
+         "path to PuLID Flux weights",
+         0,
+         &pulid_weights_path},
        {"",
         "--upscale-model",
         "path to esrgan model.",
+         0,
         &esrgan_path},
        {"",
         "--backend",
         "runtime backend assignment, e.g. cpu or clip=cpu,vae=cuda0,diffusion=vulkan0",
+         (int)',',
         &backend},
        {"",
         "--params-backend",
         "parameter backend assignment, e.g. disk, cpu, or diffusion=disk,clip=cpu",
+         (int)',',
         &params_backend},
        {"",
         "--rpc-servers",
         "comma-separated list of RPC servers to connect to for offloading, in the format host:port, e.g. localhost:50052,192.168.1.3:50052",
+         (int)',',
         &rpc_servers},
+        {"",
+         "--max-vram",
+         "maximum VRAM budget in GiB for graph-cut segmented execution. Accepts a single value or assignments by backend/device, e.g. 6 or cuda0=6,vulkan0=4. 0 disables graph splitting; a negative value auto-detects free VRAM, sparing the specified value",
+         0,
+         &max_vram},
    };

    options.int_options = {
@ -445,18 +492,15 @@ ArgOptions SDContextParams::get_options() {
         &chroma_t5_mask_pad},
    };

-    options.float_options = {
-        {"",
-         "--max-vram",
-         "maximum VRAM budget in GiB for graph-cut segmented execution. 0 disables graph splitting; a negative value auto-detects free VRAM, sparing the specified value (e.g. -0.5 will keep at least 0.5 GiB free)",
-         &max_vram},
-    };
-
    options.bool_options = {
        {"",
         "--stream-layers",
         "enable residency+prefetch streaming on top of --max-vram (no effect without --max-vram; defaults to false)",
         true, &stream_layers},
+        {"",
+         "--eager-load",
+         "load all params into the params backend at model-load time instead of lazily on first use (defaults to false)",
+         true, &eager_load},
        {"",
         "--force-sdxl-vae-conv-scale",
         "force use of conv scale on sdxl vae",
@ -758,8 +802,9 @@ std::string SDContextParams::to_string() const {
        << "  rng_type: " << sd_rng_type_name(rng_type) << ",\n"
        << "  sampler_rng_type: " << sd_rng_type_name(sampler_rng_type) << ",\n"
        << "  offload_params_to_cpu: " << (offload_params_to_cpu ? "true" : "false") << ",\n"
-        << "  max_vram: " << max_vram << ",\n"
+        << "  max_vram: \"" << max_vram << "\",\n"
        << "  stream_layers: " << (stream_layers ? "true" : "false") << ",\n"
+        << "  eager_load: " << (eager_load ? "true" : "false") << ",\n"
        << "  backend: \"" << backend << "\",\n"
        << "  params_backend: \"" << params_backend << "\",\n"
        << "  enable_mmap: " << (enable_mmap ? "true" : "false") << ",\n"
@ -815,6 +860,7 @@ sd_ctx_params_t SDContextParams::to_sd_ctx_params_t(bool taesd_preview) {
    sd_ctx_params.embeddings                      = embedding_vec.data();
    sd_ctx_params.embedding_count                 = static_cast<uint32_t>(embedding_vec.size());
    sd_ctx_params.photo_maker_path                = photo_maker_path.c_str();
+    sd_ctx_params.pulid_weights_path              = pulid_weights_path.c_str();
    sd_ctx_params.tensor_type_rules               = tensor_type_rules.c_str();
    sd_ctx_params.n_threads                       = n_threads;
    sd_ctx_params.wtype                           = wtype;
@ -836,8 +882,9 @@ sd_ctx_params_t SDContextParams::to_sd_ctx_params_t(bool taesd_preview) {
    sd_ctx_params.chroma_t5_mask_pad              = chroma_t5_mask_pad;
    sd_ctx_params.qwen_image_zero_cond_t          = qwen_image_zero_cond_t;
    sd_ctx_params.vae_format                      = str_to_vae_format(vae_format);
-    sd_ctx_params.max_vram                        = max_vram;
+    sd_ctx_params.max_vram                        = max_vram.c_str();
    sd_ctx_params.stream_layers                   = stream_layers;
+    sd_ctx_params.eager_load                      = eager_load;
    sd_ctx_params.backend                         = effective_backend.c_str();
    sd_ctx_params.params_backend                  = effective_params_backend.c_str();
    sd_ctx_params.rpc_servers                     = rpc_servers.c_str();
@ -855,54 +902,71 @@ ArgOptions SDGenerationParams::get_options() {
        {"-p",
         "--prompt",
         "the prompt to render",
+         0,
         &prompt},
        {"-n",
         "--negative-prompt",
         "the negative prompt (default: \"\")",
+         0,
         &negative_prompt},
        {"-i",
         "--init-img",
         "path to the init image",
+         0,
         &init_image_path},
        {"",
         "--end-img",
         "path to the end image, required by flf2v",
+         0,
         &end_image_path},
        {"",
         "--mask",
         "path to the mask image",
+         0,
         &mask_image_path},
        {"",
         "--control-image",
         "path to control image, control net",
+         0,
         &control_image_path},
        {"",
         "--control-video",
         "path to control video frames, It must be a directory path. The video frames inside should be stored as images in "
         "lexicographical (character) order. For example, if the control video path is `frames`, the directory contain images "
         "such as 00.png, 01.png, ... etc.",
+         0,
         &control_video_path},
        {"",
         "--pm-id-images-dir",
         "path to PHOTOMAKER input id images dir",
+         0,
         &pm_id_images_dir},
        {"",
         "--pm-id-embed-path",
         "path to PHOTOMAKER v2 id embed",
+         0,
         &pm_id_embed_path},
+        {"",
+         "--pulid-id-embedding",
+         "path to PuLID id embedding",
+         0,
+         &pulid_id_embedding_path},
        {"",
         "--hires-upscaler",
         "highres fix upscaler, Lanczos, Nearest, Latent, Latent (nearest), Latent (nearest-exact), "
         "Latent (antialiased), Latent (bicubic), Latent (bicubic antialiased), or a model name "
         "under --hires-upscalers-dir (default: Latent)",
+         0,
         &hires_upscaler},
        {"",
         "--extra-sample-args",
-         "extra sampler/scheduler/guidance args, key=value list. APG supports apg_eta, apg_momentum, apg_norm_threshold, apg_norm_threshold_smoothing; SLG supports slg_uncond; lcm supports noise_clip_std, noise_scale_start, noise_scale_end; ltx2 supports max_shift, base_shift, stretch, terminal; euler_ge supports gamma",
+         "extra sampler/scheduler/guidance args, key=value list. CFG supports guidance_schedule; APG supports apg_eta, apg_momentum, apg_norm_threshold, apg_norm_threshold_smoothing; SLG supports slg_uncond; lcm supports noise_clip_std, noise_scale_start, noise_scale_end; ltx2 supports max_shift, base_shift, stretch, terminal; euler_ge supports gamma;",
+         (int)',',
         &extra_sample_args},
        {"",
         "--extra-tiling-args",
         "extra VAE tiling args, key=value list. LTX video VAE supports temporal_tile_frames (default: 4), temporal_tile_overlap (default: 1)",
+         (int)',',
         &extra_tiling_args},
    };

@ -1040,6 +1104,10 @@ ArgOptions SDGenerationParams::get_options() {
         "--pm-style-strength",
         "",
         &pm_style_strength},
+        {"",
+         "--pulid-id-weight",
+         "strength of PuLID identity injection",
+         &pulid_id_weight},
        {"",
         "--control-strength",
         "strength to apply Control Net (default: 0.9). 1.0 corresponds to full destruction of information in init image",
@ -1354,6 +1422,42 @@ ArgOptions SDGenerationParams::get_options() {
        return 1;
    };

+    auto on_prompt_file_arg = [&](int argc, const char** argv, int index) {
+        if (++index >= argc) {
+            return -1;
+        }
+        const char* arg = argv[index];
+        std::ifstream f(arg, std::ios::binary);
+        try {
+            prompt = std::string(std::istreambuf_iterator<char>{f}, {});
+        } catch (const std::ios_base::failure&) {
+            f.setstate(std::ios_base::failbit);
+        }
+        if (f.fail()) {
+            LOG_ERROR("error: failed to read prompt file '%s'\n", arg);
+            return -1;
+        }
+        return 1;
+    };
+
+    auto on_negative_prompt_file_arg = [&](int argc, const char** argv, int index) {
+        if (++index >= argc) {
+            return -1;
+        }
+        const char* arg = argv[index];
+        std::ifstream f(arg, std::ios::binary);
+        try {
+            negative_prompt = std::string(std::istreambuf_iterator<char>{f}, {});
+        } catch (const std::ios_base::failure&) {
+            f.setstate(std::ios_base::failbit);
+        }
+        if (f.fail()) {
+            LOG_ERROR("error: failed to read negative prompt file '%s'\n", arg);
+            return -1;
+        }
+        return 1;
+    };
+
    options.manual_options = {
        {"-s",
         "--seed",
@ -1417,6 +1521,14 @@ ArgOptions SDGenerationParams::get_options() {
         "--vae-relative-tile-size",
         "relative tile size for vae tiling, format [X]x[Y], in fraction of image size if < 1, in number of tiles per dim if >=1 (overrides --vae-tile-size)",
         on_relative_tile_size_arg},
+        {"",
+         "--prompt-file",
+         "path to the file containing the prompt to render",
+         on_prompt_file_arg},
+        {"",
+         "--negative-prompt-file",
+         "path to the file containing the negative prompt",
+         on_negative_prompt_file_arg},

    };

@ -2272,6 +2384,11 @@ sd_img_gen_params_t SDGenerationParams::to_sd_img_gen_params_t() {
        pm_style_strength,
    };

+    sd_pulid_params_t pulid_params = {
+        pulid_id_embedding_path.empty() ? nullptr : pulid_id_embedding_path.c_str(),
+        pulid_id_weight,
+    };
+
    params.loras                 = lora_vec.empty() ? nullptr : lora_vec.data();
    params.lora_count            = static_cast<uint32_t>(lora_vec.size());
    params.prompt                = prompt.c_str();
@ -2292,6 +2409,7 @@ sd_img_gen_params_t SDGenerationParams::to_sd_img_gen_params_t() {
    params.control_image         = control_image.get();
    params.control_strength      = control_strength;
    params.pm_params             = pm_params;
+    params.pulid_params          = pulid_params;
    params.vae_tiling_params     = vae_tiling_params;
    params.cache                 = cache_params;

--- a/examples/common/common.h
+++ b/examples/common/common.h
@ -31,6 +31,7 @@ struct StringOption {
    std::string short_name;
    std::string long_name;
    std::string desc;
+    int concat;
    std::string* target;
 };

@ -133,6 +134,7 @@ struct SDContextParams {
    std::string control_net_path;
    std::string embedding_dir;
    std::string photo_maker_path;
+    std::string pulid_weights_path;
    sd_type_t wtype = SD_TYPE_COUNT;
    std::string tensor_type_rules;
    std::string lora_model_dir = ".";
@ -144,8 +146,9 @@ struct SDContextParams {
    rng_type_t rng_type         = CUDA_RNG;
    rng_type_t sampler_rng_type = RNG_TYPE_COUNT;
    bool offload_params_to_cpu  = false;
-    float max_vram              = 0.f;
+    std::string max_vram        = "0";
    bool stream_layers          = false;
+    bool eager_load             = false;
    std::string backend;
    std::string params_backend;
    std::string rpc_servers;
@ -234,6 +237,9 @@ struct SDGenerationParams {
    std::string pm_id_embed_path;
    float pm_style_strength = 20.f;

+    std::string pulid_id_embedding_path;
+    float pulid_id_weight = 1.0f;
+
    int upscale_repeats   = 1;
    int upscale_tile_size = 128;

--- a/examples/server/frontend
+++ b/examples/server/frontend
@ -1 +1 @@
-Subproject commit 797ccf80825cc035508ba9b599b2a21953e7f835
+Subproject commit c4bce3d6b3f236614cca21014f076083b7270ba8
--- a/examples/server/runtime.cpp
+++ b/examples/server/runtime.cpp
@ -190,8 +190,8 @@ ArgOptions SDSvrParams::get_options() {
    ArgOptions options;

    options.string_options = {
-        {"-l", "--listen-ip", "server listen ip (default: 127.0.0.1)", &listen_ip},
-        {"", "--serve-html-path", "path to HTML file to serve at root (optional)", &serve_html_path},
+        {"-l", "--listen-ip", "server listen ip (default: 127.0.0.1)", 0, &listen_ip},
+        {"", "--serve-html-path", "path to HTML file to serve at root (optional)", 0, &serve_html_path},
    };

    options.int_options = {
--- a/2
+++ b/2
@ -1 +1 @@
-Subproject commit 0ce7ad348a3151e1da9f65d962044546bcaad421
+Subproject commit 3af5f5760e19a96427f5f7a93b79cbdf3d4b265b
--- a/include/stable-diffusion.h
+++ b/include/stable-diffusion.h
@ -195,6 +195,7 @@ typedef struct {
    const sd_embedding_t* embeddings;
    uint32_t embedding_count;
    const char* photo_maker_path;
+    const char* pulid_weights_path;
    const char* tensor_type_rules;
    int n_threads;
    enum sd_type_t wtype;
@ -216,8 +217,9 @@ typedef struct {
    int chroma_t5_mask_pad;
    bool qwen_image_zero_cond_t;
    enum sd_vae_format_t vae_format;
-    float max_vram;  // GiB budget for graph-cut segmented param offload (0 = disabled, -1 = auto free VRAM minus 1 GiB)
+    const char* max_vram;  // GiB budget or backend assignment spec for graph-cut segmented param offload (0 = disabled, -1 = auto)
    bool stream_layers;  // Enable residency+prefetch streaming on top of --max-vram (no effect without --max-vram)
+    bool eager_load;  // Load all params into the params backend at model-load time instead of lazily on first use
    const char* backend;
    const char* params_backend;
    const char* rpc_servers;
@ -272,6 +274,11 @@ typedef struct {
    float style_strength;
 } sd_pm_params_t;  // photo maker

+typedef struct {
+    const char* id_embedding_path;
+    float id_weight;
+} sd_pulid_params_t;
+
 enum sd_cache_mode_t {
    SD_CACHE_DISABLED = 0,
    SD_CACHE_EASYCACHE,
@ -364,6 +371,7 @@ typedef struct {
    sd_image_t control_image;
    float control_strength;
    sd_pm_params_t pm_params;
+    sd_pulid_params_t pulid_params;
    sd_tiling_params_t vae_tiling_params;
    sd_cache_params_t cache;
    sd_hires_params_t hires;
@ -445,6 +453,17 @@ SD_API void sd_img_gen_params_init(sd_img_gen_params_t* sd_img_gen_params);
 SD_API char* sd_img_gen_params_to_str(const sd_img_gen_params_t* sd_img_gen_params);
 SD_API sd_image_t* generate_image(sd_ctx_t* sd_ctx, const sd_img_gen_params_t* sd_img_gen_params);

+enum sd_cancel_mode_t {
+    // Stop the current generation as soon as possible.
+    SD_CANCEL_ALL,
+    // Finish the current image sample, then skip additional batch latents and return completed images.
+    SD_CANCEL_NEW_LATENTS,
+    // Clear a pending cancellation request.
+    SD_CANCEL_RESET
+};
+
+SD_API void sd_cancel_generation(sd_ctx_t* sd_ctx, enum sd_cancel_mode_t mode);
+
 SD_API void sd_vid_gen_params_init(sd_vid_gen_params_t* sd_vid_gen_params);
 SD_API bool generate_video(sd_ctx_t* sd_ctx,
                           const sd_vid_gen_params_t* sd_vid_gen_params,
--- a/script/pulid_extract_id.py
+++ b/script/pulid_extract_id.py
@ -0,0 +1,134 @@
+"""
+Precompute a PuLID-Flux identity embedding from a single source portrait.
+
+Writes a gguf file (a single tensor `pulid_id`) that stable-diffusion.cpp's
+`--pulid-id-embedding` flag consumes.
+
+Dependencies (recommended: vendor rather than pip-install due to upstream
+packaging quirks):
+  - torch + safetensors
+  - The ToTheBeginning/PuLID repository's `pulid/` package and `eva_clip/`.
+    Put them on PYTHONPATH or sys.path before running this script.
+  - insightface, facexlib, torchvision, opencv-python, huggingface_hub, gguf
+  - numpy, Pillow
+
+Usage:
+  python script/pulid_extract_id.py \\
+    --portrait /path/to/source-photo.jpg \\
+    --pulid-weights /path/to/pulid_flux_v0.9.1.safetensors \\
+    --out /path/to/source.pulidembd
+
+The portrait must contain a clearly visible face. insightface's antelopev2
+detector will be auto-downloaded on first run.
+"""
+
+from __future__ import annotations
+
+import argparse
+import os
+import sys
+from types import SimpleNamespace
+
+
+def extract(portrait_path: str, pulid_weights: str) -> "torch.Tensor":
+    import numpy as np
+    import torch
+    from PIL import Image
+    from pulid.pipeline_flux import PuLIDPipeline
+
+    if torch.cuda.is_available():
+        device, onnx_provider = "cuda", "gpu"
+    else:
+        device, onnx_provider = "cpu", "cpu"
+
+    print(f"device={device}", flush=True)
+
+    # PuLIDPipeline only attaches pulid_ca attributes to `dit` during
+    # construction; get_id_embedding() never runs Flux, so a dummy object is
+    # enough and avoids importing/building a Flux skeleton.
+    print("instantiating PuLIDPipeline with a dummy Flux object", flush=True)
+    dit = SimpleNamespace()
+    pulid = PuLIDPipeline(dit=dit,
+                          device=device,
+                          weight_dtype=torch.bfloat16,
+                          onnx_provider=onnx_provider)
+
+    print(f"loading PuLID weights from {pulid_weights}", flush=True)
+    pulid.load_pretrain(pretrain_path=pulid_weights, version="v0.9.1")
+
+    print(f"extracting ID embedding from {portrait_path}", flush=True)
+    face_img = np.array(Image.open(portrait_path).convert("RGB"))
+    id_embedding, _ = pulid.get_id_embedding(face_img)
+    print(f"id embedding shape={tuple(id_embedding.shape)} dtype={id_embedding.dtype}",
+          flush=True)
+
+    if id_embedding.ndim == 3 and id_embedding.shape[0] == 1:
+        id_embedding = id_embedding[0]
+    return id_embedding
+
+
+def write_embd(tensor, out_path: str, dtype_choice: str) -> None:
+    import gguf
+    import torch
+
+    if tensor.ndim != 2:
+        raise ValueError(f"expected (num_tokens, token_dim); got {tuple(tensor.shape)}")
+    num_tokens, token_dim = tensor.shape
+
+    os.makedirs(os.path.dirname(out_path) or ".", exist_ok=True)
+
+    writer = gguf.GGUFWriter(out_path, arch="pulid")
+    writer.add_uint32("pulid.version", 1)
+
+    if dtype_choice == "fp16":
+        arr = tensor.to(torch.float16).contiguous().cpu().numpy()
+        writer.add_tensor("pulid_id", arr)
+    elif dtype_choice == "fp32":
+        arr = tensor.to(torch.float32).contiguous().cpu().numpy()
+        writer.add_tensor("pulid_id", arr)
+    elif dtype_choice == "bf16":
+        raw = tensor.to(torch.bfloat16).contiguous().view(torch.uint16).cpu().numpy()
+        writer.add_tensor("pulid_id", raw,
+                          raw_shape=(int(num_tokens), int(token_dim)),
+                          raw_dtype=gguf.GGMLQuantizationType.BF16)
+    else:
+        raise ValueError(f"unknown --dtype {dtype_choice}")
+
+    writer.write_header_to_file()
+    writer.write_kv_data_to_file()
+    writer.write_tensors_to_file()
+    writer.close()
+
+    print(f"wrote {out_path}: gguf, tensor pulid_id [{token_dim}, {num_tokens}] {dtype_choice}",
+          flush=True)
+
+
+def main() -> int:
+    ap = argparse.ArgumentParser(
+        description=__doc__,
+        formatter_class=argparse.RawDescriptionHelpFormatter)
+    ap.add_argument("--portrait", required=True,
+                    help="Path to the source portrait image (JPG/PNG).")
+    ap.add_argument("--pulid-weights", required=True,
+                    help="Path to pulid_flux_v0.9.x.safetensors.")
+    ap.add_argument("--out", required=True,
+                    help="Output path for the .pulidembd binary.")
+    ap.add_argument("--dtype", default="fp16",
+                    choices=["fp16", "bf16", "fp32"],
+                    help="Storage dtype (default fp16; produces ~131 KB).")
+    args = ap.parse_args()
+
+    if not os.path.exists(args.portrait):
+        print(f"ERROR: portrait not found at {args.portrait}", file=sys.stderr)
+        return 2
+    if not os.path.exists(args.pulid_weights):
+        print(f"ERROR: PuLID weights not found at {args.pulid_weights}", file=sys.stderr)
+        return 3
+
+    embedding = extract(args.portrait, args.pulid_weights)
+    write_embd(embedding, args.out, args.dtype)
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
--- a/src/conditioning/conditioner.hpp
+++ b/src/conditioning/conditioner.hpp
@ -142,8 +142,7 @@ struct FrozenCLIPEmbedderWithCustomWords : public Conditioner {
                                      std::shared_ptr<RunnerWeightManager> weight_manager = nullptr)
        : version(version), tokenizer(sd_version_is_sd2(version) ? 0 : 49407) {
        for (const auto& kv : orig_embedding_map) {
-            std::string name = kv.first;
-            std::transform(name.begin(), name.end(), name.begin(), [](unsigned char c) { return std::tolower(c); });
+            std::string name    = normalize_embedding_name(kv.first);
            embedding_map[name] = kv.second;
            tokenizer.add_special_token(name);
        }
@ -278,17 +277,23 @@ struct FrozenCLIPEmbedderWithCustomWords : public Conditioner {
        return true;
    }

+    static std::string normalize_embedding_name(std::string name) {
+        std::transform(name.begin(), name.end(), name.begin(), [](unsigned char c) { return std::tolower(c); });
+        return name;
+    }
+
+    bool append_embedding_tokens(std::string str, std::vector<int32_t>& bpe_tokens) {
+        std::string name = normalize_embedding_name(std::move(str));
+        auto iter        = embedding_map.find(name);
+        if (iter == embedding_map.end()) {
+            return false;
+        }
+        return load_embedding(name, iter->second, bpe_tokens);
+    }
+
    std::vector<int> convert_token_to_id(std::string text) {
        auto on_new_token_cb = [&](std::string& str, std::vector<int32_t>& bpe_tokens) -> bool {
-            auto iter = embedding_map.find(str);
-            if (iter == embedding_map.end()) {
-                return false;
-            }
-            std::string embedding_path = iter->second;
-            if (load_embedding(str, embedding_path, bpe_tokens)) {
-                return true;
-            }
-            return false;
+            return append_embedding_tokens(str, bpe_tokens);
        };
        std::vector<int> curr_tokens = tokenizer.encode(text, on_new_token_cb);
        return curr_tokens;
@ -315,15 +320,7 @@ struct FrozenCLIPEmbedderWithCustomWords : public Conditioner {
        }

        auto on_new_token_cb = [&](std::string& str, std::vector<int32_t>& bpe_tokens) -> bool {
-            auto iter = embedding_map.find(str);
-            if (iter == embedding_map.end()) {
-                return false;
-            }
-            std::string embedding_path = iter->second;
-            if (load_embedding(str, embedding_path, bpe_tokens)) {
-                return true;
-            }
-            return false;
+            return append_embedding_tokens(str, bpe_tokens);
        };

        std::vector<int> tokens;
@ -1521,7 +1518,7 @@ struct LLMEmbedder : public Conditioner {
            arch = LLM::LLMArch::GPT_OSS_20B;
        } else if (sd_version_is_pid(version)) {
            arch = LLM::LLMArch::GEMMA2_2B;
-        } else if (sd_version_is_ideogram4(version)) {
+        } else if (sd_version_is_ideogram4(version) || sd_version_is_boogu_image(version)) {
            arch = LLM::LLMArch::QWEN3_VL;
        } else if (sd_version_is_z_image(version) || version == VERSION_OVIS_IMAGE || version == VERSION_FLUX2_KLEIN) {
            arch = LLM::LLMArch::QWEN3;
@ -1781,6 +1778,65 @@ struct LLMEmbedder : public Conditioner {

                prompt += "<|im_end|>\n<|im_start|>assistant\n";
            }
+        } else if (sd_version_is_boogu_image(version)) {
+            prompt_template_encode_start_idx = 0;
+
+            const std::string t2i_system_prompt =
+                "You are a helpful assistant that generates high-quality images based on user instructions. The instructions are as follows.";
+            const std::string edit_system_prompt =
+                "Describe the key features of the input image (color, shape, size, texture, objects, background), then explain how the user's text instruction should alter or modify the image. Generate a new image that meets the user's requirements while maintaining consistency with the original input where appropriate.";
+            const bool has_ref_images = llm->enable_vision && conditioner_params.ref_images != nullptr && !conditioner_params.ref_images->empty();
+            const bool text_empty     = conditioner_params.text.find_first_not_of(" \t\r\n") == std::string::npos;
+
+            if (has_ref_images) {
+                LOG_INFO("BooguImageEditPipeline");
+                const std::string prompt_prefix = "<|im_start|>system\n" + edit_system_prompt + "<|im_end|>\n<|im_start|>user\n";
+                std::string img_prompt;
+                const std::string placeholder = "<|image_pad|>";
+
+                for (int i = 0; i < conditioner_params.ref_images->size(); i++) {
+                    const auto& image = (*conditioner_params.ref_images)[i];
+                    double factor     = llm->config.vision.patch_size * llm->config.vision.spatial_merge_size;
+                    int height        = static_cast<int>(image.shape()[1]);
+                    int width         = static_cast<int>(image.shape()[0]);
+                    double beta       = std::sqrt((384.0 * 384.0) / (static_cast<double>(height) * static_cast<double>(width)));
+                    int h_bar         = std::max(static_cast<int>(factor),
+                                                 static_cast<int>(std::round(height * beta / factor)) * static_cast<int>(factor));
+                    int w_bar         = std::max(static_cast<int>(factor),
+                                                 static_cast<int>(std::round(width * beta / factor)) * static_cast<int>(factor));
+
+                    LOG_DEBUG("resize conditioner ref image %d from %dx%d to %dx%d", i, height, width, h_bar, w_bar);
+
+                    auto resized_image = clip_preprocess(image, w_bar, h_bar);
+                    auto image_embed   = llm->encode_image(n_threads, resized_image, false, true, true);
+                    GGML_ASSERT(!image_embed.empty());
+
+                    std::string image_prefix = prompt_prefix + img_prompt + "<|vision_start|>";
+                    int image_embed_idx      = static_cast<int>(tokenizer->encode(image_prefix, nullptr).size());
+                    image_embeds.emplace_back(image_embed_idx, image_embed);
+
+                    img_prompt += "<|vision_start|>";
+                    int64_t num_image_tokens = image_embed.shape()[1];
+                    img_prompt.reserve(img_prompt.size() + static_cast<size_t>(num_image_tokens) * placeholder.size() + 32);
+                    for (int j = 0; j < num_image_tokens; j++) {
+                        img_prompt += placeholder;
+                    }
+                    img_prompt += "<|vision_end|>";
+                }
+
+                prompt                  = prompt_prefix + img_prompt;
+                prompt_attn_range.first = static_cast<int>(prompt.size());
+                prompt += conditioner_params.text;
+                prompt_attn_range.second = static_cast<int>(prompt.size());
+                prompt += "<|im_end|>\n";
+            } else {
+                const std::string& system_prompt = text_empty ? edit_system_prompt : t2i_system_prompt;
+                prompt                           = "<|im_start|>system\n" + system_prompt + "<|im_end|>\n<|im_start|>user\n";
+                prompt_attn_range.first          = static_cast<int>(prompt.size());
+                prompt += conditioner_params.text;
+                prompt_attn_range.second = static_cast<int>(prompt.size());
+                prompt += "<|im_end|>\n";
+            }
        } else if (sd_version_is_longcat(version)) {
            spell_quotes = true;

--- a/src/convert.cpp
+++ b/src/convert.cpp
@ -99,7 +99,7 @@ bool convert(const char* input_path,
        model_loader.convert_tensors_name();
    }

-    ggml_type type             = (ggml_type)output_type;
+    ggml_type type             = sd_type_to_ggml_type(output_type);
    bool output_is_safetensors = ends_with(output_path, ".safetensors");
    TensorTypeRules type_rules = parse_tensor_type_rules(tensor_type_rules);

--- a/src/core/ggml_extend_backend.cpp
+++ b/src/core/ggml_extend_backend.cpp
@ -280,7 +280,7 @@ static std::string get_default_backend_name() {
    return resolve_first_device_by_type(GGML_BACKEND_DEVICE_TYPE_CPU);
 }

-static std::string sd_resolve_backend_name(const std::string& name) {
+std::string sd_backend_resolve_name(const std::string& name) {
    ggml_backend_load_all_once();
    std::string requested = trim_copy(name);
    std::string lower     = lower_copy(requested);
@ -318,7 +318,7 @@ static std::string sd_resolve_backend_name(const std::string& name) {
 }

 static bool backend_name_exists(const std::string& name) {
-    return !sd_resolve_backend_name(name).empty();
+    return !sd_backend_resolve_name(name).empty();
 }

 static ggml_backend_t init_named_backend(const std::string& name) {
@ -328,7 +328,7 @@ static ggml_backend_t init_named_backend(const std::string& name) {
        return ggml_backend_init_best();
    }

-    std::string resolved = sd_resolve_backend_name(name);
+    std::string resolved = sd_backend_resolve_name(name);
    if (resolved.empty()) {
        return nullptr;
    }
@ -599,7 +599,7 @@ bool SDBackendManager::validate(std::string* error) const {
            }
            return false;
        }
-        if (!sd_resolve_backend_name(name).empty()) {
+        if (!sd_backend_resolve_name(name).empty()) {
            return true;
        }
        if (error != nullptr) {
@ -632,7 +632,7 @@ bool SDBackendManager::validate(std::string* error) const {
 }

 ggml_backend_t SDBackendManager::init_cached_backend(const std::string& name) {
-    std::string resolved   = sd_resolve_backend_name(name);
+    std::string resolved   = sd_backend_resolve_name(name);
    std::string key        = lower_copy(resolved);
    ggml_backend_t backend = nullptr;

--- a/src/core/ggml_extend_backend.h
+++ b/src/core/ggml_extend_backend.h
@ -71,6 +71,7 @@ bool sd_backend_is(ggml_backend_t backend, const std::string& name);
 bool sd_backend_is_cpu(ggml_backend_t backend);
 ggml_backend_t sd_backend_cpu_init();
 bool sd_backend_cpu_set_n_threads(ggml_backend_t backend_cpu, int n_threads);
+std::string sd_backend_resolve_name(const std::string& name);
 const char* sd_backend_module_name(SDBackendModule module);
 void ggml_ext_im_set_f32_1d(const struct ggml_tensor* tensor, int i, float value);
 bool add_rpc_devices(const std::string& servers);
--- a/src/core/ggml_graph_cut.cpp
+++ b/src/core/ggml_graph_cut.cpp
@ -1,6 +1,8 @@
 #include "core/ggml_graph_cut.h"

 #include <algorithm>
+#include <cctype>
+#include <cmath>
 #include <cstring>
 #include <map>
 #include <set>
@ -8,6 +10,7 @@
 #include <stack>
 #include <unordered_map>

+#include "core/ggml_extend_backend.h"
 #include "core/util.h"
 #include "ggml-alloc.h"
 #include "ggml-backend.h"
@ -83,6 +86,157 @@ namespace sd::ggml_graph_cut {
               segment.output_bytes;
    }

+    static std::string lower_ascii_copy(std::string value) {
+        std::transform(value.begin(), value.end(), value.begin(), [](unsigned char c) {
+            return static_cast<char>(std::tolower(c));
+        });
+        return value;
+    }
+
+    static std::string normalize_backend_budget_key(const std::string& value) {
+        return lower_ascii_copy(trim(value));
+    }
+
+    static bool is_default_max_vram_key(const std::string& key) {
+        std::string normalized = normalize_backend_budget_key(key);
+        return normalized == "all" || normalized == "default" || normalized == "*";
+    }
+
+    static bool parse_max_vram_budget_value(const std::string& text, float* value, std::string* error) {
+        float parsed = 0.f;
+        if (!parse_strict_float(text, parsed) || !std::isfinite(parsed)) {
+            if (error != nullptr) {
+                *error = "invalid --max-vram value '" + text + "'";
+            }
+            return false;
+        }
+        *value = parsed;
+        return true;
+    }
+
+    static std::vector<std::string> backend_budget_keys(ggml_backend_t backend) {
+        std::vector<std::string> keys;
+        if (backend == nullptr) {
+            return keys;
+        }
+
+        ggml_backend_dev_t dev = ggml_backend_get_device(backend);
+        if (dev != nullptr) {
+            keys.push_back(normalize_backend_budget_key(ggml_backend_dev_name(dev)));
+        }
+        const char* backend_name = ggml_backend_name(backend);
+        if (backend_name != nullptr) {
+            keys.push_back(normalize_backend_budget_key(backend_name));
+        }
+        return keys;
+    }
+
+    void MaxVramAssignment::reset(float fallback_gib) {
+        default_gib = fallback_gib;
+        backend_gib.clear();
+        resolved_backend_bytes.clear();
+    }
+
+    bool MaxVramAssignment::parse(const std::string& raw_spec, std::string* error) {
+        const std::string in = trim(raw_spec);
+        if (in.empty()) {
+            return true;
+        }
+
+        for (const std::string& raw_part : split_string(in, ',')) {
+            const std::string part = trim(raw_part);
+            if (part.empty()) {
+                continue;
+            }
+
+            const size_t eq = part.find('=');
+            if (eq == std::string::npos) {
+                float value = 0.f;
+                if (!parse_max_vram_budget_value(part, &value, error)) {
+                    return false;
+                }
+                default_gib = value;
+                continue;
+            }
+
+            const std::string key        = trim(part.substr(0, eq));
+            const std::string value_text = trim(part.substr(eq + 1));
+            if (key.empty() || value_text.empty()) {
+                if (error != nullptr) {
+                    *error = "invalid --max-vram assignment '" + part + "'";
+                }
+                return false;
+            }
+
+            float value = 0.f;
+            if (!parse_max_vram_budget_value(value_text, &value, error)) {
+                return false;
+            }
+
+            if (is_default_max_vram_key(key)) {
+                default_gib = value;
+                continue;
+            }
+
+            const std::string backend_key = trim(key);
+            if (backend_key.empty()) {
+                if (error != nullptr) {
+                    *error = "invalid --max-vram backend key in '" + part + "'";
+                }
+                return false;
+            }
+            backend_gib[backend_key] = value;
+        }
+        resolved_backend_bytes.clear();
+        return true;
+    }
+
+    bool MaxVramAssignment::canonicalize_backend_keys(std::string* error) {
+        if (backend_gib.empty()) {
+            return true;
+        }
+
+        std::unordered_map<std::string, float> normalized;
+        for (const auto& kv : backend_gib) {
+            std::string resolved = sd_backend_resolve_name(kv.first);
+            if (resolved.empty()) {
+                if (error != nullptr) {
+                    *error = "unknown --max-vram backend '" + kv.first + "'";
+                }
+                return false;
+            }
+            normalized[normalize_backend_budget_key(resolved)] = kv.second;
+        }
+        backend_gib = std::move(normalized);
+        resolved_backend_bytes.clear();
+        return true;
+    }
+
+    size_t MaxVramAssignment::bytes_for_backend(ggml_backend_t backend) {
+        std::vector<std::string> keys = backend_budget_keys(backend);
+        const std::string cache_key   = keys.empty() ? std::string("<none>") : keys.front();
+        auto cached                   = resolved_backend_bytes.find(cache_key);
+        if (cached != resolved_backend_bytes.end()) {
+            return cached->second;
+        }
+
+        float budget_gib = default_gib;
+        if (!backend_gib.empty()) {
+            for (const std::string& key : keys) {
+                auto backend_it = backend_gib.find(key);
+                if (backend_it != backend_gib.end()) {
+                    budget_gib = backend_it->second;
+                    break;
+                }
+            }
+        }
+
+        const float resolved_gib          = resolve_max_vram_gib(budget_gib, backend);
+        const size_t bytes                = max_vram_gib_to_bytes(resolved_gib);
+        resolved_backend_bytes[cache_key] = bytes;
+        return bytes;
+    }
+
    size_t max_vram_gib_to_bytes(float max_vram) {
        if (max_vram <= 0.f) {
            return 0;
--- a/src/core/ggml_graph_cut.h
+++ b/src/core/ggml_graph_cut.h
@ -4,6 +4,7 @@
 #include <array>
 #include <cstdint>
 #include <string>
+#include <unordered_map>
 #include <unordered_set>
 #include <vector>

@ -68,6 +69,17 @@ namespace sd::ggml_graph_cut {

    static constexpr const char* GGML_RUNNER_CUT_PREFIX = "ggml_runner_cut:";

+    struct MaxVramAssignment {
+        float default_gib = 0.f;
+        std::unordered_map<std::string, float> backend_gib;
+        std::unordered_map<std::string, size_t> resolved_backend_bytes;
+
+        void reset(float fallback_gib);
+        bool parse(const std::string& raw_spec, std::string* error);
+        bool canonicalize_backend_keys(std::string* error);
+        size_t bytes_for_backend(ggml_backend_t backend);
+    };
+
    bool is_graph_cut_tensor(const ggml_tensor* tensor);
    std::string make_graph_cut_name(const std::string& group, const std::string& output);
    void mark_graph_cut(ggml_tensor* tensor, const std::string& group, const std::string& output);
--- a/src/core/util.cpp
+++ b/src/core/util.cpp
@ -406,6 +406,15 @@ std::vector<std::string> split_string(const std::string& str, char delimiter) {
    return result;
 }

+ggml_type sd_type_to_ggml_type(sd_type_t sdtype) {
+    const int type_value = static_cast<int>(sdtype);
+    if (type_value < std::min<int>(SD_TYPE_COUNT, GGML_TYPE_COUNT)) {
+        return static_cast<ggml_type>(type_value);
+    } else {
+        return GGML_TYPE_COUNT;
+    }
+}
+
 KeyValueArgs parse_key_value_args(const char* args, const char* context) {
    KeyValueArgs pairs;

--- a/src/core/util.h
+++ b/src/core/util.h
@ -80,6 +80,8 @@ void pretty_bytes_progress(int step, int steps, uint64_t bytes_processed, float

 void log_printf(sd_log_level_t level, const char* file, int line, const char* format, ...);

+ggml_type sd_type_to_ggml_type(sd_type_t sdtype);
+
 std::string trim(const std::string& s);

 std::vector<std::pair<std::string, float>> parse_prompt_attention(const std::string& text);
--- a/src/extensions/generation_extension.h
+++ b/src/extensions/generation_extension.h
@ -10,6 +10,7 @@

 #include "conditioning/conditioner.hpp"
 #include "core/ggml_extend_backend.h"
+#include "model/diffusion/model.hpp"
 #include "model_loader.h"
 #include "model_manager.h"
 #include "stable-diffusion.h"
@ -30,6 +31,7 @@ struct GenerationExtensionConditionContext {
    Conditioner* conditioner;
    ConditionerParams& condition_params;
    const sd_pm_params_t& pm_params;
+    const sd_pulid_params_t& pulid_params;
    int n_threads;
    int total_steps;
 };
@ -56,8 +58,20 @@ struct GenerationExtension {
                                                const SDCondition& condition) const {
        return condition;
    }
+
+    // Called in the denoise loop for each enabled extension, after the per-step
+    // DiffusionParams (including its version-specific `extra`) has been built,
+    // but before diffusion_model->compute(). Lets an extension feed data into
+    // the diffusion forward that the conditioning-side hooks can't reach -- it
+    // can set/override fields on `params` (typically the architecture-specific
+    // `params.extra`, e.g. a guidance tensor, control payload, or an identity
+    // embedding for an adapter that injects inside the model's blocks). The
+    // extension targets whichever `extra` variant matches the active model.
+    // Mutates `params` only, never the extension. Default no-op.
+    virtual void before_diffusion(DiffusionParams& /*params*/, int /*step*/) const {}
 };

 std::shared_ptr<GenerationExtension> create_photomaker_extension();
+std::shared_ptr<GenerationExtension> create_pulid_extension();

 #endif
--- a/src/extensions/pulid_extension.cpp
+++ b/src/extensions/pulid_extension.cpp
@ -0,0 +1,123 @@
+#include "extensions/generation_extension.h"
+
+#include <cstring>
+#include <variant>
+
+#include "core/tensor_ggml.hpp"
+#include "core/util.h"
+#include "gguf.h"
+
+static sd::Tensor<float> load_pulid_id_embedding(const char* path) {
+    sd::Tensor<float> empty;
+    if (path == nullptr || strlen(path) == 0) {
+        return empty;
+    }
+
+    struct ggml_context* ctx_data = nullptr;
+    struct gguf_init_params gp    = {/*.no_alloc =*/false, /*.ctx =*/&ctx_data};
+    struct gguf_context* gguf_ctx = gguf_init_from_file(path, gp);
+    if (gguf_ctx == nullptr || ctx_data == nullptr) {
+        LOG_WARN("PuLID id-embedding: cannot read gguf '%s'", path);
+        if (gguf_ctx != nullptr)
+            gguf_free(gguf_ctx);
+        if (ctx_data != nullptr)
+            ggml_free(ctx_data);
+        return empty;
+    }
+
+    struct ggml_tensor* t = ggml_get_tensor(ctx_data, "pulid_id");
+    if (t == nullptr) {
+        LOG_WARN("PuLID id-embedding: no 'pulid_id' tensor in '%s'", path);
+        gguf_free(gguf_ctx);
+        ggml_free(ctx_data);
+        return empty;
+    }
+
+    const int64_t token_dim  = t->ne[0];
+    const int64_t num_tokens = t->ne[1];
+    if (token_dim <= 0 || num_tokens <= 0 || token_dim > 65536 || num_tokens > 1024 ||
+        t->ne[2] != 1 || t->ne[3] != 1) {
+        LOG_WARN("PuLID id-embedding: implausible shape [%lld, %lld] in '%s'",
+                 (long long)token_dim, (long long)num_tokens, path);
+        gguf_free(gguf_ctx);
+        ggml_free(ctx_data);
+        return empty;
+    }
+
+    const size_t n_elem = (size_t)token_dim * (size_t)num_tokens;
+    sd::Tensor<float> out({token_dim, num_tokens, 1});
+    float* dst = out.data();
+    if (t->type == GGML_TYPE_F32) {
+        memcpy(dst, t->data, n_elem * sizeof(float));
+    } else if (t->type == GGML_TYPE_F16) {
+        const ggml_fp16_t* src = reinterpret_cast<const ggml_fp16_t*>(t->data);
+        for (size_t i = 0; i < n_elem; i++) {
+            dst[i] = ggml_fp16_to_fp32(src[i]);
+        }
+    } else if (t->type == GGML_TYPE_BF16) {
+        const ggml_bf16_t* src = reinterpret_cast<const ggml_bf16_t*>(t->data);
+        for (size_t i = 0; i < n_elem; i++) {
+            dst[i] = ggml_bf16_to_fp32(src[i]);
+        }
+    } else {
+        LOG_WARN("PuLID id-embedding: unsupported tensor type %s in '%s'",
+                 ggml_type_name(t->type), path);
+        gguf_free(gguf_ctx);
+        ggml_free(ctx_data);
+        return empty;
+    }
+
+    LOG_INFO("PuLID id-embedding: loaded [%lld, %lld] type=%s from '%s'",
+             (long long)token_dim, (long long)num_tokens, ggml_type_name(t->type), path);
+    gguf_free(gguf_ctx);
+    ggml_free(ctx_data);
+    return out;
+}
+
+struct PuLIDExtension : public GenerationExtension {
+    bool enabled = false;
+    sd::Tensor<float> id_embedding;
+    float id_weight = 1.0f;
+
+    const char* name() const override {
+        return "pulid";
+    }
+
+    bool is_enabled() const override {
+        return enabled;
+    }
+
+    bool init(const GenerationExtensionInitContext& ctx) override {
+        enabled = strlen(SAFE_STR(ctx.params->pulid_weights_path)) > 0;
+        return true;
+    }
+
+    void reset_runtime_condition() override {
+        id_embedding = {};
+        id_weight    = 1.0f;
+    }
+
+    bool prepare_condition(GenerationExtensionConditionContext& ctx) override {
+        reset_runtime_condition();
+        if (!enabled) {
+            return false;
+        }
+        id_embedding = load_pulid_id_embedding(ctx.pulid_params.id_embedding_path);
+        id_weight    = ctx.pulid_params.id_weight;
+        return false;  // PuLID does not modify the conditioning
+    }
+
+    void before_diffusion(DiffusionParams& params, int /*step*/) const override {
+        if (!enabled || id_embedding.empty()) {
+            return;
+        }
+        if (auto* flux_extra = std::get_if<FluxDiffusionExtra>(&params.extra)) {
+            flux_extra->pulid_id        = &id_embedding;
+            flux_extra->pulid_id_weight = id_weight;
+        }
+    }
+};
+
+std::shared_ptr<GenerationExtension> create_pulid_extension() {
+    return std::make_shared<PuLIDExtension>();
+}
--- a/src/model.h
+++ b/src/model.h
@ -42,6 +42,7 @@ enum SDVersion {
    VERSION_LTXAV,
    VERSION_HIDREAM_O1,
    VERSION_Z_IMAGE,
+    VERSION_BOOGU_IMAGE,
    VERSION_OVIS_IMAGE,
    VERSION_ERNIE_IMAGE,
    VERSION_LENS,
@ -143,6 +144,13 @@ static inline bool sd_version_is_z_image(SDVersion version) {
    return false;
 }

+static inline bool sd_version_is_boogu_image(SDVersion version) {
+    if (version == VERSION_BOOGU_IMAGE) {
+        return true;
+    }
+    return false;
+}
+
 static inline bool sd_version_is_longcat(SDVersion version) {
    if (version == VERSION_LONGCAT) {
        return true;
@ -178,6 +186,13 @@ static inline bool sd_version_is_ideogram4(SDVersion version) {
    return false;
 }

+static inline bool sd_version_uses_flux_vae(SDVersion version) {
+    if (sd_version_is_flux(version) || sd_version_is_z_image(version) || sd_version_is_boogu_image(version) || sd_version_is_longcat(version)) {
+        return true;
+    }
+    return false;
+}
+
 static inline bool sd_version_uses_flux2_vae(SDVersion version) {
    if (sd_version_is_flux2(version) || sd_version_is_ernie_image(version) || sd_version_is_lens(version) || sd_version_is_ideogram4(version)) {
        return true;
@ -206,6 +221,7 @@ static inline bool sd_version_is_dit(SDVersion version) {
        version == VERSION_HIDREAM_O1 ||
        sd_version_is_anima(version) ||
        sd_version_is_z_image(version) ||
+        sd_version_is_boogu_image(version) ||
        sd_version_is_ernie_image(version) ||
        sd_version_is_lens(version) ||
        sd_version_is_longcat(version) ||
--- a/src/model/adapter/pulid.hpp
+++ b/src/model/adapter/pulid.hpp
@ -0,0 +1,76 @@
+#ifndef __PULID_HPP__
+#define __PULID_HPP__
+
+#include "core/ggml_extend.hpp"
+#include "model/common/block.hpp"
+
+class PuLIDPerceiverAttentionCA : public GGMLBlock {
+public:
+    static constexpr int64_t DEFAULT_DIM      = 3072;  // Flux hidden size
+    static constexpr int64_t DEFAULT_DIM_HEAD = 128;
+    static constexpr int64_t DEFAULT_HEADS    = 16;
+    static constexpr int64_t DEFAULT_KV_DIM   = 2048;  // PuLID ID-embedding dim
+
+protected:
+    int64_t dim;
+    int64_t dim_head;
+    int64_t heads;
+    int64_t kv_dim;
+    int64_t inner_dim;
+
+public:
+    PuLIDPerceiverAttentionCA(int64_t dim      = DEFAULT_DIM,
+                              int64_t dim_head = DEFAULT_DIM_HEAD,
+                              int64_t heads    = DEFAULT_HEADS,
+                              int64_t kv_dim   = DEFAULT_KV_DIM)
+        : dim(dim),
+          dim_head(dim_head),
+          heads(heads),
+          kv_dim(kv_dim),
+          inner_dim(dim_head * heads) {
+        blocks["norm1"]  = std::shared_ptr<GGMLBlock>(new LayerNorm(kv_dim));
+        blocks["norm2"]  = std::shared_ptr<GGMLBlock>(new LayerNorm(dim));
+        blocks["to_q"]   = std::shared_ptr<GGMLBlock>(new Linear(dim, inner_dim, /*bias=*/false));
+        blocks["to_kv"]  = std::shared_ptr<GGMLBlock>(new Linear(kv_dim, inner_dim * 2, /*bias=*/false));
+        blocks["to_out"] = std::shared_ptr<GGMLBlock>(new Linear(inner_dim, dim, /*bias=*/false));
+    }
+
+    ggml_tensor* forward(GGMLRunnerContext* ctx,
+                         ggml_tensor* id_embedding,
+                         ggml_tensor* image_tokens) {
+        auto norm1  = std::dynamic_pointer_cast<LayerNorm>(blocks["norm1"]);
+        auto norm2  = std::dynamic_pointer_cast<LayerNorm>(blocks["norm2"]);
+        auto to_q   = std::dynamic_pointer_cast<Linear>(blocks["to_q"]);
+        auto to_kv  = std::dynamic_pointer_cast<Linear>(blocks["to_kv"]);
+        auto to_out = std::dynamic_pointer_cast<Linear>(blocks["to_out"]);
+
+        ggml_tensor* x_normed   = norm1->forward(ctx, id_embedding);
+        ggml_tensor* lat_normed = norm2->forward(ctx, image_tokens);
+
+        ggml_tensor* q  = to_q->forward(ctx, lat_normed);  // [N, T_img, 2048]
+        ggml_tensor* kv = to_kv->forward(ctx, x_normed);   // [N, T_img, 3072]
+
+        ggml_tensor* k = ggml_view_3d(ctx->ggml_ctx, kv,
+                                      inner_dim, kv->ne[1], kv->ne[2],
+                                      kv->nb[1], kv->nb[2],
+                                      /*offset=*/0);
+        ggml_tensor* v = ggml_view_3d(ctx->ggml_ctx, kv,
+                                      inner_dim, kv->ne[1], kv->ne[2],
+                                      kv->nb[1], kv->nb[2],
+                                      /*offset=*/inner_dim * ggml_element_size(kv));
+        k              = ggml_cont(ctx->ggml_ctx, k);
+        v              = ggml_cont(ctx->ggml_ctx, v);
+
+        ggml_tensor* attn_out = ggml_ext_attention_ext(
+            ctx->ggml_ctx, ctx->backend,
+            q, k, v,
+            heads,
+            /*mask=*/nullptr,
+            /*diag_mask_inf=*/false);
+
+        ggml_tensor* out = to_out->forward(ctx, attn_out);
+        return out;
+    }
+};
+
+#endif  // __PULID_HPP__
--- a/src/model/common/rope.hpp
+++ b/src/model/common/rope.hpp
@ -899,10 +899,12 @@ namespace Rope {
        // q,k,v: [N, L, n_head, d_head]
        // pe: [L, d_head/2, 2, 2]
        // return: [N, L, n_head*d_head]
+        int64_t n_head = q->ne[1];
+
        q = apply_rope(ctx->ggml_ctx, q, pe, rope_interleaved);  // [N*n_head, L, d_head]
        k = apply_rope(ctx->ggml_ctx, k, pe, rope_interleaved);  // [N*n_head, L, d_head]

-        auto x = ggml_ext_attention_ext(ctx->ggml_ctx, ctx->backend, q, k, v, v->ne[1], mask, true, ctx->flash_attn_enabled, kv_scale);  // [N, L, n_head*d_head]
+        auto x = ggml_ext_attention_ext(ctx->ggml_ctx, ctx->backend, q, k, v, n_head, mask, true, ctx->flash_attn_enabled, kv_scale);  // [N, L, n_head*d_head]
        return x;
    }
 };  // namespace Rope
--- a/src/model/diffusion/anima.hpp
+++ b/src/model/diffusion/anima.hpp
@ -227,6 +227,7 @@ namespace Anima {
            k4 = k_norm->forward(ctx, k4);

            ggml_tensor* attn_out = nullptr;
+            float scale           = (sd_backend_is(ctx->backend, "Vulkan") && ctx->flash_attn_enabled) ? 1.0f / 32.0f : 1.0f;
            if (pe_q != nullptr || pe_k != nullptr) {
                if (pe_q == nullptr) {
                    pe_q = pe_k;
@ -244,7 +245,8 @@ namespace Anima {
                                                     num_heads,
                                                     nullptr,
                                                     true,
-                                                     ctx->flash_attn_enabled);
+                                                     ctx->flash_attn_enabled,
+                                                     scale);
            } else {
                auto q_flat = ggml_reshape_3d(ctx->ggml_ctx, q4, head_dim * num_heads, L_q, N);
                auto k_flat = ggml_reshape_3d(ctx->ggml_ctx, k4, head_dim * num_heads, L_k, N);
@ -256,7 +258,8 @@ namespace Anima {
                                                     num_heads,
                                                     nullptr,
                                                     false,
-                                                     ctx->flash_attn_enabled);
+                                                     ctx->flash_attn_enabled,
+                                                     scale);
            }

            return out_proj->forward(ctx, attn_out);
--- a/src/model/diffusion/boogu.hpp
+++ b/src/model/diffusion/boogu.hpp
@ -0,0 +1,835 @@
+#ifndef __SD_MODEL_DIFFUSION_BOOGU_HPP__
+#define __SD_MODEL_DIFFUSION_BOOGU_HPP__
+
+#include <algorithm>
+#include <cmath>
+#include <tuple>
+#include <vector>
+
+#include "core/ggml_extend.hpp"
+#include "model/common/rope.hpp"
+#include "model/diffusion/dit.hpp"
+#include "model/diffusion/model.hpp"
+#include "model/diffusion/qwen_image.hpp"
+#include "model_loader.h"
+
+namespace Boogu {
+    constexpr int BOOGU_GRAPH_SIZE = 65536;
+
+    struct BooguConfig {
+        int patch_size                   = 2;
+        int64_t in_channels              = 16;
+        int64_t out_channels             = 16;
+        int64_t hidden_size              = 3360;
+        int64_t num_layers               = 32;
+        int64_t num_double_stream_layers = 8;
+        int64_t num_refiner_layers       = 2;
+        int64_t num_attention_heads      = 28;
+        int64_t num_kv_heads             = 7;
+        int64_t head_dim                 = 120;
+        int64_t multiple_of              = 256;
+        int64_t instruction_feat_dim     = 4096;
+        int64_t timestep_embed_dim       = 1024;
+        int theta                        = 10000;
+        float timestep_scale             = 1000.0f;
+        float norm_eps                   = 1e-5f;
+        std::vector<int> axes_dim        = {40, 40, 40};
+        int64_t axes_dim_sum             = 120;
+
+        static int64_t count_blocks(const String2TensorStorage& tensor_storage_map,
+                                    const std::string& prefix,
+                                    const std::string& block_prefix) {
+            int64_t count = 0;
+            for (const auto& [name, _] : tensor_storage_map) {
+                if (!starts_with(name, prefix)) {
+                    continue;
+                }
+                size_t pos = name.find(block_prefix);
+                if (pos == std::string::npos) {
+                    continue;
+                }
+                auto items = split_string(name.substr(pos), '.');
+                if (items.size() > 1) {
+                    count = std::max<int64_t>(count, atoi(items[1].c_str()) + 1);
+                }
+            }
+            return count;
+        }
+
+        static BooguConfig detect_from_weights(const String2TensorStorage& tensor_storage_map, const std::string& prefix) {
+            BooguConfig config;
+            int64_t detected_head_dim = 0;
+            int64_t detected_kv_dim   = 0;
+
+            for (const auto& [name, tensor_storage] : tensor_storage_map) {
+                if (!starts_with(name, prefix)) {
+                    continue;
+                }
+                if (ends_with(name, "x_embedder.weight") && tensor_storage.n_dims == 2) {
+                    int64_t patch_area = config.patch_size * config.patch_size;
+                    config.in_channels = tensor_storage.ne[0] / patch_area;
+                    config.hidden_size = tensor_storage.ne[1];
+                } else if (ends_with(name, "time_caption_embed.caption_embedder.1.weight") && tensor_storage.n_dims == 2) {
+                    config.instruction_feat_dim = tensor_storage.ne[0];
+                    config.hidden_size          = tensor_storage.ne[1];
+                } else if (ends_with(name, "single_stream_layers.0.attn.norm_q.weight") && tensor_storage.n_dims == 1) {
+                    detected_head_dim = tensor_storage.ne[0];
+                } else if (ends_with(name, "double_stream_layers.0.img_self_attn.norm_q.weight") && tensor_storage.n_dims == 1) {
+                    detected_head_dim = tensor_storage.ne[0];
+                } else if (ends_with(name, "single_stream_layers.0.attn.to_k.weight") && tensor_storage.n_dims == 2) {
+                    detected_kv_dim = tensor_storage.ne[1];
+                } else if (ends_with(name, "double_stream_layers.0.img_instruct_attn.processor.img_to_k.weight") && tensor_storage.n_dims == 2) {
+                    detected_kv_dim = tensor_storage.ne[1];
+                } else if (ends_with(name, "norm_out.linear_2.weight") && tensor_storage.n_dims == 2) {
+                    int64_t patch_area  = config.patch_size * config.patch_size;
+                    config.out_channels = tensor_storage.ne[1] / patch_area;
+                }
+            }
+
+            config.num_layers               = std::max<int64_t>(1, count_blocks(tensor_storage_map, prefix, "single_stream_layers."));
+            config.num_double_stream_layers = std::max<int64_t>(0, count_blocks(tensor_storage_map, prefix, "double_stream_layers."));
+            int64_t noise_refiner_layers    = count_blocks(tensor_storage_map, prefix, "noise_refiner.");
+            int64_t ref_refiner_layers      = count_blocks(tensor_storage_map, prefix, "ref_image_refiner.");
+            int64_t context_refiner_layers  = count_blocks(tensor_storage_map, prefix, "context_refiner.");
+            config.num_refiner_layers       = std::max<int64_t>(1, std::max(noise_refiner_layers, std::max(ref_refiner_layers, context_refiner_layers)));
+
+            if (detected_head_dim > 0) {
+                config.head_dim            = detected_head_dim;
+                config.num_attention_heads = config.hidden_size / config.head_dim;
+                config.axes_dim_sum        = config.head_dim;
+                if (detected_kv_dim > 0) {
+                    config.num_kv_heads = detected_kv_dim / config.head_dim;
+                }
+                if (config.axes_dim_sum == 120) {
+                    config.axes_dim = {40, 40, 40};
+                } else if (config.axes_dim_sum % 3 == 0) {
+                    int axis        = static_cast<int>(config.axes_dim_sum / 3);
+                    config.axes_dim = {axis, axis, axis};
+                }
+            }
+            config.timestep_embed_dim = std::min<int64_t>(config.hidden_size, 1024);
+
+            LOG_DEBUG("boogu_image: layers=%" PRId64 ", double_stream_layers=%" PRId64 ", refiner_layers=%" PRId64 ", hidden=%" PRId64 ", heads=%" PRId64 ", kv_heads=%" PRId64 ", head_dim=%" PRId64 ", in_channels=%" PRId64 ", out_channels=%" PRId64,
+                      config.num_layers,
+                      config.num_double_stream_layers,
+                      config.num_refiner_layers,
+                      config.hidden_size,
+                      config.num_attention_heads,
+                      config.num_kv_heads,
+                      config.head_dim,
+                      config.in_channels,
+                      config.out_channels);
+            return config;
+        }
+    };
+
+    __STATIC_INLINE__ ggml_tensor* scale_modulate(ggml_context* ctx, ggml_tensor* x, ggml_tensor* scale) {
+        scale = ggml_reshape_3d(ctx, scale, scale->ne[0], 1, scale->ne[1]);
+        return ggml_add(ctx, x, ggml_mul(ctx, x, scale));
+    }
+
+    __STATIC_INLINE__ ggml_tensor* gate_residual(ggml_context* ctx, ggml_tensor* residual, ggml_tensor* x, ggml_tensor* gate) {
+        gate = ggml_tanh(ctx, gate);
+        gate = ggml_reshape_3d(ctx, gate, gate->ne[0], 1, gate->ne[1]);
+        x    = ggml_mul(ctx, x, gate);
+        return ggml_add(ctx, residual, x);
+    }
+
+    struct LuminaCombinedTimestepCaptionEmbedding : public GGMLBlock {
+        int64_t frequency_embedding_size;
+        float timestep_scale;
+
+        LuminaCombinedTimestepCaptionEmbedding(int64_t hidden_size,
+                                               int64_t instruction_feat_dim,
+                                               int64_t frequency_embedding_size,
+                                               float norm_eps,
+                                               float timestep_scale)
+            : frequency_embedding_size(frequency_embedding_size),
+              timestep_scale(timestep_scale) {
+            blocks["timestep_embedder"]  = std::make_shared<Qwen::TimestepEmbedding>(frequency_embedding_size, std::min<int64_t>(hidden_size, 1024));
+            blocks["caption_embedder.0"] = std::make_shared<RMSNorm>(instruction_feat_dim, norm_eps);
+            blocks["caption_embedder.1"] = std::make_shared<Linear>(instruction_feat_dim, hidden_size, true);
+        }
+
+        std::pair<ggml_tensor*, ggml_tensor*> forward(GGMLRunnerContext* ctx, ggml_tensor* timestep, ggml_tensor* text_hidden_states) {
+            auto timestep_embedder  = std::dynamic_pointer_cast<Qwen::TimestepEmbedding>(blocks["timestep_embedder"]);
+            auto caption_embedder_0 = std::dynamic_pointer_cast<RMSNorm>(blocks["caption_embedder.0"]);
+            auto caption_embedder_1 = std::dynamic_pointer_cast<Linear>(blocks["caption_embedder.1"]);
+
+            auto timestep_proj = ggml_ext_timestep_embedding(ctx->ggml_ctx, timestep, static_cast<int>(frequency_embedding_size), 10000, timestep_scale);
+            auto time_embed    = timestep_embedder->forward(ctx, timestep_proj);
+            auto caption_embed = caption_embedder_1->forward(ctx, caption_embedder_0->forward(ctx, text_hidden_states));
+            return {time_embed, caption_embed};
+        }
+    };
+
+    struct LuminaRMSNormZero : public GGMLBlock {
+        LuminaRMSNormZero(int64_t embedding_dim, int64_t conditioning_embedding_dim, float norm_eps) {
+            blocks["linear"] = std::make_shared<Linear>(conditioning_embedding_dim, 4 * embedding_dim, true);
+            blocks["norm"]   = std::make_shared<RMSNorm>(embedding_dim, norm_eps);
+        }
+
+        std::tuple<ggml_tensor*, ggml_tensor*, ggml_tensor*, ggml_tensor*> forward(GGMLRunnerContext* ctx, ggml_tensor* x, ggml_tensor* emb) {
+            auto linear = std::dynamic_pointer_cast<Linear>(blocks["linear"]);
+            auto norm   = std::dynamic_pointer_cast<RMSNorm>(blocks["norm"]);
+
+            emb       = linear->forward(ctx, ggml_silu(ctx->ggml_ctx, emb));
+            auto mods = ggml_ext_chunk(ctx->ggml_ctx, emb, 4, 0);
+
+            auto scale_msa = mods[0];
+            auto gate_msa  = mods[1];
+            auto scale_mlp = mods[2];
+            auto gate_mlp  = mods[3];
+
+            x = scale_modulate(ctx->ggml_ctx, norm->forward(ctx, x), scale_msa);
+            return {x, gate_msa, scale_mlp, gate_mlp};
+        }
+    };
+
+    struct LuminaFeedForward : public GGMLBlock {
+        LuminaFeedForward(int64_t dim, int64_t inner_dim, int64_t multiple_of) {
+            inner_dim          = multiple_of * ((inner_dim + multiple_of - 1) / multiple_of);
+            blocks["linear_1"] = std::make_shared<Linear>(dim, inner_dim, false);
+            blocks["linear_2"] = std::make_shared<Linear>(inner_dim, dim, false);
+            blocks["linear_3"] = std::make_shared<Linear>(dim, inner_dim, false);
+        }
+
+        ggml_tensor* forward(GGMLRunnerContext* ctx, ggml_tensor* x) {
+            auto linear_1 = std::dynamic_pointer_cast<Linear>(blocks["linear_1"]);
+            auto linear_2 = std::dynamic_pointer_cast<Linear>(blocks["linear_2"]);
+            auto linear_3 = std::dynamic_pointer_cast<Linear>(blocks["linear_3"]);
+
+            if (sd_backend_is(ctx->backend, "Vulkan")) {
+                linear_2->set_force_prec_f32(true);
+            }
+
+            auto h1 = linear_1->forward(ctx, x);
+            auto h2 = linear_3->forward(ctx, x);
+            x       = ggml_swiglu_split(ctx->ggml_ctx, h1, h2);
+            x       = linear_2->forward(ctx, x);
+            return x;
+        }
+    };
+
+    struct LuminaLayerNormContinuous : public GGMLBlock {
+        LuminaLayerNormContinuous(int64_t embedding_dim,
+                                  int64_t conditioning_embedding_dim,
+                                  int64_t out_dim) {
+            blocks["linear_1"] = std::make_shared<Linear>(conditioning_embedding_dim, embedding_dim, true);
+            blocks["norm"]     = std::make_shared<LayerNorm>(embedding_dim, 1e-6f, false);
+            blocks["linear_2"] = std::make_shared<Linear>(embedding_dim, out_dim, true);
+        }
+
+        ggml_tensor* forward(GGMLRunnerContext* ctx, ggml_tensor* x, ggml_tensor* conditioning_embedding) {
+            auto linear_1 = std::dynamic_pointer_cast<Linear>(blocks["linear_1"]);
+            auto norm     = std::dynamic_pointer_cast<LayerNorm>(blocks["norm"]);
+            auto linear_2 = std::dynamic_pointer_cast<Linear>(blocks["linear_2"]);
+
+            auto emb = linear_1->forward(ctx, ggml_silu(ctx->ggml_ctx, conditioning_embedding));
+            x        = scale_modulate(ctx->ggml_ctx, norm->forward(ctx, x), emb);
+            x        = linear_2->forward(ctx, x);
+            return x;
+        }
+    };
+
+    struct Attention : public GGMLBlock {
+        int64_t dim_head;
+        int64_t heads;
+        int64_t kv_heads;
+
+        Attention(int64_t query_dim, int64_t dim_head, int64_t heads, int64_t kv_heads, float eps = 1e-5f)
+            : dim_head(dim_head), heads(heads), kv_heads(kv_heads) {
+            blocks["to_q"]     = std::make_shared<Linear>(query_dim, heads * dim_head, false);
+            blocks["to_k"]     = std::make_shared<Linear>(query_dim, kv_heads * dim_head, false);
+            blocks["to_v"]     = std::make_shared<Linear>(query_dim, kv_heads * dim_head, false);
+            blocks["norm_q"]   = std::make_shared<RMSNorm>(dim_head, eps);
+            blocks["norm_k"]   = std::make_shared<RMSNorm>(dim_head, eps);
+            blocks["to_out.0"] = std::make_shared<Linear>(heads * dim_head, query_dim, false);
+        }
+
+        ggml_tensor* forward(GGMLRunnerContext* ctx,
+                             ggml_tensor* hidden_states,
+                             ggml_tensor* encoder_hidden_states,
+                             ggml_tensor* rotary_emb,
+                             ggml_tensor* attention_mask = nullptr) {
+            auto to_q     = std::dynamic_pointer_cast<Linear>(blocks["to_q"]);
+            auto to_k     = std::dynamic_pointer_cast<Linear>(blocks["to_k"]);
+            auto to_v     = std::dynamic_pointer_cast<Linear>(blocks["to_v"]);
+            auto norm_q   = std::dynamic_pointer_cast<RMSNorm>(blocks["norm_q"]);
+            auto norm_k   = std::dynamic_pointer_cast<RMSNorm>(blocks["norm_k"]);
+            auto to_out_0 = std::dynamic_pointer_cast<Linear>(blocks["to_out.0"]);
+
+            if (sd_backend_is(ctx->backend, "Vulkan")) {
+                to_out_0->set_force_prec_f32(true);
+            }
+
+            int64_t N  = hidden_states->ne[2];
+            int64_t Lq = hidden_states->ne[1];
+            int64_t Lk = encoder_hidden_states->ne[1];
+
+            auto q = to_q->forward(ctx, hidden_states);
+            q      = ggml_reshape_4d(ctx->ggml_ctx, q, dim_head, heads, Lq, N);
+            auto k = to_k->forward(ctx, encoder_hidden_states);
+            k      = ggml_reshape_4d(ctx->ggml_ctx, k, dim_head, kv_heads, Lk, N);
+            auto v = to_v->forward(ctx, encoder_hidden_states);
+            v      = ggml_reshape_4d(ctx->ggml_ctx, v, dim_head, kv_heads, Lk, N);
+
+            q = norm_q->forward(ctx, q);
+            k = norm_k->forward(ctx, k);
+
+            auto out = Rope::attention(ctx, q, k, v, rotary_emb, attention_mask);
+            out      = to_out_0->forward(ctx, out);
+            return out;
+        }
+    };
+
+    struct BooguImageTransformerBlock : public GGMLBlock {
+        bool modulation;
+
+        BooguImageTransformerBlock(int64_t dim,
+                                   int64_t num_attention_heads,
+                                   int64_t num_kv_heads,
+                                   int64_t multiple_of,
+                                   float norm_eps,
+                                   bool modulation)
+            : modulation(modulation) {
+            int64_t head_dim       = dim / num_attention_heads;
+            blocks["attn"]         = std::make_shared<Attention>(dim, head_dim, num_attention_heads, num_kv_heads, 1e-5f);
+            blocks["feed_forward"] = std::make_shared<LuminaFeedForward>(dim, 4 * dim, multiple_of);
+            if (modulation) {
+                blocks["norm1"] = std::make_shared<LuminaRMSNormZero>(dim, std::min<int64_t>(dim, 1024), norm_eps);
+            } else {
+                blocks["norm1"] = std::make_shared<RMSNorm>(dim, norm_eps);
+            }
+            blocks["ffn_norm1"] = std::make_shared<RMSNorm>(dim, norm_eps);
+            blocks["norm2"]     = std::make_shared<RMSNorm>(dim, norm_eps);
+            blocks["ffn_norm2"] = std::make_shared<RMSNorm>(dim, norm_eps);
+        }
+
+        ggml_tensor* forward(GGMLRunnerContext* ctx,
+                             ggml_tensor* hidden_states,
+                             ggml_tensor* rotary_emb,
+                             ggml_tensor* temb           = nullptr,
+                             ggml_tensor* attention_mask = nullptr) {
+            auto attn         = std::dynamic_pointer_cast<Attention>(blocks["attn"]);
+            auto feed_forward = std::dynamic_pointer_cast<LuminaFeedForward>(blocks["feed_forward"]);
+            auto ffn_norm1    = std::dynamic_pointer_cast<RMSNorm>(blocks["ffn_norm1"]);
+            auto norm2        = std::dynamic_pointer_cast<RMSNorm>(blocks["norm2"]);
+            auto ffn_norm2    = std::dynamic_pointer_cast<RMSNorm>(blocks["ffn_norm2"]);
+
+            if (modulation) {
+                auto norm1 = std::dynamic_pointer_cast<LuminaRMSNormZero>(blocks["norm1"]);
+                auto mods  = norm1->forward(ctx, hidden_states, temb);
+
+                auto norm_hidden_states = std::get<0>(mods);
+                auto gate_msa           = std::get<1>(mods);
+                auto scale_mlp          = std::get<2>(mods);
+                auto gate_mlp           = std::get<3>(mods);
+
+                auto attn_output = attn->forward(ctx, norm_hidden_states, norm_hidden_states, rotary_emb, attention_mask);
+                hidden_states    = gate_residual(ctx->ggml_ctx, hidden_states, norm2->forward(ctx, attn_output), gate_msa);
+
+                auto mlp_input  = scale_modulate(ctx->ggml_ctx, ffn_norm1->forward(ctx, hidden_states), scale_mlp);
+                auto mlp_output = feed_forward->forward(ctx, mlp_input);
+                hidden_states   = gate_residual(ctx->ggml_ctx, hidden_states, ffn_norm2->forward(ctx, mlp_output), gate_mlp);
+            } else {
+                auto norm1 = std::dynamic_pointer_cast<RMSNorm>(blocks["norm1"]);
+
+                auto norm_hidden_states = norm1->forward(ctx, hidden_states);
+                auto attn_output        = attn->forward(ctx, norm_hidden_states, norm_hidden_states, rotary_emb, attention_mask);
+                hidden_states           = ggml_add(ctx->ggml_ctx, hidden_states, norm2->forward(ctx, attn_output));
+
+                auto mlp_output = feed_forward->forward(ctx, ffn_norm1->forward(ctx, hidden_states));
+                hidden_states   = ggml_add(ctx->ggml_ctx, hidden_states, ffn_norm2->forward(ctx, mlp_output));
+            }
+            return hidden_states;
+        }
+    };
+
+    struct BooguImageJointAttention : public GGMLBlock {
+        int64_t dim_head;
+        int64_t heads;
+        int64_t kv_heads;
+
+        BooguImageJointAttention(int64_t dim, int64_t dim_head, int64_t heads, int64_t kv_heads)
+            : dim_head(dim_head), heads(heads), kv_heads(kv_heads) {
+            blocks["norm_q"]                  = std::make_shared<RMSNorm>(dim_head, 1e-5f);
+            blocks["norm_k"]                  = std::make_shared<RMSNorm>(dim_head, 1e-5f);
+            blocks["to_out.0"]                = std::make_shared<Linear>(heads * dim_head, dim, false);
+            blocks["processor.img_to_q"]      = std::make_shared<Linear>(dim, heads * dim_head, false);
+            blocks["processor.img_to_k"]      = std::make_shared<Linear>(dim, kv_heads * dim_head, false);
+            blocks["processor.img_to_v"]      = std::make_shared<Linear>(dim, kv_heads * dim_head, false);
+            blocks["processor.instruct_to_q"] = std::make_shared<Linear>(dim, heads * dim_head, false);
+            blocks["processor.instruct_to_k"] = std::make_shared<Linear>(dim, kv_heads * dim_head, false);
+            blocks["processor.instruct_to_v"] = std::make_shared<Linear>(dim, kv_heads * dim_head, false);
+            blocks["processor.instruct_out"]  = std::make_shared<Linear>(heads * dim_head, dim, false);
+            blocks["processor.img_out"]       = std::make_shared<Linear>(heads * dim_head, dim, false);
+        }
+
+        ggml_tensor* forward(GGMLRunnerContext* ctx,
+                             ggml_tensor* img_hidden_states,
+                             ggml_tensor* instruct_hidden_states,
+                             ggml_tensor* rotary_emb,
+                             ggml_tensor* attention_mask = nullptr) {
+            auto norm_q        = std::dynamic_pointer_cast<RMSNorm>(blocks["norm_q"]);
+            auto norm_k        = std::dynamic_pointer_cast<RMSNorm>(blocks["norm_k"]);
+            auto to_out_0      = std::dynamic_pointer_cast<Linear>(blocks["to_out.0"]);
+            auto img_to_q      = std::dynamic_pointer_cast<Linear>(blocks["processor.img_to_q"]);
+            auto img_to_k      = std::dynamic_pointer_cast<Linear>(blocks["processor.img_to_k"]);
+            auto img_to_v      = std::dynamic_pointer_cast<Linear>(blocks["processor.img_to_v"]);
+            auto instruct_to_q = std::dynamic_pointer_cast<Linear>(blocks["processor.instruct_to_q"]);
+            auto instruct_to_k = std::dynamic_pointer_cast<Linear>(blocks["processor.instruct_to_k"]);
+            auto instruct_to_v = std::dynamic_pointer_cast<Linear>(blocks["processor.instruct_to_v"]);
+            auto instruct_out  = std::dynamic_pointer_cast<Linear>(blocks["processor.instruct_out"]);
+            auto img_out       = std::dynamic_pointer_cast<Linear>(blocks["processor.img_out"]);
+
+            if (sd_backend_is(ctx->backend, "Vulkan")) {
+                to_out_0->set_force_prec_f32(true);
+            }
+
+            int64_t N          = img_hidden_states->ne[2];
+            int64_t L_img      = img_hidden_states->ne[1];
+            int64_t L_instruct = instruct_hidden_states->ne[1];
+
+            auto img_q = img_to_q->forward(ctx, img_hidden_states);
+            img_q      = ggml_reshape_4d(ctx->ggml_ctx, img_q, dim_head, heads, L_img, N);
+            auto img_k = img_to_k->forward(ctx, img_hidden_states);
+            img_k      = ggml_reshape_4d(ctx->ggml_ctx, img_k, dim_head, kv_heads, L_img, N);
+            auto img_v = img_to_v->forward(ctx, img_hidden_states);
+            img_v      = ggml_reshape_4d(ctx->ggml_ctx, img_v, dim_head, kv_heads, L_img, N);
+
+            auto instruct_q = instruct_to_q->forward(ctx, instruct_hidden_states);
+            instruct_q      = ggml_reshape_4d(ctx->ggml_ctx, instruct_q, dim_head, heads, L_instruct, N);
+            auto instruct_k = instruct_to_k->forward(ctx, instruct_hidden_states);
+            instruct_k      = ggml_reshape_4d(ctx->ggml_ctx, instruct_k, dim_head, kv_heads, L_instruct, N);
+            auto instruct_v = instruct_to_v->forward(ctx, instruct_hidden_states);
+            instruct_v      = ggml_reshape_4d(ctx->ggml_ctx, instruct_v, dim_head, kv_heads, L_instruct, N);
+
+            auto q = ggml_concat(ctx->ggml_ctx, instruct_q, img_q, 2);
+            auto k = ggml_concat(ctx->ggml_ctx, instruct_k, img_k, 2);
+            auto v = ggml_concat(ctx->ggml_ctx, instruct_v, img_v, 2);
+            q      = norm_q->forward(ctx, q);
+            k      = norm_k->forward(ctx, k);
+
+            auto hidden_states = Rope::attention(ctx, q, k, v, rotary_emb, attention_mask);
+            auto instruct_attn = ggml_ext_slice(ctx->ggml_ctx, hidden_states, 1, 0, L_instruct);
+            auto img_attn      = ggml_ext_slice(ctx->ggml_ctx, hidden_states, 1, L_instruct, L_instruct + L_img);
+
+            instruct_attn = instruct_out->forward(ctx, instruct_attn);
+            img_attn      = img_out->forward(ctx, img_attn);
+            hidden_states = ggml_concat(ctx->ggml_ctx, instruct_attn, img_attn, 1);
+            hidden_states = to_out_0->forward(ctx, hidden_states);
+            return hidden_states;
+        }
+    };
+
+    struct BooguImageDoubleStreamBlock : public GGMLBlock {
+        BooguImageDoubleStreamBlock(int64_t dim,
+                                    int64_t num_attention_heads,
+                                    int64_t num_kv_heads,
+                                    int64_t multiple_of,
+                                    float norm_eps) {
+            int64_t head_dim                = dim / num_attention_heads;
+            blocks["img_instruct_attn"]     = std::make_shared<BooguImageJointAttention>(dim, head_dim, num_attention_heads, num_kv_heads);
+            blocks["img_self_attn"]         = std::make_shared<Attention>(dim, head_dim, num_attention_heads, num_kv_heads, 1e-5f);
+            blocks["img_feed_forward"]      = std::make_shared<LuminaFeedForward>(dim, 4 * dim, multiple_of);
+            blocks["instruct_feed_forward"] = std::make_shared<LuminaFeedForward>(dim, 4 * dim, multiple_of);
+            blocks["img_norm1"]             = std::make_shared<LuminaRMSNormZero>(dim, std::min<int64_t>(dim, 1024), norm_eps);
+            blocks["img_norm2"]             = std::make_shared<LuminaRMSNormZero>(dim, std::min<int64_t>(dim, 1024), norm_eps);
+            blocks["img_norm3"]             = std::make_shared<LuminaRMSNormZero>(dim, std::min<int64_t>(dim, 1024), norm_eps);
+            blocks["instruct_norm1"]        = std::make_shared<LuminaRMSNormZero>(dim, std::min<int64_t>(dim, 1024), norm_eps);
+            blocks["instruct_norm2"]        = std::make_shared<LuminaRMSNormZero>(dim, std::min<int64_t>(dim, 1024), norm_eps);
+            blocks["img_attn_norm"]         = std::make_shared<RMSNorm>(dim, norm_eps);
+            blocks["img_self_attn_norm"]    = std::make_shared<RMSNorm>(dim, norm_eps);
+            blocks["img_ffn_norm1"]         = std::make_shared<RMSNorm>(dim, norm_eps);
+            blocks["img_ffn_norm2"]         = std::make_shared<RMSNorm>(dim, norm_eps);
+            blocks["instruct_attn_norm"]    = std::make_shared<RMSNorm>(dim, norm_eps);
+            blocks["instruct_ffn_norm1"]    = std::make_shared<RMSNorm>(dim, norm_eps);
+            blocks["instruct_ffn_norm2"]    = std::make_shared<RMSNorm>(dim, norm_eps);
+        }
+
+        std::pair<ggml_tensor*, ggml_tensor*> forward(GGMLRunnerContext* ctx,
+                                                      ggml_tensor* img_hidden_states,
+                                                      ggml_tensor* instruct_hidden_states,
+                                                      ggml_tensor* joint_rotary_emb,
+                                                      ggml_tensor* img_rotary_emb,
+                                                      ggml_tensor* temb) {
+            auto img_instruct_attn     = std::dynamic_pointer_cast<BooguImageJointAttention>(blocks["img_instruct_attn"]);
+            auto img_self_attn         = std::dynamic_pointer_cast<Attention>(blocks["img_self_attn"]);
+            auto img_feed_forward      = std::dynamic_pointer_cast<LuminaFeedForward>(blocks["img_feed_forward"]);
+            auto instruct_feed_forward = std::dynamic_pointer_cast<LuminaFeedForward>(blocks["instruct_feed_forward"]);
+            auto img_norm1             = std::dynamic_pointer_cast<LuminaRMSNormZero>(blocks["img_norm1"]);
+            auto img_norm2             = std::dynamic_pointer_cast<LuminaRMSNormZero>(blocks["img_norm2"]);
+            auto img_norm3             = std::dynamic_pointer_cast<LuminaRMSNormZero>(blocks["img_norm3"]);
+            auto instruct_norm1        = std::dynamic_pointer_cast<LuminaRMSNormZero>(blocks["instruct_norm1"]);
+            auto instruct_norm2        = std::dynamic_pointer_cast<LuminaRMSNormZero>(blocks["instruct_norm2"]);
+            auto img_attn_norm         = std::dynamic_pointer_cast<RMSNorm>(blocks["img_attn_norm"]);
+            auto img_self_attn_norm    = std::dynamic_pointer_cast<RMSNorm>(blocks["img_self_attn_norm"]);
+            auto img_ffn_norm1         = std::dynamic_pointer_cast<RMSNorm>(blocks["img_ffn_norm1"]);
+            auto img_ffn_norm2         = std::dynamic_pointer_cast<RMSNorm>(blocks["img_ffn_norm2"]);
+            auto instruct_attn_norm    = std::dynamic_pointer_cast<RMSNorm>(blocks["instruct_attn_norm"]);
+            auto instruct_ffn_norm1    = std::dynamic_pointer_cast<RMSNorm>(blocks["instruct_ffn_norm1"]);
+            auto instruct_ffn_norm2    = std::dynamic_pointer_cast<RMSNorm>(blocks["instruct_ffn_norm2"]);
+
+            int64_t L_instruct = instruct_hidden_states->ne[1];
+
+            auto img_norm1_out_vec      = img_norm1->forward(ctx, img_hidden_states, temb);
+            auto img_norm2_out_vec      = img_norm2->forward(ctx, img_hidden_states, temb);
+            auto img_norm3_out_vec      = img_norm3->forward(ctx, img_hidden_states, temb);
+            auto instruct_norm1_out_vec = instruct_norm1->forward(ctx, instruct_hidden_states, temb);
+            auto instruct_norm2_out_vec = instruct_norm2->forward(ctx, instruct_hidden_states, temb);
+
+            auto img_norm1_out = std::get<0>(img_norm1_out_vec);
+            auto img_gate_msa  = std::get<1>(img_norm1_out_vec);
+            auto img_scale_mlp = std::get<2>(img_norm1_out_vec);
+            auto img_gate_mlp  = std::get<3>(img_norm1_out_vec);
+
+            auto img_norm2_out = std::get<0>(img_norm2_out_vec);
+            auto img_shift_mlp = std::get<1>(img_norm2_out_vec);
+
+            auto img_norm3_out = std::get<0>(img_norm3_out_vec);
+            auto img_gate_self = std::get<1>(img_norm3_out_vec);
+
+            auto instruct_norm1_out = std::get<0>(instruct_norm1_out_vec);
+            auto instruct_gate_msa  = std::get<1>(instruct_norm1_out_vec);
+            auto instruct_scale_mlp = std::get<2>(instruct_norm1_out_vec);
+            auto instruct_gate_mlp  = std::get<3>(instruct_norm1_out_vec);
+
+            auto instruct_norm2_out = std::get<0>(instruct_norm2_out_vec);
+            auto instruct_shift_mlp = std::get<1>(instruct_norm2_out_vec);
+
+            auto joint_attn_out    = img_instruct_attn->forward(ctx, img_norm1_out, instruct_norm1_out, joint_rotary_emb);
+            auto instruct_attn_out = ggml_ext_slice(ctx->ggml_ctx, joint_attn_out, 1, 0, L_instruct);
+            auto img_attn_out      = ggml_ext_slice(ctx->ggml_ctx, joint_attn_out, 1, L_instruct, joint_attn_out->ne[1]);
+
+            auto img_self_attn_out = img_self_attn->forward(ctx, img_norm3_out, img_norm3_out, img_rotary_emb);
+
+            img_hidden_states = gate_residual(ctx->ggml_ctx, img_hidden_states, img_attn_norm->forward(ctx, img_attn_out), img_gate_msa);
+            img_hidden_states = gate_residual(ctx->ggml_ctx, img_hidden_states, img_self_attn_norm->forward(ctx, img_self_attn_out), img_gate_self);
+
+            auto img_mlp_input = scale_modulate(ctx->ggml_ctx, img_norm2_out, img_scale_mlp);
+            img_shift_mlp      = ggml_reshape_3d(ctx->ggml_ctx, img_shift_mlp, img_shift_mlp->ne[0], 1, img_shift_mlp->ne[1]);
+            img_mlp_input      = ggml_add(ctx->ggml_ctx, img_mlp_input, img_shift_mlp);
+            auto img_mlp_out   = img_feed_forward->forward(ctx, img_ffn_norm1->forward(ctx, img_mlp_input));
+            img_hidden_states  = gate_residual(ctx->ggml_ctx, img_hidden_states, img_ffn_norm2->forward(ctx, img_mlp_out), img_gate_mlp);
+
+            instruct_hidden_states  = gate_residual(ctx->ggml_ctx, instruct_hidden_states, instruct_attn_norm->forward(ctx, instruct_attn_out), instruct_gate_msa);
+            auto instruct_mlp_input = scale_modulate(ctx->ggml_ctx, instruct_norm2_out, instruct_scale_mlp);
+            instruct_shift_mlp      = ggml_reshape_3d(ctx->ggml_ctx, instruct_shift_mlp, instruct_shift_mlp->ne[0], 1, instruct_shift_mlp->ne[1]);
+            instruct_mlp_input      = ggml_add(ctx->ggml_ctx, instruct_mlp_input, instruct_shift_mlp);
+            auto instruct_mlp_out   = instruct_feed_forward->forward(ctx, instruct_ffn_norm1->forward(ctx, instruct_mlp_input));
+            instruct_hidden_states  = gate_residual(ctx->ggml_ctx, instruct_hidden_states, instruct_ffn_norm2->forward(ctx, instruct_mlp_out), instruct_gate_mlp);
+
+            return {img_hidden_states, instruct_hidden_states};
+        }
+    };
+
+    struct BooguImageModel : public GGMLBlock {
+        BooguConfig config;
+
+        void init_params(ggml_context* ctx, const String2TensorStorage& tensor_storage_map = {}, const std::string prefix = "") override {
+            GGML_UNUSED(tensor_storage_map);
+            GGML_UNUSED(prefix);
+            params["image_index_embedding"] = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, config.hidden_size, 5);
+        }
+
+        BooguImageModel() = default;
+        BooguImageModel(BooguConfig config)
+            : config(std::move(config)) {
+            blocks["x_embedder"]               = std::make_shared<Linear>(this->config.patch_size * this->config.patch_size * this->config.in_channels, this->config.hidden_size, true);
+            blocks["ref_image_patch_embedder"] = std::make_shared<Linear>(this->config.patch_size * this->config.patch_size * this->config.in_channels, this->config.hidden_size, true);
+            blocks["time_caption_embed"]       = std::make_shared<LuminaCombinedTimestepCaptionEmbedding>(this->config.hidden_size,
+                                                                                                    this->config.instruction_feat_dim,
+                                                                                                    256,
+                                                                                                    this->config.norm_eps,
+                                                                                                    this->config.timestep_scale);
+
+            for (int i = 0; i < this->config.num_refiner_layers; i++) {
+                blocks["noise_refiner." + std::to_string(i)]     = std::make_shared<BooguImageTransformerBlock>(this->config.hidden_size,
+                                                                                                            this->config.num_attention_heads,
+                                                                                                            this->config.num_kv_heads,
+                                                                                                            this->config.multiple_of,
+                                                                                                            this->config.norm_eps,
+                                                                                                            true);
+                blocks["ref_image_refiner." + std::to_string(i)] = std::make_shared<BooguImageTransformerBlock>(this->config.hidden_size,
+                                                                                                                this->config.num_attention_heads,
+                                                                                                                this->config.num_kv_heads,
+                                                                                                                this->config.multiple_of,
+                                                                                                                this->config.norm_eps,
+                                                                                                                true);
+                blocks["context_refiner." + std::to_string(i)]   = std::make_shared<BooguImageTransformerBlock>(this->config.hidden_size,
+                                                                                                              this->config.num_attention_heads,
+                                                                                                              this->config.num_kv_heads,
+                                                                                                              this->config.multiple_of,
+                                                                                                              this->config.norm_eps,
+                                                                                                              false);
+            }
+
+            for (int i = 0; i < this->config.num_double_stream_layers; i++) {
+                blocks["double_stream_layers." + std::to_string(i)] = std::make_shared<BooguImageDoubleStreamBlock>(this->config.hidden_size,
+                                                                                                                    this->config.num_attention_heads,
+                                                                                                                    this->config.num_kv_heads,
+                                                                                                                    this->config.multiple_of,
+                                                                                                                    this->config.norm_eps);
+            }
+
+            for (int i = 0; i < this->config.num_layers; i++) {
+                blocks["single_stream_layers." + std::to_string(i)] = std::make_shared<BooguImageTransformerBlock>(this->config.hidden_size,
+                                                                                                                   this->config.num_attention_heads,
+                                                                                                                   this->config.num_kv_heads,
+                                                                                                                   this->config.multiple_of,
+                                                                                                                   this->config.norm_eps,
+                                                                                                                   true);
+            }
+
+            blocks["norm_out"] = std::make_shared<LuminaLayerNormContinuous>(this->config.hidden_size,
+                                                                             this->config.timestep_embed_dim,
+                                                                             this->config.patch_size * this->config.patch_size * this->config.out_channels);
+        }
+
+        ggml_tensor* image_index_embedding(GGMLRunnerContext* ctx, int index) {
+            GGML_ASSERT(index >= 0 && index < 5);
+            auto embedding = params["image_index_embedding"];
+            auto out       = ggml_view_1d(ctx->ggml_ctx,
+                                          embedding,
+                                          config.hidden_size,
+                                          index * config.hidden_size * ggml_element_size(embedding));
+            out            = ggml_reshape_3d(ctx->ggml_ctx, out, config.hidden_size, 1, 1);
+            return out;
+        }
+
+        ggml_tensor* embed_refs(GGMLRunnerContext* ctx, const std::vector<ggml_tensor*>& ref_latents) {
+            if (ref_latents.empty()) {
+                return nullptr;
+            }
+            auto ref_image_patch_embedder = std::dynamic_pointer_cast<Linear>(blocks["ref_image_patch_embedder"]);
+
+            ggml_tensor* ref_img = nullptr;
+            for (int i = 0; i < static_cast<int>(ref_latents.size()); i++) {
+                auto ref = DiT::pad_and_patchify(ctx, ref_latents[i], config.patch_size, config.patch_size, false);
+                ref      = ref_image_patch_embedder->forward(ctx, ref);
+                ref      = ggml_add(ctx->ggml_ctx, ref, image_index_embedding(ctx, std::min(i, 4)));
+                ref_img  = ref_img == nullptr ? ref : ggml_concat(ctx->ggml_ctx, ref_img, ref, 1);
+            }
+            return ref_img;
+        }
+
+        ggml_tensor* forward(GGMLRunnerContext* ctx,
+                             ggml_tensor* x,
+                             ggml_tensor* timesteps,
+                             ggml_tensor* context,
+                             ggml_tensor* pe,
+                             std::vector<ggml_tensor*> ref_latents = {}) {
+            int64_t W = x->ne[0];
+            int64_t H = x->ne[1];
+            int64_t N = x->ne[3];
+            GGML_ASSERT(N == 1);
+
+            auto x_embedder         = std::dynamic_pointer_cast<Linear>(blocks["x_embedder"]);
+            auto time_caption_embed = std::dynamic_pointer_cast<LuminaCombinedTimestepCaptionEmbedding>(blocks["time_caption_embed"]);
+            auto norm_out           = std::dynamic_pointer_cast<LuminaLayerNormContinuous>(blocks["norm_out"]);
+
+            auto timestep = ggml_sub(ctx->ggml_ctx, ggml_ext_ones_like(ctx->ggml_ctx, timesteps), timesteps);
+            auto embeds   = time_caption_embed->forward(ctx, timestep, context);
+            auto temb     = embeds.first;
+            auto txt      = embeds.second;
+
+            auto img        = DiT::pad_and_patchify(ctx, x, config.patch_size, config.patch_size, false);
+            int64_t img_len = img->ne[1];
+            img             = x_embedder->forward(ctx, img);
+            auto ref_img    = embed_refs(ctx, ref_latents);
+            int64_t ref_len = ref_img != nullptr ? ref_img->ne[1] : 0;
+            int64_t txt_len = txt->ne[1];
+
+            GGML_ASSERT(pe->ne[3] == txt_len + ref_len + img_len);
+            auto txt_pe   = ggml_ext_slice(ctx->ggml_ctx, pe, 3, 0, txt_len);
+            auto noise_pe = ggml_ext_slice(ctx->ggml_ctx, pe, 3, txt_len + ref_len, txt_len + ref_len + img_len);
+
+            for (int i = 0; i < config.num_refiner_layers; i++) {
+                auto block = std::dynamic_pointer_cast<BooguImageTransformerBlock>(blocks["context_refiner." + std::to_string(i)]);
+                txt        = block->forward(ctx, txt, txt_pe);
+                sd::ggml_graph_cut::mark_graph_cut(txt, "boogu.context_refiner." + std::to_string(i), "txt");
+            }
+
+            for (int i = 0; i < config.num_refiner_layers; i++) {
+                auto block = std::dynamic_pointer_cast<BooguImageTransformerBlock>(blocks["noise_refiner." + std::to_string(i)]);
+                img        = block->forward(ctx, img, noise_pe, temb);
+                sd::ggml_graph_cut::mark_graph_cut(img, "boogu.noise_refiner." + std::to_string(i), "img");
+            }
+
+            ggml_tensor* combined_img = img;
+            if (ref_img != nullptr) {
+                auto ref_pe = ggml_ext_slice(ctx->ggml_ctx, pe, 3, txt_len, txt_len + ref_len);
+                for (int i = 0; i < config.num_refiner_layers; i++) {
+                    auto block = std::dynamic_pointer_cast<BooguImageTransformerBlock>(blocks["ref_image_refiner." + std::to_string(i)]);
+                    ref_img    = block->forward(ctx, ref_img, ref_pe, temb);
+                    sd::ggml_graph_cut::mark_graph_cut(ref_img, "boogu.ref_image_refiner." + std::to_string(i), "ref_img");
+                }
+                combined_img = ggml_concat(ctx->ggml_ctx, ref_img, img, 1);
+            }
+
+            auto img_pe = ggml_ext_slice(ctx->ggml_ctx, pe, 3, txt_len, txt_len + combined_img->ne[1]);
+            for (int i = 0; i < config.num_double_stream_layers; i++) {
+                auto block   = std::dynamic_pointer_cast<BooguImageDoubleStreamBlock>(blocks["double_stream_layers." + std::to_string(i)]);
+                auto result  = block->forward(ctx, combined_img, txt, pe, img_pe, temb);
+                combined_img = result.first;
+                txt          = result.second;
+                sd::ggml_graph_cut::mark_graph_cut(combined_img, "boogu.double_stream_layers." + std::to_string(i), "img");
+                sd::ggml_graph_cut::mark_graph_cut(txt, "boogu.double_stream_layers." + std::to_string(i), "txt");
+            }
+
+            auto hidden_states = ggml_concat(ctx->ggml_ctx, txt, combined_img, 1);
+            for (int i = 0; i < config.num_layers; i++) {
+                auto block    = std::dynamic_pointer_cast<BooguImageTransformerBlock>(blocks["single_stream_layers." + std::to_string(i)]);
+                hidden_states = block->forward(ctx, hidden_states, pe, temb);
+                sd::ggml_graph_cut::mark_graph_cut(hidden_states, "boogu.single_stream_layers." + std::to_string(i), "hidden_states");
+            }
+
+            hidden_states = norm_out->forward(ctx, hidden_states, temb);
+            hidden_states = ggml_ext_slice(ctx->ggml_ctx, hidden_states, 1, hidden_states->ne[1] - img_len, hidden_states->ne[1]);
+            hidden_states = DiT::unpatchify_and_crop(ctx->ggml_ctx, hidden_states, H, W, config.patch_size, config.patch_size, false);
+            hidden_states = ggml_ext_scale(ctx->ggml_ctx, hidden_states, -1.f);
+            return hidden_states;
+        }
+    };
+
+    __STATIC_INLINE__ int patched_token_count(int64_t size, int patch_size) {
+        int pad = (patch_size - (static_cast<int>(size) % patch_size)) % patch_size;
+        return (static_cast<int>(size) + pad) / patch_size;
+    }
+
+    __STATIC_INLINE__ void append_spatial_ids(std::vector<std::vector<float>>& ids,
+                                              int bs,
+                                              int pe_shift,
+                                              int h_tokens,
+                                              int w_tokens) {
+        std::vector<std::vector<float>> image_ids(h_tokens * w_tokens, std::vector<float>(3, 0.0f));
+        for (int h = 0; h < h_tokens; h++) {
+            for (int w = 0; w < w_tokens; w++) {
+                image_ids[h * w_tokens + w][0] = static_cast<float>(pe_shift);
+                image_ids[h * w_tokens + w][1] = static_cast<float>(h);
+                image_ids[h * w_tokens + w][2] = static_cast<float>(w);
+            }
+        }
+        for (int b = 0; b < bs; b++) {
+            ids.insert(ids.end(), image_ids.begin(), image_ids.end());
+        }
+    }
+
+    __STATIC_INLINE__ std::vector<float> gen_boogu_pe(int h,
+                                                      int w,
+                                                      int patch_size,
+                                                      int bs,
+                                                      int context_len,
+                                                      const std::vector<ggml_tensor*>& ref_latents,
+                                                      int theta,
+                                                      const std::vector<int>& axes_dim) {
+        std::vector<std::vector<float>> ids;
+        ids.reserve(static_cast<size_t>(bs) * context_len);
+        for (int b = 0; b < bs; b++) {
+            for (int i = 0; i < context_len; i++) {
+                float pos = static_cast<float>(i);
+                ids.push_back({pos, pos, pos});
+            }
+        }
+
+        int pe_shift = context_len;
+        for (ggml_tensor* ref : ref_latents) {
+            int ref_h_tokens = patched_token_count(ref->ne[1], patch_size);
+            int ref_w_tokens = patched_token_count(ref->ne[0], patch_size);
+            append_spatial_ids(ids, bs, pe_shift, ref_h_tokens, ref_w_tokens);
+            pe_shift += std::max(ref_h_tokens, ref_w_tokens);
+        }
+
+        int h_tokens = patched_token_count(h, patch_size);
+        int w_tokens = patched_token_count(w, patch_size);
+        append_spatial_ids(ids, bs, pe_shift, h_tokens, w_tokens);
+
+        return Rope::embed_nd(ids, bs, static_cast<float>(theta), axes_dim);
+    }
+
+    struct BooguImageRunner : public DiffusionModelRunner {
+        BooguConfig config;
+        BooguImageModel boogu;
+        std::vector<float> pe_vec;
+
+        BooguImageRunner(ggml_backend_t backend,
+                         const String2TensorStorage& tensor_storage_map      = {},
+                         const std::string prefix                            = "",
+                         SDVersion version                                   = VERSION_BOOGU_IMAGE,
+                         std::shared_ptr<RunnerWeightManager> weight_manager = nullptr)
+            : DiffusionModelRunner(backend, prefix, weight_manager),
+              config(BooguConfig::detect_from_weights(tensor_storage_map, prefix)) {
+            boogu = BooguImageModel(config);
+            boogu.init(params_ctx, tensor_storage_map, prefix);
+        }
+
+        std::string get_desc() override {
+            return "boogu_image";
+        }
+
+        void get_param_tensors(std::map<std::string, ggml_tensor*>& tensors, const std::string& prefix) override {
+            boogu.get_param_tensors(tensors, prefix);
+        }
+
+        ggml_cgraph* build_graph(const sd::Tensor<float>& x_tensor,
+                                 const sd::Tensor<float>& timesteps_tensor,
+                                 const sd::Tensor<float>& context_tensor,
+                                 const std::vector<sd::Tensor<float>>& ref_latents_tensor = {}) {
+            ggml_cgraph* gf        = new_graph_custom(BOOGU_GRAPH_SIZE);
+            ggml_tensor* x         = make_input(x_tensor);
+            ggml_tensor* timesteps = make_input(timesteps_tensor);
+            GGML_ASSERT(x->ne[3] == 1);
+            GGML_ASSERT(!context_tensor.empty());
+            ggml_tensor* context = make_input(context_tensor);
+
+            std::vector<ggml_tensor*> ref_latents;
+            ref_latents.reserve(ref_latents_tensor.size());
+            for (const auto& ref_latent_tensor : ref_latents_tensor) {
+                ref_latents.push_back(make_input(ref_latent_tensor));
+            }
+
+            pe_vec      = gen_boogu_pe(static_cast<int>(x->ne[1]),
+                                       static_cast<int>(x->ne[0]),
+                                       config.patch_size,
+                                       static_cast<int>(x->ne[3]),
+                                       static_cast<int>(context->ne[1]),
+                                       ref_latents,
+                                       config.theta,
+                                       config.axes_dim);
+            int pos_len = static_cast<int>(pe_vec.size() / config.axes_dim_sum / 2);
+            auto pe     = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, 2, 2, config.axes_dim_sum / 2, pos_len);
+            set_backend_tensor_data(pe, pe_vec.data());
+
+            auto runner_ctx  = get_context();
+            ggml_tensor* out = boogu.forward(&runner_ctx, x, timesteps, context, pe, ref_latents);
+            ggml_build_forward_expand(gf, out);
+            return gf;
+        }
+
+        sd::Tensor<float> compute(int n_threads,
+                                  const sd::Tensor<float>& x,
+                                  const sd::Tensor<float>& timesteps,
+                                  const sd::Tensor<float>& context,
+                                  const std::vector<sd::Tensor<float>>& ref_latents = {}) {
+            auto get_graph = [&]() -> ggml_cgraph* {
+                return build_graph(x, timesteps, context, ref_latents);
+            };
+            return restore_trailing_singleton_dims(GGMLRunner::compute<float>(get_graph, n_threads, false, false, false), x.dim());
+        }
+
+        sd::Tensor<float> compute(int n_threads,
+                                  const DiffusionParams& diffusion_params) override {
+            GGML_ASSERT(diffusion_params.x != nullptr);
+            GGML_ASSERT(diffusion_params.timesteps != nullptr);
+            static const std::vector<sd::Tensor<float>> empty_ref_latents;
+            return compute(n_threads,
+                           *diffusion_params.x,
+                           *diffusion_params.timesteps,
+                           tensor_or_empty(diffusion_params.context),
+                           diffusion_params.ref_latents ? *diffusion_params.ref_latents : empty_ref_latents);
+        }
+    };
+}  // namespace Boogu
+
+#endif  // __SD_MODEL_DIFFUSION_BOOGU_HPP__
--- a/src/model/diffusion/ernie_image.hpp
+++ b/src/model/diffusion/ernie_image.hpp
@ -162,6 +162,8 @@ namespace ErnieImage {
            int64_t S = x->ne[1];
            int64_t N = x->ne[2];

+            float scale = (sd_backend_is(ctx->backend, "Vulkan") && ctx->flash_attn_enabled) ? 1.0f / 32.0f : 1.0f;
+
            auto q = to_q->forward(ctx, x);
            auto k = to_k->forward(ctx, x);
            auto v = to_v->forward(ctx, x);
@ -182,7 +184,7 @@ namespace ErnieImage {
            k = ggml_cont(ctx->ggml_ctx, ggml_permute(ctx->ggml_ctx, k, 0, 2, 1, 3));  // [N, heads, S, head_dim]
            k = ggml_reshape_3d(ctx->ggml_ctx, k, k->ne[0], k->ne[1], k->ne[2] * k->ne[3]);

-            x = ggml_ext_attention_ext(ctx->ggml_ctx, ctx->backend, q, k, v, num_heads, attention_mask, true, ctx->flash_attn_enabled);  // [N, S, hidden_size]
+            x = ggml_ext_attention_ext(ctx->ggml_ctx, ctx->backend, q, k, v, num_heads, attention_mask, true, ctx->flash_attn_enabled, scale);  // [N, S, hidden_size]
            x = to_out_0->forward(ctx, x);
            return x;
        }
--- a/src/model/diffusion/flux.hpp
+++ b/src/model/diffusion/flux.hpp
@ -4,6 +4,7 @@
 #include <memory>
 #include <vector>

+#include "model/adapter/pulid.hpp"
 #include "model/common/rope.hpp"
 #include "model/diffusion/dit.hpp"
 #include "model/diffusion/model.hpp"
@ -49,6 +50,10 @@ namespace Flux {
        float ref_index_scale     = 1.f;
        ChromaRadianceConfig chroma_radiance_params;

+        bool pulid_enabled        = false;
+        int pulid_double_interval = 2;
+        int pulid_single_interval = 4;
+
        static FluxConfig detect_from_weights(const String2TensorStorage& tensor_storage_map,
                                              const std::string& prefix,
                                              SDVersion version = VERSION_FLUX) {
@ -138,6 +143,9 @@ namespace Flux {
                if (ends_with(name, "double_blocks.0.txt_attn.norm.key_norm.scale")) {
                    head_dim = tensor_storage.ne[0];
                }
+                if (name.find("pulid_ca.") != std::string::npos) {
+                    config.pulid_enabled = true;
+                }
            }
            if (actual_radiance_patch_size > 0 && actual_radiance_patch_size != config.patch_size) {
                GGML_ASSERT(config.patch_size == 2 * actual_radiance_patch_size);
@ -957,6 +965,20 @@ namespace Flux {
                blocks["double_stream_modulation_txt"] = std::make_shared<Modulation>(config.hidden_size, true, !config.disable_bias);
                blocks["single_stream_modulation"]     = std::make_shared<Modulation>(config.hidden_size, false, !config.disable_bias);
            }
+
+            if (config.pulid_enabled) {
+                int num_double_ca = (config.depth + config.pulid_double_interval - 1) / config.pulid_double_interval;
+                int num_single_ca = (config.depth_single_blocks + config.pulid_single_interval - 1) / config.pulid_single_interval;
+                int num_ca        = num_double_ca + num_single_ca;
+                for (int i = 0; i < num_ca; i++) {
+                    blocks["pulid_ca." + std::to_string(i)] =
+                        std::shared_ptr<GGMLBlock>(new PuLIDPerceiverAttentionCA(
+                            /*dim=*/config.hidden_size,
+                            /*dim_head=*/PuLIDPerceiverAttentionCA::DEFAULT_DIM_HEAD,
+                            /*heads=*/PuLIDPerceiverAttentionCA::DEFAULT_HEADS,
+                            /*kv_dim=*/PuLIDPerceiverAttentionCA::DEFAULT_KV_DIM));
+                }
+            }
        }

        ggml_tensor* forward_orig(GGMLRunnerContext* ctx,
@ -967,7 +989,9 @@ namespace Flux {
                                  ggml_tensor* guidance,
                                  ggml_tensor* pe,
                                  ggml_tensor* mod_index_arange = nullptr,
-                                  std::vector<int> skip_layers  = {}) {
+                                  std::vector<int> skip_layers  = {},
+                                  ggml_tensor* pulid_id         = nullptr,
+                                  float pulid_id_weight         = 1.0f) {
            auto img_in      = std::dynamic_pointer_cast<Linear>(blocks["img_in"]);
            auto txt_in      = std::dynamic_pointer_cast<Linear>(blocks["txt_in"]);
            auto final_layer = std::dynamic_pointer_cast<LastLayer>(blocks["final_layer"]);
@ -1044,6 +1068,13 @@ namespace Flux {
            sd::ggml_graph_cut::mark_graph_cut(txt, "flux.prelude", "txt");
            sd::ggml_graph_cut::mark_graph_cut(vec, "flux.prelude", "vec");

+            const bool pulid_active = config.pulid_enabled && pulid_id != nullptr;
+            if (pulid_active && !skip_layers.empty()) {
+                LOG_WARN("PuLID + skip_layers is not supported; disabling PuLID for this generation.");
+            }
+            const bool pulid_run = pulid_active && skip_layers.empty();
+            int ca_idx           = 0;
+
            for (int i = 0; i < config.depth; i++) {
                if (skip_layers.size() > 0 && std::find(skip_layers.begin(), skip_layers.end(), i) != skip_layers.end()) {
                    continue;
@ -1056,9 +1087,19 @@ namespace Flux {
                txt          = img_txt.second;  // [N, n_txt_token, hidden_size]
                sd::ggml_graph_cut::mark_graph_cut(img, "flux.double_blocks." + std::to_string(i), "img");
                sd::ggml_graph_cut::mark_graph_cut(txt, "flux.double_blocks." + std::to_string(i), "txt");
+
+                if (pulid_run && (i % config.pulid_double_interval == 0)) {
+                    auto pulid_ca = std::dynamic_pointer_cast<PuLIDPerceiverAttentionCA>(
+                        blocks["pulid_ca." + std::to_string(ca_idx)]);
+                    ggml_tensor* ca_out = pulid_ca->forward(ctx, pulid_id, img);  // [N, n_img_token, hidden_size]
+                    img                 = ggml_add(ctx->ggml_ctx, img, ggml_scale(ctx->ggml_ctx, ca_out, pulid_id_weight));
+                    sd::ggml_graph_cut::mark_graph_cut(img, "flux.pulid_ca." + std::to_string(ca_idx), "img");
+                    ca_idx++;
+                }
            }

-            auto txt_img = ggml_concat(ctx->ggml_ctx, txt, img, 1);  // [N, n_txt_token + n_img_token, hidden_size]
+            auto txt_img            = ggml_concat(ctx->ggml_ctx, txt, img, 1);  // [N, n_txt_token + n_img_token, hidden_size]
+            const int64_t n_txt_tok = txt->ne[1];
            for (int i = 0; i < config.depth_single_blocks; i++) {
                if (skip_layers.size() > 0 && std::find(skip_layers.begin(), skip_layers.end(), i + config.depth) != skip_layers.end()) {
                    continue;
@ -1067,6 +1108,29 @@ namespace Flux {

                txt_img = block->forward(ctx, txt_img, vec, pe, txt_img_mask, ss_mods);
                sd::ggml_graph_cut::mark_graph_cut(txt_img, "flux.single_blocks." + std::to_string(i), "txt_img");
+
+                if (pulid_run && (i % config.pulid_single_interval == 0)) {
+                    auto pulid_ca = std::dynamic_pointer_cast<PuLIDPerceiverAttentionCA>(
+                        blocks["pulid_ca." + std::to_string(ca_idx)]);
+                    ggml_tensor* txt_part = ggml_view_3d(ctx->ggml_ctx, txt_img,
+                                                         txt_img->ne[0], n_txt_tok, txt_img->ne[2],
+                                                         txt_img->nb[1], txt_img->nb[2],
+                                                         0);
+                    ggml_tensor* img_part = ggml_view_3d(ctx->ggml_ctx, txt_img,
+                                                         txt_img->ne[0],
+                                                         txt_img->ne[1] - n_txt_tok,
+                                                         txt_img->ne[2],
+                                                         txt_img->nb[1],
+                                                         txt_img->nb[2],
+                                                         n_txt_tok * txt_img->nb[1]);
+                    txt_part              = ggml_cont(ctx->ggml_ctx, txt_part);
+                    img_part              = ggml_cont(ctx->ggml_ctx, img_part);
+                    ggml_tensor* ca_out   = pulid_ca->forward(ctx, pulid_id, img_part);
+                    img_part              = ggml_add(ctx->ggml_ctx, img_part, ggml_scale(ctx->ggml_ctx, ca_out, pulid_id_weight));
+                    txt_img               = ggml_concat(ctx->ggml_ctx, txt_part, img_part, 1);
+                    sd::ggml_graph_cut::mark_graph_cut(txt_img, "flux.pulid_ca." + std::to_string(ca_idx), "txt_img");
+                    ca_idx++;
+                }
            }

            img = ggml_view_3d(ctx->ggml_ctx,
@ -1105,7 +1169,9 @@ namespace Flux {
                                             ggml_tensor* mod_index_arange         = nullptr,
                                             ggml_tensor* dct                      = nullptr,
                                             std::vector<ggml_tensor*> ref_latents = {},
-                                             std::vector<int> skip_layers          = {}) {
+                                             std::vector<int> skip_layers          = {},
+                                             ggml_tensor* pulid_id                 = nullptr,
+                                             float pulid_id_weight                 = 1.0f) {
            GGML_ASSERT(x->ne[3] == 1);

            int64_t W      = x->ne[0];
@ -1131,7 +1197,8 @@ namespace Flux {
            img = ggml_reshape_3d(ctx->ggml_ctx, img, img->ne[0] * img->ne[1], img->ne[2], img->ne[3]);  // [N, hidden_size, H/patch_size*W/patch_size]
            img = ggml_cont(ctx->ggml_ctx, ggml_ext_torch_permute(ctx->ggml_ctx, img, 1, 0, 2, 3));      // [N, H/patch_size*W/patch_size, hidden_size]

-            auto out = forward_orig(ctx, img, context, timestep, y, guidance, pe, mod_index_arange, skip_layers);  // [N, n_img_token, hidden_size]
+            auto out = forward_orig(ctx, img, context, timestep, y, guidance, pe, mod_index_arange, skip_layers,
+                                    pulid_id, pulid_id_weight);  // [N, n_img_token, hidden_size]

            // nerf decode
            auto nerf_image_embedder   = std::dynamic_pointer_cast<NerfEmbedder>(blocks["nerf_image_embedder"]);
@ -1179,7 +1246,9 @@ namespace Flux {
                                         ggml_tensor* mod_index_arange         = nullptr,
                                         ggml_tensor* dct                      = nullptr,
                                         std::vector<ggml_tensor*> ref_latents = {},
-                                         std::vector<int> skip_layers          = {}) {
+                                         std::vector<int> skip_layers          = {},
+                                         ggml_tensor* pulid_id                 = nullptr,
+                                         float pulid_id_weight                 = 1.0f) {
            GGML_ASSERT(x->ne[3] == 1);

            int64_t W      = x->ne[0];
@ -1226,7 +1295,8 @@ namespace Flux {
                }
            }

-            auto out = forward_orig(ctx, img, context, timestep, y, guidance, pe, mod_index_arange, skip_layers);  // [N, num_tokens, C * patch_size * patch_size]
+            auto out = forward_orig(ctx, img, context, timestep, y, guidance, pe, mod_index_arange, skip_layers,
+                                    pulid_id, pulid_id_weight);  // [N, num_tokens, C * patch_size * patch_size]

            if (out->ne[1] > img_tokens) {
                out = ggml_view_3d(ctx->ggml_ctx, out, out->ne[0], img_tokens, out->ne[2], out->nb[1], out->nb[2], 0);
@ -1248,7 +1318,9 @@ namespace Flux {
                             ggml_tensor* mod_index_arange         = nullptr,
                             ggml_tensor* dct                      = nullptr,
                             std::vector<ggml_tensor*> ref_latents = {},
-                             std::vector<int> skip_layers          = {}) {
+                             std::vector<int> skip_layers          = {},
+                             ggml_tensor* pulid_id                 = nullptr,
+                             float pulid_id_weight                 = 1.0f) {
            // Forward pass of DiT.
            // x: (N, C, H, W) tensor of spatial inputs (images or latent representations of images)
            // timestep: (N,) tensor of diffusion timesteps
@ -1271,7 +1343,9 @@ namespace Flux {
                                               mod_index_arange,
                                               dct,
                                               ref_latents,
-                                               skip_layers);
+                                               skip_layers,
+                                               pulid_id,
+                                               pulid_id_weight);
            } else {
                return forward_flux_chroma(ctx,
                                           x,
@ -1284,7 +1358,9 @@ namespace Flux {
                                           mod_index_arange,
                                           dct,
                                           ref_latents,
-                                           skip_layers);
+                                           skip_layers,
+                                           pulid_id,
+                                           pulid_id_weight);
            }
        }
    };
@ -1384,7 +1460,9 @@ namespace Flux {
                                 const sd::Tensor<float>& guidance_tensor                 = {},
                                 const std::vector<sd::Tensor<float>>& ref_latents_tensor = {},
                                 bool increase_ref_index                                  = false,
-                                 std::vector<int> skip_layers                             = {}) {
+                                 std::vector<int> skip_layers                             = {},
+                                 const sd::Tensor<float>& pulid_id_tensor                 = {},
+                                 float pulid_id_weight                                    = 1.0f) {
            ggml_tensor* x         = make_input(x_tensor);
            ggml_tensor* timesteps = make_input(timesteps_tensor);
            ggml_tensor* context   = make_optional_input(context_tensor);
@ -1461,6 +1539,10 @@ namespace Flux {
                set_backend_tensor_data(dct, dct_vec.data());
            }

+            ggml_tensor* pulid_id = pulid_id_tensor.empty()
+                                        ? nullptr
+                                        : make_input(pulid_id_tensor);
+
            auto runner_ctx = get_context();

            ggml_tensor* out = flux.forward(&runner_ctx,
@ -1474,7 +1556,9 @@ namespace Flux {
                                            mod_index_arange,
                                            dct,
                                            ref_latents,
-                                            skip_layers);
+                                            skip_layers,
+                                            pulid_id,
+                                            pulid_id_weight);

            ggml_build_forward_expand(gf, out);

@ -1490,14 +1574,17 @@ namespace Flux {
                                  const sd::Tensor<float>& guidance                 = {},
                                  const std::vector<sd::Tensor<float>>& ref_latents = {},
                                  bool increase_ref_index                           = false,
-                                  std::vector<int> skip_layers                      = std::vector<int>()) {
+                                  std::vector<int> skip_layers                      = std::vector<int>(),
+                                  const sd::Tensor<float>& pulid_id                 = {},
+                                  float pulid_id_weight                             = 1.0f) {
            // x: [N, in_channels, h, w]
            // timesteps: [N, ]
            // context: [N, max_position, hidden_size]
            // y: [N, adm_in_channels] or [1, adm_in_channels]
            // guidance: [N, ]
+            // pulid_id: empty (no injection) or [N, num_id_tokens=32, kv_dim=2048]
            auto get_graph = [&]() -> ggml_cgraph* {
-                return build_graph(x, timesteps, context, c_concat, y, guidance, ref_latents, increase_ref_index, skip_layers);
+                return build_graph(x, timesteps, context, c_concat, y, guidance, ref_latents, increase_ref_index, skip_layers, pulid_id, pulid_id_weight);
            };

            auto result = restore_trailing_singleton_dims(GGMLRunner::compute<float>(get_graph, n_threads, false, false, false), x.dim());
@ -1520,7 +1607,9 @@ namespace Flux {
                           tensor_or_empty(extra->guidance),
                           diffusion_params.ref_latents ? *diffusion_params.ref_latents : empty_ref_latents,
                           diffusion_params.increase_ref_index,
-                           extra->skip_layers ? *extra->skip_layers : empty_skip_layers);
+                           extra->skip_layers ? *extra->skip_layers : empty_skip_layers,
+                           tensor_or_empty(extra->pulid_id),
+                           extra->pulid_id_weight);
        }

        void test() {
--- a/src/model/diffusion/model.hpp
+++ b/src/model/diffusion/model.hpp
@ -22,6 +22,8 @@ struct SkipLayerDiffusionExtra {
 struct FluxDiffusionExtra {
    const sd::Tensor<float>* guidance   = nullptr;
    const std::vector<int>* skip_layers = nullptr;
+    const sd::Tensor<float>* pulid_id   = nullptr;
+    float pulid_id_weight               = 1.0f;
 };

 struct AnimaDiffusionExtra {
--- a/src/model/te/llm.hpp
+++ b/src/model/te/llm.hpp
@ -79,6 +79,7 @@ namespace LLM {
        int window_size                     = 112;
        int num_position_embeddings         = 0;
        std::set<int> fullatt_block_indexes = {7, 15, 23, 31};
+        bool split_patch_embed              = false;
    };

    struct LLMConfig {
@ -179,7 +180,8 @@ namespace LLM {
                config.num_experts_per_tok     = 4;
            }

-            config.num_layers = 0;
+            config.num_layers          = 0;
+            int detected_vision_layers = 0;
            for (const auto& [name, tensor_storage] : tensor_storage_map) {
                if (!starts_with(name, prefix)) {
                    continue;
@ -190,6 +192,38 @@ namespace LLM {
                    if (contains(name, "attn.q_proj")) {
                        config.llama_cpp_style = true;
                    }
+                    if (contains(name, "visual.patch_embed.proj.1.weight")) {
+                        config.vision.split_patch_embed = true;
+                    }
+                    if (contains(name, "visual.patch_embed.proj.0.weight")) {
+                        config.vision.patch_size  = static_cast<int>(tensor_storage.ne[0]);
+                        config.vision.in_channels = tensor_storage.ne[2];
+                        config.vision.hidden_size = tensor_storage.ne[3];
+                    }
+                    if (contains(name, "visual.patch_embed.bias")) {
+                        config.vision.hidden_size = tensor_storage.ne[0];
+                    }
+                    if (contains(name, "visual.pos_embed.weight")) {
+                        config.vision.hidden_size             = tensor_storage.ne[0];
+                        config.vision.num_position_embeddings = static_cast<int>(tensor_storage.ne[1]);
+                    }
+                    if (contains(name, "visual.blocks.")) {
+                        auto items = split_string(name.substr(pos), '.');
+                        if (items.size() > 2) {
+                            int block_index = atoi(items[2].c_str());
+                            if (block_index + 1 > detected_vision_layers) {
+                                detected_vision_layers = block_index + 1;
+                            }
+                        }
+                    }
+                    if (contains(name, "visual.blocks.0.mlp.linear_fc1.weight") ||
+                        contains(name, "visual.blocks.0.mlp.gate_proj.weight")) {
+                        config.vision.intermediate_size = tensor_storage.ne[1];
+                    }
+                    if (contains(name, "visual.merger.linear_fc2.weight") ||
+                        contains(name, "visual.merger.mlp.2.weight")) {
+                        config.vision.out_hidden_size = tensor_storage.ne[1];
+                    }
                    continue;
                }
                pos = name.find("layers.");
@ -219,6 +253,9 @@ namespace LLM {
            if (arch == LLMArch::QWEN3 && config.num_layers == 28) {
                config.num_heads = 16;
            }
+            if (detected_vision_layers > 0) {
+                config.vision.num_layers = detected_vision_layers;
+            }
            LOG_DEBUG("llm: num_layers = %" PRId64 ", vocab_size = %" PRId64 ", hidden_size = %" PRId64 ", intermediate_size = %" PRId64,
                      config.num_layers,
                      config.vocab_size,
@ -539,40 +576,51 @@ namespace LLM {

    struct VisionPatchEmbed : public GGMLBlock {
    protected:
-        bool llama_cpp_style;
+        bool split_patch_embed;
+        bool bias;
        int patch_size;
        int temporal_patch_size;
        int64_t in_channels;
        int64_t embed_dim;

+        void init_params(ggml_context* ctx,
+                         const String2TensorStorage& tensor_storage_map = {},
+                         const std::string prefix                       = "") override {
+            GGML_UNUSED(tensor_storage_map);
+            GGML_UNUSED(prefix);
+            if (split_patch_embed && bias) {
+                params["bias"] = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, embed_dim);
+            }
+        }
+
    public:
-        VisionPatchEmbed(bool llama_cpp_style,
+        VisionPatchEmbed(bool split_patch_embed,
                         LLMVisionArch arch,
                         int patch_size          = 14,
                         int temporal_patch_size = 2,
                         int64_t in_channels     = 3,
                         int64_t embed_dim       = 1152)
-            : llama_cpp_style(llama_cpp_style),
+            : split_patch_embed(split_patch_embed),
+              bias(arch == LLMVisionArch::QWEN3_VL),
              patch_size(patch_size),
              temporal_patch_size(temporal_patch_size),
              in_channels(in_channels),
              embed_dim(embed_dim) {
-            bool bias = arch == LLMVisionArch::QWEN3_VL;
-            if (llama_cpp_style) {
+            if (split_patch_embed) {
                blocks["proj.0"] = std::shared_ptr<GGMLBlock>(new Conv2d(in_channels,
                                                                         embed_dim,
                                                                         {patch_size, patch_size},
                                                                         {patch_size, patch_size},
                                                                         {0, 0},
                                                                         {1, 1},
-                                                                         bias));
+                                                                         false));
                blocks["proj.1"] = std::shared_ptr<GGMLBlock>(new Conv2d(in_channels,
                                                                         embed_dim,
                                                                         {patch_size, patch_size},
                                                                         {patch_size, patch_size},
                                                                         {0, 0},
                                                                         {1, 1},
-                                                                         bias));
+                                                                         false));
            } else {
                std::tuple<int, int, int> kernel_size = {(int)temporal_patch_size, (int)patch_size, (int)patch_size};
                blocks["proj"]                        = std::shared_ptr<GGMLBlock>(new Conv3d(in_channels,
@ -593,7 +641,7 @@ namespace LLM {
                                temporal_patch_size,
                                ggml_nelements(x) / (temporal_patch_size * patch_size * patch_size));

-            if (llama_cpp_style) {
+            if (split_patch_embed) {
                auto proj_0 = std::dynamic_pointer_cast<Conv2d>(blocks["proj.0"]);
                auto proj_1 = std::dynamic_pointer_cast<Conv2d>(blocks["proj.1"]);

@ -606,6 +654,10 @@ namespace LLM {
                x1      = proj_1->forward(ctx, x1);

                x = ggml_add(ctx->ggml_ctx, x0, x1);
+                if (bias) {
+                    auto b = ggml_reshape_4d(ctx->ggml_ctx, params["bias"], 1, 1, embed_dim, 1);
+                    x      = ggml_add_inplace(ctx->ggml_ctx, x, b);
+                }
            } else {
                auto proj = std::dynamic_pointer_cast<Conv3d>(blocks["proj"]);

@ -798,7 +850,7 @@ namespace LLM {
              spatial_merge_size(vision_params.spatial_merge_size),
              num_grid_per_side(vision_params.num_position_embeddings > 0 ? static_cast<int>(std::sqrt(vision_params.num_position_embeddings)) : 0),
              fullatt_block_indexes(vision_params.fullatt_block_indexes) {
-            blocks["patch_embed"] = std::shared_ptr<GGMLBlock>(new VisionPatchEmbed(llama_cpp_style,
+            blocks["patch_embed"] = std::shared_ptr<GGMLBlock>(new VisionPatchEmbed(vision_params.split_patch_embed,
                                                                                    arch_,
                                                                                    vision_params.patch_size,
                                                                                    vision_params.temporal_patch_size,
--- a/src/model/vae/auto_encoder_kl.hpp
+++ b/src/model/vae/auto_encoder_kl.hpp
@ -682,7 +682,7 @@ struct AutoEncoderKL : public VAE {
        } else if (sd_version_is_sd3(version)) {
            scale_factor = 1.5305f;
            shift_factor = 0.0609f;
-        } else if (sd_version_is_flux(version) || sd_version_is_z_image(version) || sd_version_is_longcat(version)) {
+        } else if (sd_version_uses_flux_vae(version)) {
            scale_factor = 0.3611f;
            shift_factor = 0.1159f;
        } else if (sd_version_uses_flux2_vae(version)) {
--- a/src/model_loader.cpp
+++ b/src/model_loader.cpp
@ -485,6 +485,9 @@ SDVersion ModelLoader::get_sd_version() {
        if (tensor_storage.name.find("model.diffusion_model.cap_embedder.0.weight") != std::string::npos) {
            return VERSION_Z_IMAGE;
        }
+        if (tensor_storage.name.find("double_stream_layers.0.img_instruct_attn.processor.img_to_q.weight") != std::string::npos) {
+            return VERSION_BOOGU_IMAGE;
+        }
        if (tensor_storage.name.find("model.diffusion_model.layers.0.adaLN_sa_ln.weight") != std::string::npos) {
            return VERSION_ERNIE_IMAGE;
        }
--- a/src/model_manager.cpp
+++ b/src/model_manager.cpp
@ -147,6 +147,17 @@ bool ModelManager::register_param_tensors(const std::string& desc,
    return true;
 }

+bool ModelManager::load_all_params_eagerly() {
+    std::vector<TensorState*> all_states;
+    all_states.reserve(tensor_states_.size());
+    for (const auto& s : tensor_states_) {
+        if (s != nullptr) {
+            all_states.push_back(s.get());
+        }
+    }
+    return load_tensors_to_params_backend(all_states);
+}
+
 bool ModelManager::validate_registered_tensors() {
    bool ok = true;
    for (const auto& state : tensor_states_) {
@ -469,7 +480,7 @@ bool ModelManager::mmap_params(const std::vector<TensorState*>& states,
        return true;
    }

-    auto mmap_store = model_loader_.mmap_tensors(mmap_candidates, {}, true);
+    auto mmap_store = model_loader_.mmap_tensors(mmap_candidates, {}, writable_mmap_);
    if (mmap_store.empty()) {
        return true;
    }
@ -577,13 +588,8 @@ bool ModelManager::alloc_params_buffers(const std::vector<TensorState*>& states,
        for (TensorState* state : states) {
            ggml_tensor* tensor = state->tensor;
            size_t tensor_size  = GGML_PAD(ggml_backend_buft_get_alloc_size(params_buft, tensor), alignment);
-            if (max_size > 0 && tensor_size > max_size) {
-                LOG_ERROR("model manager tensor '%s' is too large for params buffer: %zu > %zu",
-                          ggml_get_name(tensor),
-                          tensor_size,
-                          max_size);
-                return false;
-            }
+            // Some backends, e.g. Vulkan, report a preferred chunk size here rather than a
+            // hard per-tensor allocation limit. Oversized tensors are allocated alone.
            if (!chunk.empty() && max_size > 0 && chunk_size + tensor_size > max_size) {
                if (!alloc_chunk(chunk, chunk_size)) {
                    return false;
--- a/src/model_manager.h
+++ b/src/model_manager.h
@ -69,6 +69,7 @@ private:
    uint64_t current_lora_epoch_ = 0;
    int n_threads_               = 0;
    bool enable_mmap_            = false;
+    bool writable_mmap_          = false;

    void finish_compute_backend_usage(const std::vector<TensorState*>& states);
    void release_all();
@ -110,6 +111,7 @@ public:
        model_loader_.set_n_threads(n_threads);
    }
    void set_enable_mmap(bool enable_mmap) { enable_mmap_ = enable_mmap; }
+    void set_writable_mmap(bool writable_mmap) { writable_mmap_ = writable_mmap; }
    void set_common_ignore_tensors(std::set<std::string> ignore_tensors);
    void set_loras(std::vector<LoraSpec> loras, SDVersion version);

@ -158,6 +160,7 @@ public:
    }

    bool validate_registered_tensors();
+    bool load_all_params_eagerly();

    bool prepare_params(const std::vector<ggml_tensor*>& tensors) override;
    void release_compute_backend_params(const std::vector<ggml_tensor*>& tensors) override;
--- a/src/name_conversion.cpp
+++ b/src/name_conversion.cpp
@ -184,6 +184,27 @@ std::string convert_cond_stage_model_name(std::string name, std::string prefix)
    return name;
 }

+std::string convert_qwen3_vl_vision_name(std::string name) {
+    static const std::vector<std::pair<std::string, std::string>> qwen3_vl_vision_name_map{
+        {"mm.0.", "merger.linear_fc1."},
+        {"mm.2.", "merger.linear_fc2."},
+        {"v.post_ln.", "merger.norm."},
+        {"v.position_embd.weight", "pos_embed.weight"},
+        {"v.patch_embd.weight.1", "patch_embed.proj.1.weight"},
+        {"v.patch_embd.weight", "patch_embed.proj.0.weight"},
+        {"v.patch_embd.bias", "patch_embed.bias"},
+        {"v.blk.", "blocks."},
+        {"attn_qkv.", "attn.qkv."},
+        {"attn_out.", "attn.proj."},
+        {"ffn_up.", "mlp.linear_fc1."},
+        {"ffn_down.", "mlp.linear_fc2."},
+        {"ln1.", "norm1."},
+        {"ln2.", "norm2."},
+    };
+    replace_with_name_map(name, qwen3_vl_vision_name_map);
+    return name;
+}
+
 // ref: https://github.com/huggingface/diffusers/blob/main/scripts/convert_diffusers_to_original_stable_diffusion.py
 std::string convert_diffusers_unet_to_original_sd1(std::string name) {
    // (stable-diffusion, HF Diffusers)
@ -1154,6 +1175,10 @@ std::string convert_tensor_name(std::string name, SDVersion version) {

    replace_with_prefix_map(name, prefix_map);

+    if (sd_version_is_boogu_image(version) && starts_with(name, "text_encoders.llm.visual.")) {
+        name = convert_qwen3_vl_vision_name(std::move(name));
+    }
+
    // diffusion model
    {
        for (const auto& prefix : diffuison_model_prefix_vec) {
--- a/src/runtime/guidance.cpp
+++ b/src/runtime/guidance.cpp
@ -3,6 +3,7 @@
 #include <algorithm>
 #include <cmath>
 #include <cstdlib>
+#include <optional>
 #include <string>
 #include <utility>

@ -63,6 +64,82 @@ namespace sd::guidance {
        return uncond;
    }

+    std::vector<float> parse_guidance_schedule_from_spec(std::string spec) {
+        std::vector<float> schedule;
+
+        while (!spec.empty()) {
+            auto sep     = spec.find('+');
+            auto segment = spec.substr(0, sep);
+
+            auto x = segment.find('x');
+            if (x == std::string::npos) {
+                LOG_ERROR("Invalid guidance schedule segment: '%s' (expected <guidance>x<count>)", segment.c_str());
+                return {};
+            }
+
+            float guidance;
+            int count;
+
+            auto guidance_str = segment.substr(0, x);
+            auto count_str    = segment.substr(x + 1);
+
+            try {
+                size_t idx = 0;
+                guidance   = std::stof(guidance_str, &idx);
+                if (idx != guidance_str.size()) {
+                    LOG_ERROR("Invalid guidance value in guidance schedule: '%s'", guidance_str.c_str());
+                    return {};
+                }
+            } catch (const std::exception&) {
+                LOG_ERROR("Invalid guidance value in guidance schedule: '%s'", guidance_str.c_str());
+                return {};
+            }
+
+            try {
+                size_t idx = 0;
+                count      = std::stoi(count_str, &idx);
+                if (idx != count_str.size()) {
+                    LOG_ERROR("Invalid count in guidance schedule: '%s'", count_str.c_str());
+                    return {};
+                }
+            } catch (const std::exception&) {
+                LOG_ERROR("Invalid count in guidance schedule: '%s'", count_str.c_str());
+                return {};
+            }
+
+            if (count <= 0) {
+                LOG_ERROR("Guidance schedule count must be positive");
+                return {};
+            }
+
+            schedule.insert(schedule.end(), count, guidance);
+
+            if (sep == std::string::npos) {
+                break;
+            }
+
+            spec = spec.substr(sep + 1);
+        }
+
+        return schedule;
+    }
+
+    std::vector<float> parse_guidance_schedule(const char* extra_sample_args) {
+        std::vector<float> guidance_schedule;
+        std::string guidance_schedule_str = "";
+        for (const auto& [key, value] : parse_key_value_args(extra_sample_args, "extra sample arg")) {
+            float parsed = 0.0f;
+            if (key == "guidance_schedule") {
+                guidance_schedule_str = value;
+            }
+        }
+
+        if (!guidance_schedule_str.empty()) {
+            guidance_schedule = parse_guidance_schedule_from_spec(guidance_schedule_str);
+        }
+        return guidance_schedule;
+    }
+
    ClassifierFreeGuidance::ClassifierFreeGuidance(float guidance_scale,
                                                   float image_guidance_scale)
        : guidance_scale_(guidance_scale),
@ -70,8 +147,10 @@ namespace sd::guidance {
    }

    GuiderOutput ClassifierFreeGuidance::forward(const GuidanceInput& input,
-                                                 GuiderOutput previous) const {
+                                                 GuiderOutput previous,
+                                                 std::optional<float> scale_override) const {
        (void)previous;
+        float guidance_scale = scale_override.value_or(guidance_scale_);

        GuiderOutput output;
        if (!has_tensor(input.pred_cond)) {
@ -86,14 +165,14 @@ namespace sd::guidance {
                const sd::Tensor<float>& pred_img_uncond = *input.pred_img_uncond;
                output.pred                              = pred_img_uncond +
                              image_guidance_scale_ * (pred_uncond - pred_img_uncond) +
-                              guidance_scale_ * (pred_cond - pred_uncond);
+                              guidance_scale * (pred_cond - pred_uncond);

            } else {
-                output.pred = pred_uncond + guidance_scale_ * (pred_cond - pred_uncond);
+                output.pred = pred_uncond + guidance_scale * (pred_cond - pred_uncond);
            }
        } else if (has_tensor(input.pred_img_uncond)) {
            const sd::Tensor<float>& pred_img_uncond = *input.pred_img_uncond;
-            output.pred                              = pred_img_uncond + guidance_scale_ * (pred_cond - pred_img_uncond);
+            output.pred                              = pred_img_uncond + guidance_scale * (pred_cond - pred_img_uncond);
        }

        return output;
@ -128,8 +207,10 @@ namespace sd::guidance {
    }

    GuiderOutput AdaptiveProjectedGuidance::forward(const GuidanceInput& input,
-                                                    GuiderOutput previous) const {
+                                                    GuiderOutput previous,
+                                                    std::optional<float> scale_override) const {
        (void)previous;
+        float guidance_scale = scale_override.value_or(guidance_scale_);

        GuiderOutput output;
        if (!has_tensor(input.pred_cond)) {
@ -144,13 +225,13 @@ namespace sd::guidance {
                const sd::Tensor<float>& pred_img_uncond = *input.pred_img_uncond;
                output.pred                              = pred_img_uncond +
                              image_guidance_scale_ * (pred_uncond - pred_img_uncond) +
-                              guidance_scale_ * (pred_cond - pred_uncond);
+                              guidance_scale * (pred_cond - pred_uncond);
            } else {
-                output.pred = pred_uncond + guidance_scale_ * (pred_cond - pred_uncond);
+                output.pred = pred_uncond + guidance_scale * (pred_cond - pred_uncond);
            }
        } else if (has_tensor(input.pred_img_uncond)) {
            const sd::Tensor<float>& pred_img_uncond = *input.pred_img_uncond;
-            output.pred                              = pred_img_uncond + guidance_scale_ * (pred_cond - pred_img_uncond);
+            output.pred                              = pred_img_uncond + guidance_scale * (pred_cond - pred_img_uncond);
        }
        if (!has_tensor(input.pred_uncond) && !has_tensor(input.pred_img_uncond)) {
            return output;
@ -162,7 +243,7 @@ namespace sd::guidance {
        sd::Tensor<float> deltas = calculate_guidance_delta(pred_cond,
                                                            pred_uncond,
                                                            pred_img_uncond,
-                                                            guidance_scale_,
+                                                            guidance_scale,
                                                            image_guidance_scale_);
        if (params_.momentum != 0.0f) {
            if (momentum_buffer_.shape() != deltas.shape()) {
@ -239,7 +320,8 @@ namespace sd::guidance {
    }

    GuiderOutput SkipLayerGuidance::forward(const GuidanceInput& input,
-                                            GuiderOutput output) const {
+                                            GuiderOutput output,
+                                            std::optional<float> /*scale_override*/) const {
        if (scale_ == 0.0f || !is_enabled_for_step(input) || !input.predict_skip_layer) {
            return output;
        }
--- a/src/runtime/guidance.h
+++ b/src/runtime/guidance.h
@ -3,6 +3,7 @@

 #include <cstddef>
 #include <functional>
+#include <optional>
 #include <vector>

 #include "core/tensor.hpp"
@ -27,6 +28,7 @@ namespace sd::guidance {
    AdaptiveProjectedGuidanceParams parse_adaptive_projected_guidance_args(const char* extra_sample_args);
    bool is_adaptive_projected_guidance_enabled(const AdaptiveProjectedGuidanceParams& params);
    bool parse_skip_layer_guidance_uncond_arg(const char* extra_sample_args);
+    std::vector<float> parse_guidance_schedule(const char* extra_sample_args);

    struct GuidanceInput {
        int step                                 = 0;
@ -40,9 +42,10 @@ namespace sd::guidance {

    class BaseGuidance {
    public:
-        virtual ~BaseGuidance()                                   = default;
+        virtual ~BaseGuidance()                                                                = default;
        virtual GuiderOutput forward(const GuidanceInput& input,
-                                     GuiderOutput previous) const = 0;
+                                     GuiderOutput previous,
+                                     std::optional<float> scale_override = std::nullopt) const = 0;
    };

    class ClassifierFreeGuidance : public BaseGuidance {
@ -54,7 +57,8 @@ namespace sd::guidance {
                               float image_guidance_scale);

        GuiderOutput forward(const GuidanceInput& input,
-                             GuiderOutput previous) const override;
+                             GuiderOutput previous,
+                             std::optional<float> scale_override = std::nullopt) const override;
    };

    class AdaptiveProjectedGuidance : public BaseGuidance {
@ -69,7 +73,8 @@ namespace sd::guidance {
                                  AdaptiveProjectedGuidanceParams params);

        GuiderOutput forward(const GuidanceInput& input,
-                             GuiderOutput previous) const override;
+                             GuiderOutput previous,
+                             std::optional<float> scale_override = std::nullopt) const override;
    };

    class SkipLayerGuidance : public BaseGuidance {
@ -88,7 +93,8 @@ namespace sd::guidance {
        const std::vector<int>& layers() const;

        GuiderOutput forward(const GuidanceInput& input,
-                             GuiderOutput previous) const override;
+                             GuiderOutput previous,
+                             std::optional<float> scale_override = std::nullopt) const override;
    };

 }  // namespace sd::guidance
--- a/src/stable-diffusion.cpp
+++ b/src/stable-diffusion.cpp
@ -3,6 +3,7 @@
 #include <cstdlib>
 #include <set>
 #include <unordered_set>
+#include <vector>

 #include "core/ggml_extend.hpp"
 #include "core/ggml_graph_cut.h"
@ -19,6 +20,7 @@
 #include "extensions/generation_extension.h"
 #include "model/adapter/lora.hpp"
 #include "model/diffusion/anima.hpp"
+#include "model/diffusion/boogu.hpp"
 #include "model/diffusion/control.hpp"
 #include "model/diffusion/ernie_image.hpp"
 #include "model/diffusion/flux.hpp"
@ -52,6 +54,8 @@
 const char* sd_vae_format_name(enum sd_vae_format_t format);
 static SDVersion sd_vae_format_to_version(enum sd_vae_format_t format, SDVersion fallback);

+#include <atomic>
+
 const char* model_version_to_str[] = {
    "SD 1.x",
    "SD 1.x Inpaint",
@ -84,6 +88,7 @@ const char* model_version_to_str[] = {
    "LTXAV",
    "HiDream O1",
    "Z-Image",
+    "Boogu Image",
    "Ovis Image",
    "Ernie Image",
    "Lens",
@ -121,7 +126,8 @@ static bool sd_version_supports_ref_latent_img_cfg(SDVersion version) {
           sd_version_is_flux2(version) ||
           sd_version_is_qwen_image(version) ||
           sd_version_is_longcat(version) ||
-           sd_version_is_z_image(version);
+           sd_version_is_z_image(version) ||
+           sd_version_is_boogu_image(version);
 }

 static bool sd_version_supports_img_cfg(SDVersion version, bool has_ref_images) {
@ -158,6 +164,9 @@ static float get_cache_reuse_threshold(const sd_cache_params_t& params) {

 /*=============================================== StableDiffusionGGML ================================================*/

+static_assert(std::atomic<sd_cancel_mode_t>::is_always_lock_free,
+              "sd_cancel_mode_t must be lock-free");
+
 class StableDiffusionGGML {
 public:
    SDBackendManager backend_manager;
@ -188,8 +197,9 @@ public:
    std::string taesd_path;
    sd_tiling_params_t vae_tiling_params = {false, false, 0, 0, 0.5f, 0, 0, nullptr};
    bool enable_mmap                     = false;
-    float max_vram                       = 0.f;
-    bool stream_layers                   = false;
+    sd::ggml_graph_cut::MaxVramAssignment max_vram_assignment;
+    bool stream_layers = false;
+    bool eager_load    = false;
    std::string backend_spec;
    std::string params_backend_spec;

@ -221,6 +231,24 @@ public:
        return module_backend;
    }

+    std::atomic<sd_cancel_mode_t> cancellation_flag = SD_CANCEL_RESET;
+
+    void set_cancel_flag(enum sd_cancel_mode_t flag) {
+        cancellation_flag.store(flag, std::memory_order_release);
+    }
+
+    void reset_cancel_flag() {
+        set_cancel_flag(SD_CANCEL_RESET);
+    }
+
+    enum sd_cancel_mode_t get_cancel_flag() {
+        return cancellation_flag.load(std::memory_order_acquire);
+    }
+
+    size_t max_graph_vram_bytes_for_module(SDBackendModule module) {
+        return max_vram_assignment.bytes_for_backend(backend_for(module));
+    }
+
    bool ensure_backend_pair(SDBackendModule module) {
        if (backend_for(module) == nullptr) {
            return false;
@ -314,19 +342,22 @@ public:
    bool init(const sd_ctx_params_t* sd_ctx_params) {
        n_threads           = sd_ctx_params->n_threads;
        enable_mmap         = sd_ctx_params->enable_mmap;
-        max_vram            = sd_ctx_params->max_vram;
        stream_layers       = sd_ctx_params->stream_layers;
+        eager_load          = sd_ctx_params->eager_load;
        backend_spec        = SAFE_STR(sd_ctx_params->backend);
        params_backend_spec = SAFE_STR(sd_ctx_params->params_backend);
+        max_vram_assignment.reset(0.f);
+        {
+            std::string error;
+            if (!max_vram_assignment.parse(SAFE_STR(sd_ctx_params->max_vram), &error)) {
+                LOG_ERROR("%s", error.c_str());
+                return false;
+            }
+        }

        std::string rpc_servers_spec = SAFE_STR(sd_ctx_params->rpc_servers);
        add_rpc_devices(rpc_servers_spec);

-        if (stream_layers && max_vram == 0.f) {
-            LOG_WARN("--stream-layers has no effect without --max-vram set; ignoring");
-            stream_layers = false;
-        }
-
        bool use_tae         = false;
        bool use_audio_vae   = false;
        bool use_control_net = false;
@ -343,11 +374,17 @@ public:
        if (!init_backend()) {
            return false;
        }
+        {
+            std::string error;
+            if (!max_vram_assignment.canonicalize_backend_keys(&error)) {
+                LOG_ERROR("%s", error.c_str());
+                return false;
+            }
+        }
        if (stream_layers && !backend_manager.params_backend_is_cpu(SDBackendModule::DIFFUSION)) {
            LOG_WARN("--stream-layers has no effect unless diffusion params backend is cpu; ignoring");
            stream_layers = false;
        }
-        max_vram = sd::ggml_graph_cut::resolve_max_vram_gib(max_vram, backend_for(SDBackendModule::DIFFUSION));

        model_manager = std::make_shared<ModelManager>();
        model_manager->set_n_threads(n_threads);
@ -415,6 +452,14 @@ public:
            }
        }

+        if (strlen(SAFE_STR(sd_ctx_params->pulid_weights_path)) > 0) {
+            LOG_INFO("loading PuLID weights from '%s'", sd_ctx_params->pulid_weights_path);
+            if (!model_loader.init_from_file(sd_ctx_params->pulid_weights_path,
+                                             "model.diffusion_model.")) {
+                LOG_WARN("loading PuLID weights from '%s' failed", sd_ctx_params->pulid_weights_path);
+            }
+        }
+
        if (strlen(SAFE_STR(sd_ctx_params->llm_path)) > 0) {
            LOG_INFO("loading llm from '%s'", sd_ctx_params->llm_path);
            if (!model_loader.init_from_file(sd_ctx_params->llm_path, "text_encoders.llm.")) {
@ -482,14 +527,11 @@ public:
        auto& tensor_storage_map = model_loader.get_tensor_storage_map();

        LOG_INFO("Version: %s ", model_version_to_str[version]);
-        ggml_type wtype               = (int)sd_ctx_params->wtype < std::min<int>(SD_TYPE_COUNT, GGML_TYPE_COUNT)
-                                            ? (ggml_type)sd_ctx_params->wtype
-                                            : GGML_TYPE_COUNT;
+        ggml_type wtype               = sd_type_to_ggml_type(sd_ctx_params->wtype);
        std::string tensor_type_rules = SAFE_STR(sd_ctx_params->tensor_type_rules);
        if (wtype != GGML_TYPE_COUNT || tensor_type_rules.size() > 0) {
            model_loader.set_wtype_override(wtype, tensor_type_rules);
        }
-        model_loader.process_model_files(enable_mmap, true);

        std::map<ggml_type, uint32_t> wtype_stat                 = model_loader.get_wtype_stat();
        std::map<ggml_type, uint32_t> conditioner_wtype_stat     = model_loader.get_conditioner_wtype_stat();
@ -543,9 +585,12 @@ public:
            apply_lora_immediately = false;
        }

+        bool needs_writable_mmap = enable_mmap && apply_lora_immediately;
+        model_manager->set_writable_mmap(needs_writable_mmap);
        if (enable_mmap && apply_lora_immediately) {
            LOG_WARN("in mode 'immediately', LoRAs will cause extra memory usage with mmap");
        }
+        model_loader.process_model_files(enable_mmap, needs_writable_mmap);
        load_alphas_cumprod(model_loader);

        size_t text_encoder_params_mem_size = 0;
@ -564,8 +609,6 @@ public:
            LOG_INFO("Using circular padding for convolutions");
        }

-        const size_t max_graph_vram_bytes = sd::ggml_graph_cut::max_vram_gib_to_bytes(max_vram);
-
        {
            if (!ensure_backend_pair(SDBackendModule::TE) ||
                !ensure_backend_pair(SDBackendModule::DIFFUSION)) {
@ -687,7 +730,7 @@ public:
                    clip_vision = std::make_shared<FrozenCLIPVisionEmbedder>(backend_for(SDBackendModule::CLIP_VISION),
                                                                             tensor_storage_map,
                                                                             model_manager);
-                    clip_vision->set_max_graph_vram_bytes(max_graph_vram_bytes);
+                    clip_vision->set_max_graph_vram_bytes(max_graph_vram_bytes_for_module(SDBackendModule::CLIP_VISION));
                    if (!register_runner_params("CLIP vision",
                                                clip_vision,
                                                SDBackendModule::CLIP_VISION)) {
@ -748,6 +791,18 @@ public:
                                                                         "model.diffusion_model",
                                                                         version,
                                                                         model_manager);
+            } else if (sd_version_is_boogu_image(version)) {
+                cond_stage_model = std::make_shared<LLMEmbedder>(backend_for(SDBackendModule::TE),
+                                                                 tensor_storage_map,
+                                                                 version,
+                                                                 "",
+                                                                 true,
+                                                                 model_manager);
+                diffusion_model  = std::make_shared<Boogu::BooguImageRunner>(backend_for(SDBackendModule::DIFFUSION),
+                                                                            tensor_storage_map,
+                                                                            "model.diffusion_model",
+                                                                            version,
+                                                                            model_manager);
            } else if (sd_version_is_ernie_image(version)) {
                cond_stage_model = std::make_shared<LLMEmbedder>(backend_for(SDBackendModule::TE),
                                                                 tensor_storage_map,
@ -791,7 +846,7 @@ public:
                }
            }

-            cond_stage_model->set_max_graph_vram_bytes(max_graph_vram_bytes);
+            cond_stage_model->set_max_graph_vram_bytes(max_graph_vram_bytes_for_module(SDBackendModule::TE));
            if (!register_runner_params("Conditioner model",
                                        cond_stage_model,
                                        SDBackendModule::TE,
@ -799,7 +854,7 @@ public:
                return false;
            }

-            diffusion_model->set_max_graph_vram_bytes(max_graph_vram_bytes);
+            diffusion_model->set_max_graph_vram_bytes(max_graph_vram_bytes_for_module(SDBackendModule::DIFFUSION));
            diffusion_model->set_stream_layers_enabled(stream_layers);
            if (!register_runner_params("Diffusion model",
                                        diffusion_model,
@ -809,7 +864,7 @@ public:
            }

            if (high_noise_diffusion_model) {
-                high_noise_diffusion_model->set_max_graph_vram_bytes(max_graph_vram_bytes);
+                high_noise_diffusion_model->set_max_graph_vram_bytes(max_graph_vram_bytes_for_module(SDBackendModule::DIFFUSION));
                high_noise_diffusion_model->set_stream_layers_enabled(stream_layers);
                if (!register_runner_params("High noise diffusion model",
                                            high_noise_diffusion_model,
@ -908,7 +963,7 @@ public:
            } else if (use_tae && !tae_preview_only) {
                LOG_INFO("using TAE for encoding / decoding");
                first_stage_model = create_tae(false);
-                first_stage_model->set_max_graph_vram_bytes(max_graph_vram_bytes);
+                first_stage_model->set_max_graph_vram_bytes(max_graph_vram_bytes_for_module(SDBackendModule::VAE));
                if (!register_runner_params("VAE",
                                            first_stage_model,
                                            SDBackendModule::VAE,
@ -918,7 +973,7 @@ public:
            } else {
                LOG_INFO("using VAE for encoding / decoding");
                first_stage_model = create_vae();
-                first_stage_model->set_max_graph_vram_bytes(max_graph_vram_bytes);
+                first_stage_model->set_max_graph_vram_bytes(max_graph_vram_bytes_for_module(SDBackendModule::VAE));
                if (!register_runner_params("VAE",
                                            first_stage_model,
                                            SDBackendModule::VAE,
@ -928,7 +983,7 @@ public:
                if (use_tae && tae_preview_only) {
                    LOG_INFO("using TAE for preview");
                    preview_vae = create_tae(true);
-                    preview_vae->set_max_graph_vram_bytes(max_graph_vram_bytes);
+                    preview_vae->set_max_graph_vram_bytes(max_graph_vram_bytes_for_module(SDBackendModule::VAE));
                    if (!register_runner_params("preview VAE",
                                                preview_vae,
                                                SDBackendModule::VAE,
@ -1001,6 +1056,14 @@ public:
                if (photomaker_extension->is_enabled()) {
                    generation_extensions.push_back(photomaker_extension);
                }
+
+                auto pulid_extension = create_pulid_extension();
+                if (!pulid_extension->init(extension_ctx)) {
+                    return false;
+                }
+                if (pulid_extension->is_enabled()) {
+                    generation_extensions.push_back(pulid_extension);
+                }
            }
            for (auto& extension : generation_extensions) {
                if (!register_runner_params(extension->name(),
@ -1094,7 +1157,15 @@ public:
            return false;
        }

-        LOG_DEBUG("model metadata validated; weights will be prepared lazily");
+        if (eager_load) {
+            if (!model_manager->load_all_params_eagerly()) {
+                LOG_ERROR("model params eager load failed");
+                return false;
+            }
+            LOG_DEBUG("model metadata validated; weights pre-loaded to params backend");
+        } else {
+            LOG_DEBUG("model metadata validated; weights will be prepared lazily");
+        }

        {
            size_t total_params_ram_size  = 0;
@ -1176,6 +1247,7 @@ public:
                           sd_version_is_anima(version) ||
                           sd_version_is_ernie_image(version) ||
                           sd_version_is_z_image(version) ||
+                           sd_version_is_boogu_image(version) ||
                           sd_version_is_pid(version) ||
                           sd_version_is_ideogram4(version)) {
                    pred_type = FLOW_PRED;
@ -1187,6 +1259,8 @@ public:
                        default_flow_shift = 1.5f;
                    } else if (sd_version_is_ideogram4(version)) {
                        default_flow_shift = 1.0f;
+                    } else if (sd_version_is_boogu_image(version)) {
+                        default_flow_shift = 3.16f;
                    } else {
                        default_flow_shift = 3.f;
                    }
@ -1511,6 +1585,7 @@ public:
    }

    void prepare_generation_extensions(const sd_pm_params_t& pm_params,
+                                       const sd_pulid_params_t& pulid_params,
                                       ConditionerParams& condition_params,
                                       int total_steps) {
        reset_generation_extensions();
@ -1518,6 +1593,7 @@ public:
            cond_stage_model.get(),
            condition_params,
            pm_params,
+            pulid_params,
            n_threads,
            total_steps,
        };
@ -1645,7 +1721,7 @@ public:
                if (sd_version_is_sd3(version)) {
                    latent_rgb_proj = sd3_latent_rgb_proj;
                    latent_rgb_bias = sd3_latent_rgb_bias;
-                } else if (sd_version_is_flux(version) || sd_version_is_z_image(version) || sd_version_is_longcat(version)) {
+                } else if (sd_version_uses_flux_vae(version)) {
                    latent_rgb_proj = flux_latent_rgb_proj;
                    latent_rgb_bias = flux_latent_rgb_bias;
                } else if (sd_version_is_wan(version) || sd_version_is_qwen_image(version) || sd_version_is_anima(version)) {
@ -1740,6 +1816,9 @@ public:
        if (sd_version_is_anima(version)) {
            return std::vector<float>{t / static_cast<float>(TIMESTEPS)};
        }
+        if (sd_version_is_boogu_image(version)) {
+            return std::vector<float>{t / static_cast<float>(TIMESTEPS)};
+        }
        if (version == VERSION_HIDREAM_O1) {
            return std::vector<float>{1.0f - (t / static_cast<float>(TIMESTEPS))};
        }
@ -1865,6 +1944,32 @@ public:
        float slg_scale     = guidance.slg.scale;
        bool slg_uncond     = sd::guidance::parse_skip_layer_guidance_uncond_arg(extra_sample_args);

+        std::vector<float> guidance_schedule = sd::guidance::parse_guidance_schedule(extra_sample_args);
+        if (!guidance_schedule.empty() && guidance_schedule.size() != sigmas.size() - 1) {
+            if (guidance_schedule.size() > sigmas.size()) {
+                LOG_WARN("guidance_schedule length (%zu) is greater than number of steps (%zu)", guidance_schedule.size(), sigmas.size() - 1);
+                LOG_WARN("truncating guidance_schedule to match step count");
+                guidance_schedule.resize(sigmas.size() - 1);
+            } else {
+                LOG_INFO("padding guidance_schedule with cfg_scale");
+                while (guidance_schedule.size() < sigmas.size() - 1) {
+                    guidance_schedule.push_back(cfg_scale);
+                }
+            }
+        }
+
+        if (!guidance_schedule.empty()) {
+            std::string schedule_str = "[";
+            for (size_t i = 0; i < guidance_schedule.size(); ++i) {
+                schedule_str += std::to_string(guidance_schedule[i]);
+                if (i < guidance_schedule.size() - 1) {
+                    schedule_str += ", ";
+                }
+            }
+            schedule_str += "]";
+            LOG_DEBUG("using guidance schedule: %s", schedule_str.c_str());
+        }
+
        sd_sample::SampleCacheRuntime cache_runtime = sd_sample::init_sample_cache_runtime(version,
                                                                                           cache_params,
                                                                                           denoiser.get(),
@ -1912,6 +2017,11 @@ public:
        SamplePreviewContext preview = prepare_sample_preview_context();

        auto denoise = [&](const sd::Tensor<float>& x, float sigma, int step) -> sd::guidance::GuiderOutput {
+            if (get_cancel_flag() == SD_CANCEL_ALL) {
+                LOG_DEBUG("cancelling generation");
+                return {};
+            }
+
            if (step == 1 || step == -1) {
                pretty_progress(0, (int)steps, 0);
                last_progress_us = ggml_time_us();
@ -2032,6 +2142,10 @@ public:
                    return std::move(cached_output);
                }

+                for (const auto& extension : generation_extensions) {
+                    extension->before_diffusion(diffusion_params, step);
+                }
+
                auto output_opt = work_diffusion_model->compute(n_threads, diffusion_params);
                if (output_opt.empty()) {
                    LOG_ERROR("diffusion model compute failed");
@ -2096,7 +2210,7 @@ public:
            guidance_input.pred_uncond     = uncond_out.empty() ? nullptr : &uncond_out;
            guidance_input.pred_img_uncond = img_uncond_out.empty() ? nullptr : &img_uncond_out;

-            sd::guidance::GuiderOutput guided = primary_guidance.forward(guidance_input, {});
+            sd::guidance::GuiderOutput guided = guidance_schedule.empty() ? primary_guidance.forward(guidance_input, {}) : primary_guidance.forward(guidance_input, {}, guidance_schedule[guidance_schedule.size() - 1 - step]);
            if (guided.pred.empty()) {
                return {};
            }
@ -2618,8 +2732,9 @@ void sd_ctx_params_init(sd_ctx_params_t* sd_ctx_params) {
    sd_ctx_params->sampler_rng_type     = RNG_TYPE_COUNT;
    sd_ctx_params->prediction           = PREDICTION_COUNT;
    sd_ctx_params->lora_apply_mode      = LORA_APPLY_AUTO;
-    sd_ctx_params->max_vram             = 0.f;
+    sd_ctx_params->max_vram             = nullptr;
    sd_ctx_params->stream_layers        = false;
+    sd_ctx_params->eager_load           = false;
    sd_ctx_params->enable_mmap          = false;
    sd_ctx_params->diffusion_flash_attn = false;
    sd_ctx_params->circular_x           = false;
@ -2630,6 +2745,8 @@ void sd_ctx_params_init(sd_ctx_params_t* sd_ctx_params) {
    sd_ctx_params->vae_format           = SD_VAE_FORMAT_AUTO;
    sd_ctx_params->backend              = nullptr;
    sd_ctx_params->params_backend       = nullptr;
+    sd_ctx_params->rpc_servers          = nullptr;
+    sd_ctx_params->pulid_weights_path   = nullptr;
 }

 char* sd_ctx_params_to_str(const sd_ctx_params_t* sd_ctx_params) {
@ -2655,14 +2772,16 @@ char* sd_ctx_params_to_str(const sd_ctx_params_t* sd_ctx_params) {
             "taesd_path: %s\n"
             "control_net_path: %s\n"
             "photo_maker_path: %s\n"
+             "pulid_weights_path: %s\n"
             "tensor_type_rules: %s\n"
             "n_threads: %d\n"
             "wtype: %s\n"
             "rng_type: %s\n"
             "sampler_rng_type: %s\n"
             "prediction: %s\n"
-             "max_vram: %.3f\n"
+             "max_vram: %s\n"
             "stream_layers: %s\n"
+             "eager_load: %s\n"
             "backend: %s\n"
             "params_backend: %s\n"
             "flash_attn: %s\n"
@ -2689,14 +2808,16 @@ char* sd_ctx_params_to_str(const sd_ctx_params_t* sd_ctx_params) {
             SAFE_STR(sd_ctx_params->taesd_path),
             SAFE_STR(sd_ctx_params->control_net_path),
             SAFE_STR(sd_ctx_params->photo_maker_path),
+             SAFE_STR(sd_ctx_params->pulid_weights_path),
             SAFE_STR(sd_ctx_params->tensor_type_rules),
             sd_ctx_params->n_threads,
             sd_type_name(sd_ctx_params->wtype),
             sd_rng_type_name(sd_ctx_params->rng_type),
             sd_rng_type_name(sd_ctx_params->sampler_rng_type),
             sd_prediction_name(sd_ctx_params->prediction),
-             sd_ctx_params->max_vram,
+             SAFE_STR(sd_ctx_params->max_vram),
             BOOL_STR(sd_ctx_params->stream_layers),
+             BOOL_STR(sd_ctx_params->eager_load),
             SAFE_STR(sd_ctx_params->backend),
             SAFE_STR(sd_ctx_params->params_backend),
             BOOL_STR(sd_ctx_params->flash_attn),
@ -2783,6 +2904,7 @@ void sd_img_gen_params_init(sd_img_gen_params_t* sd_img_gen_params) {
    sd_img_gen_params->batch_count       = 1;
    sd_img_gen_params->control_strength  = 0.9f;
    sd_img_gen_params->pm_params         = {nullptr, 0, nullptr, 20.f};
+    sd_img_gen_params->pulid_params      = {nullptr, 1.0f};
    sd_img_gen_params->vae_tiling_params = {false, false, 0, 0, 0.5f, 0.0f, 0.0f, nullptr};
    sd_cache_params_init(&sd_img_gen_params->cache);
    sd_hires_params_init(&sd_img_gen_params->hires);
@ -2925,6 +3047,15 @@ void free_sd_ctx(sd_ctx_t* sd_ctx) {
    free(sd_ctx);
 }

+SD_API void sd_cancel_generation(sd_ctx_t* sd_ctx, enum sd_cancel_mode_t mode) {
+    if (sd_ctx && sd_ctx->sd) {
+        if (mode < SD_CANCEL_ALL || mode > SD_CANCEL_RESET) {
+            mode = SD_CANCEL_ALL;
+        }
+        sd_ctx->sd->set_cancel_flag(mode);
+    }
+}
+
 static sd_audio_t* waveform_to_sd_audio(const StableDiffusionGGML* sd,
                                        const sd::Tensor<float>& waveform) {
    if (sd == nullptr || waveform.empty()) {
@ -3084,6 +3215,7 @@ struct GenerationRequest {
    sd_guidance_params_t guidance            = {};
    sd_guidance_params_t high_noise_guidance = {};
    sd_pm_params_t pm_params                 = {};
+    sd_pulid_params_t pulid_params           = {};
    sd_hires_params_t hires                  = {};
    int frames                               = -1;
    int requested_frames                     = -1;
@ -3109,6 +3241,7 @@ struct GenerationRequest {
        has_ref_images              = sd_img_gen_params->ref_images_count > 0;
        guidance                    = sd_img_gen_params->sample_params.guidance;
        pm_params                   = sd_img_gen_params->pm_params;
+        pulid_params                = sd_img_gen_params->pulid_params;
        hires                       = sd_img_gen_params->hires;
        cache_params                = &sd_img_gen_params->cache;
        resolve(sd_ctx);
@ -4035,6 +4168,7 @@ static std::optional<ImageGenerationEmbeds> prepare_image_generation_embeds(sd_c
    condition_params.ref_images = &latents->ref_images;

    sd_ctx->sd->prepare_generation_extensions(request->pm_params,
+                                              request->pulid_params,
                                              condition_params,
                                              plan->total_steps);
    int64_t prepare_start_ms         = ggml_time_ms();
@ -4109,15 +4243,29 @@ static std::optional<ImageGenerationEmbeds> prepare_image_generation_embeds(sd_c
 static sd_image_t* decode_image_outputs(sd_ctx_t* sd_ctx,
                                        const GenerationRequest& request,
                                        const std::vector<sd::Tensor<float>>& final_latents) {
-    if (final_latents.size() != static_cast<size_t>(request.batch_count)) {
-        LOG_ERROR("expected %d latents, got %zu", request.batch_count, final_latents.size());
+    if (final_latents.empty()) {
+        LOG_ERROR("no latent images to decode");
        return nullptr;
    }
-    LOG_INFO("decoding %zu latents", final_latents.size());
+    if (final_latents.size() > static_cast<size_t>(request.batch_count)) {
+        LOG_ERROR("expected at most %d latents, got %zu", request.batch_count, final_latents.size());
+        return nullptr;
+    }
+    if (final_latents.size() < static_cast<size_t>(request.batch_count)) {
+        LOG_INFO("decoding %zu/%d latents", final_latents.size(), request.batch_count);
+    } else {
+        LOG_INFO("decoding %zu latents", final_latents.size());
+    }
    std::vector<sd::Tensor<float>> decoded_images;
-    int64_t t0 = ggml_time_ms();
+    int64_t t0     = ggml_time_ms();
+    bool cancelled = false;

    for (size_t i = 0; i < final_latents.size(); i++) {
+        if (sd_ctx->sd->get_cancel_flag() == SD_CANCEL_ALL) {
+            LOG_ERROR("cancelling latent decodings");
+            cancelled = true;
+            break;
+        }
        int64_t t1              = ggml_time_ms();
        sd::Tensor<float> image = sd_ctx->sd->decode_first_stage(final_latents[i]);
        if (image.empty()) {
@ -4131,6 +4279,10 @@ static sd_image_t* decode_image_outputs(sd_ctx_t* sd_ctx,

    int64_t t4 = ggml_time_ms();
    LOG_INFO("decode_first_stage completed, taking %.2fs", (t4 - t0) * 1.0f / 1000);
+    if (decoded_images.empty()) {
+        LOG_ERROR(cancelled ? "cancelled before any latent images were decoded" : "no decoded images");
+        return nullptr;
+    }

    sd_image_t* result_images = (sd_image_t*)calloc(request.batch_count, sizeof(sd_image_t));
    if (result_images == nullptr) {
@ -4149,6 +4301,11 @@ static sd::Tensor<float> upscale_hires_latent(sd_ctx_t* sd_ctx,
                                              const sd::Tensor<float>& latent,
                                              const GenerationRequest& request,
                                              UpscalerGGML* upscaler) {
+    if (sd_ctx->sd->get_cancel_flag() == SD_CANCEL_ALL) {
+        LOG_ERROR("cancelling hires latent upscale");
+        return {};
+    }
+
    auto get_hires_latent_target_shape = [&]() {
        std::vector<int64_t> target_shape = latent.shape();
        if (target_shape.size() < 2) {
@ -4221,6 +4378,10 @@ static sd::Tensor<float> upscale_hires_latent(sd_ctx_t* sd_ctx,
                      sd_hires_upscaler_name(request.hires.upscaler));
            return {};
        }
+        if (sd_ctx->sd->get_cancel_flag() == SD_CANCEL_ALL) {
+            LOG_ERROR("cancelling hires image upscale");
+            return {};
+        }

        sd::Tensor<float> upscaled_tensor;
        if (request.hires.upscaler == SD_HIRES_UPSCALER_MODEL) {
@ -4257,6 +4418,10 @@ static sd::Tensor<float> upscale_hires_latent(sd_ctx_t* sd_ctx,
            upscaled_tensor = sd::ops::clamp(upscaled_tensor, 0.0f, 1.0f);
        }

+        if (sd_ctx->sd->get_cancel_flag() == SD_CANCEL_ALL) {
+            LOG_ERROR("cancelling hires latent encode");
+            return {};
+        }
        sd::Tensor<float> upscaled_latent = sd_ctx->sd->encode_first_stage(upscaled_tensor);
        if (upscaled_latent.empty()) {
            LOG_ERROR("encode_first_stage failed after hires %s upscale",
@ -4321,6 +4486,8 @@ SD_API sd_image_t* generate_image(sd_ctx_t* sd_ctx, const sd_img_gen_params_t* s
        return nullptr;
    }

+    sd_ctx->sd->reset_cancel_flag();
+
    int64_t t0                    = ggml_time_ms();
    sd_ctx->sd->vae_tiling_params = sd_img_gen_params->vae_tiling_params;
    GenerationRequest request(sd_ctx, sd_img_gen_params);
@ -4356,6 +4523,18 @@ SD_API sd_image_t* generate_image(sd_ctx_t* sd_ctx, const sd_img_gen_params_t* s
    std::vector<sd::Tensor<float>> final_latents;
    int64_t denoise_start = ggml_time_ms();
    for (int b = 0; b < request.batch_count; b++) {
+        sd_cancel_mode_t cancel = sd_ctx->sd->get_cancel_flag();
+        if (cancel == SD_CANCEL_ALL) {
+            LOG_ERROR("cancelling generation");
+            return nullptr;
+        }
+        if (cancel == SD_CANCEL_NEW_LATENTS) {
+            LOG_INFO("cancelling new latent generation, returning %zu/%d completed latents",
+                     final_latents.size(),
+                     request.batch_count);
+            break;
+        }
+
        int64_t sampling_start = ggml_time_ms();
        int64_t cur_seed       = request.seed + b;
        LOG_INFO("generating image: %i/%i - seed %" PRId64, b + 1, request.batch_count, cur_seed);
@ -4405,19 +4584,31 @@ SD_API sd_image_t* generate_image(sd_ctx_t* sd_ctx, const sd_img_gen_params_t* s
    LOG_INFO("generating %zu latent images completed, taking %.2fs",
             final_latents.size(),
             (denoise_end - denoise_start) * 1.0f / 1000);
+    if (final_latents.empty()) {
+        LOG_ERROR("no latent images generated");
+        return nullptr;
+    }

    if (request.hires.enabled && request.hires.target_width > 0) {
+        if (sd_ctx->sd->get_cancel_flag() == SD_CANCEL_ALL) {
+            LOG_ERROR("cancelling generation before hires fix");
+            return nullptr;
+        }
        LOG_INFO("hires fix: upscaling to %dx%d", request.hires.target_width, request.hires.target_height);

        std::unique_ptr<UpscalerGGML> hires_upscaler;
        if (request.hires.upscaler == SD_HIRES_UPSCALER_MODEL) {
+            if (sd_ctx->sd->get_cancel_flag() == SD_CANCEL_ALL) {
+                LOG_ERROR("cancelling generation before hires model load");
+                return nullptr;
+            }
            LOG_INFO("hires fix: loading model upscaler from '%s'", request.hires.model_path);
            hires_upscaler                    = std::make_unique<UpscalerGGML>(sd_ctx->sd->n_threads,
                                                            false,
                                                            request.hires.upscale_tile_size,
                                                            sd_ctx->sd->backend_spec,
                                                            sd_ctx->sd->params_backend_spec);
-            const size_t max_graph_vram_bytes = sd::ggml_graph_cut::max_vram_gib_to_bytes(sd_ctx->sd->max_vram);
+            const size_t max_graph_vram_bytes = sd_ctx->sd->max_graph_vram_bytes_for_module(SDBackendModule::UPSCALER);
            hires_upscaler->set_max_graph_vram_bytes(max_graph_vram_bytes);
            if (!hires_upscaler->load_from_file(request.hires.model_path,
                                                sd_ctx->sd->n_threads)) {
@ -4444,6 +4635,10 @@ SD_API sd_image_t* generate_image(sd_ctx_t* sd_ctx, const sd_img_gen_params_t* s
        std::vector<sd::Tensor<float>> hires_final_latents;
        int64_t hires_denoise_start = ggml_time_ms();
        for (int b = 0; b < (int)final_latents.size(); b++) {
+            if (sd_ctx->sd->get_cancel_flag() == SD_CANCEL_ALL) {
+                LOG_ERROR("cancelling generation during hires fix");
+                return nullptr;
+            }
            int64_t cur_seed = request.seed + b;
            sd_ctx->sd->rng->manual_seed(cur_seed);
            sd_ctx->sd->sampler_rng->manual_seed(cur_seed);
@ -4874,6 +5069,10 @@ static sd_image_t* decode_video_outputs(sd_ctx_t* sd_ctx,
        LOG_ERROR("no latent video to decode");
        return nullptr;
    }
+    if (sd_ctx->sd->get_cancel_flag() == SD_CANCEL_ALL) {
+        LOG_ERROR("cancelling video decode");
+        return nullptr;
+    }
    sd::Tensor<float> video_latent = final_latent;
    if (sd_version_is_ltxav(sd_ctx->sd->version) &&
        video_latent.shape()[3] > sd_ctx->sd->get_latent_channel()) {
@ -4966,7 +5165,7 @@ static sd::Tensor<float> upscale_ltx_spatial_video_latent(sd_ctx_t* sd_ctx,
        std::make_unique<LTXVUpsampler::LatentUpsamplerRunner>(sd_ctx->sd->backend_for(SDBackendModule::UPSCALER),
                                                               model_loader.get_tensor_storage_map(),
                                                               upsampler_manager);
-    const size_t max_graph_vram_bytes = sd::ggml_graph_cut::max_vram_gib_to_bytes(sd_ctx->sd->max_vram);
+    const size_t max_graph_vram_bytes = sd_ctx->sd->max_graph_vram_bytes_for_module(SDBackendModule::UPSCALER);
    upsampler->set_max_graph_vram_bytes(max_graph_vram_bytes);
    if (upsampler->model == nullptr) {
        LOG_ERROR("init LTX latent upsampler from metadata failed");
@ -5119,6 +5318,9 @@ SD_API bool generate_video(sd_ctx_t* sd_ctx,
    if (audio_out != nullptr) {
        *audio_out = nullptr;
    }
+
+    sd_ctx->sd->reset_cancel_flag();
+
    if (num_frames_out != nullptr) {
        *num_frames_out = 0;
    }
@ -5180,6 +5382,10 @@ SD_API bool generate_video(sd_ctx_t* sd_ctx,
    sd::Tensor<float> noise = sd::Tensor<float>::randn_like(x_t, sd_ctx->sd->rng);

    if (plan.high_noise_sample_steps > 0) {
+        if (sd_ctx->sd->get_cancel_flag() == SD_CANCEL_ALL) {
+            LOG_ERROR("cancelling generation before high-noise sampling");
+            return false;
+        }
        LOG_DEBUG("sample(high noise) %dx%dx%d", W, H, T);

        int64_t sampling_start = ggml_time_ms();
@ -5222,6 +5428,10 @@ SD_API bool generate_video(sd_ctx_t* sd_ctx,
        LOG_INFO("sampling(high noise) completed, taking %.2fs", (sampling_end - sampling_start) * 1.0f / 1000);
    }

+    if (sd_ctx->sd->get_cancel_flag() == SD_CANCEL_ALL) {
+        LOG_ERROR("cancelling generation before sampling");
+        return false;
+    }
    LOG_DEBUG("sample %dx%dx%d", W, H, T);
    int64_t sampling_start         = ggml_time_ms();
    sd::Tensor<float> final_latent = sd_ctx->sd->sample(sd_ctx->sd->diffusion_model,
@ -5258,6 +5468,10 @@ SD_API bool generate_video(sd_ctx_t* sd_ctx,
    LOG_INFO("sampling completed, taking %.2fs", (sampling_end - sampling_start) * 1.0f / 1000);

    if (latent_upscale_enabled) {
+        if (sd_ctx->sd->get_cancel_flag() == SD_CANCEL_ALL) {
+            LOG_ERROR("cancelling generation before latent upscale");
+            return false;
+        }
        int64_t upscale_start             = ggml_time_ms();
        sd::Tensor<float> upscaled_latent = upscale_ltx_spatial_video_latent(sd_ctx,
                                                                             request.hires.model_path,
@ -5317,6 +5531,10 @@ SD_API bool generate_video(sd_ctx_t* sd_ctx,
        }
        sd::Tensor<float> hires_denoise_mask;
        sd::Tensor<float> hires_video_positions;
+        if (sd_ctx->sd->get_cancel_flag() == SD_CANCEL_ALL) {
+            LOG_ERROR("cancelling generation before latent upscale refine");
+            return false;
+        }
        if (!apply_ltxv_refine_image_conditioning(sd_ctx,
                                                  sd_vid_gen_params,
                                                  hires_request,
@ -5396,6 +5614,10 @@ SD_API bool generate_video(sd_ctx_t* sd_ctx,
    if (sd_version_is_ltxav(sd_ctx->sd->version) &&
        latents.audio_length > 0 &&
        sd_ctx->sd->audio_vae_model != nullptr) {
+        if (sd_ctx->sd->get_cancel_flag() == SD_CANCEL_ALL) {
+            LOG_ERROR("cancelling generation before audio decode");
+            return false;
+        }
        int64_t audio_latent_decode_start = ggml_time_ms();

        auto audio_latent = unpack_ltxav_audio_latent(final_latent,
@ -5428,6 +5650,11 @@ SD_API bool generate_video(sd_ctx_t* sd_ctx,
        final_latent = sd::ops::slice(final_latent, 2, latents.ref_image_num, final_latent.shape()[2]);
    }

+    if (sd_ctx->sd->get_cancel_flag() == SD_CANCEL_ALL) {
+        LOG_ERROR("cancelling generation before video decode");
+        free_sd_audio(generated_audio);
+        return false;
+    }
    auto result = decode_video_outputs(sd_ctx, latent_upscale_enabled ? hires_request : request, final_latent, num_frames_out);
    if (result == nullptr) {
        free_sd_audio(generated_audio);
--- a/src/tokenizers/bpe_tokenizer.cpp
+++ b/src/tokenizers/bpe_tokenizer.cpp
@ -134,7 +134,8 @@ std::vector<int> BPETokenizer::encode(const std::string& text, on_new_token_cb_t
    std::vector<int32_t> bpe_tokens;
    std::vector<std::string> token_strs;

-    auto splited_texts = split_with_special_tokens(text, special_tokens);
+    std::string normalized_text = normalize_before_split ? normalize(text) : text;
+    auto splited_texts          = split_with_special_tokens(normalized_text, special_tokens);

    for (auto& splited_text : splited_texts) {
        if (is_special_token(splited_text)) {
@ -159,7 +160,7 @@ std::vector<int> BPETokenizer::encode(const std::string& text, on_new_token_cb_t
                }
            }

-            std::string token_str = normalize(token);
+            std::string token_str = normalize_before_split ? token : normalize(token);
            std::u32string utf32_token;
            if (byte_level_bpe) {
                for (int i = 0; i < token_str.length(); i++) {
--- a/src/tokenizers/clip_tokenizer.cpp
+++ b/src/tokenizers/clip_tokenizer.cpp
@ -22,9 +22,10 @@ CLIPTokenizer::CLIPTokenizer(int pad_token_id, const std::string& merges_utf8_st
    EOS_TOKEN_ID = 49407;
    PAD_TOKEN_ID = pad_token_id;

-    end_of_word_suffix = "</w>";
-    add_bos_token      = true;
-    add_eos_token      = true;
+    end_of_word_suffix     = "</w>";
+    add_bos_token          = true;
+    add_eos_token          = true;
+    normalize_before_split = true;

    if (merges_utf8_str.size() > 0) {
        load_from_merges(merges_utf8_str);
--- a/src/tokenizers/tokenizer.h
+++ b/src/tokenizers/tokenizer.h
@ -12,9 +12,10 @@ using on_new_token_cb_t = std::function<bool(std::string&, std::vector<int32_t>&
 class Tokenizer {
 protected:
    std::vector<std::string> special_tokens;
-    bool add_bos_token = false;
-    bool add_eos_token = false;
-    bool pad_left      = false;
+    bool add_bos_token          = false;
+    bool add_eos_token          = false;
+    bool pad_left               = false;
+    bool normalize_before_split = false;
    std::string end_of_word_suffix;

    virtual std::string decode_token(int token_id) const = 0;
Author	SHA1	Message	Date
leejet	f440ad9c29	fix: avoid writable mmap for read-only weights (#1698 )	2026-06-23 00:39:31 +08:00
stduhpf	41f7acbfb0	feat: support guidance_schedule (#1684 )	2026-06-23 00:05:55 +08:00
leejet	b395a6972d	refactor: add Flux VAE version helper (#1696 )	2026-06-22 22:39:42 +08:00
Alex Klinkhamer	854bebfe02	feat: add --prompt-file and --negative-prompt-file flags (#1693 )	2026-06-22 22:16:54 +08:00
fszontagh	787d229d84	perf: --eager-load to pre-load params at model-load time (#1687 )	2026-06-22 22:10:09 +08:00
leejet	b12098f5d0	feat: add boogu image support (#1688 )	2026-06-22 00:36:17 +08:00
stduhpf	2bd249c971	feat: concatenate repeated cli arg strings (#1686 )	2026-06-22 00:24:13 +08:00
Daniele	e9e952462f	fix: workaround for Ernie with Vulkan and Flash Attention (#1680 )	2026-06-22 00:21:38 +08:00
Wagner Bruna	e8e012eef2	fix: workaround for Anima with Vulkan and Flash Attention (#1678 )	2026-06-22 00:20:00 +08:00
leejet	7f0e728b7d	fix: normalize CLIP prompts before special-token splitting (#1670 )	2026-06-17 00:33:00 +08:00
leejet	92a3b73cdb	sync: update sdcpp-webui (#1668 )	2026-06-16 23:55:03 +08:00
Wagner Bruna	710bc91c8f	fix: correct conversion from sd_type_t to ggml_type (#1519 )	2026-06-16 23:54:42 +08:00
Wagner Bruna	5a34bc7f6e	feat: support for cancelling generations (#1124 ) * feat: support for canceling the ongoing generation * return partial image batches on cancel --------- Co-authored-by: leejet <leejet714@gmail.com>	2026-06-16 00:36:38 +08:00
leejet	146b6cc49e	fix: simplify PuLID ID extraction setup (#1664 )	2026-06-15 23:55:38 +08:00
RapidMark	93527fda74	feat: add PuLID-Flux identity-injection support (#1595 )	2026-06-15 23:33:50 +08:00
leejet	6e66a1a4a4	fix: allow oversized Vulkan parameter tensors (#1662 )	2026-06-15 23:18:52 +08:00
leejet	bb90bfa00f	feat: support backend-specific max-vram budgets	2026-06-14 22:46:32 +08:00
leejet	517abc777d	sync: update ggml (#1656 )	2026-06-14 20:45:05 +08:00
leejet	6f00939f75	docs: refresh README guide links	2026-06-14 17:58:58 +08:00