fix: avoid writable mmap for read-only weights (#1698 )

feat: support guidance_schedule (#1684 )
refactor: add Flux VAE version helper (#1696 )
2026-06-23 22:56:42 +00:00 · 2026-06-23 00:39:31 +08:00 · 2026-06-23 00:05:55 +08:00 · 2026-06-22 22:39:42 +08:00 · 2026-06-22 22:16:54 +08:00 · 2026-06-22 22:10:09 +08:00
53 changed files with 2900 additions and 741 deletions
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@ -204,6 +204,12 @@ if(SD_WEBM)
    endif()
 endif()

+if (SD_RPC)
+    message("-- Use RPC as backend stable-diffusion")
+    set(GGML_RPC ON)
+    add_definitions(-DSD_USE_RPC)
+endif ()
+
 set(SD_LIB stable-diffusion)

 file(GLOB SD_LIB_SOURCES CONFIGURE_DEPENDS
--- a/README.md
+++ b/README.md
@ -34,8 +34,8 @@ API and command-line option may change frequently.***
 - Super lightweight and without external dependencies
 - Supported models
  - Image Models
-    - SD1.x, SD2.x, [SD-Turbo](https://huggingface.co/stabilityai/sd-turbo)
-    - SDXL, [SDXL-Turbo](https://huggingface.co/stabilityai/sdxl-turbo)
+    - [SD1.x, SD2.x, SD-Turbo](./docs/sd.md)
+    - [SDXL, SDXL-Turbo](./docs/sd.md)
    - [Some SD1.x and SDXL distilled models](./docs/distilled_sd.md)
    - [SD3/SD3.5](./docs/sd3.md)
    - [FLUX.1-dev/FLUX.1-schnell](./docs/flux.md)
@ -50,21 +50,23 @@ API and command-line option may change frequently.***
    - [Ovis-Image](./docs/ovis_image.md)
    - [Anima](./docs/anima.md)
    - [ERNIE-Image](./docs/ernie_image.md)
+    - [Boogu Image](./docs/boogu_image.md)
    - [HiDream-O1-Image](./docs/hidream_o1_image.md)
    - [Ideogram4](./docs/ideogram4.md)
  - Image Edit Models
    - [FLUX.1-Kontext-dev](./docs/kontext.md)
    - [Qwen Image Edit series](./docs/qwen_image_edit.md)
    - [LongCat Image Edit](./docs/longcat_image.md)
+    - [Boogu Image Edit](./docs/boogu_image.md)
  - Video Models
    - [Wan2.1/Wan2.2](./docs/wan.md)
    - [LTX-2.3](./docs/ltx2.md)
-  - [PhotoMaker](https://github.com/TencentARC/PhotoMaker) support.
+  - [PhotoMaker](./docs/photo_maker.md) support.
  - Control Net support with SD 1.5
  - LoRA support, same as [stable-diffusion-webui](https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Features#lora)
  - Latent Consistency Models support (LCM/LCM-LoRA)
-  - Faster and memory efficient latent decoding with [TAESD](https://github.com/madebyollin/taesd)
-  - Upscale images generated with [ESRGAN](https://github.com/xinntao/Real-ESRGAN)
+  - Faster and memory efficient latent decoding with [TAESD](./docs/taesd.md)
+  - Upscale images generated with [ESRGAN](./docs/esrgan.md)
 - Supported backends
  - CPU (AVX, AVX2 and AVX512 support for x86 architectures)
  - CUDA
@ -133,28 +135,9 @@ For runtime and parameter backend placement, see the [backend selection guide](.
 ## More Guides

 - [Backend selection](./docs/backend.md)
- [SD1.x/SD2.x/SDXL](./docs/sd.md)
- [SD3/SD3.5](./docs/sd3.md)
- [FLUX.1-dev/FLUX.1-schnell](./docs/flux.md)
- [FLUX.2-dev/FLUX.2-klein](./docs/flux2.md)
- [FLUX.1-Kontext-dev](./docs/kontext.md)
- [Chroma](./docs/chroma.md)
- [🔥Qwen Image](./docs/qwen_image.md)
- [🔥Qwen Image Edit series](./docs/qwen_image_edit.md)
- [🔥Wan2.1/Wan2.2](./docs/wan.md)
- [🔥LTX-2.3](./docs/ltx2.md)
- [🔥Z-Image](./docs/z_image.md)
- [Ovis-Image](./docs/ovis_image.md)
- [Anima](./docs/anima.md)
- [ERNIE-Image](./docs/ernie_image.md)
- [HiDream-O1-Image](./docs/hidream_o1_image.md)
- [Lens](./docs/lens.md)
- [LongCat Image / LongCat Image Edit](./docs/longcat_image.md)
+- [RPC](./docs/rpc.md)
 - [LoRA](./docs/lora.md)
 - [LCM/LCM-LoRA](./docs/lcm.md)
- [Using PhotoMaker to personalize image generation](./docs/photo_maker.md)
- [Using ESRGAN to upscale results](./docs/esrgan.md)
- [Using TAESD to faster decoding](./docs/taesd.md)
 - [Docker](./docs/docker.md)
 - [Quantization and GGUF](./docs/quantization_and_gguf.md)
 - [Inference acceleration via caching](./docs/caching.md)
--- a/assets/boogu/edit_example.png
+++ b/assets/boogu/edit_example.png
--- a/assets/boogu/example.png
+++ b/assets/boogu/example.png
--- a/docs/backend.md
+++ b/docs/backend.md
@ -35,6 +35,14 @@ sd-cli -m model.safetensors -p "a cat" --backend cuda0 --params-backend te=cpu,v
 sd-cli -m model.safetensors -p "a cat" --backend cuda0 --params-backend disk
 ```

+`--max-vram` can target resolved backend/device names:
+
+```shell
+sd-cli -m model.safetensors -p "a cat" --backend diffusion=cuda0,vae=vulkan0 --max-vram cuda0=6,vulkan0=2
+```
+
+The budget applies to every module running on that backend.
+
 Module names are case-insensitive. Hyphens and underscores in module names are ignored, so `clip_vision`, `clip-vision`, and `clipvision` are equivalent.

 `all=`, `default=`, and `*=` can be used to set the default backend inside a mixed assignment:
@ -124,16 +132,16 @@ Runtime and parameter assignments also share the same backend cache. If `--backe

 ## Compatibility flags

-The older CPU placement flags are still supported:
+The example CLI/server still accepts these older CPU placement flags as compatibility aliases:

 - `--clip-on-cpu`
 - `--vae-on-cpu`
 - `--control-net-cpu`
 - `--offload-to-cpu`

-`--clip-on-cpu`, `--vae-on-cpu`, and `--control-net-cpu` affect runtime backend assignment only when `--backend` is not set. They map to `te=cpu`, `vae=cpu`, and `controlnet=cpu`.
+`--clip-on-cpu`, `--vae-on-cpu`, and `--control-net-cpu` are deprecated. The example argument layer prepends `te=cpu`, `vae=cpu`, and `controlnet=cpu` to `--backend` before creating the context.

-`--offload-to-cpu` prepends a CPU default to the parameter assignment before parsing:
+`--offload-to-cpu` prepends a CPU default to the parameter assignment in the caller before creating the context:

 ```shell
 --params-backend '*=cpu'
@ -141,4 +149,4 @@ The older CPU placement flags are still supported:

 Because this default is inserted first, later explicit `--params-backend` entries can still override it, for example `--offload-to-cpu --params-backend te=disk` keeps non-TE parameters on CPU and reloads TE parameters from disk.

-Explicit `--backend` and `--params-backend` assignments are preferred for new commands.
+Library callers should set `backend` and `params_backend` directly. The old CPU/offload fields are no longer part of the C API. Explicit `--backend` and `--params-backend` assignments are preferred for new commands.
--- a/docs/boogu_image.md
+++ b/docs/boogu_image.md
@ -0,0 +1,31 @@
+# How to Use
+
+Boogu Image uses a Boogu diffusion transformer, the FLUX VAE, and Qwen3-VL as the LLM text and vision encoder.
+
+## Download weights
+
+- Download Boogu Image
+    - safetensors: https://huggingface.co/Comfy-Org/Boogu-Image/tree/main/diffusion_models
+- Download vae
+    - safetensors: https://huggingface.co/black-forest-labs/FLUX.1-dev/blob/main/ae.safetensors
+- Download Qwen3-VL 8B
+    - gguf: https://huggingface.co/unsloth/Qwen3-VL-8B-Instruct-GGUF/tree/main
+        - For image editing with GGUF text encoders, also download the matching mmproj file and pass it with `--llm_vision`.
+
+## Examples
+
+### Boogu Image Base
+
+```
+.\bin\Release\sd-cli.exe --diffusion-model ..\..\ComfyUI\models\diffusion_models\boogu_image_base_bf16.safetensors --llm ..\..\llm\Qwen3VL-8B-Instruct-Q4_K_M.gguf --vae ..\..\ComfyUI\models\vae\ae.sft -p "a lovely cat" --diffusion-fa -v --offload-to-cpu
+```
+
+<img width="256" alt="Boogu Image Base example" src="../assets/boogu/example.png" />
+
+### Boogu Image Edit
+
+```
+.\bin\Release\sd-cli.exe --diffusion-model ..\..\ComfyUI\models\diffusion_models\boogu_image_edit_bf16.safetensors --llm ..\..\llm\Qwen3VL-8B-Instruct-Q4_K_M.gguf --llm_vision ..\..\llm\mmproj-Qwen3VL-8B-Instruct-F16.gguf --vae ..\..\ComfyUI\models\vae\ae.sft --diffusion-fa -v --offload-to-cpu -r ..\assets\flux\flux1-dev-q8_0.png -p "change 'flux.cpp' to 'boogu.cpp'"
+```
+
+<img width="256" alt="Boogu Image Edit example" src="../assets/boogu/edit_example.png" />
--- a/docs/performance.md
+++ b/docs/performance.md
@ -31,7 +31,7 @@ Use CPU params to reduce VRAM usage:
 --backend cuda0 --params-backend cpu
 ```

-This keeps model weights in system RAM and moves them to the runtime backend when needed. `--offload-to-cpu` is a compatibility shortcut that prepends `*=cpu` to `--params-backend`, so explicit module assignments can still override it:
+This keeps model weights in system RAM and moves them to the runtime backend when needed. In the example CLI/server, `--offload-to-cpu` is a compatibility shortcut that prepends `*=cpu` to `--params-backend` before creating the context, so explicit module assignments can still override it:

 ```shell
 --offload-to-cpu --params-backend te=disk
--- a/docs/pulid.md
+++ b/docs/pulid.md
@ -0,0 +1,196 @@
+# PuLID-Flux face-identity preservation
+
+stable-diffusion.cpp supports the [PuLID-Flux](https://github.com/ToTheBeginning/PuLID)
+identity-injection technique on top of Flux.1 (schnell or dev) models.
+Given a single source portrait, PuLID-Flux produces new generations that
+preserve the source person's face across arbitrary scenes, poses, and
+prompts.
+
+Unlike PhotoMaker (which extracts the identity inside the inference
+process from a directory of images), PuLID-Flux's identity extractor is
+a heavy stack (insightface ArcFace + EVA-CLIP-L + IDFormer encoder) that
+is impractical to port to C++/ggml. To keep this implementation small and
+cross-vendor, **stable-diffusion.cpp consumes a precomputed identity
+embedding** produced by an external Python tool that runs once per source
+portrait. Everything downstream of that one-shot extraction is C++ and
+runs on any backend (Vulkan, CUDA, Metal, ROCm, CPU).
+
+## Architecture summary
+
+The PuLID-Flux contribution to the Flux denoise loop is a stack of 20
+small cross-attention modules (`PerceiverAttentionCA`) inserted between
+the Flux transformer blocks:
+
+- After every 2nd of the 19 double-stream blocks (10 hook points)
+- After every 4th of the 38 single-stream blocks (10 hook points)
+
+Each cross-attention layer takes the current image tokens as query, the
+32-token / 2048-dim identity embedding as key+value, and adds its output
+(scaled by `id_weight`, typically 1.0) back to the image tokens.
+
+## Required weights
+
+Three files in addition to the standard Flux weight set:
+
+1. **Flux base** (transformer + VAE + clip_l + t5xxl) -- exactly as
+   [docs/flux.md](flux.md) describes.
+2. **PuLID weights** -- download from
+   [guozinan/PuLID](https://huggingface.co/guozinan/PuLID):
+   - `pulid_flux_v0.9.0.safetensors` or `pulid_flux_v0.9.1.safetensors`
+     (recommended; this implementation is verified against v0.9.1)
+   - **v1.1 (`pulid_v1.1.safetensors`) is NOT yet supported** -- it uses
+     renamed keys (`id_adapter_attn_layers.*` instead of `pulid_ca.*`)
+     and possibly different module structure. Future PR.
+3. **Identity embedding (.pulidembd)** -- produced by the precompute
+   tool below.
+
+## Precompute the identity embedding
+
+The precompute tool runs the PyTorch identity-extraction stack on a
+single portrait image and writes the resulting `(32, 2048)` embedding
+to a `.pulidembd` binary file (about 131 KB). Run it once per source
+person; the same file is reused for any number of generations.
+
+A reference Python script is provided alongside this docs file at
+[`script/pulid_extract_id.py`](../script/pulid_extract_id.py). It
+requires:
+- A working CUDA / CPU PyTorch stack
+- `insightface`, `facexlib`, `eva-clip`, `torchvision`, `opencv-python`,
+  `huggingface_hub`, `gguf`
+- The PuLID weights file (same one stable-diffusion.cpp will load below)
+- The ToTheBeginning/PuLID repo's `pulid/` package (including
+  `pulid/pipeline_flux.py`) and `eva_clip/` package on `PYTHONPATH`; `flux/`
+  is not needed for embedding extraction
+
+Run it as:
+
+```
+python pulid_extract_id.py \
+  --portrait /path/to/source-photo.jpg \
+  --pulid-weights /path/to/pulid_flux_v0.9.1.safetensors \
+  --out /path/to/source.pulidembd
+```
+
+## Format (gguf)
+
+The embedding is a standard **gguf** container holding a single tensor:
+
+```
+tensor name : "pulid_id"
+shape       : [token_dim, num_tokens]   (ggml order; typically [2048, 32])
+type        : F16 (also accepts F32 / BF16)
+metadata    : general.architecture = "pulid", pulid.version = 1
+```
+
+stable-diffusion.cpp loads it with the normal gguf reader
+(`gguf_init_from_file`) and converts to fp32 at load time -- no bespoke
+parser. Total file size for the typical (32, 2048, fp16) case is ~131 KB.
+
+## Command-line usage
+
+```
+.\bin\Release\sd-cli.exe \
+  --diffusion-model     models\flux1-schnell-Q4_K_S.gguf \
+  --vae                 models\ae.safetensors \
+  --clip_l              models\clip_l.safetensors \
+  --t5xxl               models\t5xxl_fp16.safetensors \
+  --pulid-weights       models\pulid_flux_v0.9.1.safetensors \
+  --pulid-id-embedding  source.pulidembd \
+  --pulid-id-weight     1.0 \
+  -p "candid photograph of a young woman on a beach at sunset" \
+  --cfg-scale 1.0 --sampling-method euler --steps 4 -W 512 -H 512 \
+  --seed 42 --clip-on-cpu \
+  -o out.png
+```
+
+For Flux Dev (instead of Schnell), add `--guidance 3.5` and `--steps 20`.
+
+## Flags
+
+| Flag                       | Purpose                                                           |
+|----------------------------|-------------------------------------------------------------------|
+| `--pulid-weights <path>`   | Path to `pulid_flux_v0.9.x.safetensors`. Loaded with the model.   |
+| `--pulid-id-embedding <p>` | Path to a `.pulidembd` binary produced by the precompute tool.    |
+| `--pulid-id-weight <f>`    | Identity-injection strength. Typical 0.7-1.2; default 1.0.        |
+
+All three flags must be set together to activate PuLID. Setting only
+`--pulid-weights` (no embedding) loads the weights but disables injection
+at runtime. Setting `--pulid-id-weight 0` zeros out the contribution
+(useful for falsification testing: outputs should be byte-identical to
+a no-PuLID run with the same seed).
+
+## Memory budget
+
+At 512x512, 4 steps (Schnell), the 20 cross-attention layers add roughly
+10% to denoise time and almost nothing to peak VRAM. Tested on a 12 GB
+consumer card alongside Flux Schnell Q4 GGUF + CPU-offloaded clip_l and
+t5xxl + GPU-resident VAE.
+
+At 1024x1024 with Flux Dev Q4 + 20 steps + PuLID, the VAE decode compute
+buffer doesn't fit on a 12 GB card even with `--vae-on-cpu`. Workaround:
+explicitly route VAE to the CPU backend instead of the offload flag:
+
+```
+--backend "diffusion=vulkan0,vae=cpu"
+```
+
+The `--vae-on-cpu` flag offloads VAE weights but leaves the compute graph
+on the default backend; this is existing stable-diffusion.cpp behavior,
+not a PuLID-specific issue. Documented here because anyone running PuLID
+at 1024 will hit it.
+
+## Backend selection
+
+The standard `--backend` flag works as documented. Common patterns:
+
+```
+# AMD Vulkan
+--backend "diffusion=vulkan0,vae=cpu"
+
+# NVIDIA Vulkan
+--backend "diffusion=vulkan1,vae=cpu"
+
+# CUDA
+--backend "diffusion=cuda0,vae=cpu"
+```
+
+The PuLID cross-attention layers run on the same backend as the main
+diffusion model. They have not yet been independently profiled on every
+backend; only Vulkan and CPU have been tested by the original contributor.
+
+## Verification
+
+A three-way SHA-256 check is the recommended sanity test when bringing up
+a new combination of model + backend + hardware:
+
+| Run                                          | Expected hash relation             |
+|----------------------------------------------|------------------------------------|
+| A: no `--pulid-*` flags                      | baseline                           |
+| B: PuLID flags, `--pulid-id-weight 0.0`      | **byte-identical to A**            |
+| C: PuLID flags, `--pulid-id-weight 1.0`      | **different from A,B**, preserves source identity |
+
+If A and C differ but A and B differ too, the injection is allocating
+or computing something even at zero weight -- likely a bug.
+
+## Limitations / not yet supported
+
+- **`--skip-layers` (skip-layer-guidance / SLG) combined with PuLID** is not
+  supported. The `pulid_ca` index advances per non-skipped block, so a
+  skipped block silently misaligns the cross-attention weight assignment
+  vs. the trained intervals. The reference PyTorch implementation does
+  not have SLG either, so there is no well-defined behavior to emulate.
+  Use either feature alone.
+- **PuLID v1.1 weights** (`pulid_v1.1.safetensors`, renamed key layout).
+- **Multiple ID images.** The reference PyTorch implementation can fuse
+  several portraits into one embedding for stronger identity. This
+  implementation accepts a single embedding produced from one or more
+  images by the external precompute tool.
+- **Negative-prompt branch of CFG.** PuLID only injects on the positive
+  conditioning path in the published reference, and the implementation
+  here follows that. Flux's distilled guidance doesn't run a separate
+  uncond branch in normal use, so this matters only for `--true-cfg`
+  workflows that aren't standard for Flux.
+- **Backends other than Vulkan and CPU** are untested by the original
+  contributor. The implementation is pure-ggml and should work on CUDA,
+  ROCm, and Metal, but verification by users on those backends is
+  welcomed.
--- a/docs/rpc.md
+++ b/docs/rpc.md
@ -0,0 +1,220 @@
+# Building and Using the RPC Server with `stable-diffusion.cpp`
+
+This guide covers how to build a version of [the RPC server from `llama.cpp`](https://github.com/ggml-org/llama.cpp/blob/master/tools/rpc/README.md) that is compatible with your version of `stable-diffusion.cpp` to manage multi-backends setups. RPC allows you to offload specific model components to a remote server.
+
+> **Note on Model Location:** The model files (e.g., `.safetensors` or `.gguf`) remain on the **Client** machine. The client parses the file and transmits the necessary tensor data and computational graphs to the server. The server does not need to store the model files locally.
+
+## 1. Building `stable-diffusion.cpp` with RPC client
+
+First, you should build the client application from source. It requires `SD_RPC=ON` to include the RPC backend to your client.
+
+```bash
+mkdir build
+cd build
+cmake .. \
+    -DSD_RPC=ON \
+    # Add other build flags here (e.g., -DSD_VULKAN=ON)
+cmake --build . --config Release -j $(nproc)
+```
+
+> **Note:** Ensure you add the other flags you would normally use (e.g., `-DSD_VULKAN=ON`, `-DSD_CUDA=ON`, `-DSD_HIPBLAS=ON`, or `-DGGML_METAL=ON`), for more information about building `stable-diffusion.cpp` from source, please refer to the [build.md](build.md) documentation.
+
+## 2. Ensure `llama.cpp` is at the correct commit
+
+`stable-diffusion.cpp`'s RPC client is designed to work with a specific version of `llama.cpp` (compatible with the `ggml` submodule) to ensure API compatibility. The commit hash for `llama.cpp` is stored in `ggml/scripts/sync-llama.last`.
+
+> **Start from Root:** Perform these steps from the root of your `stable-diffusion.cpp` directory.
+
+1.  Read the target commit hash from the submodule tracker:
+
+    ```bash
+    # Linux / WSL / MacOS
+    HASH=$(cat ggml/scripts/sync-llama.last)
+
+    # Windows (PowerShell)
+    $HASH = Get-Content -Path "ggml\scripts\sync-llama.last"
+    ```
+
+2.  Clone `llama.cpp` at the target commit .
+    ```bash
+    git clone https://github.com/ggml-org/llama.cpp.git
+    cd llama.cpp
+    git checkout $HASH
+    ```
+    To save on download time and storage, you can use a shallow clone to download only the target commit:
+    ```bash
+    mkdir -p llama.cpp
+    cd llama.cpp
+    git init
+    git remote add origin https://github.com/ggml-org/llama.cpp.git
+    git fetch --depth 1 origin $HASH
+    git checkout FETCH_HEAD
+    ```
+
+## 3. Build `llama.cpp` (RPC Server)
+
+The RPC server acts as the worker. You must explicitly enable the **backend** (the hardware interface, such as CUDA for Nvidia, Metal for Apple Silicon, or Vulkan) when building, otherwise the server will default to using only the CPU.
+
+To find the correct flags for your system, refer to the official documentation for the [`llama.cpp`](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md) repository.
+
+> **Crucial:** You must include the compiler flags required to satisfy the API compatibility with `stable-diffusion.cpp` (`-DGGML_MAX_NAME=128`). Without this flag, `GGML_MAX_NAME` will default to `64` for the server, and data transfers between the client and server will fail. Of course, `-DGGML_RPC` must also be enabled.
+>
+> I recommend disabling the `LLAMA_CURL` flag to avoid unnecessary dependencies, and disabling shared library builds to avoid potential conflicts.
+
+> **Build Target:** We are specifically building the `rpc-server` target. This prevents the build system from compiling the entire `llama.cpp` suite (like `llama-server`), making the build significantly faster.
+
+### Linux / WSL (Vulkan)
+
+```bash
+mkdir build
+cd build
+cmake .. -DGGML_RPC=ON \
+    -DGGML_VULKAN=ON \        # Ensure backend is enabled
+    -DGGML_BUILD_SHARED_LIBS=OFF \
+    -DLLAMA_CURL=OFF \
+    -DCMAKE_C_FLAGS=-DGGML_MAX_NAME=128 \
+    -DCMAKE_CXX_FLAGS=-DGGML_MAX_NAME=128
+cmake --build . --config Release --target rpc-server -j $(nproc)
+```
+
+### macOS (Metal)
+
+```bash
+mkdir build
+cd build
+cmake .. -DGGML_RPC=ON \
+    -DGGML_METAL=ON \
+    -DGGML_BUILD_SHARED_LIBS=OFF \
+    -DLLAMA_CURL=OFF \
+    -DCMAKE_C_FLAGS=-DGGML_MAX_NAME=128 \
+    -DCMAKE_CXX_FLAGS=-DGGML_MAX_NAME=128
+cmake --build . --config Release --target rpc-server
+```
+
+### Windows (Visual Studio 2022, Vulkan)
+
+```powershell
+mkdir build
+cd build
+cmake .. -G "Visual Studio 17 2022" -A x64 `
+    -DGGML_RPC=ON `
+    -DGGML_VULKAN=ON `
+    -DGGML_BUILD_SHARED_LIBS=OFF `
+    -DLLAMA_CURL=OFF `
+    -DCMAKE_C_FLAGS=-DGGML_MAX_NAME=128 `
+    -DCMAKE_CXX_FLAGS=-DGGML_MAX_NAME=128
+cmake --build . --config Release --target rpc-server
+```
+
+## 4. Usage
+
+Once both applications are built, you can run the server and the client to manage your GPU allocation.
+
+### Step A: Run the RPC Server
+
+Start the server. It listens for connections on the default address (usually `localhost:50052`). If your server is on a different machine, ensure the server binds to the correct interface and your firewall allows the connection.
+
+**On the Server :**
+If running on the same machine, you can use the default address:
+
+```bash
+./rpc-server
+```
+
+If you want to allow connections from other machines on the network:
+
+```bash
+./rpc-server --host 0.0.0.0
+```
+
+> **Security Warning:** The RPC server does not currently support authentication or encryption. **Only run the server on trusted local networks**. Never expose the RPC server directly to the open internet.
+
+> **Drivers & Hardware:** Ensure the Server machine has the necessary drivers installed and functional (e.g., Nvidia Drivers for CUDA, Vulkan SDK, or Metal). If no devices are found, the server will simply fallback to CPU usage.
+
+<!-- ### Step B: Check if the client is able to connect to the server and see the available devices
+
+We're assuming the server is running on your local machine, and listening on the default port `50052`. If it's running on a different machine, you can replace `localhost` with the IP address of the server.
+
+**On the Client:**
+
+```bash
+./sd-cli --rpc-servers localhost:50052 --list-devices
+```
+
+If the server is running and the client is able to connect, you should see `RPC0    localhost:50052` in the list of devices.
+
+Example output:
+(Client built without GPU acceleration, two GPUs available on the server)
+
+```
+List of available GGML devices:
+Name    Description
+-------------------
+CPU     AMD Ryzen 9 5900X 12-Core Processor
+RPC0    localhost:50052
+RPC1    localhost:50052
+``` -->
+
+### Step B: Run with RPC device
+
+If everything is working correctly, you can now run the client while offloading some or all of the work to the RPC server.
+
+Example: Setting the main backend to the RPC0 device for doing all the work on the server.
+
+```bash
+./sd-cli -m models/sd1.5.safetensors -p "A cat" --rpc-servers localhost:50052  --backend RPC0
+```
+
+---
+
+## 5. Scaling: Multiple RPC Servers
+
+You can connect the client to multiple RPC servers simultaneously to scale out your hardware usage.
+
+Example: A main machine (192.168.1.10) with 3 GPUs, with one GPU running CUDA and the other two running Vulkan, and a second machine (192.168.1.11) only one GPU.
+
+**On the first machine (Running two server instances):**
+
+**Terminal 1 (CUDA):**
+
+```bash
+# Linux / WSL
+export CUDA_VISIBLE_DEVICES=0
+cd ./build_cuda/bin/Release
+./rpc-server --host 0.0.0.0
+
+# Windows PowerShell
+$env:CUDA_VISIBLE_DEVICES="0"
+cd .\build_cuda\bin\Release
+./rpc-server --host 0.0.0.0
+```
+
+**Terminal 2 (Vulkan):**
+
+```bash
+cd ./build_vulkan/bin/Release
+# ignore the first GPU (used by CUDA server)
+./rpc-server --host 0.0.0.0 --port 50053 -d Vulkan1,Vulkan2
+```
+
+**On the second machine:**
+
+```bash
+cd ./build/bin/Release
+./rpc-server --host 0.0.0.0
+```
+
+**On the Client:**
+Pass multiple server addresses separated by commas.
+
+```bash
+./sd-cli --rpc-servers 192.168.1.10:50052,192.168.1.10:50053,192.168.1.11:50052 [...]
+```
+
+The client will map these servers to sequential device IDs (e.g., RPC0 from the first server, RPC2, RPC3 from the second, and RPC4 from the third). With this setup, you could for example use RPC0 for the main backend, RPC1 and RPC2 for the text encoders, and RPC3 for the VAE.
+
+---
+
+## 6. Performance Considerations
+
+RPC performance is heavily dependent on network bandwidth, as large weights and activations must be transferred back and forth over the network, especially for large models, or when using high resolutions. For best results, ensure your network connection is stable and has sufficient bandwidth (>1Gbps recommended). This shoumd not be a concern if you are running the server and client on the same machine, as the data transfer will happen over the loopback interface.
--- a/examples/cli/README.md
+++ b/examples/cli/README.md
@ -1,204 +1,9 @@
-# Run
+# Usage

-```
-usage: ./bin/sd-cli  [options]
+For detailed command-line arguments, run:

-CLI Options:
-  -o, --output <string>         path to write result image to. you can use printf-style %d format specifiers for image
-                                sequences (default: ./output.png) (eg. output_%03d.png). Single-file video outputs
-                                support .avi, .webm, and animated .webp
-  --image <string>              path to the image to inspect (for metadata mode)
-  --metadata-format <string>    metadata output format, one of [text, json] (default: text)
-  --preview-path <string>       path to write preview image to (default: ./preview.png). Multi-frame previews support
-                                .avi, .webm, and animated .webp
-  --preview-interval <int>      interval in denoising steps between consecutive updates of the image preview file
-                                (default is 1, meaning updating at every step)
-  --output-begin-idx <int>      starting index for output image sequence, must be non-negative (default 0 if specified
-                                %d in output path, 1 otherwise)
-  --canny                       apply canny preprocessor (edge detection)
-  --convert-name                convert tensor name (for convert mode)
-  -v, --verbose                 print extra info
-  --color                       colors the logging tags according to level
-  --taesd-preview-only          prevents usage of taesd for decoding the final image. (for use with --preview tae)
-  --preview-noisy               enables previewing noisy inputs of the models rather than the denoised outputs
-  --metadata-raw                include raw hex previews for unparsed metadata payloads
-  --metadata-brief              truncate long metadata text values in text output
-  --metadata-all                include structural/container entries such as IHDR, IDAT, and non-metadata JPEG segments
-  -M, --mode                    run mode, one of [img_gen, vid_gen, upscale, convert, metadata], default: img_gen
-  --preview                     preview method. must be one of the following [none, proj, tae, vae] (default is none)
-  -h, --help                    show this help message and exit
-
-Context Options:
-  -m, --model <string>                     path to full model
-  --clip_l <string>                        path to the clip-l text encoder
-  --clip_g <string>                        path to the clip-g text encoder
-  --clip_vision <string>                   path to the clip-vision encoder
-  --t5xxl <string>                         path to the t5xxl text encoder
-  --llm <string>                           path to the llm text encoder. For example: (qwenvl2.5 for qwen-image,
-                                           mistral-small3.2 for flux2, ...)
-  --llm_vision <string>                    path to the llm vit
-  --qwen2vl <string>                       alias of --llm. Deprecated.
-  --qwen2vl_vision <string>                alias of --llm_vision. Deprecated.
-  --diffusion-model <string>               path to the standalone diffusion model
-  --high-noise-diffusion-model <string>    path to the standalone high noise diffusion model
-  --uncond-diffusion-model <string>        path to the standalone unconditional diffusion model, currently used by
-                                           Ideogram4 CFG
-  --vae <string>                           path to standalone vae model
-  --taesd <string>                         path to taesd. Using Tiny AutoEncoder for fast decoding (low quality)
-  --tae <string>                           alias of --taesd
-  --control-net <string>                   path to control net model
-  --embd-dir <string>                      embeddings directory
-  --lora-model-dir <string>                lora model directory
-  --hires-upscalers-dir <string>           highres fix upscaler model directory
-  --tensor-type-rules <string>             weight type per tensor pattern (example: "^vae\.=f16,model\.=q8_0")
-  --photo-maker <string>                   path to PHOTOMAKER model
-  --upscale-model <string>                 path to esrgan model.
-  -t, --threads <int>                      number of threads to use during computation (default: -1). If threads <= 0,
-                                           then threads will be set to the number of CPU physical cores
-  --chroma-t5-mask-pad <int>               t5 mask pad size of chroma
-  --max-vram <float>                       maximum VRAM budget in GiB for graph-cut segmented execution. 0 disables
-                                           graph splitting; a negative value auto-detects free VRAM, sparing the
-                                           specified value (e.g. -0.5 will keep at least 0.5 GiB free)
-  --force-sdxl-vae-conv-scale              force use of conv scale on sdxl vae
-  --offload-to-cpu                         place the weights in RAM to save VRAM, and automatically load them into VRAM
-                                           when needed
-  --mmap                                   whether to memory-map model
-  --control-net-cpu                        keep controlnet in cpu (for low vram)
-  --clip-on-cpu                            keep clip in cpu (for low vram)
-  --vae-on-cpu                             keep vae in cpu (for low vram)
-  --fa                                     use flash attention
-  --diffusion-fa                           use flash attention in the diffusion model only
-  --diffusion-conv-direct                  use ggml_conv2d_direct in the diffusion model
-  --vae-conv-direct                        use ggml_conv2d_direct in the vae model
-  --circular                               enable circular padding for convolutions
-  --circularx                              enable circular RoPE wrapping on x-axis (width) only
-  --circulary                              enable circular RoPE wrapping on y-axis (height) only
-  --chroma-disable-dit-mask                disable dit mask for chroma
-  --qwen-image-zero-cond-t                 enable zero_cond_t for qwen image
-  --chroma-enable-t5-mask                  enable t5 mask for chroma
-  --type                                   weight type (examples: f32, f16, q4_0, q4_1, q5_0, q5_1, q8_0, q2_K, q3_K,
-                                           q4_K). If not specified, the default is the type of the weight file
-  --rng                                    RNG, one of [std_default, cuda, cpu], default: cuda(sd-webui), cpu(comfyui)
-  --sampler-rng                            sampler RNG, one of [std_default, cuda, cpu]. If not specified, use --rng
-  --prediction                             prediction type override, one of [eps, v, edm_v, sd3_flow, flux_flow,
-                                           flux2_flow]
-  --lora-apply-mode                        the way to apply LoRA, one of [auto, immediately, at_runtime], default is
-                                           auto. In auto mode, if the model weights contain any quantized parameters,
-                                           the at_runtime mode will be used; otherwise, immediately will be used.The
-                                           immediately mode may have precision and compatibility issues with quantized
-                                           parameters, but it usually offers faster inference speed and, in some cases,
-                                           lower memory usage. The at_runtime mode, on the other hand, is exactly the
-                                           opposite.
-
-Generation Options:
-  -p, --prompt <string>                    the prompt to render
-  -n, --negative-prompt <string>           the negative prompt (default: "")
-  -i, --init-img <string>                  path to the init image
-  --end-img <string>                       path to the end image, required by flf2v
-  --mask <string>                          path to the mask image
-  --control-image <string>                 path to control image, control net
-  --control-video <string>                 path to control video frames, It must be a directory path. The video frames
-                                           inside should be stored as images in lexicographical (character) order. For
-                                           example, if the control video path is `frames`, the directory contain images
-                                           such as 00.png, 01.png, ... etc.
-  --pm-id-images-dir <string>              path to PHOTOMAKER input id images dir
-  --pm-id-embed-path <string>              path to PHOTOMAKER v2 id embed
-  --hires-upscaler <string>                highres fix upscaler, Lanczos, Nearest, Latent, Latent (nearest), Latent
-                                           (nearest-exact), Latent (antialiased), Latent (bicubic), Latent (bicubic
-                                           antialiased), or a model name under --hires-upscalers-dir (default: Latent)
-  --extra-sample-args <string>             extra sampler/scheduler/guidance args, key=value list. APG supports apg_eta,
-                                           apg_momentum, apg_norm_threshold, apg_norm_threshold_smoothing; SLG supports
-                                           slg_uncond; lcm supports noise_clip_std, noise_scale_start, noise_scale_end;
-                                           ltx2 supports max_shift, base_shift, stretch, terminal; euler_ge supports gamma
-  --extra-tiling-args <string>             extra VAE tiling args, key=value list. LTX video VAE supports
-                                           temporal_tile_frames (default: 4), temporal_tile_overlap (default: 1)
-  -H, --height <int>                       image height, in pixel space (default: 512)
-  -W, --width <int>                        image width, in pixel space (default: 512)
-  --steps <int>                            number of sample steps (default: 20)
-  --high-noise-steps <int>                 (high noise) number of sample steps (default: -1 = auto)
-  --clip-skip <int>                        ignore last layers of CLIP network; 1 ignores none, 2 ignores one layer
-                                           (default: -1). <= 0 represents unspecified, will be 1 for SD1.x, 2 for SD2.x
-  -b, --batch-count <int>                  batch count
-  --video-frames <int>                     video frames (default: 1)
-  --fps <int>                              fps (default: 24)
-  --timestep-shift <int>                   shift timestep for NitroFusion models (default: 0). recommended N for
-                                           NitroSD-Realism around 250 and 500 for NitroSD-Vibrant
-  --upscale-repeats <int>                  Run the ESRGAN upscaler this many times (default: 1)
-  --upscale-tile-size <int>                tile size for ESRGAN upscaling (default: 128)
-  --hires-width <int>                      highres fix target width, 0 to use --hires-scale (default: 0)
-  --hires-height <int>                     highres fix target height, 0 to use --hires-scale (default: 0)
-  --hires-steps <int>                      highres fix second pass sample steps, 0 to reuse --steps (default: 0)
-  --hires-upscale-tile-size <int>          highres fix upscaler tile size, reserved for model-backed upscalers (default:
-                                           128)
-  --cfg-scale <float>                      unconditional guidance scale: (default: 7.0)
-  --img-cfg-scale <float>                  image guidance scale for inpaint or image edit models: (default: same as
-                                           --cfg-scale)
-  --guidance <float>                       distilled guidance scale for models with guidance input (default: 3.5)
-  --slg-scale <float>                      skip layer guidance (SLG) scale, only for DiT models: (default: 0). 0 means
-                                           disabled, a value of 2.5 is nice for sd3.5 medium
-  --skip-layer-start <float>               SLG enabling point (default: 0.01)
-  --skip-layer-end <float>                 SLG disabling point (default: 0.2)
-  --eta <float>                            noise multiplier (default: 0 for ddim_trailing, tcd, res_multistep and
-                                           res_2s; 1 for euler_a, er_sde and dpm++2s_a)
-  --flow-shift <float>                     shift value for Flow models like SD3.x or WAN (default: auto)
-  --high-noise-cfg-scale <float>           (high noise) unconditional guidance scale: (default: 7.0)
-  --high-noise-img-cfg-scale <float>       (high noise) image guidance scale for inpaint or image edit models (default:
-                                           same as --cfg-scale)
-  --high-noise-guidance <float>            (high noise) distilled guidance scale for models with guidance input
-                                           (default: 3.5)
-  --high-noise-slg-scale <float>           (high noise) skip layer guidance (SLG) scale, only for DiT models: (default:
-                                           0)
-  --high-noise-skip-layer-start <float>    (high noise) SLG enabling point (default: 0.01)
-  --high-noise-skip-layer-end <float>      (high noise) SLG disabling point (default: 0.2)
-  --high-noise-eta <float>                 (high noise) noise multiplier (default: 0 for ddim_trailing, tcd,
-                                           res_multistep and res_2s; 1 for euler_a, er_sde and dpm++2s_a)
-  --strength <float>                       strength for noising/unnoising (default: 0.75)
-  --pm-style-strength <float>
-  --control-strength <float>               strength to apply Control Net (default: 0.9). 1.0 corresponds to full
-                                           destruction of information in init image
-  --moe-boundary <float>                   timestep boundary for Wan2.2 MoE model. (default: 0.875). Only enabled if
-                                           `--high-noise-steps` is set to -1
-  --vace-strength <float>                  wan vace strength
-  --vae-tile-overlap <float>               tile overlap for vae tiling, in fraction of tile size (default: 0.5)
-  --hires-scale <float>                    highres fix scale when target size is not set (default: 2.0)
-  --hires-denoising-strength <float>       highres fix second pass denoising strength (default: 0.7)
-  --increase-ref-index                     automatically increase the indices of references images based on the order
-                                           they are listed (starting with 1).
-  --disable-auto-resize-ref-image          disable auto resize of ref images
-  --disable-image-metadata                 do not embed generation metadata on image files
-  --vae-tiling                             process vae in tiles to reduce memory usage
-  --temporal-tiling                        enable temporal tiling for LTX video VAE decode
-  --hires                                  enable highres fix
-  -s, --seed                               RNG seed (default: 42, use random seed for < 0)
-  --sampling-method                        sampling method, one of [euler, euler_a, heun, dpm2, dpm++2s_a, dpm++2m,
-                                           dpm++2mv2, ipndm, ipndm_v, lcm, ddim_trailing, tcd, res_multistep, res_2s,
-                                           er_sde, euler_cfg_pp, euler_a_cfg_pp] (default: euler for Flux/SD3/Wan, euler_a otherwise)
-  --high-noise-sampling-method             (high noise) sampling method, one of [euler, euler_a, heun, dpm2, dpm++2s_a,
-                                           dpm++2m, dpm++2mv2, ipndm, ipndm_v, lcm, ddim_trailing, tcd, res_multistep,
-                                           res_2s, er_sde, euler_cfg_pp, euler_a_cfg_pp] default: euler for Flux/SD3/Wan, euler_a otherwise
-  --scheduler                              denoiser sigma scheduler, one of [discrete, karras, exponential, ays, gits,
-                                           smoothstep, sgm_uniform, simple, kl_optimal, lcm, bong_tangent, ltx2], default:
-                                           model-specific
-  --sigmas                                 custom sigma values for the sampler, comma-separated (e.g.,
-                                           "14.61,7.8,3.5,0.0").
-  --hires-sigmas                           custom sigma values for the highres fix second pass, comma-separated (e.g.,
-                                           "0.85,0.725,0.421875,0.0").
-  --skip-layers                            layers to skip for SLG steps (default: [7,8,9])
-  --high-noise-skip-layers                 (high noise) layers to skip for SLG steps (default: [7,8,9])
-  -r, --ref-image                          reference image for Flux Kontext models (can be used multiple times)
-  --cache-mode                             caching method: 'easycache' (DiT), 'ucache' (UNET),
-                                           'dbcache'/'taylorseer'/'cache-dit' (DiT block-level), 'spectrum' (UNET/DiT
-                                           Chebyshev+Taylor forecasting)
-  --cache-option                           named cache params (key=value format, comma-separated). easycache/ucache:
-                                           threshold=,start=,end=,decay=,relative=,reset=; dbcache/taylorseer/cache-dit:
-                                           Fn=,Bn=,threshold=,warmup=; spectrum: w=,m=,lam=,window=,flex=,warmup=,stop=.
-                                           Examples: "threshold=0.25" or "threshold=1.5,reset=0"
-  --scm-mask                               SCM steps mask for cache-dit: comma-separated 0/1 (e.g.,
-                                           "1,1,1,0,0,1,0,0,1,0") - 1=compute, 0=can cache
-  --scm-policy                             SCM policy: 'dynamic' (default) or 'static'
-  --vae-tile-size                          tile size for vae tiling, format [X]x[Y] (default: 32x32)
-  --vae-relative-tile-size                 relative tile size for vae tiling, format [X]x[Y], in fraction of image size
-                                           if < 1, in number of tiles per dim if >=1 (overrides --vae-tile-size)
+```bash
+./bin/sd-cli -h
 ```

 Metadata mode inspects PNG/JPEG container metadata without loading any model:
--- a/examples/cli/main.cpp
+++ b/examples/cli/main.cpp
@ -62,18 +62,22 @@ struct SDCliParams {
            {"-o",
             "--output",
             "path to write result image to. you can use printf-style %d format specifiers for image sequences (default: ./output.png) (eg. output_%03d.png). Single-file video outputs support .avi, .webm, and animated .webp",
+             0,
             &output_path},
            {"",
             "--image",
             "path to the image to inspect (for metadata mode)",
+             0,
             &image_path},
            {"",
             "--metadata-format",
             "metadata output format, one of [text, json] (default: text)",
+             0,
             &metadata_format},
            {"",
             "--preview-path",
             "path to write preview image to (default: ./preview.png). Multi-frame previews support .avi, .webm, and animated .webp",
+             0,
             &preview_path},
        };

@ -782,12 +786,11 @@ int main(int argc, const char* argv[]) {
    int upscale_factor = 4;  // unused for RealESRGAN_x4plus_anime_6B.pth
    if (ctx_params.esrgan_path.size() > 0 && gen_params.upscale_repeats > 0) {
        UpscalerCtxPtr upscaler_ctx(new_upscaler_ctx(ctx_params.esrgan_path.c_str(),
-                                                     ctx_params.offload_params_to_cpu,
                                                     ctx_params.diffusion_conv_direct,
                                                     ctx_params.n_threads,
                                                     gen_params.upscale_tile_size,
-                                                     ctx_params.backend.c_str(),
-                                                     ctx_params.params_backend.c_str()));
+                                                     sd_ctx_params.backend,
+                                                     sd_ctx_params.params_backend));

        if (upscaler_ctx == nullptr) {
            LOG_ERROR("new_upscaler_ctx failed");
--- a/examples/common/common.cpp
+++ b/examples/common/common.cpp
@ -6,6 +6,7 @@
 #include <cstdlib>
 #include <ctime>
 #include <filesystem>
+#include <fstream>
 #include <iomanip>
 #include <iostream>
 #include <regex>
@ -51,6 +52,10 @@ static sd_vae_format_t str_to_vae_format(const std::string& value) {
    return SD_VAE_FORMAT_COUNT;
 }

+static void prepend_backend_assignment(std::string& spec, const char* assignment) {
+    spec = spec.empty() ? assignment : std::string(assignment) + "," + spec;
+}
+
 #if defined(_WIN32)
 static std::string utf16_to_utf8(const std::wstring& wstr) {
    if (wstr.empty())
@ -256,8 +261,15 @@ bool parse_options(int argc, const char** argv, const std::vector<ArgOptions>& o
                        invalid_arg = true;
                        return;
                    }
-                    *option.target = argv_to_utf8(i, argv);
-                    found_arg      = true;
+                    if (option.concat && !option.target->empty()) {
+                        if (option.concat > 0 && option.concat <= 0xff) {
+                            *option.target += static_cast<char>(option.concat);
+                        }
+                        *option.target += argv_to_utf8(i, argv);
+                    } else {
+                        *option.target = argv_to_utf8(i, argv);
+                    }
+                    found_arg = true;
                }))
                break;

@ -320,109 +332,152 @@ ArgOptions SDContextParams::get_options() {
        {"-m",
         "--model",
         "path to full model",
+         0,
         &model_path},
        {"",
         "--clip_l",
-         "path to the clip-l text encoder", &clip_l_path},
+         "path to the clip-l text encoder",
+         0,
+         &clip_l_path},
        {"", "--clip_g",
         "path to the clip-g text encoder",
+         0,
         &clip_g_path},
        {"",
         "--clip_vision",
         "path to the clip-vision encoder",
+         0,
         &clip_vision_path},
        {"",
         "--t5xxl",
         "path to the t5xxl text encoder",
+         0,
         &t5xxl_path},
        {"",
         "--llm",
         "path to the llm text encoder. For example: (qwenvl2.5 for qwen-image, mistral-small3.2 for flux2, ...)",
+         0,
         &llm_path},
        {"",
         "--llm_vision",
         "path to the llm vit",
+         0,
         &llm_vision_path},
        {"",
         "--qwen2vl",
         "alias of --llm. Deprecated.",
+         0,
         &llm_path},
        {"",
         "--qwen2vl_vision",
         "alias of --llm_vision. Deprecated.",
+         0,
         &llm_vision_path},
        {"",
         "--diffusion-model",
         "path to the standalone diffusion model",
+         0,
         &diffusion_model_path},
        {"",
         "--high-noise-diffusion-model",
         "path to the standalone high noise diffusion model",
+         0,
         &high_noise_diffusion_model_path},
        {"",
         "--uncond-diffusion-model",
         "path to the standalone unconditional diffusion model, currently used by Ideogram4 CFG",
+         0,
         &uncond_diffusion_model_path},
        {"",
         "--embeddings-connectors",
         "path to LTXAV embeddings connectors",
+         0,
         &embeddings_connectors_path},
        {"",
         "--vae",
         "path to standalone vae model",
+         0,
         &vae_path},
        {"",
         "--vae-format",
         "VAE latent format override: auto, flux, sd3, or flux2 (default: auto)",
+         0,
         &vae_format},
        {"",
         "--audio-vae",
         "path to standalone LTX audio vae model",
+         0,
         &audio_vae_path},
        {"",
         "--taesd",
         "path to taesd. Using Tiny AutoEncoder for fast decoding (low quality)",
+         0,
         &taesd_path},
        {"",
         "--tae",
         "alias of --taesd",
+         0,
         &taesd_path},
        {"",
         "--control-net",
         "path to control net model",
+         0,
         &control_net_path},
        {"",
         "--embd-dir",
         "embeddings directory",
+         0,
         &embedding_dir},
        {"",
         "--lora-model-dir",
         "lora model directory",
+         0,
         &lora_model_dir},
        {"",
         "--hires-upscalers-dir",
         "highres fix upscaler model directory",
+         0,
         &hires_upscalers_dir},
        {"",
         "--tensor-type-rules",
         "weight type per tensor pattern (example: \"^vae\\.=f16,model\\.=q8_0\")",
+         (int)',',
         &tensor_type_rules},
        {"",
         "--photo-maker",
         "path to PHOTOMAKER model",
+         0,
         &photo_maker_path},
+        {"",
+         "--pulid-weights",
+         "path to PuLID Flux weights",
+         0,
+         &pulid_weights_path},
        {"",
         "--upscale-model",
         "path to esrgan model.",
+         0,
         &esrgan_path},
        {"",
         "--backend",
         "runtime backend assignment, e.g. cpu or clip=cpu,vae=cuda0,diffusion=vulkan0",
+         (int)',',
         &backend},
        {"",
         "--params-backend",
         "parameter backend assignment, e.g. disk, cpu, or diffusion=disk,clip=cpu",
+         (int)',',
         &params_backend},
+        {"",
+         "--rpc-servers",
+         "comma-separated list of RPC servers to connect to for offloading, in the format host:port, e.g. localhost:50052,192.168.1.3:50052",
+         (int)',',
+         &rpc_servers},
+        {"",
+         "--max-vram",
+         "maximum VRAM budget in GiB for graph-cut segmented execution. Accepts a single value or assignments by backend/device, e.g. 6 or cuda0=6,vulkan0=4. 0 disables graph splitting; a negative value auto-detects free VRAM, sparing the specified value",
+         0,
+         &max_vram},
    };

    options.int_options = {
@ -437,18 +492,15 @@ ArgOptions SDContextParams::get_options() {
         &chroma_t5_mask_pad},
    };

-    options.float_options = {
-        {"",
-         "--max-vram",
-         "maximum VRAM budget in GiB for graph-cut segmented execution. 0 disables graph splitting; a negative value auto-detects free VRAM, sparing the specified value (e.g. -0.5 will keep at least 0.5 GiB free)",
-         &max_vram},
-    };
-
    options.bool_options = {
        {"",
         "--stream-layers",
         "enable residency+prefetch streaming on top of --max-vram (no effect without --max-vram; defaults to false)",
         true, &stream_layers},
+        {"",
+         "--eager-load",
+         "load all params into the params backend at model-load time instead of lazily on first use (defaults to false)",
+         true, &eager_load},
        {"",
         "--force-sdxl-vae-conv-scale",
         "force use of conv scale on sdxl vae",
@ -463,15 +515,15 @@ ArgOptions SDContextParams::get_options() {
         true, &enable_mmap},
        {"",
         "--control-net-cpu",
-         "keep controlnet in cpu (for low vram)",
+         "deprecated; use --backend controlnet=cpu",
         true, &control_net_cpu},
        {"",
         "--clip-on-cpu",
-         "keep clip in cpu (for low vram)",
+         "deprecated; use --backend te=cpu",
         true, &clip_on_cpu},
        {"",
         "--vae-on-cpu",
-         "keep vae in cpu (for low vram)",
+         "deprecated; use --backend vae=cpu",
         true, &vae_on_cpu},
        {"",
         "--fa",
@ -688,6 +740,25 @@ bool SDContextParams::resolve_and_validate(SDMode mode) {
    return true;
 }

+void SDContextParams::prepare_backend_assignments() {
+    effective_backend        = backend;
+    effective_params_backend = params_backend;
+
+    if (offload_params_to_cpu) {
+        prepend_backend_assignment(effective_params_backend, "*=cpu");
+    }
+
+    if (clip_on_cpu) {
+        prepend_backend_assignment(effective_backend, "te=cpu");
+    }
+    if (vae_on_cpu) {
+        prepend_backend_assignment(effective_backend, "vae=cpu");
+    }
+    if (control_net_cpu) {
+        prepend_backend_assignment(effective_backend, "controlnet=cpu");
+    }
+}
+
 std::string SDContextParams::to_string() const {
    std::ostringstream emb_ss;
    emb_ss << "{\n";
@ -731,8 +802,9 @@ std::string SDContextParams::to_string() const {
        << "  rng_type: " << sd_rng_type_name(rng_type) << ",\n"
        << "  sampler_rng_type: " << sd_rng_type_name(sampler_rng_type) << ",\n"
        << "  offload_params_to_cpu: " << (offload_params_to_cpu ? "true" : "false") << ",\n"
-        << "  max_vram: " << max_vram << ",\n"
+        << "  max_vram: \"" << max_vram << "\",\n"
        << "  stream_layers: " << (stream_layers ? "true" : "false") << ",\n"
+        << "  eager_load: " << (eager_load ? "true" : "false") << ",\n"
        << "  backend: \"" << backend << "\",\n"
        << "  params_backend: \"" << params_backend << "\",\n"
        << "  enable_mmap: " << (enable_mmap ? "true" : "false") << ",\n"
@ -758,6 +830,7 @@ std::string SDContextParams::to_string() const {
 }

 sd_ctx_params_t SDContextParams::to_sd_ctx_params_t(bool taesd_preview) {
+    prepare_backend_assignments();
    embedding_vec.clear();
    embedding_vec.reserve(embedding_map.size());
    for (const auto& kv : embedding_map) {
@ -767,55 +840,54 @@ sd_ctx_params_t SDContextParams::to_sd_ctx_params_t(bool taesd_preview) {
        embedding_vec.emplace_back(item);
    }

-    sd_ctx_params_t sd_ctx_params = {
-        model_path.c_str(),
-        clip_l_path.c_str(),
-        clip_g_path.c_str(),
-        clip_vision_path.c_str(),
-        t5xxl_path.c_str(),
-        llm_path.c_str(),
-        llm_vision_path.c_str(),
-        diffusion_model_path.c_str(),
-        high_noise_diffusion_model_path.c_str(),
-        uncond_diffusion_model_path.c_str(),
-        embeddings_connectors_path.c_str(),
-        vae_path.c_str(),
-        audio_vae_path.c_str(),
-        taesd_path.c_str(),
-        control_net_path.c_str(),
-        embedding_vec.data(),
-        static_cast<uint32_t>(embedding_vec.size()),
-        photo_maker_path.c_str(),
-        tensor_type_rules.c_str(),
-        n_threads,
-        wtype,
-        rng_type,
-        sampler_rng_type,
-        prediction,
-        lora_apply_mode,
-        offload_params_to_cpu,
-        enable_mmap,
-        clip_on_cpu,
-        control_net_cpu,
-        vae_on_cpu,
-        flash_attn,
-        diffusion_flash_attn,
-        taesd_preview,
-        diffusion_conv_direct,
-        vae_conv_direct,
-        circular || circular_x,
-        circular || circular_y,
-        force_sdxl_vae_conv_scale,
-        chroma_use_dit_mask,
-        chroma_use_t5_mask,
-        chroma_t5_mask_pad,
-        qwen_image_zero_cond_t,
-        str_to_vae_format(vae_format),
-        max_vram,
-        stream_layers,
-        backend.c_str(),
-        params_backend.c_str(),
-    };
+    sd_ctx_params_t sd_ctx_params;
+    sd_ctx_params_init(&sd_ctx_params);
+    sd_ctx_params.model_path                      = model_path.c_str();
+    sd_ctx_params.clip_l_path                     = clip_l_path.c_str();
+    sd_ctx_params.clip_g_path                     = clip_g_path.c_str();
+    sd_ctx_params.clip_vision_path                = clip_vision_path.c_str();
+    sd_ctx_params.t5xxl_path                      = t5xxl_path.c_str();
+    sd_ctx_params.llm_path                        = llm_path.c_str();
+    sd_ctx_params.llm_vision_path                 = llm_vision_path.c_str();
+    sd_ctx_params.diffusion_model_path            = diffusion_model_path.c_str();
+    sd_ctx_params.high_noise_diffusion_model_path = high_noise_diffusion_model_path.c_str();
+    sd_ctx_params.uncond_diffusion_model_path     = uncond_diffusion_model_path.c_str();
+    sd_ctx_params.embeddings_connectors_path      = embeddings_connectors_path.c_str();
+    sd_ctx_params.vae_path                        = vae_path.c_str();
+    sd_ctx_params.audio_vae_path                  = audio_vae_path.c_str();
+    sd_ctx_params.taesd_path                      = taesd_path.c_str();
+    sd_ctx_params.control_net_path                = control_net_path.c_str();
+    sd_ctx_params.embeddings                      = embedding_vec.data();
+    sd_ctx_params.embedding_count                 = static_cast<uint32_t>(embedding_vec.size());
+    sd_ctx_params.photo_maker_path                = photo_maker_path.c_str();
+    sd_ctx_params.pulid_weights_path              = pulid_weights_path.c_str();
+    sd_ctx_params.tensor_type_rules               = tensor_type_rules.c_str();
+    sd_ctx_params.n_threads                       = n_threads;
+    sd_ctx_params.wtype                           = wtype;
+    sd_ctx_params.rng_type                        = rng_type;
+    sd_ctx_params.sampler_rng_type                = sampler_rng_type;
+    sd_ctx_params.prediction                      = prediction;
+    sd_ctx_params.lora_apply_mode                 = lora_apply_mode;
+    sd_ctx_params.enable_mmap                     = enable_mmap;
+    sd_ctx_params.flash_attn                      = flash_attn;
+    sd_ctx_params.diffusion_flash_attn            = diffusion_flash_attn;
+    sd_ctx_params.tae_preview_only                = taesd_preview;
+    sd_ctx_params.diffusion_conv_direct           = diffusion_conv_direct;
+    sd_ctx_params.vae_conv_direct                 = vae_conv_direct;
+    sd_ctx_params.circular_x                      = circular || circular_x;
+    sd_ctx_params.circular_y                      = circular || circular_y;
+    sd_ctx_params.force_sdxl_vae_conv_scale       = force_sdxl_vae_conv_scale;
+    sd_ctx_params.chroma_use_dit_mask             = chroma_use_dit_mask;
+    sd_ctx_params.chroma_use_t5_mask              = chroma_use_t5_mask;
+    sd_ctx_params.chroma_t5_mask_pad              = chroma_t5_mask_pad;
+    sd_ctx_params.qwen_image_zero_cond_t          = qwen_image_zero_cond_t;
+    sd_ctx_params.vae_format                      = str_to_vae_format(vae_format);
+    sd_ctx_params.max_vram                        = max_vram.c_str();
+    sd_ctx_params.stream_layers                   = stream_layers;
+    sd_ctx_params.eager_load                      = eager_load;
+    sd_ctx_params.backend                         = effective_backend.c_str();
+    sd_ctx_params.params_backend                  = effective_params_backend.c_str();
+    sd_ctx_params.rpc_servers                     = rpc_servers.c_str();
    return sd_ctx_params;
 }

@ -830,54 +902,71 @@ ArgOptions SDGenerationParams::get_options() {
        {"-p",
         "--prompt",
         "the prompt to render",
+         0,
         &prompt},
        {"-n",
         "--negative-prompt",
         "the negative prompt (default: \"\")",
+         0,
         &negative_prompt},
        {"-i",
         "--init-img",
         "path to the init image",
+         0,
         &init_image_path},
        {"",
         "--end-img",
         "path to the end image, required by flf2v",
+         0,
         &end_image_path},
        {"",
         "--mask",
         "path to the mask image",
+         0,
         &mask_image_path},
        {"",
         "--control-image",
         "path to control image, control net",
+         0,
         &control_image_path},
        {"",
         "--control-video",
         "path to control video frames, It must be a directory path. The video frames inside should be stored as images in "
         "lexicographical (character) order. For example, if the control video path is `frames`, the directory contain images "
         "such as 00.png, 01.png, ... etc.",
+         0,
         &control_video_path},
        {"",
         "--pm-id-images-dir",
         "path to PHOTOMAKER input id images dir",
+         0,
         &pm_id_images_dir},
        {"",
         "--pm-id-embed-path",
         "path to PHOTOMAKER v2 id embed",
+         0,
         &pm_id_embed_path},
+        {"",
+         "--pulid-id-embedding",
+         "path to PuLID id embedding",
+         0,
+         &pulid_id_embedding_path},
        {"",
         "--hires-upscaler",
         "highres fix upscaler, Lanczos, Nearest, Latent, Latent (nearest), Latent (nearest-exact), "
         "Latent (antialiased), Latent (bicubic), Latent (bicubic antialiased), or a model name "
         "under --hires-upscalers-dir (default: Latent)",
+         0,
         &hires_upscaler},
        {"",
         "--extra-sample-args",
-         "extra sampler/scheduler/guidance args, key=value list. APG supports apg_eta, apg_momentum, apg_norm_threshold, apg_norm_threshold_smoothing; SLG supports slg_uncond; lcm supports noise_clip_std, noise_scale_start, noise_scale_end; ltx2 supports max_shift, base_shift, stretch, terminal; euler_ge supports gamma",
+         "extra sampler/scheduler/guidance args, key=value list. CFG supports guidance_schedule; APG supports apg_eta, apg_momentum, apg_norm_threshold, apg_norm_threshold_smoothing; SLG supports slg_uncond; lcm supports noise_clip_std, noise_scale_start, noise_scale_end; ltx2 supports max_shift, base_shift, stretch, terminal; euler_ge supports gamma;",
+         (int)',',
         &extra_sample_args},
        {"",
         "--extra-tiling-args",
         "extra VAE tiling args, key=value list. LTX video VAE supports temporal_tile_frames (default: 4), temporal_tile_overlap (default: 1)",
+         (int)',',
         &extra_tiling_args},
    };

@ -1015,6 +1104,10 @@ ArgOptions SDGenerationParams::get_options() {
         "--pm-style-strength",
         "",
         &pm_style_strength},
+        {"",
+         "--pulid-id-weight",
+         "strength of PuLID identity injection",
+         &pulid_id_weight},
        {"",
         "--control-strength",
         "strength to apply Control Net (default: 0.9). 1.0 corresponds to full destruction of information in init image",
@ -1329,6 +1422,42 @@ ArgOptions SDGenerationParams::get_options() {
        return 1;
    };

+    auto on_prompt_file_arg = [&](int argc, const char** argv, int index) {
+        if (++index >= argc) {
+            return -1;
+        }
+        const char* arg = argv[index];
+        std::ifstream f(arg, std::ios::binary);
+        try {
+            prompt = std::string(std::istreambuf_iterator<char>{f}, {});
+        } catch (const std::ios_base::failure&) {
+            f.setstate(std::ios_base::failbit);
+        }
+        if (f.fail()) {
+            LOG_ERROR("error: failed to read prompt file '%s'\n", arg);
+            return -1;
+        }
+        return 1;
+    };
+
+    auto on_negative_prompt_file_arg = [&](int argc, const char** argv, int index) {
+        if (++index >= argc) {
+            return -1;
+        }
+        const char* arg = argv[index];
+        std::ifstream f(arg, std::ios::binary);
+        try {
+            negative_prompt = std::string(std::istreambuf_iterator<char>{f}, {});
+        } catch (const std::ios_base::failure&) {
+            f.setstate(std::ios_base::failbit);
+        }
+        if (f.fail()) {
+            LOG_ERROR("error: failed to read negative prompt file '%s'\n", arg);
+            return -1;
+        }
+        return 1;
+    };
+
    options.manual_options = {
        {"-s",
         "--seed",
@ -1392,6 +1521,14 @@ ArgOptions SDGenerationParams::get_options() {
         "--vae-relative-tile-size",
         "relative tile size for vae tiling, format [X]x[Y], in fraction of image size if < 1, in number of tiles per dim if >=1 (overrides --vae-tile-size)",
         on_relative_tile_size_arg},
+        {"",
+         "--prompt-file",
+         "path to the file containing the prompt to render",
+         on_prompt_file_arg},
+        {"",
+         "--negative-prompt-file",
+         "path to the file containing the negative prompt",
+         on_negative_prompt_file_arg},

    };

@ -2247,6 +2384,11 @@ sd_img_gen_params_t SDGenerationParams::to_sd_img_gen_params_t() {
        pm_style_strength,
    };

+    sd_pulid_params_t pulid_params = {
+        pulid_id_embedding_path.empty() ? nullptr : pulid_id_embedding_path.c_str(),
+        pulid_id_weight,
+    };
+
    params.loras                 = lora_vec.empty() ? nullptr : lora_vec.data();
    params.lora_count            = static_cast<uint32_t>(lora_vec.size());
    params.prompt                = prompt.c_str();
@ -2267,6 +2409,7 @@ sd_img_gen_params_t SDGenerationParams::to_sd_img_gen_params_t() {
    params.control_image         = control_image.get();
    params.control_strength      = control_strength;
    params.pm_params             = pm_params;
+    params.pulid_params          = pulid_params;
    params.vae_tiling_params     = vae_tiling_params;
    params.cache                 = cache_params;

--- a/examples/common/common.h
+++ b/examples/common/common.h
@ -31,6 +31,7 @@ struct StringOption {
    std::string short_name;
    std::string long_name;
    std::string desc;
+    int concat;
    std::string* target;
 };

@ -133,6 +134,7 @@ struct SDContextParams {
    std::string control_net_path;
    std::string embedding_dir;
    std::string photo_maker_path;
+    std::string pulid_weights_path;
    sd_type_t wtype = SD_TYPE_COUNT;
    std::string tensor_type_rules;
    std::string lora_model_dir = ".";
@ -144,10 +146,14 @@ struct SDContextParams {
    rng_type_t rng_type         = CUDA_RNG;
    rng_type_t sampler_rng_type = RNG_TYPE_COUNT;
    bool offload_params_to_cpu  = false;
-    float max_vram              = 0.f;
+    std::string max_vram        = "0";
    bool stream_layers          = false;
+    bool eager_load             = false;
    std::string backend;
    std::string params_backend;
+    std::string rpc_servers;
+    std::string effective_backend;
+    std::string effective_params_backend;
    bool enable_mmap           = false;
    bool control_net_cpu       = false;
    bool clip_on_cpu           = false;
@ -175,6 +181,7 @@ struct SDContextParams {
    float flow_shift = INFINITY;
    ArgOptions get_options();
    void build_embedding_map();
+    void prepare_backend_assignments();
    bool resolve(SDMode mode);
    bool validate(SDMode mode);
    bool resolve_and_validate(SDMode mode);
@ -230,6 +237,9 @@ struct SDGenerationParams {
    std::string pm_id_embed_path;
    float pm_style_strength = 20.f;

+    std::string pulid_id_embedding_path;
+    float pulid_id_weight = 1.0f;
+
    int upscale_repeats   = 1;
    int upscale_tile_size = 128;

--- a/examples/server/README.md
+++ b/examples/server/README.md
@ -117,188 +117,10 @@ In this case, the server will load and serve the specified `index.html` file ins
 * using a custom UI
 * avoiding rebuilding the binary after frontend modifications

-# Run
+# Usage

-```
-usage: ./bin/sd-server  [options]
-
-Svr Options:
-  -l, --listen-ip <string>      server listen ip (default: 127.0.0.1)
-  --serve-html-path <string>    path to HTML file to serve at root (optional)
-  --listen-port <int>           server listen port (default: 1234)
-  -v, --verbose                 print extra info
-  --color                       colors the logging tags according to level
-  -h, --help                    show this help message and exit
-
-Context Options:
-  -m, --model <string>                     path to full model
-  --clip_l <string>                        path to the clip-l text encoder
-  --clip_g <string>                        path to the clip-g text encoder
-  --clip_vision <string>                   path to the clip-vision encoder
-  --t5xxl <string>                         path to the t5xxl text encoder
-  --llm <string>                           path to the llm text encoder. For example: (qwenvl2.5 for qwen-image,
-                                           mistral-small3.2 for flux2, ...)
-  --llm_vision <string>                    path to the llm vit
-  --qwen2vl <string>                       alias of --llm. Deprecated.
-  --qwen2vl_vision <string>                alias of --llm_vision. Deprecated.
-  --diffusion-model <string>               path to the standalone diffusion model
-  --high-noise-diffusion-model <string>    path to the standalone high noise diffusion model
-  --uncond-diffusion-model <string>        path to the standalone unconditional diffusion model, currently used by
-                                           Ideogram4 CFG
-  --vae <string>                           path to standalone vae model
-  --taesd <string>                         path to taesd. Using Tiny AutoEncoder for fast decoding (low quality)
-  --tae <string>                           alias of --taesd
-  --control-net <string>                   path to control net model
-  --embd-dir <string>                      embeddings directory
-  --lora-model-dir <string>                lora model directory
-  --hires-upscalers-dir <string>           highres fix upscaler model directory
-  --tensor-type-rules <string>             weight type per tensor pattern (example: "^vae\.=f16,model\.=q8_0")
-  --photo-maker <string>                   path to PHOTOMAKER model
-  --upscale-model <string>                 path to esrgan model.
-  -t, --threads <int>                      number of threads to use during computation (default: -1). If threads <= 0,
-                                           then threads will be set to the number of CPU physical cores
-  --chroma-t5-mask-pad <int>               t5 mask pad size of chroma
-  --max-vram <float>                       maximum VRAM budget in GiB for graph-cut segmented execution. 0 disables
-                                           graph splitting; a negative value auto-detects free VRAM, sparing the
-                                           specified value (e.g. -0.5 will keep at least 0.5 GiB free)
-  --force-sdxl-vae-conv-scale              force use of conv scale on sdxl vae
-  --offload-to-cpu                         place the weights in RAM to save VRAM, and automatically load them into VRAM
-                                           when needed
-  --mmap                                   whether to memory-map model
-  --control-net-cpu                        keep controlnet in cpu (for low vram)
-  --clip-on-cpu                            keep clip in cpu (for low vram)
-  --vae-on-cpu                             keep vae in cpu (for low vram)
-  --fa                                     use flash attention
-  --diffusion-fa                           use flash attention in the diffusion model only
-  --diffusion-conv-direct                  use ggml_conv2d_direct in the diffusion model
-  --vae-conv-direct                        use ggml_conv2d_direct in the vae model
-  --circular                               enable circular padding for convolutions
-  --circularx                              enable circular RoPE wrapping on x-axis (width) only
-  --circulary                              enable circular RoPE wrapping on y-axis (height) only
-  --chroma-disable-dit-mask                disable dit mask for chroma
-  --qwen-image-zero-cond-t                 enable zero_cond_t for qwen image
-  --chroma-enable-t5-mask                  enable t5 mask for chroma
-  --type                                   weight type (examples: f32, f16, q4_0, q4_1, q5_0, q5_1, q8_0, q2_K, q3_K,
-                                           q4_K). If not specified, the default is the type of the weight file
-  --rng                                    RNG, one of [std_default, cuda, cpu], default: cuda(sd-webui), cpu(comfyui)
-  --sampler-rng                            sampler RNG, one of [std_default, cuda, cpu]. If not specified, use --rng
-  --prediction                             prediction type override, one of [eps, v, edm_v, sd3_flow, flux_flow,
-                                           flux2_flow]
-  --lora-apply-mode                        the way to apply LoRA, one of [auto, immediately, at_runtime], default is
-                                           auto. In auto mode, if the model weights contain any quantized parameters,
-                                           the at_runtime mode will be used; otherwise, immediately will be used.The
-                                           immediately mode may have precision and compatibility issues with quantized
-                                           parameters, but it usually offers faster inference speed and, in some cases,
-                                           lower memory usage. The at_runtime mode, on the other hand, is exactly the
-                                           opposite.
-
-Default Generation Options:
-  -p, --prompt <string>                    the prompt to render
-  -n, --negative-prompt <string>           the negative prompt (default: "")
-  -i, --init-img <string>                  path to the init image
-  --end-img <string>                       path to the end image, required by flf2v
-  --mask <string>                          path to the mask image
-  --control-image <string>                 path to control image, control net
-  --control-video <string>                 path to control video frames, It must be a directory path. The video frames
-                                           inside should be stored as images in lexicographical (character) order. For
-                                           example, if the control video path is `frames`, the directory contain images
-                                           such as 00.png, 01.png, ... etc.
-  --pm-id-images-dir <string>              path to PHOTOMAKER input id images dir
-  --pm-id-embed-path <string>              path to PHOTOMAKER v2 id embed
-  --hires-upscaler <string>                highres fix upscaler, Lanczos, Nearest, Latent, Latent (nearest), Latent
-                                           (nearest-exact), Latent (antialiased), Latent (bicubic), Latent (bicubic
-                                           antialiased), or a model name under --hires-upscalers-dir (default: Latent)
-  --extra-sample-args <string>             extra sampler/scheduler/guidance args, key=value list. APG supports apg_eta,
-                                           apg_momentum, apg_norm_threshold, apg_norm_threshold_smoothing; SLG supports
-                                           slg_uncond; lcm supports noise_clip_std, noise_scale_start, noise_scale_end;
-                                           ltx2 supports max_shift, base_shift, stretch, terminal; euler_ge supports gamma
-  --extra-tiling-args <string>             extra VAE tiling args, key=value list. LTX video VAE supports
-                                           temporal_tile_frames (default: 4), temporal_tile_overlap (default: 1)
-  -H, --height <int>                       image height, in pixel space (default: 512)
-  -W, --width <int>                        image width, in pixel space (default: 512)
-  --steps <int>                            number of sample steps (default: 20)
-  --high-noise-steps <int>                 (high noise) number of sample steps (default: -1 = auto)
-  --clip-skip <int>                        ignore last layers of CLIP network; 1 ignores none, 2 ignores one layer
-                                           (default: -1). <= 0 represents unspecified, will be 1 for SD1.x, 2 for SD2.x
-  -b, --batch-count <int>                  batch count
-  --video-frames <int>                     video frames (default: 1)
-  --fps <int>                              fps (default: 24)
-  --timestep-shift <int>                   shift timestep for NitroFusion models (default: 0). recommended N for
-                                           NitroSD-Realism around 250 and 500 for NitroSD-Vibrant
-  --upscale-repeats <int>                  Run the ESRGAN upscaler this many times (default: 1)
-  --upscale-tile-size <int>                tile size for ESRGAN upscaling (default: 128)
-  --hires-width <int>                      highres fix target width, 0 to use --hires-scale (default: 0)
-  --hires-height <int>                     highres fix target height, 0 to use --hires-scale (default: 0)
-  --hires-steps <int>                      highres fix second pass sample steps, 0 to reuse --steps (default: 0)
-  --hires-upscale-tile-size <int>          highres fix upscaler tile size, reserved for model-backed upscalers (default:
-                                           128)
-  --cfg-scale <float>                      unconditional guidance scale: (default: 7.0)
-  --img-cfg-scale <float>                  image guidance scale for inpaint or image edit models: (default: same as
-                                           --cfg-scale)
-  --guidance <float>                       distilled guidance scale for models with guidance input (default: 3.5)
-  --slg-scale <float>                      skip layer guidance (SLG) scale, only for DiT models: (default: 0). 0 means
-                                           disabled, a value of 2.5 is nice for sd3.5 medium
-  --skip-layer-start <float>               SLG enabling point (default: 0.01)
-  --skip-layer-end <float>                 SLG disabling point (default: 0.2)
-  --eta <float>                            noise multiplier (default: 0 for ddim_trailing, tcd, res_multistep and
-                                           res_2s; 1 for euler_a, er_sde and dpm++2s_a)
-  --flow-shift <float>                     shift value for Flow models like SD3.x or WAN (default: auto)
-  --high-noise-cfg-scale <float>           (high noise) unconditional guidance scale: (default: 7.0)
-  --high-noise-img-cfg-scale <float>       (high noise) image guidance scale for inpaint or image edit models (default:
-                                           same as --cfg-scale)
-  --high-noise-guidance <float>            (high noise) distilled guidance scale for models with guidance input
-                                           (default: 3.5)
-  --high-noise-slg-scale <float>           (high noise) skip layer guidance (SLG) scale, only for DiT models: (default:
-                                           0)
-  --high-noise-skip-layer-start <float>    (high noise) SLG enabling point (default: 0.01)
-  --high-noise-skip-layer-end <float>      (high noise) SLG disabling point (default: 0.2)
-  --high-noise-eta <float>                 (high noise) noise multiplier (default: 0 for ddim_trailing, tcd,
-                                           res_multistep and res_2s; 1 for euler_a, er_sde and dpm++2s_a)
-  --strength <float>                       strength for noising/unnoising (default: 0.75)
-  --pm-style-strength <float>
-  --control-strength <float>               strength to apply Control Net (default: 0.9). 1.0 corresponds to full
-                                           destruction of information in init image
-  --moe-boundary <float>                   timestep boundary for Wan2.2 MoE model. (default: 0.875). Only enabled if
-                                           `--high-noise-steps` is set to -1
-  --vace-strength <float>                  wan vace strength
-  --vae-tile-overlap <float>               tile overlap for vae tiling, in fraction of tile size (default: 0.5)
-  --hires-scale <float>                    highres fix scale when target size is not set (default: 2.0)
-  --hires-denoising-strength <float>       highres fix second pass denoising strength (default: 0.7)
-  --increase-ref-index                     automatically increase the indices of references images based on the order
-                                           they are listed (starting with 1).
-  --disable-auto-resize-ref-image          disable auto resize of ref images
-  --disable-image-metadata                 do not embed generation metadata on image files
-  --vae-tiling                             process vae in tiles to reduce memory usage
-  --temporal-tiling                        enable temporal tiling for LTX video VAE decode
-  --hires                                  enable highres fix
-  -s, --seed                               RNG seed (default: 42, use random seed for < 0)
-  --sampling-method                        sampling method, one of [euler, euler_a, heun, dpm2, dpm++2s_a, dpm++2m,
-                                           dpm++2mv2, ipndm, ipndm_v, lcm, ddim_trailing, tcd, res_multistep, res_2s,
-                                           er_sde, euler_cfg_pp, euler_a_cfg_pp] (default: euler for Flux/SD3/Wan, euler_a otherwise)
-  --high-noise-sampling-method             (high noise) sampling method, one of [euler, euler_a, heun, dpm2, dpm++2s_a,
-                                           dpm++2m, dpm++2mv2, ipndm, ipndm_v, lcm, ddim_trailing, tcd, res_multistep,
-                                           res_2s, er_sde, euler_cfg_pp, euler_a_cfg_pp] default: euler for Flux/SD3/Wan, euler_a otherwise
-  --scheduler                              denoiser sigma scheduler, one of [discrete, karras, exponential, ays, gits,
-                                           smoothstep, sgm_uniform, simple, kl_optimal, lcm, bong_tangent, ltx2], default:
-                                           model-specific
-  --sigmas                                 custom sigma values for the sampler, comma-separated (e.g.,
-                                           "14.61,7.8,3.5,0.0").
-  --hires-sigmas                           custom sigma values for the highres fix second pass, comma-separated (e.g.,
-                                           "0.85,0.725,0.421875,0.0").
-  --skip-layers                            layers to skip for SLG steps (default: [7,8,9])
-  --high-noise-skip-layers                 (high noise) layers to skip for SLG steps (default: [7,8,9])
-  -r, --ref-image                          reference image for Flux Kontext models (can be used multiple times)
-  --cache-mode                             caching method: 'easycache' (DiT), 'ucache' (UNET),
-                                           'dbcache'/'taylorseer'/'cache-dit' (DiT block-level), 'spectrum' (UNET/DiT
-                                           Chebyshev+Taylor forecasting)
-  --cache-option                           named cache params (key=value format, comma-separated). easycache/ucache:
-                                           threshold=,start=,end=,decay=,relative=,reset=; dbcache/taylorseer/cache-dit:
-                                           Fn=,Bn=,threshold=,warmup=; spectrum: w=,m=,lam=,window=,flex=,warmup=,stop=.
-                                           Examples: "threshold=0.25" or "threshold=1.5,reset=0"
-  --scm-mask                               SCM steps mask for cache-dit: comma-separated 0/1 (e.g.,
-                                           "1,1,1,0,0,1,0,0,1,0") - 1=compute, 0=can cache
-  --scm-policy                             SCM policy: 'dynamic' (default) or 'static'
-  --vae-tile-size                          tile size for vae tiling, format [X]x[Y] (default: 32x32)
-  --vae-relative-tile-size                 relative tile size for vae tiling, format [X]x[Y], in fraction of image size
-                                           if < 1, in number of tiles per dim if >=1 (overrides --vae-tile-size)
+For detailed command-line arguments, run:
+
+```bash
+./bin/sd-server -h
 ```
--- a/examples/server/frontend
+++ b/examples/server/frontend
@ -1 +1 @@
-Subproject commit 797ccf80825cc035508ba9b599b2a21953e7f835
+Subproject commit c4bce3d6b3f236614cca21014f076083b7270ba8
--- a/examples/server/runtime.cpp
+++ b/examples/server/runtime.cpp
@ -190,8 +190,8 @@ ArgOptions SDSvrParams::get_options() {
    ArgOptions options;

    options.string_options = {
-        {"-l", "--listen-ip", "server listen ip (default: 127.0.0.1)", &listen_ip},
-        {"", "--serve-html-path", "path to HTML file to serve at root (optional)", &serve_html_path},
+        {"-l", "--listen-ip", "server listen ip (default: 127.0.0.1)", 0, &listen_ip},
+        {"", "--serve-html-path", "path to HTML file to serve at root (optional)", 0, &serve_html_path},
    };

    options.int_options = {
--- a/2
+++ b/2
@ -1 +1 @@
-Subproject commit 0ce7ad348a3151e1da9f65d962044546bcaad421
+Subproject commit 3af5f5760e19a96427f5f7a93b79cbdf3d4b265b
--- a/include/stable-diffusion.h
+++ b/include/stable-diffusion.h
@ -195,6 +195,7 @@ typedef struct {
    const sd_embedding_t* embeddings;
    uint32_t embedding_count;
    const char* photo_maker_path;
+    const char* pulid_weights_path;
    const char* tensor_type_rules;
    int n_threads;
    enum sd_type_t wtype;
@ -202,11 +203,7 @@ typedef struct {
    enum rng_type_t sampler_rng_type;
    enum prediction_t prediction;
    enum lora_apply_mode_t lora_apply_mode;
-    bool offload_params_to_cpu;
    bool enable_mmap;
-    bool keep_clip_on_cpu;
-    bool keep_control_net_on_cpu;
-    bool keep_vae_on_cpu;
    bool flash_attn;
    bool diffusion_flash_attn;
    bool tae_preview_only;
@ -220,10 +217,12 @@ typedef struct {
    int chroma_t5_mask_pad;
    bool qwen_image_zero_cond_t;
    enum sd_vae_format_t vae_format;
-    float max_vram;  // GiB budget for graph-cut segmented param offload (0 = disabled, -1 = auto free VRAM minus 1 GiB)
+    const char* max_vram;  // GiB budget or backend assignment spec for graph-cut segmented param offload (0 = disabled, -1 = auto)
    bool stream_layers;  // Enable residency+prefetch streaming on top of --max-vram (no effect without --max-vram)
+    bool eager_load;  // Load all params into the params backend at model-load time instead of lazily on first use
    const char* backend;
    const char* params_backend;
+    const char* rpc_servers;
 } sd_ctx_params_t;

 typedef struct {
@ -275,6 +274,11 @@ typedef struct {
    float style_strength;
 } sd_pm_params_t;  // photo maker

+typedef struct {
+    const char* id_embedding_path;
+    float id_weight;
+} sd_pulid_params_t;
+
 enum sd_cache_mode_t {
    SD_CACHE_DISABLED = 0,
    SD_CACHE_EASYCACHE,
@ -367,6 +371,7 @@ typedef struct {
    sd_image_t control_image;
    float control_strength;
    sd_pm_params_t pm_params;
+    sd_pulid_params_t pulid_params;
    sd_tiling_params_t vae_tiling_params;
    sd_cache_params_t cache;
    sd_hires_params_t hires;
@ -448,6 +453,17 @@ SD_API void sd_img_gen_params_init(sd_img_gen_params_t* sd_img_gen_params);
 SD_API char* sd_img_gen_params_to_str(const sd_img_gen_params_t* sd_img_gen_params);
 SD_API sd_image_t* generate_image(sd_ctx_t* sd_ctx, const sd_img_gen_params_t* sd_img_gen_params);

+enum sd_cancel_mode_t {
+    // Stop the current generation as soon as possible.
+    SD_CANCEL_ALL,
+    // Finish the current image sample, then skip additional batch latents and return completed images.
+    SD_CANCEL_NEW_LATENTS,
+    // Clear a pending cancellation request.
+    SD_CANCEL_RESET
+};
+
+SD_API void sd_cancel_generation(sd_ctx_t* sd_ctx, enum sd_cancel_mode_t mode);
+
 SD_API void sd_vid_gen_params_init(sd_vid_gen_params_t* sd_vid_gen_params);
 SD_API bool generate_video(sd_ctx_t* sd_ctx,
                           const sd_vid_gen_params_t* sd_vid_gen_params,
@ -458,7 +474,6 @@ SD_API bool generate_video(sd_ctx_t* sd_ctx,
 typedef struct upscaler_ctx_t upscaler_ctx_t;

 SD_API upscaler_ctx_t* new_upscaler_ctx(const char* esrgan_path,
-                                        bool offload_params_to_cpu,
                                        bool direct,
                                        int n_threads,
                                        int tile_size,
--- a/script/pulid_extract_id.py
+++ b/script/pulid_extract_id.py
@ -0,0 +1,134 @@
+"""
+Precompute a PuLID-Flux identity embedding from a single source portrait.
+
+Writes a gguf file (a single tensor `pulid_id`) that stable-diffusion.cpp's
+`--pulid-id-embedding` flag consumes.
+
+Dependencies (recommended: vendor rather than pip-install due to upstream
+packaging quirks):
+  - torch + safetensors
+  - The ToTheBeginning/PuLID repository's `pulid/` package and `eva_clip/`.
+    Put them on PYTHONPATH or sys.path before running this script.
+  - insightface, facexlib, torchvision, opencv-python, huggingface_hub, gguf
+  - numpy, Pillow
+
+Usage:
+  python script/pulid_extract_id.py \\
+    --portrait /path/to/source-photo.jpg \\
+    --pulid-weights /path/to/pulid_flux_v0.9.1.safetensors \\
+    --out /path/to/source.pulidembd
+
+The portrait must contain a clearly visible face. insightface's antelopev2
+detector will be auto-downloaded on first run.
+"""
+
+from __future__ import annotations
+
+import argparse
+import os
+import sys
+from types import SimpleNamespace
+
+
+def extract(portrait_path: str, pulid_weights: str) -> "torch.Tensor":
+    import numpy as np
+    import torch
+    from PIL import Image
+    from pulid.pipeline_flux import PuLIDPipeline
+
+    if torch.cuda.is_available():
+        device, onnx_provider = "cuda", "gpu"
+    else:
+        device, onnx_provider = "cpu", "cpu"
+
+    print(f"device={device}", flush=True)
+
+    # PuLIDPipeline only attaches pulid_ca attributes to `dit` during
+    # construction; get_id_embedding() never runs Flux, so a dummy object is
+    # enough and avoids importing/building a Flux skeleton.
+    print("instantiating PuLIDPipeline with a dummy Flux object", flush=True)
+    dit = SimpleNamespace()
+    pulid = PuLIDPipeline(dit=dit,
+                          device=device,
+                          weight_dtype=torch.bfloat16,
+                          onnx_provider=onnx_provider)
+
+    print(f"loading PuLID weights from {pulid_weights}", flush=True)
+    pulid.load_pretrain(pretrain_path=pulid_weights, version="v0.9.1")
+
+    print(f"extracting ID embedding from {portrait_path}", flush=True)
+    face_img = np.array(Image.open(portrait_path).convert("RGB"))
+    id_embedding, _ = pulid.get_id_embedding(face_img)
+    print(f"id embedding shape={tuple(id_embedding.shape)} dtype={id_embedding.dtype}",
+          flush=True)
+
+    if id_embedding.ndim == 3 and id_embedding.shape[0] == 1:
+        id_embedding = id_embedding[0]
+    return id_embedding
+
+
+def write_embd(tensor, out_path: str, dtype_choice: str) -> None:
+    import gguf
+    import torch
+
+    if tensor.ndim != 2:
+        raise ValueError(f"expected (num_tokens, token_dim); got {tuple(tensor.shape)}")
+    num_tokens, token_dim = tensor.shape
+
+    os.makedirs(os.path.dirname(out_path) or ".", exist_ok=True)
+
+    writer = gguf.GGUFWriter(out_path, arch="pulid")
+    writer.add_uint32("pulid.version", 1)
+
+    if dtype_choice == "fp16":
+        arr = tensor.to(torch.float16).contiguous().cpu().numpy()
+        writer.add_tensor("pulid_id", arr)
+    elif dtype_choice == "fp32":
+        arr = tensor.to(torch.float32).contiguous().cpu().numpy()
+        writer.add_tensor("pulid_id", arr)
+    elif dtype_choice == "bf16":
+        raw = tensor.to(torch.bfloat16).contiguous().view(torch.uint16).cpu().numpy()
+        writer.add_tensor("pulid_id", raw,
+                          raw_shape=(int(num_tokens), int(token_dim)),
+                          raw_dtype=gguf.GGMLQuantizationType.BF16)
+    else:
+        raise ValueError(f"unknown --dtype {dtype_choice}")
+
+    writer.write_header_to_file()
+    writer.write_kv_data_to_file()
+    writer.write_tensors_to_file()
+    writer.close()
+
+    print(f"wrote {out_path}: gguf, tensor pulid_id [{token_dim}, {num_tokens}] {dtype_choice}",
+          flush=True)
+
+
+def main() -> int:
+    ap = argparse.ArgumentParser(
+        description=__doc__,
+        formatter_class=argparse.RawDescriptionHelpFormatter)
+    ap.add_argument("--portrait", required=True,
+                    help="Path to the source portrait image (JPG/PNG).")
+    ap.add_argument("--pulid-weights", required=True,
+                    help="Path to pulid_flux_v0.9.x.safetensors.")
+    ap.add_argument("--out", required=True,
+                    help="Output path for the .pulidembd binary.")
+    ap.add_argument("--dtype", default="fp16",
+                    choices=["fp16", "bf16", "fp32"],
+                    help="Storage dtype (default fp16; produces ~131 KB).")
+    args = ap.parse_args()
+
+    if not os.path.exists(args.portrait):
+        print(f"ERROR: portrait not found at {args.portrait}", file=sys.stderr)
+        return 2
+    if not os.path.exists(args.pulid_weights):
+        print(f"ERROR: PuLID weights not found at {args.pulid_weights}", file=sys.stderr)
+        return 3
+
+    embedding = extract(args.portrait, args.pulid_weights)
+    write_embd(embedding, args.out, args.dtype)
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
--- a/src/conditioning/conditioner.hpp
+++ b/src/conditioning/conditioner.hpp
@ -142,8 +142,7 @@ struct FrozenCLIPEmbedderWithCustomWords : public Conditioner {
                                      std::shared_ptr<RunnerWeightManager> weight_manager = nullptr)
        : version(version), tokenizer(sd_version_is_sd2(version) ? 0 : 49407) {
        for (const auto& kv : orig_embedding_map) {
-            std::string name = kv.first;
-            std::transform(name.begin(), name.end(), name.begin(), [](unsigned char c) { return std::tolower(c); });
+            std::string name    = normalize_embedding_name(kv.first);
            embedding_map[name] = kv.second;
            tokenizer.add_special_token(name);
        }
@ -278,17 +277,23 @@ struct FrozenCLIPEmbedderWithCustomWords : public Conditioner {
        return true;
    }

+    static std::string normalize_embedding_name(std::string name) {
+        std::transform(name.begin(), name.end(), name.begin(), [](unsigned char c) { return std::tolower(c); });
+        return name;
+    }
+
+    bool append_embedding_tokens(std::string str, std::vector<int32_t>& bpe_tokens) {
+        std::string name = normalize_embedding_name(std::move(str));
+        auto iter        = embedding_map.find(name);
+        if (iter == embedding_map.end()) {
+            return false;
+        }
+        return load_embedding(name, iter->second, bpe_tokens);
+    }
+
    std::vector<int> convert_token_to_id(std::string text) {
        auto on_new_token_cb = [&](std::string& str, std::vector<int32_t>& bpe_tokens) -> bool {
-            auto iter = embedding_map.find(str);
-            if (iter == embedding_map.end()) {
-                return false;
-            }
-            std::string embedding_path = iter->second;
-            if (load_embedding(str, embedding_path, bpe_tokens)) {
-                return true;
-            }
-            return false;
+            return append_embedding_tokens(str, bpe_tokens);
        };
        std::vector<int> curr_tokens = tokenizer.encode(text, on_new_token_cb);
        return curr_tokens;
@ -315,15 +320,7 @@ struct FrozenCLIPEmbedderWithCustomWords : public Conditioner {
        }

        auto on_new_token_cb = [&](std::string& str, std::vector<int32_t>& bpe_tokens) -> bool {
-            auto iter = embedding_map.find(str);
-            if (iter == embedding_map.end()) {
-                return false;
-            }
-            std::string embedding_path = iter->second;
-            if (load_embedding(str, embedding_path, bpe_tokens)) {
-                return true;
-            }
-            return false;
+            return append_embedding_tokens(str, bpe_tokens);
        };

        std::vector<int> tokens;
@ -1521,7 +1518,7 @@ struct LLMEmbedder : public Conditioner {
            arch = LLM::LLMArch::GPT_OSS_20B;
        } else if (sd_version_is_pid(version)) {
            arch = LLM::LLMArch::GEMMA2_2B;
-        } else if (sd_version_is_ideogram4(version)) {
+        } else if (sd_version_is_ideogram4(version) || sd_version_is_boogu_image(version)) {
            arch = LLM::LLMArch::QWEN3_VL;
        } else if (sd_version_is_z_image(version) || version == VERSION_OVIS_IMAGE || version == VERSION_FLUX2_KLEIN) {
            arch = LLM::LLMArch::QWEN3;
@ -1781,6 +1778,65 @@ struct LLMEmbedder : public Conditioner {

                prompt += "<|im_end|>\n<|im_start|>assistant\n";
            }
+        } else if (sd_version_is_boogu_image(version)) {
+            prompt_template_encode_start_idx = 0;
+
+            const std::string t2i_system_prompt =
+                "You are a helpful assistant that generates high-quality images based on user instructions. The instructions are as follows.";
+            const std::string edit_system_prompt =
+                "Describe the key features of the input image (color, shape, size, texture, objects, background), then explain how the user's text instruction should alter or modify the image. Generate a new image that meets the user's requirements while maintaining consistency with the original input where appropriate.";
+            const bool has_ref_images = llm->enable_vision && conditioner_params.ref_images != nullptr && !conditioner_params.ref_images->empty();
+            const bool text_empty     = conditioner_params.text.find_first_not_of(" \t\r\n") == std::string::npos;
+
+            if (has_ref_images) {
+                LOG_INFO("BooguImageEditPipeline");
+                const std::string prompt_prefix = "<|im_start|>system\n" + edit_system_prompt + "<|im_end|>\n<|im_start|>user\n";
+                std::string img_prompt;
+                const std::string placeholder = "<|image_pad|>";
+
+                for (int i = 0; i < conditioner_params.ref_images->size(); i++) {
+                    const auto& image = (*conditioner_params.ref_images)[i];
+                    double factor     = llm->config.vision.patch_size * llm->config.vision.spatial_merge_size;
+                    int height        = static_cast<int>(image.shape()[1]);
+                    int width         = static_cast<int>(image.shape()[0]);
+                    double beta       = std::sqrt((384.0 * 384.0) / (static_cast<double>(height) * static_cast<double>(width)));
+                    int h_bar         = std::max(static_cast<int>(factor),
+                                                 static_cast<int>(std::round(height * beta / factor)) * static_cast<int>(factor));
+                    int w_bar         = std::max(static_cast<int>(factor),
+                                                 static_cast<int>(std::round(width * beta / factor)) * static_cast<int>(factor));
+
+                    LOG_DEBUG("resize conditioner ref image %d from %dx%d to %dx%d", i, height, width, h_bar, w_bar);
+
+                    auto resized_image = clip_preprocess(image, w_bar, h_bar);
+                    auto image_embed   = llm->encode_image(n_threads, resized_image, false, true, true);
+                    GGML_ASSERT(!image_embed.empty());
+
+                    std::string image_prefix = prompt_prefix + img_prompt + "<|vision_start|>";
+                    int image_embed_idx      = static_cast<int>(tokenizer->encode(image_prefix, nullptr).size());
+                    image_embeds.emplace_back(image_embed_idx, image_embed);
+
+                    img_prompt += "<|vision_start|>";
+                    int64_t num_image_tokens = image_embed.shape()[1];
+                    img_prompt.reserve(img_prompt.size() + static_cast<size_t>(num_image_tokens) * placeholder.size() + 32);
+                    for (int j = 0; j < num_image_tokens; j++) {
+                        img_prompt += placeholder;
+                    }
+                    img_prompt += "<|vision_end|>";
+                }
+
+                prompt                  = prompt_prefix + img_prompt;
+                prompt_attn_range.first = static_cast<int>(prompt.size());
+                prompt += conditioner_params.text;
+                prompt_attn_range.second = static_cast<int>(prompt.size());
+                prompt += "<|im_end|>\n";
+            } else {
+                const std::string& system_prompt = text_empty ? edit_system_prompt : t2i_system_prompt;
+                prompt                           = "<|im_start|>system\n" + system_prompt + "<|im_end|>\n<|im_start|>user\n";
+                prompt_attn_range.first          = static_cast<int>(prompt.size());
+                prompt += conditioner_params.text;
+                prompt_attn_range.second = static_cast<int>(prompt.size());
+                prompt += "<|im_end|>\n";
+            }
        } else if (sd_version_is_longcat(version)) {
            spell_quotes = true;

--- a/src/convert.cpp
+++ b/src/convert.cpp
@ -99,7 +99,7 @@ bool convert(const char* input_path,
        model_loader.convert_tensors_name();
    }

-    ggml_type type             = (ggml_type)output_type;
+    ggml_type type             = sd_type_to_ggml_type(output_type);
    bool output_is_safetensors = ends_with(output_path, ".safetensors");
    TensorTypeRules type_rules = parse_tensor_type_rules(tensor_type_rules);

--- a/src/core/ggml_extend.hpp
+++ b/src/core/ggml_extend.hpp
@ -2007,6 +2007,10 @@ protected:
    }

    bool copy_cache_tensors_to_cache_buffer(const std::unordered_set<std::string>* cache_keep_names = nullptr) {
+        if (cache_tensor_map.empty() && cache_keep_names == nullptr) {
+            return true;
+        }
+
        ggml_context* old_cache_ctx            = cache_ctx;
        ggml_backend_buffer_t old_cache_buffer = cache_buffer;
        cache_ctx                              = nullptr;
--- a/src/core/ggml_extend_backend.cpp
+++ b/src/core/ggml_extend_backend.cpp
@ -204,6 +204,36 @@ void ggml_ext_im_set_f32_1d(const struct ggml_tensor* tensor, int i, float value
    }
 }

+bool add_rpc_devices(const std::string& servers) {
+    const std::string in = trim_copy(servers);
+    if (in.empty()) {
+        return true;
+    }
+    auto rpc_servers = split_copy(in, ',');
+    if (rpc_servers.empty()) {
+        LOG_ERROR("invalid RPC servers specification: '%s'", servers.c_str());
+        return false;
+    }
+    ggml_backend_reg_t rpc_reg = ggml_backend_reg_by_name("RPC");
+    if (!rpc_reg) {
+        LOG_ERROR("RPC backend not found, cannot add RPC servers");
+        return false;
+    }
+    typedef ggml_backend_reg_t (*ggml_backend_rpc_add_server_t)(const char* endpoint);
+    ggml_backend_rpc_add_server_t ggml_backend_rpc_add_server_fn = (ggml_backend_rpc_add_server_t)ggml_backend_reg_get_proc_address(rpc_reg, "ggml_backend_rpc_add_server");
+    if (!ggml_backend_rpc_add_server_fn) {
+        LOG_ERROR("RPC backend does not have ggml_backend_rpc_add_server function, cannot add RPC servers");
+        return false;
+    }
+    for (const auto& server : rpc_servers) {
+        LOG_INFO("Adding RPC server: %s", server.c_str());
+        auto reg = ggml_backend_rpc_add_server_fn(server.c_str());
+        // no return value to check for success but should print errors from the RPC backend if it fails to add the server
+        ggml_backend_register(reg);
+    }
+    return true;
+}
+
 static void ggml_backend_load_all_once() {
    // If the registry already has devices and the CPU backend is present,
    // assume either static registration or explicit host-side preloading has
@ -250,7 +280,7 @@ static std::string get_default_backend_name() {
    return resolve_first_device_by_type(GGML_BACKEND_DEVICE_TYPE_CPU);
 }

-static std::string sd_resolve_backend_name(const std::string& name) {
+std::string sd_backend_resolve_name(const std::string& name) {
    ggml_backend_load_all_once();
    std::string requested = trim_copy(name);
    std::string lower     = lower_copy(requested);
@ -288,7 +318,7 @@ static std::string sd_resolve_backend_name(const std::string& name) {
 }

 static bool backend_name_exists(const std::string& name) {
-    return !sd_resolve_backend_name(name).empty();
+    return !sd_backend_resolve_name(name).empty();
 }

 static ggml_backend_t init_named_backend(const std::string& name) {
@ -298,7 +328,7 @@ static ggml_backend_t init_named_backend(const std::string& name) {
        return ggml_backend_init_best();
    }

-    std::string resolved = sd_resolve_backend_name(name);
+    std::string resolved = sd_backend_resolve_name(name);
    if (resolved.empty()) {
        return nullptr;
    }
@ -545,9 +575,6 @@ bool SDBackendManager::runtime_backend_supports_host_buffer(SDBackendModule modu

 bool SDBackendManager::init(const char* backend_spec,
                            const char* params_backend_spec,
-                            bool keep_clip_on_cpu,
-                            bool keep_vae_on_cpu,
-                            bool keep_control_net_on_cpu,
                            std::string* error) {
    reset();

@ -558,18 +585,6 @@ bool SDBackendManager::init(const char* backend_spec,
        return false;
    }

-    if (runtime_assignment_.empty()) {
-        if (keep_clip_on_cpu) {
-            runtime_assignment_.set_module(SDBackendModule::TE, "cpu");
-        }
-        if (keep_vae_on_cpu) {
-            runtime_assignment_.set_module(SDBackendModule::VAE, "cpu");
-        }
-        if (keep_control_net_on_cpu) {
-            runtime_assignment_.set_module(SDBackendModule::CONTROL_NET, "cpu");
-        }
-    }
-
    return validate(error);
 }

@ -584,7 +599,7 @@ bool SDBackendManager::validate(std::string* error) const {
            }
            return false;
        }
-        if (!sd_resolve_backend_name(name).empty()) {
+        if (!sd_backend_resolve_name(name).empty()) {
            return true;
        }
        if (error != nullptr) {
@ -617,7 +632,7 @@ bool SDBackendManager::validate(std::string* error) const {
 }

 ggml_backend_t SDBackendManager::init_cached_backend(const std::string& name) {
-    std::string resolved   = sd_resolve_backend_name(name);
+    std::string resolved   = sd_backend_resolve_name(name);
    std::string key        = lower_copy(resolved);
    ggml_backend_t backend = nullptr;

--- a/src/core/ggml_extend_backend.h
+++ b/src/core/ggml_extend_backend.h
@ -51,9 +51,6 @@ public:

    bool init(const char* backend_spec,
              const char* params_backend_spec,
-              bool keep_clip_on_cpu,
-              bool keep_vae_on_cpu,
-              bool keep_control_net_on_cpu,
              std::string* error);
    void reset();

@ -74,6 +71,8 @@ bool sd_backend_is(ggml_backend_t backend, const std::string& name);
 bool sd_backend_is_cpu(ggml_backend_t backend);
 ggml_backend_t sd_backend_cpu_init();
 bool sd_backend_cpu_set_n_threads(ggml_backend_t backend_cpu, int n_threads);
+std::string sd_backend_resolve_name(const std::string& name);
 const char* sd_backend_module_name(SDBackendModule module);
 void ggml_ext_im_set_f32_1d(const struct ggml_tensor* tensor, int i, float value);
+bool add_rpc_devices(const std::string& servers);
 #endif  // __SD_CORE_GGML_EXTEND_BACKEND_H__
--- a/src/core/ggml_graph_cut.cpp
+++ b/src/core/ggml_graph_cut.cpp
@ -1,6 +1,8 @@
 #include "core/ggml_graph_cut.h"

 #include <algorithm>
+#include <cctype>
+#include <cmath>
 #include <cstring>
 #include <map>
 #include <set>
@ -8,6 +10,7 @@
 #include <stack>
 #include <unordered_map>

+#include "core/ggml_extend_backend.h"
 #include "core/util.h"
 #include "ggml-alloc.h"
 #include "ggml-backend.h"
@ -83,6 +86,157 @@ namespace sd::ggml_graph_cut {
               segment.output_bytes;
    }

+    static std::string lower_ascii_copy(std::string value) {
+        std::transform(value.begin(), value.end(), value.begin(), [](unsigned char c) {
+            return static_cast<char>(std::tolower(c));
+        });
+        return value;
+    }
+
+    static std::string normalize_backend_budget_key(const std::string& value) {
+        return lower_ascii_copy(trim(value));
+    }
+
+    static bool is_default_max_vram_key(const std::string& key) {
+        std::string normalized = normalize_backend_budget_key(key);
+        return normalized == "all" || normalized == "default" || normalized == "*";
+    }
+
+    static bool parse_max_vram_budget_value(const std::string& text, float* value, std::string* error) {
+        float parsed = 0.f;
+        if (!parse_strict_float(text, parsed) || !std::isfinite(parsed)) {
+            if (error != nullptr) {
+                *error = "invalid --max-vram value '" + text + "'";
+            }
+            return false;
+        }
+        *value = parsed;
+        return true;
+    }
+
+    static std::vector<std::string> backend_budget_keys(ggml_backend_t backend) {
+        std::vector<std::string> keys;
+        if (backend == nullptr) {
+            return keys;
+        }
+
+        ggml_backend_dev_t dev = ggml_backend_get_device(backend);
+        if (dev != nullptr) {
+            keys.push_back(normalize_backend_budget_key(ggml_backend_dev_name(dev)));
+        }
+        const char* backend_name = ggml_backend_name(backend);
+        if (backend_name != nullptr) {
+            keys.push_back(normalize_backend_budget_key(backend_name));
+        }
+        return keys;
+    }
+
+    void MaxVramAssignment::reset(float fallback_gib) {
+        default_gib = fallback_gib;
+        backend_gib.clear();
+        resolved_backend_bytes.clear();
+    }
+
+    bool MaxVramAssignment::parse(const std::string& raw_spec, std::string* error) {
+        const std::string in = trim(raw_spec);
+        if (in.empty()) {
+            return true;
+        }
+
+        for (const std::string& raw_part : split_string(in, ',')) {
+            const std::string part = trim(raw_part);
+            if (part.empty()) {
+                continue;
+            }
+
+            const size_t eq = part.find('=');
+            if (eq == std::string::npos) {
+                float value = 0.f;
+                if (!parse_max_vram_budget_value(part, &value, error)) {
+                    return false;
+                }
+                default_gib = value;
+                continue;
+            }
+
+            const std::string key        = trim(part.substr(0, eq));
+            const std::string value_text = trim(part.substr(eq + 1));
+            if (key.empty() || value_text.empty()) {
+                if (error != nullptr) {
+                    *error = "invalid --max-vram assignment '" + part + "'";
+                }
+                return false;
+            }
+
+            float value = 0.f;
+            if (!parse_max_vram_budget_value(value_text, &value, error)) {
+                return false;
+            }
+
+            if (is_default_max_vram_key(key)) {
+                default_gib = value;
+                continue;
+            }
+
+            const std::string backend_key = trim(key);
+            if (backend_key.empty()) {
+                if (error != nullptr) {
+                    *error = "invalid --max-vram backend key in '" + part + "'";
+                }
+                return false;
+            }
+            backend_gib[backend_key] = value;
+        }
+        resolved_backend_bytes.clear();
+        return true;
+    }
+
+    bool MaxVramAssignment::canonicalize_backend_keys(std::string* error) {
+        if (backend_gib.empty()) {
+            return true;
+        }
+
+        std::unordered_map<std::string, float> normalized;
+        for (const auto& kv : backend_gib) {
+            std::string resolved = sd_backend_resolve_name(kv.first);
+            if (resolved.empty()) {
+                if (error != nullptr) {
+                    *error = "unknown --max-vram backend '" + kv.first + "'";
+                }
+                return false;
+            }
+            normalized[normalize_backend_budget_key(resolved)] = kv.second;
+        }
+        backend_gib = std::move(normalized);
+        resolved_backend_bytes.clear();
+        return true;
+    }
+
+    size_t MaxVramAssignment::bytes_for_backend(ggml_backend_t backend) {
+        std::vector<std::string> keys = backend_budget_keys(backend);
+        const std::string cache_key   = keys.empty() ? std::string("<none>") : keys.front();
+        auto cached                   = resolved_backend_bytes.find(cache_key);
+        if (cached != resolved_backend_bytes.end()) {
+            return cached->second;
+        }
+
+        float budget_gib = default_gib;
+        if (!backend_gib.empty()) {
+            for (const std::string& key : keys) {
+                auto backend_it = backend_gib.find(key);
+                if (backend_it != backend_gib.end()) {
+                    budget_gib = backend_it->second;
+                    break;
+                }
+            }
+        }
+
+        const float resolved_gib          = resolve_max_vram_gib(budget_gib, backend);
+        const size_t bytes                = max_vram_gib_to_bytes(resolved_gib);
+        resolved_backend_bytes[cache_key] = bytes;
+        return bytes;
+    }
+
    size_t max_vram_gib_to_bytes(float max_vram) {
        if (max_vram <= 0.f) {
            return 0;
--- a/src/core/ggml_graph_cut.h
+++ b/src/core/ggml_graph_cut.h
@ -4,6 +4,7 @@
 #include <array>
 #include <cstdint>
 #include <string>
+#include <unordered_map>
 #include <unordered_set>
 #include <vector>

@ -68,6 +69,17 @@ namespace sd::ggml_graph_cut {

    static constexpr const char* GGML_RUNNER_CUT_PREFIX = "ggml_runner_cut:";

+    struct MaxVramAssignment {
+        float default_gib = 0.f;
+        std::unordered_map<std::string, float> backend_gib;
+        std::unordered_map<std::string, size_t> resolved_backend_bytes;
+
+        void reset(float fallback_gib);
+        bool parse(const std::string& raw_spec, std::string* error);
+        bool canonicalize_backend_keys(std::string* error);
+        size_t bytes_for_backend(ggml_backend_t backend);
+    };
+
    bool is_graph_cut_tensor(const ggml_tensor* tensor);
    std::string make_graph_cut_name(const std::string& group, const std::string& output);
    void mark_graph_cut(ggml_tensor* tensor, const std::string& group, const std::string& output);
--- a/src/core/util.cpp
+++ b/src/core/util.cpp
@ -406,6 +406,15 @@ std::vector<std::string> split_string(const std::string& str, char delimiter) {
    return result;
 }

+ggml_type sd_type_to_ggml_type(sd_type_t sdtype) {
+    const int type_value = static_cast<int>(sdtype);
+    if (type_value < std::min<int>(SD_TYPE_COUNT, GGML_TYPE_COUNT)) {
+        return static_cast<ggml_type>(type_value);
+    } else {
+        return GGML_TYPE_COUNT;
+    }
+}
+
 KeyValueArgs parse_key_value_args(const char* args, const char* context) {
    KeyValueArgs pairs;

--- a/src/core/util.h
+++ b/src/core/util.h
@ -80,6 +80,8 @@ void pretty_bytes_progress(int step, int steps, uint64_t bytes_processed, float

 void log_printf(sd_log_level_t level, const char* file, int line, const char* format, ...);

+ggml_type sd_type_to_ggml_type(sd_type_t sdtype);
+
 std::string trim(const std::string& s);

 std::vector<std::pair<std::string, float>> parse_prompt_attention(const std::string& text);
--- a/src/extensions/generation_extension.h
+++ b/src/extensions/generation_extension.h
@ -10,6 +10,7 @@

 #include "conditioning/conditioner.hpp"
 #include "core/ggml_extend_backend.h"
+#include "model/diffusion/model.hpp"
 #include "model_loader.h"
 #include "model_manager.h"
 #include "stable-diffusion.h"
@ -30,6 +31,7 @@ struct GenerationExtensionConditionContext {
    Conditioner* conditioner;
    ConditionerParams& condition_params;
    const sd_pm_params_t& pm_params;
+    const sd_pulid_params_t& pulid_params;
    int n_threads;
    int total_steps;
 };
@ -56,8 +58,20 @@ struct GenerationExtension {
                                                const SDCondition& condition) const {
        return condition;
    }
+
+    // Called in the denoise loop for each enabled extension, after the per-step
+    // DiffusionParams (including its version-specific `extra`) has been built,
+    // but before diffusion_model->compute(). Lets an extension feed data into
+    // the diffusion forward that the conditioning-side hooks can't reach -- it
+    // can set/override fields on `params` (typically the architecture-specific
+    // `params.extra`, e.g. a guidance tensor, control payload, or an identity
+    // embedding for an adapter that injects inside the model's blocks). The
+    // extension targets whichever `extra` variant matches the active model.
+    // Mutates `params` only, never the extension. Default no-op.
+    virtual void before_diffusion(DiffusionParams& /*params*/, int /*step*/) const {}
 };

 std::shared_ptr<GenerationExtension> create_photomaker_extension();
+std::shared_ptr<GenerationExtension> create_pulid_extension();

 #endif
--- a/src/extensions/pulid_extension.cpp
+++ b/src/extensions/pulid_extension.cpp
@ -0,0 +1,123 @@
+#include "extensions/generation_extension.h"
+
+#include <cstring>
+#include <variant>
+
+#include "core/tensor_ggml.hpp"
+#include "core/util.h"
+#include "gguf.h"
+
+static sd::Tensor<float> load_pulid_id_embedding(const char* path) {
+    sd::Tensor<float> empty;
+    if (path == nullptr || strlen(path) == 0) {
+        return empty;
+    }
+
+    struct ggml_context* ctx_data = nullptr;
+    struct gguf_init_params gp    = {/*.no_alloc =*/false, /*.ctx =*/&ctx_data};
+    struct gguf_context* gguf_ctx = gguf_init_from_file(path, gp);
+    if (gguf_ctx == nullptr || ctx_data == nullptr) {
+        LOG_WARN("PuLID id-embedding: cannot read gguf '%s'", path);
+        if (gguf_ctx != nullptr)
+            gguf_free(gguf_ctx);
+        if (ctx_data != nullptr)
+            ggml_free(ctx_data);
+        return empty;
+    }
+
+    struct ggml_tensor* t = ggml_get_tensor(ctx_data, "pulid_id");
+    if (t == nullptr) {
+        LOG_WARN("PuLID id-embedding: no 'pulid_id' tensor in '%s'", path);
+        gguf_free(gguf_ctx);
+        ggml_free(ctx_data);
+        return empty;
+    }
+
+    const int64_t token_dim  = t->ne[0];
+    const int64_t num_tokens = t->ne[1];
+    if (token_dim <= 0 || num_tokens <= 0 || token_dim > 65536 || num_tokens > 1024 ||
+        t->ne[2] != 1 || t->ne[3] != 1) {
+        LOG_WARN("PuLID id-embedding: implausible shape [%lld, %lld] in '%s'",
+                 (long long)token_dim, (long long)num_tokens, path);
+        gguf_free(gguf_ctx);
+        ggml_free(ctx_data);
+        return empty;
+    }
+
+    const size_t n_elem = (size_t)token_dim * (size_t)num_tokens;
+    sd::Tensor<float> out({token_dim, num_tokens, 1});
+    float* dst = out.data();
+    if (t->type == GGML_TYPE_F32) {
+        memcpy(dst, t->data, n_elem * sizeof(float));
+    } else if (t->type == GGML_TYPE_F16) {
+        const ggml_fp16_t* src = reinterpret_cast<const ggml_fp16_t*>(t->data);
+        for (size_t i = 0; i < n_elem; i++) {
+            dst[i] = ggml_fp16_to_fp32(src[i]);
+        }
+    } else if (t->type == GGML_TYPE_BF16) {
+        const ggml_bf16_t* src = reinterpret_cast<const ggml_bf16_t*>(t->data);
+        for (size_t i = 0; i < n_elem; i++) {
+            dst[i] = ggml_bf16_to_fp32(src[i]);
+        }
+    } else {
+        LOG_WARN("PuLID id-embedding: unsupported tensor type %s in '%s'",
+                 ggml_type_name(t->type), path);
+        gguf_free(gguf_ctx);
+        ggml_free(ctx_data);
+        return empty;
+    }
+
+    LOG_INFO("PuLID id-embedding: loaded [%lld, %lld] type=%s from '%s'",
+             (long long)token_dim, (long long)num_tokens, ggml_type_name(t->type), path);
+    gguf_free(gguf_ctx);
+    ggml_free(ctx_data);
+    return out;
+}
+
+struct PuLIDExtension : public GenerationExtension {
+    bool enabled = false;
+    sd::Tensor<float> id_embedding;
+    float id_weight = 1.0f;
+
+    const char* name() const override {
+        return "pulid";
+    }
+
+    bool is_enabled() const override {
+        return enabled;
+    }
+
+    bool init(const GenerationExtensionInitContext& ctx) override {
+        enabled = strlen(SAFE_STR(ctx.params->pulid_weights_path)) > 0;
+        return true;
+    }
+
+    void reset_runtime_condition() override {
+        id_embedding = {};
+        id_weight    = 1.0f;
+    }
+
+    bool prepare_condition(GenerationExtensionConditionContext& ctx) override {
+        reset_runtime_condition();
+        if (!enabled) {
+            return false;
+        }
+        id_embedding = load_pulid_id_embedding(ctx.pulid_params.id_embedding_path);
+        id_weight    = ctx.pulid_params.id_weight;
+        return false;  // PuLID does not modify the conditioning
+    }
+
+    void before_diffusion(DiffusionParams& params, int /*step*/) const override {
+        if (!enabled || id_embedding.empty()) {
+            return;
+        }
+        if (auto* flux_extra = std::get_if<FluxDiffusionExtra>(&params.extra)) {
+            flux_extra->pulid_id        = &id_embedding;
+            flux_extra->pulid_id_weight = id_weight;
+        }
+    }
+};
+
+std::shared_ptr<GenerationExtension> create_pulid_extension() {
+    return std::make_shared<PuLIDExtension>();
+}
--- a/src/model.h
+++ b/src/model.h
@ -42,6 +42,7 @@ enum SDVersion {
    VERSION_LTXAV,
    VERSION_HIDREAM_O1,
    VERSION_Z_IMAGE,
+    VERSION_BOOGU_IMAGE,
    VERSION_OVIS_IMAGE,
    VERSION_ERNIE_IMAGE,
    VERSION_LENS,
@ -143,6 +144,13 @@ static inline bool sd_version_is_z_image(SDVersion version) {
    return false;
 }

+static inline bool sd_version_is_boogu_image(SDVersion version) {
+    if (version == VERSION_BOOGU_IMAGE) {
+        return true;
+    }
+    return false;
+}
+
 static inline bool sd_version_is_longcat(SDVersion version) {
    if (version == VERSION_LONGCAT) {
        return true;
@ -178,6 +186,13 @@ static inline bool sd_version_is_ideogram4(SDVersion version) {
    return false;
 }

+static inline bool sd_version_uses_flux_vae(SDVersion version) {
+    if (sd_version_is_flux(version) || sd_version_is_z_image(version) || sd_version_is_boogu_image(version) || sd_version_is_longcat(version)) {
+        return true;
+    }
+    return false;
+}
+
 static inline bool sd_version_uses_flux2_vae(SDVersion version) {
    if (sd_version_is_flux2(version) || sd_version_is_ernie_image(version) || sd_version_is_lens(version) || sd_version_is_ideogram4(version)) {
        return true;
@ -206,6 +221,7 @@ static inline bool sd_version_is_dit(SDVersion version) {
        version == VERSION_HIDREAM_O1 ||
        sd_version_is_anima(version) ||
        sd_version_is_z_image(version) ||
+        sd_version_is_boogu_image(version) ||
        sd_version_is_ernie_image(version) ||
        sd_version_is_lens(version) ||
        sd_version_is_longcat(version) ||
--- a/src/model/adapter/pulid.hpp
+++ b/src/model/adapter/pulid.hpp
@ -0,0 +1,76 @@
+#ifndef __PULID_HPP__
+#define __PULID_HPP__
+
+#include "core/ggml_extend.hpp"
+#include "model/common/block.hpp"
+
+class PuLIDPerceiverAttentionCA : public GGMLBlock {
+public:
+    static constexpr int64_t DEFAULT_DIM      = 3072;  // Flux hidden size
+    static constexpr int64_t DEFAULT_DIM_HEAD = 128;
+    static constexpr int64_t DEFAULT_HEADS    = 16;
+    static constexpr int64_t DEFAULT_KV_DIM   = 2048;  // PuLID ID-embedding dim
+
+protected:
+    int64_t dim;
+    int64_t dim_head;
+    int64_t heads;
+    int64_t kv_dim;
+    int64_t inner_dim;
+
+public:
+    PuLIDPerceiverAttentionCA(int64_t dim      = DEFAULT_DIM,
+                              int64_t dim_head = DEFAULT_DIM_HEAD,
+                              int64_t heads    = DEFAULT_HEADS,
+                              int64_t kv_dim   = DEFAULT_KV_DIM)
+        : dim(dim),
+          dim_head(dim_head),
+          heads(heads),
+          kv_dim(kv_dim),
+          inner_dim(dim_head * heads) {
+        blocks["norm1"]  = std::shared_ptr<GGMLBlock>(new LayerNorm(kv_dim));
+        blocks["norm2"]  = std::shared_ptr<GGMLBlock>(new LayerNorm(dim));
+        blocks["to_q"]   = std::shared_ptr<GGMLBlock>(new Linear(dim, inner_dim, /*bias=*/false));
+        blocks["to_kv"]  = std::shared_ptr<GGMLBlock>(new Linear(kv_dim, inner_dim * 2, /*bias=*/false));
+        blocks["to_out"] = std::shared_ptr<GGMLBlock>(new Linear(inner_dim, dim, /*bias=*/false));
+    }
+
+    ggml_tensor* forward(GGMLRunnerContext* ctx,
+                         ggml_tensor* id_embedding,
+                         ggml_tensor* image_tokens) {
+        auto norm1  = std::dynamic_pointer_cast<LayerNorm>(blocks["norm1"]);
+        auto norm2  = std::dynamic_pointer_cast<LayerNorm>(blocks["norm2"]);
+        auto to_q   = std::dynamic_pointer_cast<Linear>(blocks["to_q"]);
+        auto to_kv  = std::dynamic_pointer_cast<Linear>(blocks["to_kv"]);
+        auto to_out = std::dynamic_pointer_cast<Linear>(blocks["to_out"]);
+
+        ggml_tensor* x_normed   = norm1->forward(ctx, id_embedding);
+        ggml_tensor* lat_normed = norm2->forward(ctx, image_tokens);
+
+        ggml_tensor* q  = to_q->forward(ctx, lat_normed);  // [N, T_img, 2048]
+        ggml_tensor* kv = to_kv->forward(ctx, x_normed);   // [N, T_img, 3072]
+
+        ggml_tensor* k = ggml_view_3d(ctx->ggml_ctx, kv,
+                                      inner_dim, kv->ne[1], kv->ne[2],
+                                      kv->nb[1], kv->nb[2],
+                                      /*offset=*/0);
+        ggml_tensor* v = ggml_view_3d(ctx->ggml_ctx, kv,
+                                      inner_dim, kv->ne[1], kv->ne[2],
+                                      kv->nb[1], kv->nb[2],
+                                      /*offset=*/inner_dim * ggml_element_size(kv));
+        k              = ggml_cont(ctx->ggml_ctx, k);
+        v              = ggml_cont(ctx->ggml_ctx, v);
+
+        ggml_tensor* attn_out = ggml_ext_attention_ext(
+            ctx->ggml_ctx, ctx->backend,
+            q, k, v,
+            heads,
+            /*mask=*/nullptr,
+            /*diag_mask_inf=*/false);
+
+        ggml_tensor* out = to_out->forward(ctx, attn_out);
+        return out;
+    }
+};
+
+#endif  // __PULID_HPP__
--- a/src/model/common/rope.hpp
+++ b/src/model/common/rope.hpp
@ -899,10 +899,12 @@ namespace Rope {
        // q,k,v: [N, L, n_head, d_head]
        // pe: [L, d_head/2, 2, 2]
        // return: [N, L, n_head*d_head]
+        int64_t n_head = q->ne[1];
+
        q = apply_rope(ctx->ggml_ctx, q, pe, rope_interleaved);  // [N*n_head, L, d_head]
        k = apply_rope(ctx->ggml_ctx, k, pe, rope_interleaved);  // [N*n_head, L, d_head]

-        auto x = ggml_ext_attention_ext(ctx->ggml_ctx, ctx->backend, q, k, v, v->ne[1], mask, true, ctx->flash_attn_enabled, kv_scale);  // [N, L, n_head*d_head]
+        auto x = ggml_ext_attention_ext(ctx->ggml_ctx, ctx->backend, q, k, v, n_head, mask, true, ctx->flash_attn_enabled, kv_scale);  // [N, L, n_head*d_head]
        return x;
    }
 };  // namespace Rope
--- a/src/model/diffusion/anima.hpp
+++ b/src/model/diffusion/anima.hpp
@ -227,6 +227,7 @@ namespace Anima {
            k4 = k_norm->forward(ctx, k4);

            ggml_tensor* attn_out = nullptr;
+            float scale           = (sd_backend_is(ctx->backend, "Vulkan") && ctx->flash_attn_enabled) ? 1.0f / 32.0f : 1.0f;
            if (pe_q != nullptr || pe_k != nullptr) {
                if (pe_q == nullptr) {
                    pe_q = pe_k;
@ -244,7 +245,8 @@ namespace Anima {
                                                     num_heads,
                                                     nullptr,
                                                     true,
-                                                     ctx->flash_attn_enabled);
+                                                     ctx->flash_attn_enabled,
+                                                     scale);
            } else {
                auto q_flat = ggml_reshape_3d(ctx->ggml_ctx, q4, head_dim * num_heads, L_q, N);
                auto k_flat = ggml_reshape_3d(ctx->ggml_ctx, k4, head_dim * num_heads, L_k, N);
@ -256,7 +258,8 @@ namespace Anima {
                                                     num_heads,
                                                     nullptr,
                                                     false,
-                                                     ctx->flash_attn_enabled);
+                                                     ctx->flash_attn_enabled,
+                                                     scale);
            }

            return out_proj->forward(ctx, attn_out);
--- a/src/model/diffusion/boogu.hpp
+++ b/src/model/diffusion/boogu.hpp
@ -0,0 +1,835 @@
+#ifndef __SD_MODEL_DIFFUSION_BOOGU_HPP__
+#define __SD_MODEL_DIFFUSION_BOOGU_HPP__
+
+#include <algorithm>
+#include <cmath>
+#include <tuple>
+#include <vector>
+
+#include "core/ggml_extend.hpp"
+#include "model/common/rope.hpp"
+#include "model/diffusion/dit.hpp"
+#include "model/diffusion/model.hpp"
+#include "model/diffusion/qwen_image.hpp"
+#include "model_loader.h"
+
+namespace Boogu {
+    constexpr int BOOGU_GRAPH_SIZE = 65536;
+
+    struct BooguConfig {
+        int patch_size                   = 2;
+        int64_t in_channels              = 16;
+        int64_t out_channels             = 16;
+        int64_t hidden_size              = 3360;
+        int64_t num_layers               = 32;
+        int64_t num_double_stream_layers = 8;
+        int64_t num_refiner_layers       = 2;
+        int64_t num_attention_heads      = 28;
+        int64_t num_kv_heads             = 7;
+        int64_t head_dim                 = 120;
+        int64_t multiple_of              = 256;
+        int64_t instruction_feat_dim     = 4096;
+        int64_t timestep_embed_dim       = 1024;
+        int theta                        = 10000;
+        float timestep_scale             = 1000.0f;
+        float norm_eps                   = 1e-5f;
+        std::vector<int> axes_dim        = {40, 40, 40};
+        int64_t axes_dim_sum             = 120;
+
+        static int64_t count_blocks(const String2TensorStorage& tensor_storage_map,
+                                    const std::string& prefix,
+                                    const std::string& block_prefix) {
+            int64_t count = 0;
+            for (const auto& [name, _] : tensor_storage_map) {
+                if (!starts_with(name, prefix)) {
+                    continue;
+                }
+                size_t pos = name.find(block_prefix);
+                if (pos == std::string::npos) {
+                    continue;
+                }
+                auto items = split_string(name.substr(pos), '.');
+                if (items.size() > 1) {
+                    count = std::max<int64_t>(count, atoi(items[1].c_str()) + 1);
+                }
+            }
+            return count;
+        }
+
+        static BooguConfig detect_from_weights(const String2TensorStorage& tensor_storage_map, const std::string& prefix) {
+            BooguConfig config;
+            int64_t detected_head_dim = 0;
+            int64_t detected_kv_dim   = 0;
+
+            for (const auto& [name, tensor_storage] : tensor_storage_map) {
+                if (!starts_with(name, prefix)) {
+                    continue;
+                }
+                if (ends_with(name, "x_embedder.weight") && tensor_storage.n_dims == 2) {
+                    int64_t patch_area = config.patch_size * config.patch_size;
+                    config.in_channels = tensor_storage.ne[0] / patch_area;
+                    config.hidden_size = tensor_storage.ne[1];
+                } else if (ends_with(name, "time_caption_embed.caption_embedder.1.weight") && tensor_storage.n_dims == 2) {
+                    config.instruction_feat_dim = tensor_storage.ne[0];
+                    config.hidden_size          = tensor_storage.ne[1];
+                } else if (ends_with(name, "single_stream_layers.0.attn.norm_q.weight") && tensor_storage.n_dims == 1) {
+                    detected_head_dim = tensor_storage.ne[0];
+                } else if (ends_with(name, "double_stream_layers.0.img_self_attn.norm_q.weight") && tensor_storage.n_dims == 1) {
+                    detected_head_dim = tensor_storage.ne[0];
+                } else if (ends_with(name, "single_stream_layers.0.attn.to_k.weight") && tensor_storage.n_dims == 2) {
+                    detected_kv_dim = tensor_storage.ne[1];
+                } else if (ends_with(name, "double_stream_layers.0.img_instruct_attn.processor.img_to_k.weight") && tensor_storage.n_dims == 2) {
+                    detected_kv_dim = tensor_storage.ne[1];
+                } else if (ends_with(name, "norm_out.linear_2.weight") && tensor_storage.n_dims == 2) {
+                    int64_t patch_area  = config.patch_size * config.patch_size;
+                    config.out_channels = tensor_storage.ne[1] / patch_area;
+                }
+            }
+
+            config.num_layers               = std::max<int64_t>(1, count_blocks(tensor_storage_map, prefix, "single_stream_layers."));
+            config.num_double_stream_layers = std::max<int64_t>(0, count_blocks(tensor_storage_map, prefix, "double_stream_layers."));
+            int64_t noise_refiner_layers    = count_blocks(tensor_storage_map, prefix, "noise_refiner.");
+            int64_t ref_refiner_layers      = count_blocks(tensor_storage_map, prefix, "ref_image_refiner.");
+            int64_t context_refiner_layers  = count_blocks(tensor_storage_map, prefix, "context_refiner.");
+            config.num_refiner_layers       = std::max<int64_t>(1, std::max(noise_refiner_layers, std::max(ref_refiner_layers, context_refiner_layers)));
+
+            if (detected_head_dim > 0) {
+                config.head_dim            = detected_head_dim;
+                config.num_attention_heads = config.hidden_size / config.head_dim;
+                config.axes_dim_sum        = config.head_dim;
+                if (detected_kv_dim > 0) {
+                    config.num_kv_heads = detected_kv_dim / config.head_dim;
+                }
+                if (config.axes_dim_sum == 120) {
+                    config.axes_dim = {40, 40, 40};
+                } else if (config.axes_dim_sum % 3 == 0) {
+                    int axis        = static_cast<int>(config.axes_dim_sum / 3);
+                    config.axes_dim = {axis, axis, axis};
+                }
+            }
+            config.timestep_embed_dim = std::min<int64_t>(config.hidden_size, 1024);
+
+            LOG_DEBUG("boogu_image: layers=%" PRId64 ", double_stream_layers=%" PRId64 ", refiner_layers=%" PRId64 ", hidden=%" PRId64 ", heads=%" PRId64 ", kv_heads=%" PRId64 ", head_dim=%" PRId64 ", in_channels=%" PRId64 ", out_channels=%" PRId64,
+                      config.num_layers,
+                      config.num_double_stream_layers,
+                      config.num_refiner_layers,
+                      config.hidden_size,
+                      config.num_attention_heads,
+                      config.num_kv_heads,
+                      config.head_dim,
+                      config.in_channels,
+                      config.out_channels);
+            return config;
+        }
+    };
+
+    __STATIC_INLINE__ ggml_tensor* scale_modulate(ggml_context* ctx, ggml_tensor* x, ggml_tensor* scale) {
+        scale = ggml_reshape_3d(ctx, scale, scale->ne[0], 1, scale->ne[1]);
+        return ggml_add(ctx, x, ggml_mul(ctx, x, scale));
+    }
+
+    __STATIC_INLINE__ ggml_tensor* gate_residual(ggml_context* ctx, ggml_tensor* residual, ggml_tensor* x, ggml_tensor* gate) {
+        gate = ggml_tanh(ctx, gate);
+        gate = ggml_reshape_3d(ctx, gate, gate->ne[0], 1, gate->ne[1]);
+        x    = ggml_mul(ctx, x, gate);
+        return ggml_add(ctx, residual, x);
+    }
+
+    struct LuminaCombinedTimestepCaptionEmbedding : public GGMLBlock {
+        int64_t frequency_embedding_size;
+        float timestep_scale;
+
+        LuminaCombinedTimestepCaptionEmbedding(int64_t hidden_size,
+                                               int64_t instruction_feat_dim,
+                                               int64_t frequency_embedding_size,
+                                               float norm_eps,
+                                               float timestep_scale)
+            : frequency_embedding_size(frequency_embedding_size),
+              timestep_scale(timestep_scale) {
+            blocks["timestep_embedder"]  = std::make_shared<Qwen::TimestepEmbedding>(frequency_embedding_size, std::min<int64_t>(hidden_size, 1024));
+            blocks["caption_embedder.0"] = std::make_shared<RMSNorm>(instruction_feat_dim, norm_eps);
+            blocks["caption_embedder.1"] = std::make_shared<Linear>(instruction_feat_dim, hidden_size, true);
+        }
+
+        std::pair<ggml_tensor*, ggml_tensor*> forward(GGMLRunnerContext* ctx, ggml_tensor* timestep, ggml_tensor* text_hidden_states) {
+            auto timestep_embedder  = std::dynamic_pointer_cast<Qwen::TimestepEmbedding>(blocks["timestep_embedder"]);
+            auto caption_embedder_0 = std::dynamic_pointer_cast<RMSNorm>(blocks["caption_embedder.0"]);
+            auto caption_embedder_1 = std::dynamic_pointer_cast<Linear>(blocks["caption_embedder.1"]);
+
+            auto timestep_proj = ggml_ext_timestep_embedding(ctx->ggml_ctx, timestep, static_cast<int>(frequency_embedding_size), 10000, timestep_scale);
+            auto time_embed    = timestep_embedder->forward(ctx, timestep_proj);
+            auto caption_embed = caption_embedder_1->forward(ctx, caption_embedder_0->forward(ctx, text_hidden_states));
+            return {time_embed, caption_embed};
+        }
+    };
+
+    struct LuminaRMSNormZero : public GGMLBlock {
+        LuminaRMSNormZero(int64_t embedding_dim, int64_t conditioning_embedding_dim, float norm_eps) {
+            blocks["linear"] = std::make_shared<Linear>(conditioning_embedding_dim, 4 * embedding_dim, true);
+            blocks["norm"]   = std::make_shared<RMSNorm>(embedding_dim, norm_eps);
+        }
+
+        std::tuple<ggml_tensor*, ggml_tensor*, ggml_tensor*, ggml_tensor*> forward(GGMLRunnerContext* ctx, ggml_tensor* x, ggml_tensor* emb) {
+            auto linear = std::dynamic_pointer_cast<Linear>(blocks["linear"]);
+            auto norm   = std::dynamic_pointer_cast<RMSNorm>(blocks["norm"]);
+
+            emb       = linear->forward(ctx, ggml_silu(ctx->ggml_ctx, emb));
+            auto mods = ggml_ext_chunk(ctx->ggml_ctx, emb, 4, 0);
+
+            auto scale_msa = mods[0];
+            auto gate_msa  = mods[1];
+            auto scale_mlp = mods[2];
+            auto gate_mlp  = mods[3];
+
+            x = scale_modulate(ctx->ggml_ctx, norm->forward(ctx, x), scale_msa);
+            return {x, gate_msa, scale_mlp, gate_mlp};
+        }
+    };
+
+    struct LuminaFeedForward : public GGMLBlock {
+        LuminaFeedForward(int64_t dim, int64_t inner_dim, int64_t multiple_of) {
+            inner_dim          = multiple_of * ((inner_dim + multiple_of - 1) / multiple_of);
+            blocks["linear_1"] = std::make_shared<Linear>(dim, inner_dim, false);
+            blocks["linear_2"] = std::make_shared<Linear>(inner_dim, dim, false);
+            blocks["linear_3"] = std::make_shared<Linear>(dim, inner_dim, false);
+        }
+
+        ggml_tensor* forward(GGMLRunnerContext* ctx, ggml_tensor* x) {
+            auto linear_1 = std::dynamic_pointer_cast<Linear>(blocks["linear_1"]);
+            auto linear_2 = std::dynamic_pointer_cast<Linear>(blocks["linear_2"]);
+            auto linear_3 = std::dynamic_pointer_cast<Linear>(blocks["linear_3"]);
+
+            if (sd_backend_is(ctx->backend, "Vulkan")) {
+                linear_2->set_force_prec_f32(true);
+            }
+
+            auto h1 = linear_1->forward(ctx, x);
+            auto h2 = linear_3->forward(ctx, x);
+            x       = ggml_swiglu_split(ctx->ggml_ctx, h1, h2);
+            x       = linear_2->forward(ctx, x);
+            return x;
+        }
+    };
+
+    struct LuminaLayerNormContinuous : public GGMLBlock {
+        LuminaLayerNormContinuous(int64_t embedding_dim,
+                                  int64_t conditioning_embedding_dim,
+                                  int64_t out_dim) {
+            blocks["linear_1"] = std::make_shared<Linear>(conditioning_embedding_dim, embedding_dim, true);
+            blocks["norm"]     = std::make_shared<LayerNorm>(embedding_dim, 1e-6f, false);
+            blocks["linear_2"] = std::make_shared<Linear>(embedding_dim, out_dim, true);
+        }
+
+        ggml_tensor* forward(GGMLRunnerContext* ctx, ggml_tensor* x, ggml_tensor* conditioning_embedding) {
+            auto linear_1 = std::dynamic_pointer_cast<Linear>(blocks["linear_1"]);
+            auto norm     = std::dynamic_pointer_cast<LayerNorm>(blocks["norm"]);
+            auto linear_2 = std::dynamic_pointer_cast<Linear>(blocks["linear_2"]);
+
+            auto emb = linear_1->forward(ctx, ggml_silu(ctx->ggml_ctx, conditioning_embedding));
+            x        = scale_modulate(ctx->ggml_ctx, norm->forward(ctx, x), emb);
+            x        = linear_2->forward(ctx, x);
+            return x;
+        }
+    };
+
+    struct Attention : public GGMLBlock {
+        int64_t dim_head;
+        int64_t heads;
+        int64_t kv_heads;
+
+        Attention(int64_t query_dim, int64_t dim_head, int64_t heads, int64_t kv_heads, float eps = 1e-5f)
+            : dim_head(dim_head), heads(heads), kv_heads(kv_heads) {
+            blocks["to_q"]     = std::make_shared<Linear>(query_dim, heads * dim_head, false);
+            blocks["to_k"]     = std::make_shared<Linear>(query_dim, kv_heads * dim_head, false);
+            blocks["to_v"]     = std::make_shared<Linear>(query_dim, kv_heads * dim_head, false);
+            blocks["norm_q"]   = std::make_shared<RMSNorm>(dim_head, eps);
+            blocks["norm_k"]   = std::make_shared<RMSNorm>(dim_head, eps);
+            blocks["to_out.0"] = std::make_shared<Linear>(heads * dim_head, query_dim, false);
+        }
+
+        ggml_tensor* forward(GGMLRunnerContext* ctx,
+                             ggml_tensor* hidden_states,
+                             ggml_tensor* encoder_hidden_states,
+                             ggml_tensor* rotary_emb,
+                             ggml_tensor* attention_mask = nullptr) {
+            auto to_q     = std::dynamic_pointer_cast<Linear>(blocks["to_q"]);
+            auto to_k     = std::dynamic_pointer_cast<Linear>(blocks["to_k"]);
+            auto to_v     = std::dynamic_pointer_cast<Linear>(blocks["to_v"]);
+            auto norm_q   = std::dynamic_pointer_cast<RMSNorm>(blocks["norm_q"]);
+            auto norm_k   = std::dynamic_pointer_cast<RMSNorm>(blocks["norm_k"]);
+            auto to_out_0 = std::dynamic_pointer_cast<Linear>(blocks["to_out.0"]);
+
+            if (sd_backend_is(ctx->backend, "Vulkan")) {
+                to_out_0->set_force_prec_f32(true);
+            }
+
+            int64_t N  = hidden_states->ne[2];
+            int64_t Lq = hidden_states->ne[1];
+            int64_t Lk = encoder_hidden_states->ne[1];
+
+            auto q = to_q->forward(ctx, hidden_states);
+            q      = ggml_reshape_4d(ctx->ggml_ctx, q, dim_head, heads, Lq, N);
+            auto k = to_k->forward(ctx, encoder_hidden_states);
+            k      = ggml_reshape_4d(ctx->ggml_ctx, k, dim_head, kv_heads, Lk, N);
+            auto v = to_v->forward(ctx, encoder_hidden_states);
+            v      = ggml_reshape_4d(ctx->ggml_ctx, v, dim_head, kv_heads, Lk, N);
+
+            q = norm_q->forward(ctx, q);
+            k = norm_k->forward(ctx, k);
+
+            auto out = Rope::attention(ctx, q, k, v, rotary_emb, attention_mask);
+            out      = to_out_0->forward(ctx, out);
+            return out;
+        }
+    };
+
+    struct BooguImageTransformerBlock : public GGMLBlock {
+        bool modulation;
+
+        BooguImageTransformerBlock(int64_t dim,
+                                   int64_t num_attention_heads,
+                                   int64_t num_kv_heads,
+                                   int64_t multiple_of,
+                                   float norm_eps,
+                                   bool modulation)
+            : modulation(modulation) {
+            int64_t head_dim       = dim / num_attention_heads;
+            blocks["attn"]         = std::make_shared<Attention>(dim, head_dim, num_attention_heads, num_kv_heads, 1e-5f);
+            blocks["feed_forward"] = std::make_shared<LuminaFeedForward>(dim, 4 * dim, multiple_of);
+            if (modulation) {
+                blocks["norm1"] = std::make_shared<LuminaRMSNormZero>(dim, std::min<int64_t>(dim, 1024), norm_eps);
+            } else {
+                blocks["norm1"] = std::make_shared<RMSNorm>(dim, norm_eps);
+            }
+            blocks["ffn_norm1"] = std::make_shared<RMSNorm>(dim, norm_eps);
+            blocks["norm2"]     = std::make_shared<RMSNorm>(dim, norm_eps);
+            blocks["ffn_norm2"] = std::make_shared<RMSNorm>(dim, norm_eps);
+        }
+
+        ggml_tensor* forward(GGMLRunnerContext* ctx,
+                             ggml_tensor* hidden_states,
+                             ggml_tensor* rotary_emb,
+                             ggml_tensor* temb           = nullptr,
+                             ggml_tensor* attention_mask = nullptr) {
+            auto attn         = std::dynamic_pointer_cast<Attention>(blocks["attn"]);
+            auto feed_forward = std::dynamic_pointer_cast<LuminaFeedForward>(blocks["feed_forward"]);
+            auto ffn_norm1    = std::dynamic_pointer_cast<RMSNorm>(blocks["ffn_norm1"]);
+            auto norm2        = std::dynamic_pointer_cast<RMSNorm>(blocks["norm2"]);
+            auto ffn_norm2    = std::dynamic_pointer_cast<RMSNorm>(blocks["ffn_norm2"]);
+
+            if (modulation) {
+                auto norm1 = std::dynamic_pointer_cast<LuminaRMSNormZero>(blocks["norm1"]);
+                auto mods  = norm1->forward(ctx, hidden_states, temb);
+
+                auto norm_hidden_states = std::get<0>(mods);
+                auto gate_msa           = std::get<1>(mods);
+                auto scale_mlp          = std::get<2>(mods);
+                auto gate_mlp           = std::get<3>(mods);
+
+                auto attn_output = attn->forward(ctx, norm_hidden_states, norm_hidden_states, rotary_emb, attention_mask);
+                hidden_states    = gate_residual(ctx->ggml_ctx, hidden_states, norm2->forward(ctx, attn_output), gate_msa);
+
+                auto mlp_input  = scale_modulate(ctx->ggml_ctx, ffn_norm1->forward(ctx, hidden_states), scale_mlp);
+                auto mlp_output = feed_forward->forward(ctx, mlp_input);
+                hidden_states   = gate_residual(ctx->ggml_ctx, hidden_states, ffn_norm2->forward(ctx, mlp_output), gate_mlp);
+            } else {
+                auto norm1 = std::dynamic_pointer_cast<RMSNorm>(blocks["norm1"]);
+
+                auto norm_hidden_states = norm1->forward(ctx, hidden_states);
+                auto attn_output        = attn->forward(ctx, norm_hidden_states, norm_hidden_states, rotary_emb, attention_mask);
+                hidden_states           = ggml_add(ctx->ggml_ctx, hidden_states, norm2->forward(ctx, attn_output));
+
+                auto mlp_output = feed_forward->forward(ctx, ffn_norm1->forward(ctx, hidden_states));
+                hidden_states   = ggml_add(ctx->ggml_ctx, hidden_states, ffn_norm2->forward(ctx, mlp_output));
+            }
+            return hidden_states;
+        }
+    };
+
+    struct BooguImageJointAttention : public GGMLBlock {
+        int64_t dim_head;
+        int64_t heads;
+        int64_t kv_heads;
+
+        BooguImageJointAttention(int64_t dim, int64_t dim_head, int64_t heads, int64_t kv_heads)
+            : dim_head(dim_head), heads(heads), kv_heads(kv_heads) {
+            blocks["norm_q"]                  = std::make_shared<RMSNorm>(dim_head, 1e-5f);
+            blocks["norm_k"]                  = std::make_shared<RMSNorm>(dim_head, 1e-5f);
+            blocks["to_out.0"]                = std::make_shared<Linear>(heads * dim_head, dim, false);
+            blocks["processor.img_to_q"]      = std::make_shared<Linear>(dim, heads * dim_head, false);
+            blocks["processor.img_to_k"]      = std::make_shared<Linear>(dim, kv_heads * dim_head, false);
+            blocks["processor.img_to_v"]      = std::make_shared<Linear>(dim, kv_heads * dim_head, false);
+            blocks["processor.instruct_to_q"] = std::make_shared<Linear>(dim, heads * dim_head, false);
+            blocks["processor.instruct_to_k"] = std::make_shared<Linear>(dim, kv_heads * dim_head, false);
+            blocks["processor.instruct_to_v"] = std::make_shared<Linear>(dim, kv_heads * dim_head, false);
+            blocks["processor.instruct_out"]  = std::make_shared<Linear>(heads * dim_head, dim, false);
+            blocks["processor.img_out"]       = std::make_shared<Linear>(heads * dim_head, dim, false);
+        }
+
+        ggml_tensor* forward(GGMLRunnerContext* ctx,
+                             ggml_tensor* img_hidden_states,
+                             ggml_tensor* instruct_hidden_states,
+                             ggml_tensor* rotary_emb,
+                             ggml_tensor* attention_mask = nullptr) {
+            auto norm_q        = std::dynamic_pointer_cast<RMSNorm>(blocks["norm_q"]);
+            auto norm_k        = std::dynamic_pointer_cast<RMSNorm>(blocks["norm_k"]);
+            auto to_out_0      = std::dynamic_pointer_cast<Linear>(blocks["to_out.0"]);
+            auto img_to_q      = std::dynamic_pointer_cast<Linear>(blocks["processor.img_to_q"]);
+            auto img_to_k      = std::dynamic_pointer_cast<Linear>(blocks["processor.img_to_k"]);
+            auto img_to_v      = std::dynamic_pointer_cast<Linear>(blocks["processor.img_to_v"]);
+            auto instruct_to_q = std::dynamic_pointer_cast<Linear>(blocks["processor.instruct_to_q"]);
+            auto instruct_to_k = std::dynamic_pointer_cast<Linear>(blocks["processor.instruct_to_k"]);
+            auto instruct_to_v = std::dynamic_pointer_cast<Linear>(blocks["processor.instruct_to_v"]);
+            auto instruct_out  = std::dynamic_pointer_cast<Linear>(blocks["processor.instruct_out"]);
+            auto img_out       = std::dynamic_pointer_cast<Linear>(blocks["processor.img_out"]);
+
+            if (sd_backend_is(ctx->backend, "Vulkan")) {
+                to_out_0->set_force_prec_f32(true);
+            }
+
+            int64_t N          = img_hidden_states->ne[2];
+            int64_t L_img      = img_hidden_states->ne[1];
+            int64_t L_instruct = instruct_hidden_states->ne[1];
+
+            auto img_q = img_to_q->forward(ctx, img_hidden_states);
+            img_q      = ggml_reshape_4d(ctx->ggml_ctx, img_q, dim_head, heads, L_img, N);
+            auto img_k = img_to_k->forward(ctx, img_hidden_states);
+            img_k      = ggml_reshape_4d(ctx->ggml_ctx, img_k, dim_head, kv_heads, L_img, N);
+            auto img_v = img_to_v->forward(ctx, img_hidden_states);
+            img_v      = ggml_reshape_4d(ctx->ggml_ctx, img_v, dim_head, kv_heads, L_img, N);
+
+            auto instruct_q = instruct_to_q->forward(ctx, instruct_hidden_states);
+            instruct_q      = ggml_reshape_4d(ctx->ggml_ctx, instruct_q, dim_head, heads, L_instruct, N);
+            auto instruct_k = instruct_to_k->forward(ctx, instruct_hidden_states);
+            instruct_k      = ggml_reshape_4d(ctx->ggml_ctx, instruct_k, dim_head, kv_heads, L_instruct, N);
+            auto instruct_v = instruct_to_v->forward(ctx, instruct_hidden_states);
+            instruct_v      = ggml_reshape_4d(ctx->ggml_ctx, instruct_v, dim_head, kv_heads, L_instruct, N);
+
+            auto q = ggml_concat(ctx->ggml_ctx, instruct_q, img_q, 2);
+            auto k = ggml_concat(ctx->ggml_ctx, instruct_k, img_k, 2);
+            auto v = ggml_concat(ctx->ggml_ctx, instruct_v, img_v, 2);
+            q      = norm_q->forward(ctx, q);
+            k      = norm_k->forward(ctx, k);
+
+            auto hidden_states = Rope::attention(ctx, q, k, v, rotary_emb, attention_mask);
+            auto instruct_attn = ggml_ext_slice(ctx->ggml_ctx, hidden_states, 1, 0, L_instruct);
+            auto img_attn      = ggml_ext_slice(ctx->ggml_ctx, hidden_states, 1, L_instruct, L_instruct + L_img);
+
+            instruct_attn = instruct_out->forward(ctx, instruct_attn);
+            img_attn      = img_out->forward(ctx, img_attn);
+            hidden_states = ggml_concat(ctx->ggml_ctx, instruct_attn, img_attn, 1);
+            hidden_states = to_out_0->forward(ctx, hidden_states);
+            return hidden_states;
+        }
+    };
+
+    struct BooguImageDoubleStreamBlock : public GGMLBlock {
+        BooguImageDoubleStreamBlock(int64_t dim,
+                                    int64_t num_attention_heads,
+                                    int64_t num_kv_heads,
+                                    int64_t multiple_of,
+                                    float norm_eps) {
+            int64_t head_dim                = dim / num_attention_heads;
+            blocks["img_instruct_attn"]     = std::make_shared<BooguImageJointAttention>(dim, head_dim, num_attention_heads, num_kv_heads);
+            blocks["img_self_attn"]         = std::make_shared<Attention>(dim, head_dim, num_attention_heads, num_kv_heads, 1e-5f);
+            blocks["img_feed_forward"]      = std::make_shared<LuminaFeedForward>(dim, 4 * dim, multiple_of);
+            blocks["instruct_feed_forward"] = std::make_shared<LuminaFeedForward>(dim, 4 * dim, multiple_of);
+            blocks["img_norm1"]             = std::make_shared<LuminaRMSNormZero>(dim, std::min<int64_t>(dim, 1024), norm_eps);
+            blocks["img_norm2"]             = std::make_shared<LuminaRMSNormZero>(dim, std::min<int64_t>(dim, 1024), norm_eps);
+            blocks["img_norm3"]             = std::make_shared<LuminaRMSNormZero>(dim, std::min<int64_t>(dim, 1024), norm_eps);
+            blocks["instruct_norm1"]        = std::make_shared<LuminaRMSNormZero>(dim, std::min<int64_t>(dim, 1024), norm_eps);
+            blocks["instruct_norm2"]        = std::make_shared<LuminaRMSNormZero>(dim, std::min<int64_t>(dim, 1024), norm_eps);
+            blocks["img_attn_norm"]         = std::make_shared<RMSNorm>(dim, norm_eps);
+            blocks["img_self_attn_norm"]    = std::make_shared<RMSNorm>(dim, norm_eps);
+            blocks["img_ffn_norm1"]         = std::make_shared<RMSNorm>(dim, norm_eps);
+            blocks["img_ffn_norm2"]         = std::make_shared<RMSNorm>(dim, norm_eps);
+            blocks["instruct_attn_norm"]    = std::make_shared<RMSNorm>(dim, norm_eps);
+            blocks["instruct_ffn_norm1"]    = std::make_shared<RMSNorm>(dim, norm_eps);
+            blocks["instruct_ffn_norm2"]    = std::make_shared<RMSNorm>(dim, norm_eps);
+        }
+
+        std::pair<ggml_tensor*, ggml_tensor*> forward(GGMLRunnerContext* ctx,
+                                                      ggml_tensor* img_hidden_states,
+                                                      ggml_tensor* instruct_hidden_states,
+                                                      ggml_tensor* joint_rotary_emb,
+                                                      ggml_tensor* img_rotary_emb,
+                                                      ggml_tensor* temb) {
+            auto img_instruct_attn     = std::dynamic_pointer_cast<BooguImageJointAttention>(blocks["img_instruct_attn"]);
+            auto img_self_attn         = std::dynamic_pointer_cast<Attention>(blocks["img_self_attn"]);
+            auto img_feed_forward      = std::dynamic_pointer_cast<LuminaFeedForward>(blocks["img_feed_forward"]);
+            auto instruct_feed_forward = std::dynamic_pointer_cast<LuminaFeedForward>(blocks["instruct_feed_forward"]);
+            auto img_norm1             = std::dynamic_pointer_cast<LuminaRMSNormZero>(blocks["img_norm1"]);
+            auto img_norm2             = std::dynamic_pointer_cast<LuminaRMSNormZero>(blocks["img_norm2"]);
+            auto img_norm3             = std::dynamic_pointer_cast<LuminaRMSNormZero>(blocks["img_norm3"]);
+            auto instruct_norm1        = std::dynamic_pointer_cast<LuminaRMSNormZero>(blocks["instruct_norm1"]);
+            auto instruct_norm2        = std::dynamic_pointer_cast<LuminaRMSNormZero>(blocks["instruct_norm2"]);
+            auto img_attn_norm         = std::dynamic_pointer_cast<RMSNorm>(blocks["img_attn_norm"]);
+            auto img_self_attn_norm    = std::dynamic_pointer_cast<RMSNorm>(blocks["img_self_attn_norm"]);
+            auto img_ffn_norm1         = std::dynamic_pointer_cast<RMSNorm>(blocks["img_ffn_norm1"]);
+            auto img_ffn_norm2         = std::dynamic_pointer_cast<RMSNorm>(blocks["img_ffn_norm2"]);
+            auto instruct_attn_norm    = std::dynamic_pointer_cast<RMSNorm>(blocks["instruct_attn_norm"]);
+            auto instruct_ffn_norm1    = std::dynamic_pointer_cast<RMSNorm>(blocks["instruct_ffn_norm1"]);
+            auto instruct_ffn_norm2    = std::dynamic_pointer_cast<RMSNorm>(blocks["instruct_ffn_norm2"]);
+
+            int64_t L_instruct = instruct_hidden_states->ne[1];
+
+            auto img_norm1_out_vec      = img_norm1->forward(ctx, img_hidden_states, temb);
+            auto img_norm2_out_vec      = img_norm2->forward(ctx, img_hidden_states, temb);
+            auto img_norm3_out_vec      = img_norm3->forward(ctx, img_hidden_states, temb);
+            auto instruct_norm1_out_vec = instruct_norm1->forward(ctx, instruct_hidden_states, temb);
+            auto instruct_norm2_out_vec = instruct_norm2->forward(ctx, instruct_hidden_states, temb);
+
+            auto img_norm1_out = std::get<0>(img_norm1_out_vec);
+            auto img_gate_msa  = std::get<1>(img_norm1_out_vec);
+            auto img_scale_mlp = std::get<2>(img_norm1_out_vec);
+            auto img_gate_mlp  = std::get<3>(img_norm1_out_vec);
+
+            auto img_norm2_out = std::get<0>(img_norm2_out_vec);
+            auto img_shift_mlp = std::get<1>(img_norm2_out_vec);
+
+            auto img_norm3_out = std::get<0>(img_norm3_out_vec);
+            auto img_gate_self = std::get<1>(img_norm3_out_vec);
+
+            auto instruct_norm1_out = std::get<0>(instruct_norm1_out_vec);
+            auto instruct_gate_msa  = std::get<1>(instruct_norm1_out_vec);
+            auto instruct_scale_mlp = std::get<2>(instruct_norm1_out_vec);
+            auto instruct_gate_mlp  = std::get<3>(instruct_norm1_out_vec);
+
+            auto instruct_norm2_out = std::get<0>(instruct_norm2_out_vec);
+            auto instruct_shift_mlp = std::get<1>(instruct_norm2_out_vec);
+
+            auto joint_attn_out    = img_instruct_attn->forward(ctx, img_norm1_out, instruct_norm1_out, joint_rotary_emb);
+            auto instruct_attn_out = ggml_ext_slice(ctx->ggml_ctx, joint_attn_out, 1, 0, L_instruct);
+            auto img_attn_out      = ggml_ext_slice(ctx->ggml_ctx, joint_attn_out, 1, L_instruct, joint_attn_out->ne[1]);
+
+            auto img_self_attn_out = img_self_attn->forward(ctx, img_norm3_out, img_norm3_out, img_rotary_emb);
+
+            img_hidden_states = gate_residual(ctx->ggml_ctx, img_hidden_states, img_attn_norm->forward(ctx, img_attn_out), img_gate_msa);
+            img_hidden_states = gate_residual(ctx->ggml_ctx, img_hidden_states, img_self_attn_norm->forward(ctx, img_self_attn_out), img_gate_self);
+
+            auto img_mlp_input = scale_modulate(ctx->ggml_ctx, img_norm2_out, img_scale_mlp);
+            img_shift_mlp      = ggml_reshape_3d(ctx->ggml_ctx, img_shift_mlp, img_shift_mlp->ne[0], 1, img_shift_mlp->ne[1]);
+            img_mlp_input      = ggml_add(ctx->ggml_ctx, img_mlp_input, img_shift_mlp);
+            auto img_mlp_out   = img_feed_forward->forward(ctx, img_ffn_norm1->forward(ctx, img_mlp_input));
+            img_hidden_states  = gate_residual(ctx->ggml_ctx, img_hidden_states, img_ffn_norm2->forward(ctx, img_mlp_out), img_gate_mlp);
+
+            instruct_hidden_states  = gate_residual(ctx->ggml_ctx, instruct_hidden_states, instruct_attn_norm->forward(ctx, instruct_attn_out), instruct_gate_msa);
+            auto instruct_mlp_input = scale_modulate(ctx->ggml_ctx, instruct_norm2_out, instruct_scale_mlp);
+            instruct_shift_mlp      = ggml_reshape_3d(ctx->ggml_ctx, instruct_shift_mlp, instruct_shift_mlp->ne[0], 1, instruct_shift_mlp->ne[1]);
+            instruct_mlp_input      = ggml_add(ctx->ggml_ctx, instruct_mlp_input, instruct_shift_mlp);
+            auto instruct_mlp_out   = instruct_feed_forward->forward(ctx, instruct_ffn_norm1->forward(ctx, instruct_mlp_input));
+            instruct_hidden_states  = gate_residual(ctx->ggml_ctx, instruct_hidden_states, instruct_ffn_norm2->forward(ctx, instruct_mlp_out), instruct_gate_mlp);
+
+            return {img_hidden_states, instruct_hidden_states};
+        }
+    };
+
+    struct BooguImageModel : public GGMLBlock {
+        BooguConfig config;
+
+        void init_params(ggml_context* ctx, const String2TensorStorage& tensor_storage_map = {}, const std::string prefix = "") override {
+            GGML_UNUSED(tensor_storage_map);
+            GGML_UNUSED(prefix);
+            params["image_index_embedding"] = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, config.hidden_size, 5);
+        }
+
+        BooguImageModel() = default;
+        BooguImageModel(BooguConfig config)
+            : config(std::move(config)) {
+            blocks["x_embedder"]               = std::make_shared<Linear>(this->config.patch_size * this->config.patch_size * this->config.in_channels, this->config.hidden_size, true);
+            blocks["ref_image_patch_embedder"] = std::make_shared<Linear>(this->config.patch_size * this->config.patch_size * this->config.in_channels, this->config.hidden_size, true);
+            blocks["time_caption_embed"]       = std::make_shared<LuminaCombinedTimestepCaptionEmbedding>(this->config.hidden_size,
+                                                                                                    this->config.instruction_feat_dim,
+                                                                                                    256,
+                                                                                                    this->config.norm_eps,
+                                                                                                    this->config.timestep_scale);
+
+            for (int i = 0; i < this->config.num_refiner_layers; i++) {
+                blocks["noise_refiner." + std::to_string(i)]     = std::make_shared<BooguImageTransformerBlock>(this->config.hidden_size,
+                                                                                                            this->config.num_attention_heads,
+                                                                                                            this->config.num_kv_heads,
+                                                                                                            this->config.multiple_of,
+                                                                                                            this->config.norm_eps,
+                                                                                                            true);
+                blocks["ref_image_refiner." + std::to_string(i)] = std::make_shared<BooguImageTransformerBlock>(this->config.hidden_size,
+                                                                                                                this->config.num_attention_heads,
+                                                                                                                this->config.num_kv_heads,
+                                                                                                                this->config.multiple_of,
+                                                                                                                this->config.norm_eps,
+                                                                                                                true);
+                blocks["context_refiner." + std::to_string(i)]   = std::make_shared<BooguImageTransformerBlock>(this->config.hidden_size,
+                                                                                                              this->config.num_attention_heads,
+                                                                                                              this->config.num_kv_heads,
+                                                                                                              this->config.multiple_of,
+                                                                                                              this->config.norm_eps,
+                                                                                                              false);
+            }
+
+            for (int i = 0; i < this->config.num_double_stream_layers; i++) {
+                blocks["double_stream_layers." + std::to_string(i)] = std::make_shared<BooguImageDoubleStreamBlock>(this->config.hidden_size,
+                                                                                                                    this->config.num_attention_heads,
+                                                                                                                    this->config.num_kv_heads,
+                                                                                                                    this->config.multiple_of,
+                                                                                                                    this->config.norm_eps);
+            }
+
+            for (int i = 0; i < this->config.num_layers; i++) {
+                blocks["single_stream_layers." + std::to_string(i)] = std::make_shared<BooguImageTransformerBlock>(this->config.hidden_size,
+                                                                                                                   this->config.num_attention_heads,
+                                                                                                                   this->config.num_kv_heads,
+                                                                                                                   this->config.multiple_of,
+                                                                                                                   this->config.norm_eps,
+                                                                                                                   true);
+            }
+
+            blocks["norm_out"] = std::make_shared<LuminaLayerNormContinuous>(this->config.hidden_size,
+                                                                             this->config.timestep_embed_dim,
+                                                                             this->config.patch_size * this->config.patch_size * this->config.out_channels);
+        }
+
+        ggml_tensor* image_index_embedding(GGMLRunnerContext* ctx, int index) {
+            GGML_ASSERT(index >= 0 && index < 5);
+            auto embedding = params["image_index_embedding"];
+            auto out       = ggml_view_1d(ctx->ggml_ctx,
+                                          embedding,
+                                          config.hidden_size,
+                                          index * config.hidden_size * ggml_element_size(embedding));
+            out            = ggml_reshape_3d(ctx->ggml_ctx, out, config.hidden_size, 1, 1);
+            return out;
+        }
+
+        ggml_tensor* embed_refs(GGMLRunnerContext* ctx, const std::vector<ggml_tensor*>& ref_latents) {
+            if (ref_latents.empty()) {
+                return nullptr;
+            }
+            auto ref_image_patch_embedder = std::dynamic_pointer_cast<Linear>(blocks["ref_image_patch_embedder"]);
+
+            ggml_tensor* ref_img = nullptr;
+            for (int i = 0; i < static_cast<int>(ref_latents.size()); i++) {
+                auto ref = DiT::pad_and_patchify(ctx, ref_latents[i], config.patch_size, config.patch_size, false);
+                ref      = ref_image_patch_embedder->forward(ctx, ref);
+                ref      = ggml_add(ctx->ggml_ctx, ref, image_index_embedding(ctx, std::min(i, 4)));
+                ref_img  = ref_img == nullptr ? ref : ggml_concat(ctx->ggml_ctx, ref_img, ref, 1);
+            }
+            return ref_img;
+        }
+
+        ggml_tensor* forward(GGMLRunnerContext* ctx,
+                             ggml_tensor* x,
+                             ggml_tensor* timesteps,
+                             ggml_tensor* context,
+                             ggml_tensor* pe,
+                             std::vector<ggml_tensor*> ref_latents = {}) {
+            int64_t W = x->ne[0];
+            int64_t H = x->ne[1];
+            int64_t N = x->ne[3];
+            GGML_ASSERT(N == 1);
+
+            auto x_embedder         = std::dynamic_pointer_cast<Linear>(blocks["x_embedder"]);
+            auto time_caption_embed = std::dynamic_pointer_cast<LuminaCombinedTimestepCaptionEmbedding>(blocks["time_caption_embed"]);
+            auto norm_out           = std::dynamic_pointer_cast<LuminaLayerNormContinuous>(blocks["norm_out"]);
+
+            auto timestep = ggml_sub(ctx->ggml_ctx, ggml_ext_ones_like(ctx->ggml_ctx, timesteps), timesteps);
+            auto embeds   = time_caption_embed->forward(ctx, timestep, context);
+            auto temb     = embeds.first;
+            auto txt      = embeds.second;
+
+            auto img        = DiT::pad_and_patchify(ctx, x, config.patch_size, config.patch_size, false);
+            int64_t img_len = img->ne[1];
+            img             = x_embedder->forward(ctx, img);
+            auto ref_img    = embed_refs(ctx, ref_latents);
+            int64_t ref_len = ref_img != nullptr ? ref_img->ne[1] : 0;
+            int64_t txt_len = txt->ne[1];
+
+            GGML_ASSERT(pe->ne[3] == txt_len + ref_len + img_len);
+            auto txt_pe   = ggml_ext_slice(ctx->ggml_ctx, pe, 3, 0, txt_len);
+            auto noise_pe = ggml_ext_slice(ctx->ggml_ctx, pe, 3, txt_len + ref_len, txt_len + ref_len + img_len);
+
+            for (int i = 0; i < config.num_refiner_layers; i++) {
+                auto block = std::dynamic_pointer_cast<BooguImageTransformerBlock>(blocks["context_refiner." + std::to_string(i)]);
+                txt        = block->forward(ctx, txt, txt_pe);
+                sd::ggml_graph_cut::mark_graph_cut(txt, "boogu.context_refiner." + std::to_string(i), "txt");
+            }
+
+            for (int i = 0; i < config.num_refiner_layers; i++) {
+                auto block = std::dynamic_pointer_cast<BooguImageTransformerBlock>(blocks["noise_refiner." + std::to_string(i)]);
+                img        = block->forward(ctx, img, noise_pe, temb);
+                sd::ggml_graph_cut::mark_graph_cut(img, "boogu.noise_refiner." + std::to_string(i), "img");
+            }
+
+            ggml_tensor* combined_img = img;
+            if (ref_img != nullptr) {
+                auto ref_pe = ggml_ext_slice(ctx->ggml_ctx, pe, 3, txt_len, txt_len + ref_len);
+                for (int i = 0; i < config.num_refiner_layers; i++) {
+                    auto block = std::dynamic_pointer_cast<BooguImageTransformerBlock>(blocks["ref_image_refiner." + std::to_string(i)]);
+                    ref_img    = block->forward(ctx, ref_img, ref_pe, temb);
+                    sd::ggml_graph_cut::mark_graph_cut(ref_img, "boogu.ref_image_refiner." + std::to_string(i), "ref_img");
+                }
+                combined_img = ggml_concat(ctx->ggml_ctx, ref_img, img, 1);
+            }
+
+            auto img_pe = ggml_ext_slice(ctx->ggml_ctx, pe, 3, txt_len, txt_len + combined_img->ne[1]);
+            for (int i = 0; i < config.num_double_stream_layers; i++) {
+                auto block   = std::dynamic_pointer_cast<BooguImageDoubleStreamBlock>(blocks["double_stream_layers." + std::to_string(i)]);
+                auto result  = block->forward(ctx, combined_img, txt, pe, img_pe, temb);
+                combined_img = result.first;
+                txt          = result.second;
+                sd::ggml_graph_cut::mark_graph_cut(combined_img, "boogu.double_stream_layers." + std::to_string(i), "img");
+                sd::ggml_graph_cut::mark_graph_cut(txt, "boogu.double_stream_layers." + std::to_string(i), "txt");
+            }
+
+            auto hidden_states = ggml_concat(ctx->ggml_ctx, txt, combined_img, 1);
+            for (int i = 0; i < config.num_layers; i++) {
+                auto block    = std::dynamic_pointer_cast<BooguImageTransformerBlock>(blocks["single_stream_layers." + std::to_string(i)]);
+                hidden_states = block->forward(ctx, hidden_states, pe, temb);
+                sd::ggml_graph_cut::mark_graph_cut(hidden_states, "boogu.single_stream_layers." + std::to_string(i), "hidden_states");
+            }
+
+            hidden_states = norm_out->forward(ctx, hidden_states, temb);
+            hidden_states = ggml_ext_slice(ctx->ggml_ctx, hidden_states, 1, hidden_states->ne[1] - img_len, hidden_states->ne[1]);
+            hidden_states = DiT::unpatchify_and_crop(ctx->ggml_ctx, hidden_states, H, W, config.patch_size, config.patch_size, false);
+            hidden_states = ggml_ext_scale(ctx->ggml_ctx, hidden_states, -1.f);
+            return hidden_states;
+        }
+    };
+
+    __STATIC_INLINE__ int patched_token_count(int64_t size, int patch_size) {
+        int pad = (patch_size - (static_cast<int>(size) % patch_size)) % patch_size;
+        return (static_cast<int>(size) + pad) / patch_size;
+    }
+
+    __STATIC_INLINE__ void append_spatial_ids(std::vector<std::vector<float>>& ids,
+                                              int bs,
+                                              int pe_shift,
+                                              int h_tokens,
+                                              int w_tokens) {
+        std::vector<std::vector<float>> image_ids(h_tokens * w_tokens, std::vector<float>(3, 0.0f));
+        for (int h = 0; h < h_tokens; h++) {
+            for (int w = 0; w < w_tokens; w++) {
+                image_ids[h * w_tokens + w][0] = static_cast<float>(pe_shift);
+                image_ids[h * w_tokens + w][1] = static_cast<float>(h);
+                image_ids[h * w_tokens + w][2] = static_cast<float>(w);
+            }
+        }
+        for (int b = 0; b < bs; b++) {
+            ids.insert(ids.end(), image_ids.begin(), image_ids.end());
+        }
+    }
+
+    __STATIC_INLINE__ std::vector<float> gen_boogu_pe(int h,
+                                                      int w,
+                                                      int patch_size,
+                                                      int bs,
+                                                      int context_len,
+                                                      const std::vector<ggml_tensor*>& ref_latents,
+                                                      int theta,
+                                                      const std::vector<int>& axes_dim) {
+        std::vector<std::vector<float>> ids;
+        ids.reserve(static_cast<size_t>(bs) * context_len);
+        for (int b = 0; b < bs; b++) {
+            for (int i = 0; i < context_len; i++) {
+                float pos = static_cast<float>(i);
+                ids.push_back({pos, pos, pos});
+            }
+        }
+
+        int pe_shift = context_len;
+        for (ggml_tensor* ref : ref_latents) {
+            int ref_h_tokens = patched_token_count(ref->ne[1], patch_size);
+            int ref_w_tokens = patched_token_count(ref->ne[0], patch_size);
+            append_spatial_ids(ids, bs, pe_shift, ref_h_tokens, ref_w_tokens);
+            pe_shift += std::max(ref_h_tokens, ref_w_tokens);
+        }
+
+        int h_tokens = patched_token_count(h, patch_size);
+        int w_tokens = patched_token_count(w, patch_size);
+        append_spatial_ids(ids, bs, pe_shift, h_tokens, w_tokens);
+
+        return Rope::embed_nd(ids, bs, static_cast<float>(theta), axes_dim);
+    }
+
+    struct BooguImageRunner : public DiffusionModelRunner {
+        BooguConfig config;
+        BooguImageModel boogu;
+        std::vector<float> pe_vec;
+
+        BooguImageRunner(ggml_backend_t backend,
+                         const String2TensorStorage& tensor_storage_map      = {},
+                         const std::string prefix                            = "",
+                         SDVersion version                                   = VERSION_BOOGU_IMAGE,
+                         std::shared_ptr<RunnerWeightManager> weight_manager = nullptr)
+            : DiffusionModelRunner(backend, prefix, weight_manager),
+              config(BooguConfig::detect_from_weights(tensor_storage_map, prefix)) {
+            boogu = BooguImageModel(config);
+            boogu.init(params_ctx, tensor_storage_map, prefix);
+        }
+
+        std::string get_desc() override {
+            return "boogu_image";
+        }
+
+        void get_param_tensors(std::map<std::string, ggml_tensor*>& tensors, const std::string& prefix) override {
+            boogu.get_param_tensors(tensors, prefix);
+        }
+
+        ggml_cgraph* build_graph(const sd::Tensor<float>& x_tensor,
+                                 const sd::Tensor<float>& timesteps_tensor,
+                                 const sd::Tensor<float>& context_tensor,
+                                 const std::vector<sd::Tensor<float>>& ref_latents_tensor = {}) {
+            ggml_cgraph* gf        = new_graph_custom(BOOGU_GRAPH_SIZE);
+            ggml_tensor* x         = make_input(x_tensor);
+            ggml_tensor* timesteps = make_input(timesteps_tensor);
+            GGML_ASSERT(x->ne[3] == 1);
+            GGML_ASSERT(!context_tensor.empty());
+            ggml_tensor* context = make_input(context_tensor);
+
+            std::vector<ggml_tensor*> ref_latents;
+            ref_latents.reserve(ref_latents_tensor.size());
+            for (const auto& ref_latent_tensor : ref_latents_tensor) {
+                ref_latents.push_back(make_input(ref_latent_tensor));
+            }
+
+            pe_vec      = gen_boogu_pe(static_cast<int>(x->ne[1]),
+                                       static_cast<int>(x->ne[0]),
+                                       config.patch_size,
+                                       static_cast<int>(x->ne[3]),
+                                       static_cast<int>(context->ne[1]),
+                                       ref_latents,
+                                       config.theta,
+                                       config.axes_dim);
+            int pos_len = static_cast<int>(pe_vec.size() / config.axes_dim_sum / 2);
+            auto pe     = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, 2, 2, config.axes_dim_sum / 2, pos_len);
+            set_backend_tensor_data(pe, pe_vec.data());
+
+            auto runner_ctx  = get_context();
+            ggml_tensor* out = boogu.forward(&runner_ctx, x, timesteps, context, pe, ref_latents);
+            ggml_build_forward_expand(gf, out);
+            return gf;
+        }
+
+        sd::Tensor<float> compute(int n_threads,
+                                  const sd::Tensor<float>& x,
+                                  const sd::Tensor<float>& timesteps,
+                                  const sd::Tensor<float>& context,
+                                  const std::vector<sd::Tensor<float>>& ref_latents = {}) {
+            auto get_graph = [&]() -> ggml_cgraph* {
+                return build_graph(x, timesteps, context, ref_latents);
+            };
+            return restore_trailing_singleton_dims(GGMLRunner::compute<float>(get_graph, n_threads, false, false, false), x.dim());
+        }
+
+        sd::Tensor<float> compute(int n_threads,
+                                  const DiffusionParams& diffusion_params) override {
+            GGML_ASSERT(diffusion_params.x != nullptr);
+            GGML_ASSERT(diffusion_params.timesteps != nullptr);
+            static const std::vector<sd::Tensor<float>> empty_ref_latents;
+            return compute(n_threads,
+                           *diffusion_params.x,
+                           *diffusion_params.timesteps,
+                           tensor_or_empty(diffusion_params.context),
+                           diffusion_params.ref_latents ? *diffusion_params.ref_latents : empty_ref_latents);
+        }
+    };
+}  // namespace Boogu
+
+#endif  // __SD_MODEL_DIFFUSION_BOOGU_HPP__
--- a/src/model/diffusion/control.hpp
+++ b/src/model/diffusion/control.hpp
@ -312,16 +312,17 @@ struct ControlNet : public GGMLRunner {
    ControlNetBlock control_net;
    std::string weight_prefix;

-    ggml_backend_buffer_t control_buffer = nullptr;
-    ggml_context* control_ctx            = nullptr;
    std::vector<ggml_tensor*> control_outputs_ggml;
    ggml_tensor* guided_hint_output_ggml = nullptr;
    std::vector<sd::Tensor<float>> controls;
-    sd::Tensor<float> guided_hint;
    bool guided_hint_cached = false;
    std::shared_ptr<ModelManager> owned_model_manager;
    ggml_backend_t params_backend = nullptr;

+    static const char* guided_hint_cache_name() {
+        return "controlnet.guided_hint";
+    }
+
    ControlNet(ggml_backend_t backend,
               ggml_backend_t params_backend_,
               const String2TensorStorage& tensor_storage_map      = {},
@ -336,44 +337,12 @@ struct ControlNet : public GGMLRunner {
        free_control_ctx();
    }

-    void alloc_control_ctx(std::vector<ggml_tensor*> outs) {
-        ggml_init_params params;
-        params.mem_size   = static_cast<size_t>(outs.size() * ggml_tensor_overhead()) + 1024 * 1024;
-        params.mem_buffer = nullptr;
-        params.no_alloc   = true;
-        control_ctx       = ggml_init(params);
-
-        control_outputs_ggml.resize(outs.size() - 1);
-
-        size_t control_buffer_size = 0;
-
-        guided_hint_output_ggml = ggml_dup_tensor(control_ctx, outs[0]);
-        control_buffer_size += ggml_nbytes(guided_hint_output_ggml);
-
-        for (int i = 0; i < outs.size() - 1; i++) {
-            control_outputs_ggml[i] = ggml_dup_tensor(control_ctx, outs[i + 1]);
-            control_buffer_size += ggml_nbytes(control_outputs_ggml[i]);
-        }
-
-        control_buffer = ggml_backend_alloc_ctx_tensors(control_ctx, runtime_backend);
-
-        LOG_DEBUG("control buffer size %.2fMB", control_buffer_size * 1.f / 1024.f / 1024.f);
-    }
-
    void free_control_ctx() {
-        if (control_buffer != nullptr) {
-            ggml_backend_buffer_free(control_buffer);
-            control_buffer = nullptr;
-        }
-        if (control_ctx != nullptr) {
-            ggml_free(control_ctx);
-            control_ctx = nullptr;
-        }
        guided_hint_output_ggml = nullptr;
        guided_hint_cached      = false;
-        guided_hint             = {};
        control_outputs_ggml.clear();
        controls.clear();
+        free_cache_ctx_and_buffer();
    }

    std::string get_desc() override {
@ -397,11 +366,17 @@ struct ControlNet : public GGMLRunner {
        ggml_tensor* context   = make_optional_input(context_tensor);
        ggml_tensor* y         = make_optional_input(y_tensor);

+        guided_hint_output_ggml = nullptr;
+        control_outputs_ggml.clear();
+
        ggml_tensor* guided_hint_input = nullptr;
-        if (guided_hint_cached && !guided_hint.empty()) {
-            guided_hint_input = make_input(guided_hint);
-            hint              = nullptr;
-        } else {
+        if (guided_hint_cached) {
+            guided_hint_input = get_cache_tensor_by_name(guided_hint_cache_name());
+            if (guided_hint_input == nullptr) {
+                guided_hint_cached = false;
+            }
+        }
+        if (guided_hint_input == nullptr) {
            hint = make_input(hint_tensor);
        }

@ -415,13 +390,19 @@ struct ControlNet : public GGMLRunner {
                                        context,
                                        y);

-        if (control_ctx == nullptr) {
-            alloc_control_ctx(outs);
+        if (guided_hint_input == nullptr && !outs.empty()) {
+            guided_hint_output_ggml = outs[0];
+            ggml_set_output(guided_hint_output_ggml);
+            cache(guided_hint_cache_name(), guided_hint_output_ggml);
+            ggml_build_forward_expand(gf, guided_hint_output_ggml);
        }

-        ggml_build_forward_expand(gf, ggml_cpy(compute_ctx, outs[0], guided_hint_output_ggml));
-        for (int i = 0; i < outs.size() - 1; i++) {
-            ggml_build_forward_expand(gf, ggml_cpy(compute_ctx, outs[i + 1], control_outputs_ggml[i]));
+        control_outputs_ggml.reserve(outs.size() > 0 ? outs.size() - 1 : 0);
+        for (size_t i = 1; i < outs.size(); i++) {
+            ggml_tensor* control_output = outs[i];
+            ggml_set_output(control_output);
+            ggml_build_forward_expand(gf, control_output);
+            control_outputs_ggml.push_back(control_output);
        }

        return gf;
@ -441,15 +422,12 @@ struct ControlNet : public GGMLRunner {
            return build_graph(x, hint, timesteps, context, y);
        };

-        auto compute_result = GGMLRunner::compute<float>(get_graph, n_threads, false, false, false);
+        auto compute_result = GGMLRunner::compute<float>(get_graph, n_threads, false, false, false, true);
        if (!compute_result.has_value()) {
            return std::nullopt;
        }

-        if (guided_hint_output_ggml != nullptr) {
-            guided_hint = restore_trailing_singleton_dims(sd::make_sd_tensor_from_ggml<float>(guided_hint_output_ggml),
-                                                          4);
-        }
+        guided_hint_cached = get_cache_tensor_by_name(guided_hint_cache_name()) != nullptr;
        controls.clear();
        controls.reserve(control_outputs_ggml.size());
        for (ggml_tensor* control : control_outputs_ggml) {
@ -457,7 +435,6 @@ struct ControlNet : public GGMLRunner {
            GGML_ASSERT(!control_host.empty());
            controls.push_back(std::move(control_host));
        }
-        guided_hint_cached = true;
        return controls;
    }

--- a/src/model/diffusion/ernie_image.hpp
+++ b/src/model/diffusion/ernie_image.hpp
@ -162,6 +162,8 @@ namespace ErnieImage {
            int64_t S = x->ne[1];
            int64_t N = x->ne[2];

+            float scale = (sd_backend_is(ctx->backend, "Vulkan") && ctx->flash_attn_enabled) ? 1.0f / 32.0f : 1.0f;
+
            auto q = to_q->forward(ctx, x);
            auto k = to_k->forward(ctx, x);
            auto v = to_v->forward(ctx, x);
@ -182,7 +184,7 @@ namespace ErnieImage {
            k = ggml_cont(ctx->ggml_ctx, ggml_permute(ctx->ggml_ctx, k, 0, 2, 1, 3));  // [N, heads, S, head_dim]
            k = ggml_reshape_3d(ctx->ggml_ctx, k, k->ne[0], k->ne[1], k->ne[2] * k->ne[3]);

-            x = ggml_ext_attention_ext(ctx->ggml_ctx, ctx->backend, q, k, v, num_heads, attention_mask, true, ctx->flash_attn_enabled);  // [N, S, hidden_size]
+            x = ggml_ext_attention_ext(ctx->ggml_ctx, ctx->backend, q, k, v, num_heads, attention_mask, true, ctx->flash_attn_enabled, scale);  // [N, S, hidden_size]
            x = to_out_0->forward(ctx, x);
            return x;
        }
--- a/src/model/diffusion/flux.hpp
+++ b/src/model/diffusion/flux.hpp
@ -4,6 +4,7 @@
 #include <memory>
 #include <vector>

+#include "model/adapter/pulid.hpp"
 #include "model/common/rope.hpp"
 #include "model/diffusion/dit.hpp"
 #include "model/diffusion/model.hpp"
@ -49,6 +50,10 @@ namespace Flux {
        float ref_index_scale     = 1.f;
        ChromaRadianceConfig chroma_radiance_params;

+        bool pulid_enabled        = false;
+        int pulid_double_interval = 2;
+        int pulid_single_interval = 4;
+
        static FluxConfig detect_from_weights(const String2TensorStorage& tensor_storage_map,
                                              const std::string& prefix,
                                              SDVersion version = VERSION_FLUX) {
@ -138,6 +143,9 @@ namespace Flux {
                if (ends_with(name, "double_blocks.0.txt_attn.norm.key_norm.scale")) {
                    head_dim = tensor_storage.ne[0];
                }
+                if (name.find("pulid_ca.") != std::string::npos) {
+                    config.pulid_enabled = true;
+                }
            }
            if (actual_radiance_patch_size > 0 && actual_radiance_patch_size != config.patch_size) {
                GGML_ASSERT(config.patch_size == 2 * actual_radiance_patch_size);
@ -957,6 +965,20 @@ namespace Flux {
                blocks["double_stream_modulation_txt"] = std::make_shared<Modulation>(config.hidden_size, true, !config.disable_bias);
                blocks["single_stream_modulation"]     = std::make_shared<Modulation>(config.hidden_size, false, !config.disable_bias);
            }
+
+            if (config.pulid_enabled) {
+                int num_double_ca = (config.depth + config.pulid_double_interval - 1) / config.pulid_double_interval;
+                int num_single_ca = (config.depth_single_blocks + config.pulid_single_interval - 1) / config.pulid_single_interval;
+                int num_ca        = num_double_ca + num_single_ca;
+                for (int i = 0; i < num_ca; i++) {
+                    blocks["pulid_ca." + std::to_string(i)] =
+                        std::shared_ptr<GGMLBlock>(new PuLIDPerceiverAttentionCA(
+                            /*dim=*/config.hidden_size,
+                            /*dim_head=*/PuLIDPerceiverAttentionCA::DEFAULT_DIM_HEAD,
+                            /*heads=*/PuLIDPerceiverAttentionCA::DEFAULT_HEADS,
+                            /*kv_dim=*/PuLIDPerceiverAttentionCA::DEFAULT_KV_DIM));
+                }
+            }
        }

        ggml_tensor* forward_orig(GGMLRunnerContext* ctx,
@ -967,7 +989,9 @@ namespace Flux {
                                  ggml_tensor* guidance,
                                  ggml_tensor* pe,
                                  ggml_tensor* mod_index_arange = nullptr,
-                                  std::vector<int> skip_layers  = {}) {
+                                  std::vector<int> skip_layers  = {},
+                                  ggml_tensor* pulid_id         = nullptr,
+                                  float pulid_id_weight         = 1.0f) {
            auto img_in      = std::dynamic_pointer_cast<Linear>(blocks["img_in"]);
            auto txt_in      = std::dynamic_pointer_cast<Linear>(blocks["txt_in"]);
            auto final_layer = std::dynamic_pointer_cast<LastLayer>(blocks["final_layer"]);
@ -1044,6 +1068,13 @@ namespace Flux {
            sd::ggml_graph_cut::mark_graph_cut(txt, "flux.prelude", "txt");
            sd::ggml_graph_cut::mark_graph_cut(vec, "flux.prelude", "vec");

+            const bool pulid_active = config.pulid_enabled && pulid_id != nullptr;
+            if (pulid_active && !skip_layers.empty()) {
+                LOG_WARN("PuLID + skip_layers is not supported; disabling PuLID for this generation.");
+            }
+            const bool pulid_run = pulid_active && skip_layers.empty();
+            int ca_idx           = 0;
+
            for (int i = 0; i < config.depth; i++) {
                if (skip_layers.size() > 0 && std::find(skip_layers.begin(), skip_layers.end(), i) != skip_layers.end()) {
                    continue;
@ -1056,9 +1087,19 @@ namespace Flux {
                txt          = img_txt.second;  // [N, n_txt_token, hidden_size]
                sd::ggml_graph_cut::mark_graph_cut(img, "flux.double_blocks." + std::to_string(i), "img");
                sd::ggml_graph_cut::mark_graph_cut(txt, "flux.double_blocks." + std::to_string(i), "txt");
+
+                if (pulid_run && (i % config.pulid_double_interval == 0)) {
+                    auto pulid_ca = std::dynamic_pointer_cast<PuLIDPerceiverAttentionCA>(
+                        blocks["pulid_ca." + std::to_string(ca_idx)]);
+                    ggml_tensor* ca_out = pulid_ca->forward(ctx, pulid_id, img);  // [N, n_img_token, hidden_size]
+                    img                 = ggml_add(ctx->ggml_ctx, img, ggml_scale(ctx->ggml_ctx, ca_out, pulid_id_weight));
+                    sd::ggml_graph_cut::mark_graph_cut(img, "flux.pulid_ca." + std::to_string(ca_idx), "img");
+                    ca_idx++;
+                }
            }

-            auto txt_img = ggml_concat(ctx->ggml_ctx, txt, img, 1);  // [N, n_txt_token + n_img_token, hidden_size]
+            auto txt_img            = ggml_concat(ctx->ggml_ctx, txt, img, 1);  // [N, n_txt_token + n_img_token, hidden_size]
+            const int64_t n_txt_tok = txt->ne[1];
            for (int i = 0; i < config.depth_single_blocks; i++) {
                if (skip_layers.size() > 0 && std::find(skip_layers.begin(), skip_layers.end(), i + config.depth) != skip_layers.end()) {
                    continue;
@ -1067,6 +1108,29 @@ namespace Flux {

                txt_img = block->forward(ctx, txt_img, vec, pe, txt_img_mask, ss_mods);
                sd::ggml_graph_cut::mark_graph_cut(txt_img, "flux.single_blocks." + std::to_string(i), "txt_img");
+
+                if (pulid_run && (i % config.pulid_single_interval == 0)) {
+                    auto pulid_ca = std::dynamic_pointer_cast<PuLIDPerceiverAttentionCA>(
+                        blocks["pulid_ca." + std::to_string(ca_idx)]);
+                    ggml_tensor* txt_part = ggml_view_3d(ctx->ggml_ctx, txt_img,
+                                                         txt_img->ne[0], n_txt_tok, txt_img->ne[2],
+                                                         txt_img->nb[1], txt_img->nb[2],
+                                                         0);
+                    ggml_tensor* img_part = ggml_view_3d(ctx->ggml_ctx, txt_img,
+                                                         txt_img->ne[0],
+                                                         txt_img->ne[1] - n_txt_tok,
+                                                         txt_img->ne[2],
+                                                         txt_img->nb[1],
+                                                         txt_img->nb[2],
+                                                         n_txt_tok * txt_img->nb[1]);
+                    txt_part              = ggml_cont(ctx->ggml_ctx, txt_part);
+                    img_part              = ggml_cont(ctx->ggml_ctx, img_part);
+                    ggml_tensor* ca_out   = pulid_ca->forward(ctx, pulid_id, img_part);
+                    img_part              = ggml_add(ctx->ggml_ctx, img_part, ggml_scale(ctx->ggml_ctx, ca_out, pulid_id_weight));
+                    txt_img               = ggml_concat(ctx->ggml_ctx, txt_part, img_part, 1);
+                    sd::ggml_graph_cut::mark_graph_cut(txt_img, "flux.pulid_ca." + std::to_string(ca_idx), "txt_img");
+                    ca_idx++;
+                }
            }

            img = ggml_view_3d(ctx->ggml_ctx,
@ -1105,7 +1169,9 @@ namespace Flux {
                                             ggml_tensor* mod_index_arange         = nullptr,
                                             ggml_tensor* dct                      = nullptr,
                                             std::vector<ggml_tensor*> ref_latents = {},
-                                             std::vector<int> skip_layers          = {}) {
+                                             std::vector<int> skip_layers          = {},
+                                             ggml_tensor* pulid_id                 = nullptr,
+                                             float pulid_id_weight                 = 1.0f) {
            GGML_ASSERT(x->ne[3] == 1);

            int64_t W      = x->ne[0];
@ -1131,7 +1197,8 @@ namespace Flux {
            img = ggml_reshape_3d(ctx->ggml_ctx, img, img->ne[0] * img->ne[1], img->ne[2], img->ne[3]);  // [N, hidden_size, H/patch_size*W/patch_size]
            img = ggml_cont(ctx->ggml_ctx, ggml_ext_torch_permute(ctx->ggml_ctx, img, 1, 0, 2, 3));      // [N, H/patch_size*W/patch_size, hidden_size]

-            auto out = forward_orig(ctx, img, context, timestep, y, guidance, pe, mod_index_arange, skip_layers);  // [N, n_img_token, hidden_size]
+            auto out = forward_orig(ctx, img, context, timestep, y, guidance, pe, mod_index_arange, skip_layers,
+                                    pulid_id, pulid_id_weight);  // [N, n_img_token, hidden_size]

            // nerf decode
            auto nerf_image_embedder   = std::dynamic_pointer_cast<NerfEmbedder>(blocks["nerf_image_embedder"]);
@ -1179,7 +1246,9 @@ namespace Flux {
                                         ggml_tensor* mod_index_arange         = nullptr,
                                         ggml_tensor* dct                      = nullptr,
                                         std::vector<ggml_tensor*> ref_latents = {},
-                                         std::vector<int> skip_layers          = {}) {
+                                         std::vector<int> skip_layers          = {},
+                                         ggml_tensor* pulid_id                 = nullptr,
+                                         float pulid_id_weight                 = 1.0f) {
            GGML_ASSERT(x->ne[3] == 1);

            int64_t W      = x->ne[0];
@ -1226,7 +1295,8 @@ namespace Flux {
                }
            }

-            auto out = forward_orig(ctx, img, context, timestep, y, guidance, pe, mod_index_arange, skip_layers);  // [N, num_tokens, C * patch_size * patch_size]
+            auto out = forward_orig(ctx, img, context, timestep, y, guidance, pe, mod_index_arange, skip_layers,
+                                    pulid_id, pulid_id_weight);  // [N, num_tokens, C * patch_size * patch_size]

            if (out->ne[1] > img_tokens) {
                out = ggml_view_3d(ctx->ggml_ctx, out, out->ne[0], img_tokens, out->ne[2], out->nb[1], out->nb[2], 0);
@ -1248,7 +1318,9 @@ namespace Flux {
                             ggml_tensor* mod_index_arange         = nullptr,
                             ggml_tensor* dct                      = nullptr,
                             std::vector<ggml_tensor*> ref_latents = {},
-                             std::vector<int> skip_layers          = {}) {
+                             std::vector<int> skip_layers          = {},
+                             ggml_tensor* pulid_id                 = nullptr,
+                             float pulid_id_weight                 = 1.0f) {
            // Forward pass of DiT.
            // x: (N, C, H, W) tensor of spatial inputs (images or latent representations of images)
            // timestep: (N,) tensor of diffusion timesteps
@ -1271,7 +1343,9 @@ namespace Flux {
                                               mod_index_arange,
                                               dct,
                                               ref_latents,
-                                               skip_layers);
+                                               skip_layers,
+                                               pulid_id,
+                                               pulid_id_weight);
            } else {
                return forward_flux_chroma(ctx,
                                           x,
@ -1284,7 +1358,9 @@ namespace Flux {
                                           mod_index_arange,
                                           dct,
                                           ref_latents,
-                                           skip_layers);
+                                           skip_layers,
+                                           pulid_id,
+                                           pulid_id_weight);
            }
        }
    };
@ -1384,7 +1460,9 @@ namespace Flux {
                                 const sd::Tensor<float>& guidance_tensor                 = {},
                                 const std::vector<sd::Tensor<float>>& ref_latents_tensor = {},
                                 bool increase_ref_index                                  = false,
-                                 std::vector<int> skip_layers                             = {}) {
+                                 std::vector<int> skip_layers                             = {},
+                                 const sd::Tensor<float>& pulid_id_tensor                 = {},
+                                 float pulid_id_weight                                    = 1.0f) {
            ggml_tensor* x         = make_input(x_tensor);
            ggml_tensor* timesteps = make_input(timesteps_tensor);
            ggml_tensor* context   = make_optional_input(context_tensor);
@ -1461,6 +1539,10 @@ namespace Flux {
                set_backend_tensor_data(dct, dct_vec.data());
            }

+            ggml_tensor* pulid_id = pulid_id_tensor.empty()
+                                        ? nullptr
+                                        : make_input(pulid_id_tensor);
+
            auto runner_ctx = get_context();

            ggml_tensor* out = flux.forward(&runner_ctx,
@ -1474,7 +1556,9 @@ namespace Flux {
                                            mod_index_arange,
                                            dct,
                                            ref_latents,
-                                            skip_layers);
+                                            skip_layers,
+                                            pulid_id,
+                                            pulid_id_weight);

            ggml_build_forward_expand(gf, out);

@ -1490,14 +1574,17 @@ namespace Flux {
                                  const sd::Tensor<float>& guidance                 = {},
                                  const std::vector<sd::Tensor<float>>& ref_latents = {},
                                  bool increase_ref_index                           = false,
-                                  std::vector<int> skip_layers                      = std::vector<int>()) {
+                                  std::vector<int> skip_layers                      = std::vector<int>(),
+                                  const sd::Tensor<float>& pulid_id                 = {},
+                                  float pulid_id_weight                             = 1.0f) {
            // x: [N, in_channels, h, w]
            // timesteps: [N, ]
            // context: [N, max_position, hidden_size]
            // y: [N, adm_in_channels] or [1, adm_in_channels]
            // guidance: [N, ]
+            // pulid_id: empty (no injection) or [N, num_id_tokens=32, kv_dim=2048]
            auto get_graph = [&]() -> ggml_cgraph* {
-                return build_graph(x, timesteps, context, c_concat, y, guidance, ref_latents, increase_ref_index, skip_layers);
+                return build_graph(x, timesteps, context, c_concat, y, guidance, ref_latents, increase_ref_index, skip_layers, pulid_id, pulid_id_weight);
            };

            auto result = restore_trailing_singleton_dims(GGMLRunner::compute<float>(get_graph, n_threads, false, false, false), x.dim());
@ -1520,7 +1607,9 @@ namespace Flux {
                           tensor_or_empty(extra->guidance),
                           diffusion_params.ref_latents ? *diffusion_params.ref_latents : empty_ref_latents,
                           diffusion_params.increase_ref_index,
-                           extra->skip_layers ? *extra->skip_layers : empty_skip_layers);
+                           extra->skip_layers ? *extra->skip_layers : empty_skip_layers,
+                           tensor_or_empty(extra->pulid_id),
+                           extra->pulid_id_weight);
        }

        void test() {
--- a/src/model/diffusion/model.hpp
+++ b/src/model/diffusion/model.hpp
@ -22,6 +22,8 @@ struct SkipLayerDiffusionExtra {
 struct FluxDiffusionExtra {
    const sd::Tensor<float>* guidance   = nullptr;
    const std::vector<int>* skip_layers = nullptr;
+    const sd::Tensor<float>* pulid_id   = nullptr;
+    float pulid_id_weight               = 1.0f;
 };

 struct AnimaDiffusionExtra {
--- a/src/model/te/llm.hpp
+++ b/src/model/te/llm.hpp
@ -79,6 +79,7 @@ namespace LLM {
        int window_size                     = 112;
        int num_position_embeddings         = 0;
        std::set<int> fullatt_block_indexes = {7, 15, 23, 31};
+        bool split_patch_embed              = false;
    };

    struct LLMConfig {
@ -179,7 +180,8 @@ namespace LLM {
                config.num_experts_per_tok     = 4;
            }

-            config.num_layers = 0;
+            config.num_layers          = 0;
+            int detected_vision_layers = 0;
            for (const auto& [name, tensor_storage] : tensor_storage_map) {
                if (!starts_with(name, prefix)) {
                    continue;
@ -190,6 +192,38 @@ namespace LLM {
                    if (contains(name, "attn.q_proj")) {
                        config.llama_cpp_style = true;
                    }
+                    if (contains(name, "visual.patch_embed.proj.1.weight")) {
+                        config.vision.split_patch_embed = true;
+                    }
+                    if (contains(name, "visual.patch_embed.proj.0.weight")) {
+                        config.vision.patch_size  = static_cast<int>(tensor_storage.ne[0]);
+                        config.vision.in_channels = tensor_storage.ne[2];
+                        config.vision.hidden_size = tensor_storage.ne[3];
+                    }
+                    if (contains(name, "visual.patch_embed.bias")) {
+                        config.vision.hidden_size = tensor_storage.ne[0];
+                    }
+                    if (contains(name, "visual.pos_embed.weight")) {
+                        config.vision.hidden_size             = tensor_storage.ne[0];
+                        config.vision.num_position_embeddings = static_cast<int>(tensor_storage.ne[1]);
+                    }
+                    if (contains(name, "visual.blocks.")) {
+                        auto items = split_string(name.substr(pos), '.');
+                        if (items.size() > 2) {
+                            int block_index = atoi(items[2].c_str());
+                            if (block_index + 1 > detected_vision_layers) {
+                                detected_vision_layers = block_index + 1;
+                            }
+                        }
+                    }
+                    if (contains(name, "visual.blocks.0.mlp.linear_fc1.weight") ||
+                        contains(name, "visual.blocks.0.mlp.gate_proj.weight")) {
+                        config.vision.intermediate_size = tensor_storage.ne[1];
+                    }
+                    if (contains(name, "visual.merger.linear_fc2.weight") ||
+                        contains(name, "visual.merger.mlp.2.weight")) {
+                        config.vision.out_hidden_size = tensor_storage.ne[1];
+                    }
                    continue;
                }
                pos = name.find("layers.");
@ -219,6 +253,9 @@ namespace LLM {
            if (arch == LLMArch::QWEN3 && config.num_layers == 28) {
                config.num_heads = 16;
            }
+            if (detected_vision_layers > 0) {
+                config.vision.num_layers = detected_vision_layers;
+            }
            LOG_DEBUG("llm: num_layers = %" PRId64 ", vocab_size = %" PRId64 ", hidden_size = %" PRId64 ", intermediate_size = %" PRId64,
                      config.num_layers,
                      config.vocab_size,
@ -539,40 +576,51 @@ namespace LLM {

    struct VisionPatchEmbed : public GGMLBlock {
    protected:
-        bool llama_cpp_style;
+        bool split_patch_embed;
+        bool bias;
        int patch_size;
        int temporal_patch_size;
        int64_t in_channels;
        int64_t embed_dim;

+        void init_params(ggml_context* ctx,
+                         const String2TensorStorage& tensor_storage_map = {},
+                         const std::string prefix                       = "") override {
+            GGML_UNUSED(tensor_storage_map);
+            GGML_UNUSED(prefix);
+            if (split_patch_embed && bias) {
+                params["bias"] = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, embed_dim);
+            }
+        }
+
    public:
-        VisionPatchEmbed(bool llama_cpp_style,
+        VisionPatchEmbed(bool split_patch_embed,
                         LLMVisionArch arch,
                         int patch_size          = 14,
                         int temporal_patch_size = 2,
                         int64_t in_channels     = 3,
                         int64_t embed_dim       = 1152)
-            : llama_cpp_style(llama_cpp_style),
+            : split_patch_embed(split_patch_embed),
+              bias(arch == LLMVisionArch::QWEN3_VL),
              patch_size(patch_size),
              temporal_patch_size(temporal_patch_size),
              in_channels(in_channels),
              embed_dim(embed_dim) {
-            bool bias = arch == LLMVisionArch::QWEN3_VL;
-            if (llama_cpp_style) {
+            if (split_patch_embed) {
                blocks["proj.0"] = std::shared_ptr<GGMLBlock>(new Conv2d(in_channels,
                                                                         embed_dim,
                                                                         {patch_size, patch_size},
                                                                         {patch_size, patch_size},
                                                                         {0, 0},
                                                                         {1, 1},
-                                                                         bias));
+                                                                         false));
                blocks["proj.1"] = std::shared_ptr<GGMLBlock>(new Conv2d(in_channels,
                                                                         embed_dim,
                                                                         {patch_size, patch_size},
                                                                         {patch_size, patch_size},
                                                                         {0, 0},
                                                                         {1, 1},
-                                                                         bias));
+                                                                         false));
            } else {
                std::tuple<int, int, int> kernel_size = {(int)temporal_patch_size, (int)patch_size, (int)patch_size};
                blocks["proj"]                        = std::shared_ptr<GGMLBlock>(new Conv3d(in_channels,
@ -593,7 +641,7 @@ namespace LLM {
                                temporal_patch_size,
                                ggml_nelements(x) / (temporal_patch_size * patch_size * patch_size));

-            if (llama_cpp_style) {
+            if (split_patch_embed) {
                auto proj_0 = std::dynamic_pointer_cast<Conv2d>(blocks["proj.0"]);
                auto proj_1 = std::dynamic_pointer_cast<Conv2d>(blocks["proj.1"]);

@ -606,6 +654,10 @@ namespace LLM {
                x1      = proj_1->forward(ctx, x1);

                x = ggml_add(ctx->ggml_ctx, x0, x1);
+                if (bias) {
+                    auto b = ggml_reshape_4d(ctx->ggml_ctx, params["bias"], 1, 1, embed_dim, 1);
+                    x      = ggml_add_inplace(ctx->ggml_ctx, x, b);
+                }
            } else {
                auto proj = std::dynamic_pointer_cast<Conv3d>(blocks["proj"]);

@ -798,7 +850,7 @@ namespace LLM {
              spatial_merge_size(vision_params.spatial_merge_size),
              num_grid_per_side(vision_params.num_position_embeddings > 0 ? static_cast<int>(std::sqrt(vision_params.num_position_embeddings)) : 0),
              fullatt_block_indexes(vision_params.fullatt_block_indexes) {
-            blocks["patch_embed"] = std::shared_ptr<GGMLBlock>(new VisionPatchEmbed(llama_cpp_style,
+            blocks["patch_embed"] = std::shared_ptr<GGMLBlock>(new VisionPatchEmbed(vision_params.split_patch_embed,
                                                                                    arch_,
                                                                                    vision_params.patch_size,
                                                                                    vision_params.temporal_patch_size,
--- a/src/model/vae/auto_encoder_kl.hpp
+++ b/src/model/vae/auto_encoder_kl.hpp
@ -682,7 +682,7 @@ struct AutoEncoderKL : public VAE {
        } else if (sd_version_is_sd3(version)) {
            scale_factor = 1.5305f;
            shift_factor = 0.0609f;
-        } else if (sd_version_is_flux(version) || sd_version_is_z_image(version) || sd_version_is_longcat(version)) {
+        } else if (sd_version_uses_flux_vae(version)) {
            scale_factor = 0.3611f;
            shift_factor = 0.1159f;
        } else if (sd_version_uses_flux2_vae(version)) {
--- a/src/model_loader.cpp
+++ b/src/model_loader.cpp
@ -485,6 +485,9 @@ SDVersion ModelLoader::get_sd_version() {
        if (tensor_storage.name.find("model.diffusion_model.cap_embedder.0.weight") != std::string::npos) {
            return VERSION_Z_IMAGE;
        }
+        if (tensor_storage.name.find("double_stream_layers.0.img_instruct_attn.processor.img_to_q.weight") != std::string::npos) {
+            return VERSION_BOOGU_IMAGE;
+        }
        if (tensor_storage.name.find("model.diffusion_model.layers.0.adaLN_sa_ln.weight") != std::string::npos) {
            return VERSION_ERNIE_IMAGE;
        }
@ -1002,6 +1005,7 @@ bool ModelLoader::load_tensors(on_new_tensor_cb_t on_new_tensor_cb,
        std::atomic<size_t> tensor_idx(0);
        std::atomic<bool> failed(false);
        std::vector<std::thread> workers;
+        std::mutex rpc_backend_mutex;

        for (int i = 0; i < n_threads; ++i) {
            workers.emplace_back([&, file_path, is_zip]() {
@ -1158,7 +1162,19 @@ bool ModelLoader::load_tensors(on_new_tensor_cb_t on_new_tensor_cb,

                    if (dst_tensor->buffer != nullptr && !ggml_backend_buffer_is_host(dst_tensor->buffer)) {
                        t0 = ggml_time_ms();
-                        ggml_backend_tensor_set(dst_tensor, convert_buf, 0, ggml_nbytes(dst_tensor));
+
+                        // RPC backends require serialized access to prevent concurrency issues
+                        const char* buffer_type_name = ggml_backend_buft_name(ggml_backend_buffer_get_type(dst_tensor->buffer));
+                        bool is_rpc_buffer           = buffer_type_name != nullptr &&
+                                             std::string(buffer_type_name).find("RPC") != std::string::npos;
+
+                        if (is_rpc_buffer) {
+                            std::lock_guard<std::mutex> lock(rpc_backend_mutex);
+                            ggml_backend_tensor_set(dst_tensor, convert_buf, 0, ggml_nbytes(dst_tensor));
+                        } else {
+                            ggml_backend_tensor_set(dst_tensor, convert_buf, 0, ggml_nbytes(dst_tensor));
+                        }
+
                        t1 = ggml_time_ms();
                        copy_to_backend_time_ms.fetch_add(t1 - t0);
                    }
--- a/src/model_manager.cpp
+++ b/src/model_manager.cpp
@ -147,6 +147,17 @@ bool ModelManager::register_param_tensors(const std::string& desc,
    return true;
 }

+bool ModelManager::load_all_params_eagerly() {
+    std::vector<TensorState*> all_states;
+    all_states.reserve(tensor_states_.size());
+    for (const auto& s : tensor_states_) {
+        if (s != nullptr) {
+            all_states.push_back(s.get());
+        }
+    }
+    return load_tensors_to_params_backend(all_states);
+}
+
 bool ModelManager::validate_registered_tensors() {
    bool ok = true;
    for (const auto& state : tensor_states_) {
@ -469,7 +480,7 @@ bool ModelManager::mmap_params(const std::vector<TensorState*>& states,
        return true;
    }

-    auto mmap_store = model_loader_.mmap_tensors(mmap_candidates, {}, true);
+    auto mmap_store = model_loader_.mmap_tensors(mmap_candidates, {}, writable_mmap_);
    if (mmap_store.empty()) {
        return true;
    }
@ -577,13 +588,8 @@ bool ModelManager::alloc_params_buffers(const std::vector<TensorState*>& states,
        for (TensorState* state : states) {
            ggml_tensor* tensor = state->tensor;
            size_t tensor_size  = GGML_PAD(ggml_backend_buft_get_alloc_size(params_buft, tensor), alignment);
-            if (max_size > 0 && tensor_size > max_size) {
-                LOG_ERROR("model manager tensor '%s' is too large for params buffer: %zu > %zu",
-                          ggml_get_name(tensor),
-                          tensor_size,
-                          max_size);
-                return false;
-            }
+            // Some backends, e.g. Vulkan, report a preferred chunk size here rather than a
+            // hard per-tensor allocation limit. Oversized tensors are allocated alone.
            if (!chunk.empty() && max_size > 0 && chunk_size + tensor_size > max_size) {
                if (!alloc_chunk(chunk, chunk_size)) {
                    return false;
--- a/src/model_manager.h
+++ b/src/model_manager.h
@ -69,6 +69,7 @@ private:
    uint64_t current_lora_epoch_ = 0;
    int n_threads_               = 0;
    bool enable_mmap_            = false;
+    bool writable_mmap_          = false;

    void finish_compute_backend_usage(const std::vector<TensorState*>& states);
    void release_all();
@ -110,6 +111,7 @@ public:
        model_loader_.set_n_threads(n_threads);
    }
    void set_enable_mmap(bool enable_mmap) { enable_mmap_ = enable_mmap; }
+    void set_writable_mmap(bool writable_mmap) { writable_mmap_ = writable_mmap; }
    void set_common_ignore_tensors(std::set<std::string> ignore_tensors);
    void set_loras(std::vector<LoraSpec> loras, SDVersion version);

@ -158,6 +160,7 @@ public:
    }

    bool validate_registered_tensors();
+    bool load_all_params_eagerly();

    bool prepare_params(const std::vector<ggml_tensor*>& tensors) override;
    void release_compute_backend_params(const std::vector<ggml_tensor*>& tensors) override;
--- a/src/name_conversion.cpp
+++ b/src/name_conversion.cpp
@ -184,6 +184,27 @@ std::string convert_cond_stage_model_name(std::string name, std::string prefix)
    return name;
 }

+std::string convert_qwen3_vl_vision_name(std::string name) {
+    static const std::vector<std::pair<std::string, std::string>> qwen3_vl_vision_name_map{
+        {"mm.0.", "merger.linear_fc1."},
+        {"mm.2.", "merger.linear_fc2."},
+        {"v.post_ln.", "merger.norm."},
+        {"v.position_embd.weight", "pos_embed.weight"},
+        {"v.patch_embd.weight.1", "patch_embed.proj.1.weight"},
+        {"v.patch_embd.weight", "patch_embed.proj.0.weight"},
+        {"v.patch_embd.bias", "patch_embed.bias"},
+        {"v.blk.", "blocks."},
+        {"attn_qkv.", "attn.qkv."},
+        {"attn_out.", "attn.proj."},
+        {"ffn_up.", "mlp.linear_fc1."},
+        {"ffn_down.", "mlp.linear_fc2."},
+        {"ln1.", "norm1."},
+        {"ln2.", "norm2."},
+    };
+    replace_with_name_map(name, qwen3_vl_vision_name_map);
+    return name;
+}
+
 // ref: https://github.com/huggingface/diffusers/blob/main/scripts/convert_diffusers_to_original_stable_diffusion.py
 std::string convert_diffusers_unet_to_original_sd1(std::string name) {
    // (stable-diffusion, HF Diffusers)
@ -1154,6 +1175,10 @@ std::string convert_tensor_name(std::string name, SDVersion version) {

    replace_with_prefix_map(name, prefix_map);

+    if (sd_version_is_boogu_image(version) && starts_with(name, "text_encoders.llm.visual.")) {
+        name = convert_qwen3_vl_vision_name(std::move(name));
+    }
+
    // diffusion model
    {
        for (const auto& prefix : diffuison_model_prefix_vec) {
--- a/src/runtime/guidance.cpp
+++ b/src/runtime/guidance.cpp
@ -3,6 +3,7 @@
 #include <algorithm>
 #include <cmath>
 #include <cstdlib>
+#include <optional>
 #include <string>
 #include <utility>

@ -63,6 +64,82 @@ namespace sd::guidance {
        return uncond;
    }

+    std::vector<float> parse_guidance_schedule_from_spec(std::string spec) {
+        std::vector<float> schedule;
+
+        while (!spec.empty()) {
+            auto sep     = spec.find('+');
+            auto segment = spec.substr(0, sep);
+
+            auto x = segment.find('x');
+            if (x == std::string::npos) {
+                LOG_ERROR("Invalid guidance schedule segment: '%s' (expected <guidance>x<count>)", segment.c_str());
+                return {};
+            }
+
+            float guidance;
+            int count;
+
+            auto guidance_str = segment.substr(0, x);
+            auto count_str    = segment.substr(x + 1);
+
+            try {
+                size_t idx = 0;
+                guidance   = std::stof(guidance_str, &idx);
+                if (idx != guidance_str.size()) {
+                    LOG_ERROR("Invalid guidance value in guidance schedule: '%s'", guidance_str.c_str());
+                    return {};
+                }
+            } catch (const std::exception&) {
+                LOG_ERROR("Invalid guidance value in guidance schedule: '%s'", guidance_str.c_str());
+                return {};
+            }
+
+            try {
+                size_t idx = 0;
+                count      = std::stoi(count_str, &idx);
+                if (idx != count_str.size()) {
+                    LOG_ERROR("Invalid count in guidance schedule: '%s'", count_str.c_str());
+                    return {};
+                }
+            } catch (const std::exception&) {
+                LOG_ERROR("Invalid count in guidance schedule: '%s'", count_str.c_str());
+                return {};
+            }
+
+            if (count <= 0) {
+                LOG_ERROR("Guidance schedule count must be positive");
+                return {};
+            }
+
+            schedule.insert(schedule.end(), count, guidance);
+
+            if (sep == std::string::npos) {
+                break;
+            }
+
+            spec = spec.substr(sep + 1);
+        }
+
+        return schedule;
+    }
+
+    std::vector<float> parse_guidance_schedule(const char* extra_sample_args) {
+        std::vector<float> guidance_schedule;
+        std::string guidance_schedule_str = "";
+        for (const auto& [key, value] : parse_key_value_args(extra_sample_args, "extra sample arg")) {
+            float parsed = 0.0f;
+            if (key == "guidance_schedule") {
+                guidance_schedule_str = value;
+            }
+        }
+
+        if (!guidance_schedule_str.empty()) {
+            guidance_schedule = parse_guidance_schedule_from_spec(guidance_schedule_str);
+        }
+        return guidance_schedule;
+    }
+
    ClassifierFreeGuidance::ClassifierFreeGuidance(float guidance_scale,
                                                   float image_guidance_scale)
        : guidance_scale_(guidance_scale),
@ -70,8 +147,10 @@ namespace sd::guidance {
    }

    GuiderOutput ClassifierFreeGuidance::forward(const GuidanceInput& input,
-                                                 GuiderOutput previous) const {
+                                                 GuiderOutput previous,
+                                                 std::optional<float> scale_override) const {
        (void)previous;
+        float guidance_scale = scale_override.value_or(guidance_scale_);

        GuiderOutput output;
        if (!has_tensor(input.pred_cond)) {
@ -86,14 +165,14 @@ namespace sd::guidance {
                const sd::Tensor<float>& pred_img_uncond = *input.pred_img_uncond;
                output.pred                              = pred_img_uncond +
                              image_guidance_scale_ * (pred_uncond - pred_img_uncond) +
-                              guidance_scale_ * (pred_cond - pred_uncond);
+                              guidance_scale * (pred_cond - pred_uncond);

            } else {
-                output.pred = pred_uncond + guidance_scale_ * (pred_cond - pred_uncond);
+                output.pred = pred_uncond + guidance_scale * (pred_cond - pred_uncond);
            }
        } else if (has_tensor(input.pred_img_uncond)) {
            const sd::Tensor<float>& pred_img_uncond = *input.pred_img_uncond;
-            output.pred                              = pred_img_uncond + guidance_scale_ * (pred_cond - pred_img_uncond);
+            output.pred                              = pred_img_uncond + guidance_scale * (pred_cond - pred_img_uncond);
        }

        return output;
@ -128,8 +207,10 @@ namespace sd::guidance {
    }

    GuiderOutput AdaptiveProjectedGuidance::forward(const GuidanceInput& input,
-                                                    GuiderOutput previous) const {
+                                                    GuiderOutput previous,
+                                                    std::optional<float> scale_override) const {
        (void)previous;
+        float guidance_scale = scale_override.value_or(guidance_scale_);

        GuiderOutput output;
        if (!has_tensor(input.pred_cond)) {
@ -144,13 +225,13 @@ namespace sd::guidance {
                const sd::Tensor<float>& pred_img_uncond = *input.pred_img_uncond;
                output.pred                              = pred_img_uncond +
                              image_guidance_scale_ * (pred_uncond - pred_img_uncond) +
-                              guidance_scale_ * (pred_cond - pred_uncond);
+                              guidance_scale * (pred_cond - pred_uncond);
            } else {
-                output.pred = pred_uncond + guidance_scale_ * (pred_cond - pred_uncond);
+                output.pred = pred_uncond + guidance_scale * (pred_cond - pred_uncond);
            }
        } else if (has_tensor(input.pred_img_uncond)) {
            const sd::Tensor<float>& pred_img_uncond = *input.pred_img_uncond;
-            output.pred                              = pred_img_uncond + guidance_scale_ * (pred_cond - pred_img_uncond);
+            output.pred                              = pred_img_uncond + guidance_scale * (pred_cond - pred_img_uncond);
        }
        if (!has_tensor(input.pred_uncond) && !has_tensor(input.pred_img_uncond)) {
            return output;
@ -162,7 +243,7 @@ namespace sd::guidance {
        sd::Tensor<float> deltas = calculate_guidance_delta(pred_cond,
                                                            pred_uncond,
                                                            pred_img_uncond,
-                                                            guidance_scale_,
+                                                            guidance_scale,
                                                            image_guidance_scale_);
        if (params_.momentum != 0.0f) {
            if (momentum_buffer_.shape() != deltas.shape()) {
@ -239,7 +320,8 @@ namespace sd::guidance {
    }

    GuiderOutput SkipLayerGuidance::forward(const GuidanceInput& input,
-                                            GuiderOutput output) const {
+                                            GuiderOutput output,
+                                            std::optional<float> /*scale_override*/) const {
        if (scale_ == 0.0f || !is_enabled_for_step(input) || !input.predict_skip_layer) {
            return output;
        }
--- a/src/runtime/guidance.h
+++ b/src/runtime/guidance.h
@ -3,6 +3,7 @@

 #include <cstddef>
 #include <functional>
+#include <optional>
 #include <vector>

 #include "core/tensor.hpp"
@ -27,6 +28,7 @@ namespace sd::guidance {
    AdaptiveProjectedGuidanceParams parse_adaptive_projected_guidance_args(const char* extra_sample_args);
    bool is_adaptive_projected_guidance_enabled(const AdaptiveProjectedGuidanceParams& params);
    bool parse_skip_layer_guidance_uncond_arg(const char* extra_sample_args);
+    std::vector<float> parse_guidance_schedule(const char* extra_sample_args);

    struct GuidanceInput {
        int step                                 = 0;
@ -40,9 +42,10 @@ namespace sd::guidance {

    class BaseGuidance {
    public:
-        virtual ~BaseGuidance()                                   = default;
+        virtual ~BaseGuidance()                                                                = default;
        virtual GuiderOutput forward(const GuidanceInput& input,
-                                     GuiderOutput previous) const = 0;
+                                     GuiderOutput previous,
+                                     std::optional<float> scale_override = std::nullopt) const = 0;
    };

    class ClassifierFreeGuidance : public BaseGuidance {
@ -54,7 +57,8 @@ namespace sd::guidance {
                               float image_guidance_scale);

        GuiderOutput forward(const GuidanceInput& input,
-                             GuiderOutput previous) const override;
+                             GuiderOutput previous,
+                             std::optional<float> scale_override = std::nullopt) const override;
    };

    class AdaptiveProjectedGuidance : public BaseGuidance {
@ -69,7 +73,8 @@ namespace sd::guidance {
                                  AdaptiveProjectedGuidanceParams params);

        GuiderOutput forward(const GuidanceInput& input,
-                             GuiderOutput previous) const override;
+                             GuiderOutput previous,
+                             std::optional<float> scale_override = std::nullopt) const override;
    };

    class SkipLayerGuidance : public BaseGuidance {
@ -88,7 +93,8 @@ namespace sd::guidance {
        const std::vector<int>& layers() const;

        GuiderOutput forward(const GuidanceInput& input,
-                             GuiderOutput previous) const override;
+                             GuiderOutput previous,
+                             std::optional<float> scale_override = std::nullopt) const override;
    };

 }  // namespace sd::guidance
--- a/src/stable-diffusion.cpp
+++ b/src/stable-diffusion.cpp
@ -3,6 +3,7 @@
 #include <cstdlib>
 #include <set>
 #include <unordered_set>
+#include <vector>

 #include "core/ggml_extend.hpp"
 #include "core/ggml_graph_cut.h"
@ -19,6 +20,7 @@
 #include "extensions/generation_extension.h"
 #include "model/adapter/lora.hpp"
 #include "model/diffusion/anima.hpp"
+#include "model/diffusion/boogu.hpp"
 #include "model/diffusion/control.hpp"
 #include "model/diffusion/ernie_image.hpp"
 #include "model/diffusion/flux.hpp"
@ -52,6 +54,8 @@
 const char* sd_vae_format_name(enum sd_vae_format_t format);
 static SDVersion sd_vae_format_to_version(enum sd_vae_format_t format, SDVersion fallback);

+#include <atomic>
+
 const char* model_version_to_str[] = {
    "SD 1.x",
    "SD 1.x Inpaint",
@ -84,6 +88,7 @@ const char* model_version_to_str[] = {
    "LTXAV",
    "HiDream O1",
    "Z-Image",
+    "Boogu Image",
    "Ovis Image",
    "Ernie Image",
    "Lens",
@ -121,7 +126,8 @@ static bool sd_version_supports_ref_latent_img_cfg(SDVersion version) {
           sd_version_is_flux2(version) ||
           sd_version_is_qwen_image(version) ||
           sd_version_is_longcat(version) ||
-           sd_version_is_z_image(version);
+           sd_version_is_z_image(version) ||
+           sd_version_is_boogu_image(version);
 }

 static bool sd_version_supports_img_cfg(SDVersion version, bool has_ref_images) {
@ -158,6 +164,9 @@ static float get_cache_reuse_threshold(const sd_cache_params_t& params) {

 /*=============================================== StableDiffusionGGML ================================================*/

+static_assert(std::atomic<sd_cancel_mode_t>::is_always_lock_free,
+              "sd_cancel_mode_t must be lock-free");
+
 class StableDiffusionGGML {
 public:
    SDBackendManager backend_manager;
@ -187,10 +196,10 @@ public:

    std::string taesd_path;
    sd_tiling_params_t vae_tiling_params = {false, false, 0, 0, 0.5f, 0, 0, nullptr};
-    bool offload_params_to_cpu           = false;
    bool enable_mmap                     = false;
-    float max_vram                       = 0.f;
-    bool stream_layers                   = false;
+    sd::ggml_graph_cut::MaxVramAssignment max_vram_assignment;
+    bool stream_layers = false;
+    bool eager_load    = false;
    std::string backend_spec;
    std::string params_backend_spec;

@ -222,6 +231,24 @@ public:
        return module_backend;
    }

+    std::atomic<sd_cancel_mode_t> cancellation_flag = SD_CANCEL_RESET;
+
+    void set_cancel_flag(enum sd_cancel_mode_t flag) {
+        cancellation_flag.store(flag, std::memory_order_release);
+    }
+
+    void reset_cancel_flag() {
+        set_cancel_flag(SD_CANCEL_RESET);
+    }
+
+    enum sd_cancel_mode_t get_cancel_flag() {
+        return cancellation_flag.load(std::memory_order_acquire);
+    }
+
+    size_t max_graph_vram_bytes_for_module(SDBackendModule module) {
+        return max_vram_assignment.bytes_for_backend(backend_for(module));
+    }
+
    bool ensure_backend_pair(SDBackendModule module) {
        if (backend_for(module) == nullptr) {
            return false;
@ -250,13 +277,10 @@ public:
                                                     params_mem_size);
    }

-    bool init_backend(const sd_ctx_params_t* sd_ctx_params) {
+    bool init_backend() {
        std::string error;
-        if (!backend_manager.init(sd_ctx_params->backend,
+        if (!backend_manager.init(backend_spec.c_str(),
                                  params_backend_spec.c_str(),
-                                  sd_ctx_params->keep_clip_on_cpu,
-                                  sd_ctx_params->keep_vae_on_cpu,
-                                  sd_ctx_params->keep_control_net_on_cpu,
                                  &error)) {
            LOG_ERROR("backend config failed: %s", error.c_str());
            return false;
@ -316,21 +340,24 @@ public:
    }

    bool init(const sd_ctx_params_t* sd_ctx_params) {
-        n_threads             = sd_ctx_params->n_threads;
-        offload_params_to_cpu = sd_ctx_params->offload_params_to_cpu;
-        enable_mmap           = sd_ctx_params->enable_mmap;
-        max_vram              = sd_ctx_params->max_vram;
-        stream_layers         = sd_ctx_params->stream_layers;
-        backend_spec          = SAFE_STR(sd_ctx_params->backend);
-        params_backend_spec   = SAFE_STR(sd_ctx_params->params_backend);
-        if (offload_params_to_cpu) {
-            params_backend_spec = params_backend_spec.empty() ? "*=cpu" : "*=cpu," + params_backend_spec;
-        }
-        if (stream_layers && max_vram == 0.f) {
-            LOG_WARN("--stream-layers has no effect without --max-vram set; ignoring");
-            stream_layers = false;
+        n_threads           = sd_ctx_params->n_threads;
+        enable_mmap         = sd_ctx_params->enable_mmap;
+        stream_layers       = sd_ctx_params->stream_layers;
+        eager_load          = sd_ctx_params->eager_load;
+        backend_spec        = SAFE_STR(sd_ctx_params->backend);
+        params_backend_spec = SAFE_STR(sd_ctx_params->params_backend);
+        max_vram_assignment.reset(0.f);
+        {
+            std::string error;
+            if (!max_vram_assignment.parse(SAFE_STR(sd_ctx_params->max_vram), &error)) {
+                LOG_ERROR("%s", error.c_str());
+                return false;
+            }
        }

+        std::string rpc_servers_spec = SAFE_STR(sd_ctx_params->rpc_servers);
+        add_rpc_devices(rpc_servers_spec);
+
        bool use_tae         = false;
        bool use_audio_vae   = false;
        bool use_control_net = false;
@ -344,14 +371,20 @@ public:

        ggml_log_set(ggml_log_callback_default, nullptr);

-        if (!init_backend(sd_ctx_params)) {
+        if (!init_backend()) {
            return false;
        }
+        {
+            std::string error;
+            if (!max_vram_assignment.canonicalize_backend_keys(&error)) {
+                LOG_ERROR("%s", error.c_str());
+                return false;
+            }
+        }
        if (stream_layers && !backend_manager.params_backend_is_cpu(SDBackendModule::DIFFUSION)) {
            LOG_WARN("--stream-layers has no effect unless diffusion params backend is cpu; ignoring");
            stream_layers = false;
        }
-        max_vram = sd::ggml_graph_cut::resolve_max_vram_gib(max_vram, backend_for(SDBackendModule::DIFFUSION));

        model_manager = std::make_shared<ModelManager>();
        model_manager->set_n_threads(n_threads);
@ -419,6 +452,14 @@ public:
            }
        }

+        if (strlen(SAFE_STR(sd_ctx_params->pulid_weights_path)) > 0) {
+            LOG_INFO("loading PuLID weights from '%s'", sd_ctx_params->pulid_weights_path);
+            if (!model_loader.init_from_file(sd_ctx_params->pulid_weights_path,
+                                             "model.diffusion_model.")) {
+                LOG_WARN("loading PuLID weights from '%s' failed", sd_ctx_params->pulid_weights_path);
+            }
+        }
+
        if (strlen(SAFE_STR(sd_ctx_params->llm_path)) > 0) {
            LOG_INFO("loading llm from '%s'", sd_ctx_params->llm_path);
            if (!model_loader.init_from_file(sd_ctx_params->llm_path, "text_encoders.llm.")) {
@ -486,14 +527,11 @@ public:
        auto& tensor_storage_map = model_loader.get_tensor_storage_map();

        LOG_INFO("Version: %s ", model_version_to_str[version]);
-        ggml_type wtype               = (int)sd_ctx_params->wtype < std::min<int>(SD_TYPE_COUNT, GGML_TYPE_COUNT)
-                                            ? (ggml_type)sd_ctx_params->wtype
-                                            : GGML_TYPE_COUNT;
+        ggml_type wtype               = sd_type_to_ggml_type(sd_ctx_params->wtype);
        std::string tensor_type_rules = SAFE_STR(sd_ctx_params->tensor_type_rules);
        if (wtype != GGML_TYPE_COUNT || tensor_type_rules.size() > 0) {
            model_loader.set_wtype_override(wtype, tensor_type_rules);
        }
-        model_loader.process_model_files(enable_mmap, true);

        std::map<ggml_type, uint32_t> wtype_stat                 = model_loader.get_wtype_stat();
        std::map<ggml_type, uint32_t> conditioner_wtype_stat     = model_loader.get_conditioner_wtype_stat();
@ -534,8 +572,8 @@ public:
                }
            }
            // Avoid full-model LoRA merge buffers on constrained setups.
-            const bool streaming_constrained = stream_layers ||
-                                               sd_ctx_params->offload_params_to_cpu;
+            const bool params_offloaded      = params_backend_for(SDBackendModule::DIFFUSION) != backend_for(SDBackendModule::DIFFUSION);
+            const bool streaming_constrained = stream_layers || params_offloaded;
            if (have_quantized_weight || streaming_constrained) {
                apply_lora_immediately = false;
            } else {
@ -547,9 +585,12 @@ public:
            apply_lora_immediately = false;
        }

+        bool needs_writable_mmap = enable_mmap && apply_lora_immediately;
+        model_manager->set_writable_mmap(needs_writable_mmap);
        if (enable_mmap && apply_lora_immediately) {
            LOG_WARN("in mode 'immediately', LoRAs will cause extra memory usage with mmap");
        }
+        model_loader.process_model_files(enable_mmap, needs_writable_mmap);
        load_alphas_cumprod(model_loader);

        size_t text_encoder_params_mem_size = 0;
@ -568,8 +609,6 @@ public:
            LOG_INFO("Using circular padding for convolutions");
        }

-        const size_t max_graph_vram_bytes = sd::ggml_graph_cut::max_vram_gib_to_bytes(max_vram);
-
        {
            if (!ensure_backend_pair(SDBackendModule::TE) ||
                !ensure_backend_pair(SDBackendModule::DIFFUSION)) {
@ -691,7 +730,7 @@ public:
                    clip_vision = std::make_shared<FrozenCLIPVisionEmbedder>(backend_for(SDBackendModule::CLIP_VISION),
                                                                             tensor_storage_map,
                                                                             model_manager);
-                    clip_vision->set_max_graph_vram_bytes(max_graph_vram_bytes);
+                    clip_vision->set_max_graph_vram_bytes(max_graph_vram_bytes_for_module(SDBackendModule::CLIP_VISION));
                    if (!register_runner_params("CLIP vision",
                                                clip_vision,
                                                SDBackendModule::CLIP_VISION)) {
@ -752,6 +791,18 @@ public:
                                                                         "model.diffusion_model",
                                                                         version,
                                                                         model_manager);
+            } else if (sd_version_is_boogu_image(version)) {
+                cond_stage_model = std::make_shared<LLMEmbedder>(backend_for(SDBackendModule::TE),
+                                                                 tensor_storage_map,
+                                                                 version,
+                                                                 "",
+                                                                 true,
+                                                                 model_manager);
+                diffusion_model  = std::make_shared<Boogu::BooguImageRunner>(backend_for(SDBackendModule::DIFFUSION),
+                                                                            tensor_storage_map,
+                                                                            "model.diffusion_model",
+                                                                            version,
+                                                                            model_manager);
            } else if (sd_version_is_ernie_image(version)) {
                cond_stage_model = std::make_shared<LLMEmbedder>(backend_for(SDBackendModule::TE),
                                                                 tensor_storage_map,
@ -795,7 +846,7 @@ public:
                }
            }

-            cond_stage_model->set_max_graph_vram_bytes(max_graph_vram_bytes);
+            cond_stage_model->set_max_graph_vram_bytes(max_graph_vram_bytes_for_module(SDBackendModule::TE));
            if (!register_runner_params("Conditioner model",
                                        cond_stage_model,
                                        SDBackendModule::TE,
@ -803,7 +854,7 @@ public:
                return false;
            }

-            diffusion_model->set_max_graph_vram_bytes(max_graph_vram_bytes);
+            diffusion_model->set_max_graph_vram_bytes(max_graph_vram_bytes_for_module(SDBackendModule::DIFFUSION));
            diffusion_model->set_stream_layers_enabled(stream_layers);
            if (!register_runner_params("Diffusion model",
                                        diffusion_model,
@ -813,7 +864,7 @@ public:
            }

            if (high_noise_diffusion_model) {
-                high_noise_diffusion_model->set_max_graph_vram_bytes(max_graph_vram_bytes);
+                high_noise_diffusion_model->set_max_graph_vram_bytes(max_graph_vram_bytes_for_module(SDBackendModule::DIFFUSION));
                high_noise_diffusion_model->set_stream_layers_enabled(stream_layers);
                if (!register_runner_params("High noise diffusion model",
                                            high_noise_diffusion_model,
@ -912,7 +963,7 @@ public:
            } else if (use_tae && !tae_preview_only) {
                LOG_INFO("using TAE for encoding / decoding");
                first_stage_model = create_tae(false);
-                first_stage_model->set_max_graph_vram_bytes(max_graph_vram_bytes);
+                first_stage_model->set_max_graph_vram_bytes(max_graph_vram_bytes_for_module(SDBackendModule::VAE));
                if (!register_runner_params("VAE",
                                            first_stage_model,
                                            SDBackendModule::VAE,
@ -922,7 +973,7 @@ public:
            } else {
                LOG_INFO("using VAE for encoding / decoding");
                first_stage_model = create_vae();
-                first_stage_model->set_max_graph_vram_bytes(max_graph_vram_bytes);
+                first_stage_model->set_max_graph_vram_bytes(max_graph_vram_bytes_for_module(SDBackendModule::VAE));
                if (!register_runner_params("VAE",
                                            first_stage_model,
                                            SDBackendModule::VAE,
@ -932,7 +983,7 @@ public:
                if (use_tae && tae_preview_only) {
                    LOG_INFO("using TAE for preview");
                    preview_vae = create_tae(true);
-                    preview_vae->set_max_graph_vram_bytes(max_graph_vram_bytes);
+                    preview_vae->set_max_graph_vram_bytes(max_graph_vram_bytes_for_module(SDBackendModule::VAE));
                    if (!register_runner_params("preview VAE",
                                                preview_vae,
                                                SDBackendModule::VAE,
@ -1005,6 +1056,14 @@ public:
                if (photomaker_extension->is_enabled()) {
                    generation_extensions.push_back(photomaker_extension);
                }
+
+                auto pulid_extension = create_pulid_extension();
+                if (!pulid_extension->init(extension_ctx)) {
+                    return false;
+                }
+                if (pulid_extension->is_enabled()) {
+                    generation_extensions.push_back(pulid_extension);
+                }
            }
            for (auto& extension : generation_extensions) {
                if (!register_runner_params(extension->name(),
@ -1098,7 +1157,15 @@ public:
            return false;
        }

-        LOG_DEBUG("model metadata validated; weights will be prepared lazily");
+        if (eager_load) {
+            if (!model_manager->load_all_params_eagerly()) {
+                LOG_ERROR("model params eager load failed");
+                return false;
+            }
+            LOG_DEBUG("model metadata validated; weights pre-loaded to params backend");
+        } else {
+            LOG_DEBUG("model metadata validated; weights will be prepared lazily");
+        }

        {
            size_t total_params_ram_size  = 0;
@ -1180,6 +1247,7 @@ public:
                           sd_version_is_anima(version) ||
                           sd_version_is_ernie_image(version) ||
                           sd_version_is_z_image(version) ||
+                           sd_version_is_boogu_image(version) ||
                           sd_version_is_pid(version) ||
                           sd_version_is_ideogram4(version)) {
                    pred_type = FLOW_PRED;
@ -1191,6 +1259,8 @@ public:
                        default_flow_shift = 1.5f;
                    } else if (sd_version_is_ideogram4(version)) {
                        default_flow_shift = 1.0f;
+                    } else if (sd_version_is_boogu_image(version)) {
+                        default_flow_shift = 3.16f;
                    } else {
                        default_flow_shift = 3.f;
                    }
@ -1515,6 +1585,7 @@ public:
    }

    void prepare_generation_extensions(const sd_pm_params_t& pm_params,
+                                       const sd_pulid_params_t& pulid_params,
                                       ConditionerParams& condition_params,
                                       int total_steps) {
        reset_generation_extensions();
@ -1522,6 +1593,7 @@ public:
            cond_stage_model.get(),
            condition_params,
            pm_params,
+            pulid_params,
            n_threads,
            total_steps,
        };
@ -1649,7 +1721,7 @@ public:
                if (sd_version_is_sd3(version)) {
                    latent_rgb_proj = sd3_latent_rgb_proj;
                    latent_rgb_bias = sd3_latent_rgb_bias;
-                } else if (sd_version_is_flux(version) || sd_version_is_z_image(version) || sd_version_is_longcat(version)) {
+                } else if (sd_version_uses_flux_vae(version)) {
                    latent_rgb_proj = flux_latent_rgb_proj;
                    latent_rgb_bias = flux_latent_rgb_bias;
                } else if (sd_version_is_wan(version) || sd_version_is_qwen_image(version) || sd_version_is_anima(version)) {
@ -1744,6 +1816,9 @@ public:
        if (sd_version_is_anima(version)) {
            return std::vector<float>{t / static_cast<float>(TIMESTEPS)};
        }
+        if (sd_version_is_boogu_image(version)) {
+            return std::vector<float>{t / static_cast<float>(TIMESTEPS)};
+        }
        if (version == VERSION_HIDREAM_O1) {
            return std::vector<float>{1.0f - (t / static_cast<float>(TIMESTEPS))};
        }
@ -1869,6 +1944,32 @@ public:
        float slg_scale     = guidance.slg.scale;
        bool slg_uncond     = sd::guidance::parse_skip_layer_guidance_uncond_arg(extra_sample_args);

+        std::vector<float> guidance_schedule = sd::guidance::parse_guidance_schedule(extra_sample_args);
+        if (!guidance_schedule.empty() && guidance_schedule.size() != sigmas.size() - 1) {
+            if (guidance_schedule.size() > sigmas.size()) {
+                LOG_WARN("guidance_schedule length (%zu) is greater than number of steps (%zu)", guidance_schedule.size(), sigmas.size() - 1);
+                LOG_WARN("truncating guidance_schedule to match step count");
+                guidance_schedule.resize(sigmas.size() - 1);
+            } else {
+                LOG_INFO("padding guidance_schedule with cfg_scale");
+                while (guidance_schedule.size() < sigmas.size() - 1) {
+                    guidance_schedule.push_back(cfg_scale);
+                }
+            }
+        }
+
+        if (!guidance_schedule.empty()) {
+            std::string schedule_str = "[";
+            for (size_t i = 0; i < guidance_schedule.size(); ++i) {
+                schedule_str += std::to_string(guidance_schedule[i]);
+                if (i < guidance_schedule.size() - 1) {
+                    schedule_str += ", ";
+                }
+            }
+            schedule_str += "]";
+            LOG_DEBUG("using guidance schedule: %s", schedule_str.c_str());
+        }
+
        sd_sample::SampleCacheRuntime cache_runtime = sd_sample::init_sample_cache_runtime(version,
                                                                                           cache_params,
                                                                                           denoiser.get(),
@ -1916,6 +2017,11 @@ public:
        SamplePreviewContext preview = prepare_sample_preview_context();

        auto denoise = [&](const sd::Tensor<float>& x, float sigma, int step) -> sd::guidance::GuiderOutput {
+            if (get_cancel_flag() == SD_CANCEL_ALL) {
+                LOG_DEBUG("cancelling generation");
+                return {};
+            }
+
            if (step == 1 || step == -1) {
                pretty_progress(0, (int)steps, 0);
                last_progress_us = ggml_time_us();
@ -2036,6 +2142,10 @@ public:
                    return std::move(cached_output);
                }

+                for (const auto& extension : generation_extensions) {
+                    extension->before_diffusion(diffusion_params, step);
+                }
+
                auto output_opt = work_diffusion_model->compute(n_threads, diffusion_params);
                if (output_opt.empty()) {
                    LOG_ERROR("diffusion model compute failed");
@ -2100,7 +2210,7 @@ public:
            guidance_input.pred_uncond     = uncond_out.empty() ? nullptr : &uncond_out;
            guidance_input.pred_img_uncond = img_uncond_out.empty() ? nullptr : &img_uncond_out;

-            sd::guidance::GuiderOutput guided = primary_guidance.forward(guidance_input, {});
+            sd::guidance::GuiderOutput guided = guidance_schedule.empty() ? primary_guidance.forward(guidance_input, {}) : primary_guidance.forward(guidance_input, {}, guidance_schedule[guidance_schedule.size() - 1 - step]);
            if (guided.pred.empty()) {
                return {};
            }
@ -2615,29 +2725,28 @@ void sd_hires_params_init(sd_hires_params_t* hires_params) {
 }

 void sd_ctx_params_init(sd_ctx_params_t* sd_ctx_params) {
-    *sd_ctx_params                         = {};
-    sd_ctx_params->n_threads               = sd_get_num_physical_cores();
-    sd_ctx_params->wtype                   = SD_TYPE_COUNT;
-    sd_ctx_params->rng_type                = CUDA_RNG;
-    sd_ctx_params->sampler_rng_type        = RNG_TYPE_COUNT;
-    sd_ctx_params->prediction              = PREDICTION_COUNT;
-    sd_ctx_params->lora_apply_mode         = LORA_APPLY_AUTO;
-    sd_ctx_params->offload_params_to_cpu   = false;
-    sd_ctx_params->max_vram                = 0.f;
-    sd_ctx_params->stream_layers           = false;
-    sd_ctx_params->enable_mmap             = false;
-    sd_ctx_params->keep_clip_on_cpu        = false;
-    sd_ctx_params->keep_control_net_on_cpu = false;
-    sd_ctx_params->keep_vae_on_cpu         = false;
-    sd_ctx_params->diffusion_flash_attn    = false;
-    sd_ctx_params->circular_x              = false;
-    sd_ctx_params->circular_y              = false;
-    sd_ctx_params->chroma_use_dit_mask     = true;
-    sd_ctx_params->chroma_use_t5_mask      = false;
-    sd_ctx_params->chroma_t5_mask_pad      = 1;
-    sd_ctx_params->vae_format              = SD_VAE_FORMAT_AUTO;
-    sd_ctx_params->backend                 = nullptr;
-    sd_ctx_params->params_backend          = nullptr;
+    *sd_ctx_params                      = {};
+    sd_ctx_params->n_threads            = sd_get_num_physical_cores();
+    sd_ctx_params->wtype                = SD_TYPE_COUNT;
+    sd_ctx_params->rng_type             = CUDA_RNG;
+    sd_ctx_params->sampler_rng_type     = RNG_TYPE_COUNT;
+    sd_ctx_params->prediction           = PREDICTION_COUNT;
+    sd_ctx_params->lora_apply_mode      = LORA_APPLY_AUTO;
+    sd_ctx_params->max_vram             = nullptr;
+    sd_ctx_params->stream_layers        = false;
+    sd_ctx_params->eager_load           = false;
+    sd_ctx_params->enable_mmap          = false;
+    sd_ctx_params->diffusion_flash_attn = false;
+    sd_ctx_params->circular_x           = false;
+    sd_ctx_params->circular_y           = false;
+    sd_ctx_params->chroma_use_dit_mask  = true;
+    sd_ctx_params->chroma_use_t5_mask   = false;
+    sd_ctx_params->chroma_t5_mask_pad   = 1;
+    sd_ctx_params->vae_format           = SD_VAE_FORMAT_AUTO;
+    sd_ctx_params->backend              = nullptr;
+    sd_ctx_params->params_backend       = nullptr;
+    sd_ctx_params->rpc_servers          = nullptr;
+    sd_ctx_params->pulid_weights_path   = nullptr;
 }

 char* sd_ctx_params_to_str(const sd_ctx_params_t* sd_ctx_params) {
@ -2663,20 +2772,18 @@ char* sd_ctx_params_to_str(const sd_ctx_params_t* sd_ctx_params) {
             "taesd_path: %s\n"
             "control_net_path: %s\n"
             "photo_maker_path: %s\n"
+             "pulid_weights_path: %s\n"
             "tensor_type_rules: %s\n"
             "n_threads: %d\n"
             "wtype: %s\n"
             "rng_type: %s\n"
             "sampler_rng_type: %s\n"
             "prediction: %s\n"
-             "offload_params_to_cpu: %s\n"
-             "max_vram: %.3f\n"
+             "max_vram: %s\n"
             "stream_layers: %s\n"
+             "eager_load: %s\n"
             "backend: %s\n"
             "params_backend: %s\n"
-             "keep_clip_on_cpu: %s\n"
-             "keep_control_net_on_cpu: %s\n"
-             "keep_vae_on_cpu: %s\n"
             "flash_attn: %s\n"
             "diffusion_flash_attn: %s\n"
             "circular_x: %s\n"
@ -2701,20 +2808,18 @@ char* sd_ctx_params_to_str(const sd_ctx_params_t* sd_ctx_params) {
             SAFE_STR(sd_ctx_params->taesd_path),
             SAFE_STR(sd_ctx_params->control_net_path),
             SAFE_STR(sd_ctx_params->photo_maker_path),
+             SAFE_STR(sd_ctx_params->pulid_weights_path),
             SAFE_STR(sd_ctx_params->tensor_type_rules),
             sd_ctx_params->n_threads,
             sd_type_name(sd_ctx_params->wtype),
             sd_rng_type_name(sd_ctx_params->rng_type),
             sd_rng_type_name(sd_ctx_params->sampler_rng_type),
             sd_prediction_name(sd_ctx_params->prediction),
-             BOOL_STR(sd_ctx_params->offload_params_to_cpu),
-             sd_ctx_params->max_vram,
+             SAFE_STR(sd_ctx_params->max_vram),
             BOOL_STR(sd_ctx_params->stream_layers),
+             BOOL_STR(sd_ctx_params->eager_load),
             SAFE_STR(sd_ctx_params->backend),
             SAFE_STR(sd_ctx_params->params_backend),
-             BOOL_STR(sd_ctx_params->keep_clip_on_cpu),
-             BOOL_STR(sd_ctx_params->keep_control_net_on_cpu),
-             BOOL_STR(sd_ctx_params->keep_vae_on_cpu),
             BOOL_STR(sd_ctx_params->flash_attn),
             BOOL_STR(sd_ctx_params->diffusion_flash_attn),
             BOOL_STR(sd_ctx_params->circular_x),
@ -2799,6 +2904,7 @@ void sd_img_gen_params_init(sd_img_gen_params_t* sd_img_gen_params) {
    sd_img_gen_params->batch_count       = 1;
    sd_img_gen_params->control_strength  = 0.9f;
    sd_img_gen_params->pm_params         = {nullptr, 0, nullptr, 20.f};
+    sd_img_gen_params->pulid_params      = {nullptr, 1.0f};
    sd_img_gen_params->vae_tiling_params = {false, false, 0, 0, 0.5f, 0.0f, 0.0f, nullptr};
    sd_cache_params_init(&sd_img_gen_params->cache);
    sd_hires_params_init(&sd_img_gen_params->hires);
@ -2941,6 +3047,15 @@ void free_sd_ctx(sd_ctx_t* sd_ctx) {
    free(sd_ctx);
 }

+SD_API void sd_cancel_generation(sd_ctx_t* sd_ctx, enum sd_cancel_mode_t mode) {
+    if (sd_ctx && sd_ctx->sd) {
+        if (mode < SD_CANCEL_ALL || mode > SD_CANCEL_RESET) {
+            mode = SD_CANCEL_ALL;
+        }
+        sd_ctx->sd->set_cancel_flag(mode);
+    }
+}
+
 static sd_audio_t* waveform_to_sd_audio(const StableDiffusionGGML* sd,
                                        const sd::Tensor<float>& waveform) {
    if (sd == nullptr || waveform.empty()) {
@ -3100,6 +3215,7 @@ struct GenerationRequest {
    sd_guidance_params_t guidance            = {};
    sd_guidance_params_t high_noise_guidance = {};
    sd_pm_params_t pm_params                 = {};
+    sd_pulid_params_t pulid_params           = {};
    sd_hires_params_t hires                  = {};
    int frames                               = -1;
    int requested_frames                     = -1;
@ -3125,6 +3241,7 @@ struct GenerationRequest {
        has_ref_images              = sd_img_gen_params->ref_images_count > 0;
        guidance                    = sd_img_gen_params->sample_params.guidance;
        pm_params                   = sd_img_gen_params->pm_params;
+        pulid_params                = sd_img_gen_params->pulid_params;
        hires                       = sd_img_gen_params->hires;
        cache_params                = &sd_img_gen_params->cache;
        resolve(sd_ctx);
@ -4051,6 +4168,7 @@ static std::optional<ImageGenerationEmbeds> prepare_image_generation_embeds(sd_c
    condition_params.ref_images = &latents->ref_images;

    sd_ctx->sd->prepare_generation_extensions(request->pm_params,
+                                              request->pulid_params,
                                              condition_params,
                                              plan->total_steps);
    int64_t prepare_start_ms         = ggml_time_ms();
@ -4125,15 +4243,29 @@ static std::optional<ImageGenerationEmbeds> prepare_image_generation_embeds(sd_c
 static sd_image_t* decode_image_outputs(sd_ctx_t* sd_ctx,
                                        const GenerationRequest& request,
                                        const std::vector<sd::Tensor<float>>& final_latents) {
-    if (final_latents.size() != static_cast<size_t>(request.batch_count)) {
-        LOG_ERROR("expected %d latents, got %zu", request.batch_count, final_latents.size());
+    if (final_latents.empty()) {
+        LOG_ERROR("no latent images to decode");
        return nullptr;
    }
-    LOG_INFO("decoding %zu latents", final_latents.size());
+    if (final_latents.size() > static_cast<size_t>(request.batch_count)) {
+        LOG_ERROR("expected at most %d latents, got %zu", request.batch_count, final_latents.size());
+        return nullptr;
+    }
+    if (final_latents.size() < static_cast<size_t>(request.batch_count)) {
+        LOG_INFO("decoding %zu/%d latents", final_latents.size(), request.batch_count);
+    } else {
+        LOG_INFO("decoding %zu latents", final_latents.size());
+    }
    std::vector<sd::Tensor<float>> decoded_images;
-    int64_t t0 = ggml_time_ms();
+    int64_t t0     = ggml_time_ms();
+    bool cancelled = false;

    for (size_t i = 0; i < final_latents.size(); i++) {
+        if (sd_ctx->sd->get_cancel_flag() == SD_CANCEL_ALL) {
+            LOG_ERROR("cancelling latent decodings");
+            cancelled = true;
+            break;
+        }
        int64_t t1              = ggml_time_ms();
        sd::Tensor<float> image = sd_ctx->sd->decode_first_stage(final_latents[i]);
        if (image.empty()) {
@ -4147,6 +4279,10 @@ static sd_image_t* decode_image_outputs(sd_ctx_t* sd_ctx,

    int64_t t4 = ggml_time_ms();
    LOG_INFO("decode_first_stage completed, taking %.2fs", (t4 - t0) * 1.0f / 1000);
+    if (decoded_images.empty()) {
+        LOG_ERROR(cancelled ? "cancelled before any latent images were decoded" : "no decoded images");
+        return nullptr;
+    }

    sd_image_t* result_images = (sd_image_t*)calloc(request.batch_count, sizeof(sd_image_t));
    if (result_images == nullptr) {
@ -4165,6 +4301,11 @@ static sd::Tensor<float> upscale_hires_latent(sd_ctx_t* sd_ctx,
                                              const sd::Tensor<float>& latent,
                                              const GenerationRequest& request,
                                              UpscalerGGML* upscaler) {
+    if (sd_ctx->sd->get_cancel_flag() == SD_CANCEL_ALL) {
+        LOG_ERROR("cancelling hires latent upscale");
+        return {};
+    }
+
    auto get_hires_latent_target_shape = [&]() {
        std::vector<int64_t> target_shape = latent.shape();
        if (target_shape.size() < 2) {
@ -4237,6 +4378,10 @@ static sd::Tensor<float> upscale_hires_latent(sd_ctx_t* sd_ctx,
                      sd_hires_upscaler_name(request.hires.upscaler));
            return {};
        }
+        if (sd_ctx->sd->get_cancel_flag() == SD_CANCEL_ALL) {
+            LOG_ERROR("cancelling hires image upscale");
+            return {};
+        }

        sd::Tensor<float> upscaled_tensor;
        if (request.hires.upscaler == SD_HIRES_UPSCALER_MODEL) {
@ -4273,6 +4418,10 @@ static sd::Tensor<float> upscale_hires_latent(sd_ctx_t* sd_ctx,
            upscaled_tensor = sd::ops::clamp(upscaled_tensor, 0.0f, 1.0f);
        }

+        if (sd_ctx->sd->get_cancel_flag() == SD_CANCEL_ALL) {
+            LOG_ERROR("cancelling hires latent encode");
+            return {};
+        }
        sd::Tensor<float> upscaled_latent = sd_ctx->sd->encode_first_stage(upscaled_tensor);
        if (upscaled_latent.empty()) {
            LOG_ERROR("encode_first_stage failed after hires %s upscale",
@ -4337,6 +4486,8 @@ SD_API sd_image_t* generate_image(sd_ctx_t* sd_ctx, const sd_img_gen_params_t* s
        return nullptr;
    }

+    sd_ctx->sd->reset_cancel_flag();
+
    int64_t t0                    = ggml_time_ms();
    sd_ctx->sd->vae_tiling_params = sd_img_gen_params->vae_tiling_params;
    GenerationRequest request(sd_ctx, sd_img_gen_params);
@ -4372,6 +4523,18 @@ SD_API sd_image_t* generate_image(sd_ctx_t* sd_ctx, const sd_img_gen_params_t* s
    std::vector<sd::Tensor<float>> final_latents;
    int64_t denoise_start = ggml_time_ms();
    for (int b = 0; b < request.batch_count; b++) {
+        sd_cancel_mode_t cancel = sd_ctx->sd->get_cancel_flag();
+        if (cancel == SD_CANCEL_ALL) {
+            LOG_ERROR("cancelling generation");
+            return nullptr;
+        }
+        if (cancel == SD_CANCEL_NEW_LATENTS) {
+            LOG_INFO("cancelling new latent generation, returning %zu/%d completed latents",
+                     final_latents.size(),
+                     request.batch_count);
+            break;
+        }
+
        int64_t sampling_start = ggml_time_ms();
        int64_t cur_seed       = request.seed + b;
        LOG_INFO("generating image: %i/%i - seed %" PRId64, b + 1, request.batch_count, cur_seed);
@ -4421,22 +4584,33 @@ SD_API sd_image_t* generate_image(sd_ctx_t* sd_ctx, const sd_img_gen_params_t* s
    LOG_INFO("generating %zu latent images completed, taking %.2fs",
             final_latents.size(),
             (denoise_end - denoise_start) * 1.0f / 1000);
+    if (final_latents.empty()) {
+        LOG_ERROR("no latent images generated");
+        return nullptr;
+    }

    if (request.hires.enabled && request.hires.target_width > 0) {
+        if (sd_ctx->sd->get_cancel_flag() == SD_CANCEL_ALL) {
+            LOG_ERROR("cancelling generation before hires fix");
+            return nullptr;
+        }
        LOG_INFO("hires fix: upscaling to %dx%d", request.hires.target_width, request.hires.target_height);

        std::unique_ptr<UpscalerGGML> hires_upscaler;
        if (request.hires.upscaler == SD_HIRES_UPSCALER_MODEL) {
+            if (sd_ctx->sd->get_cancel_flag() == SD_CANCEL_ALL) {
+                LOG_ERROR("cancelling generation before hires model load");
+                return nullptr;
+            }
            LOG_INFO("hires fix: loading model upscaler from '%s'", request.hires.model_path);
            hires_upscaler                    = std::make_unique<UpscalerGGML>(sd_ctx->sd->n_threads,
                                                            false,
                                                            request.hires.upscale_tile_size,
                                                            sd_ctx->sd->backend_spec,
                                                            sd_ctx->sd->params_backend_spec);
-            const size_t max_graph_vram_bytes = sd::ggml_graph_cut::max_vram_gib_to_bytes(sd_ctx->sd->max_vram);
+            const size_t max_graph_vram_bytes = sd_ctx->sd->max_graph_vram_bytes_for_module(SDBackendModule::UPSCALER);
            hires_upscaler->set_max_graph_vram_bytes(max_graph_vram_bytes);
            if (!hires_upscaler->load_from_file(request.hires.model_path,
-                                                sd_ctx->sd->offload_params_to_cpu,
                                                sd_ctx->sd->n_threads)) {
                LOG_ERROR("load hires model upscaler failed");
                return nullptr;
@ -4461,6 +4635,10 @@ SD_API sd_image_t* generate_image(sd_ctx_t* sd_ctx, const sd_img_gen_params_t* s
        std::vector<sd::Tensor<float>> hires_final_latents;
        int64_t hires_denoise_start = ggml_time_ms();
        for (int b = 0; b < (int)final_latents.size(); b++) {
+            if (sd_ctx->sd->get_cancel_flag() == SD_CANCEL_ALL) {
+                LOG_ERROR("cancelling generation during hires fix");
+                return nullptr;
+            }
            int64_t cur_seed = request.seed + b;
            sd_ctx->sd->rng->manual_seed(cur_seed);
            sd_ctx->sd->sampler_rng->manual_seed(cur_seed);
@ -4891,6 +5069,10 @@ static sd_image_t* decode_video_outputs(sd_ctx_t* sd_ctx,
        LOG_ERROR("no latent video to decode");
        return nullptr;
    }
+    if (sd_ctx->sd->get_cancel_flag() == SD_CANCEL_ALL) {
+        LOG_ERROR("cancelling video decode");
+        return nullptr;
+    }
    sd::Tensor<float> video_latent = final_latent;
    if (sd_version_is_ltxav(sd_ctx->sd->version) &&
        video_latent.shape()[3] > sd_ctx->sd->get_latent_channel()) {
@ -4983,7 +5165,7 @@ static sd::Tensor<float> upscale_ltx_spatial_video_latent(sd_ctx_t* sd_ctx,
        std::make_unique<LTXVUpsampler::LatentUpsamplerRunner>(sd_ctx->sd->backend_for(SDBackendModule::UPSCALER),
                                                               model_loader.get_tensor_storage_map(),
                                                               upsampler_manager);
-    const size_t max_graph_vram_bytes = sd::ggml_graph_cut::max_vram_gib_to_bytes(sd_ctx->sd->max_vram);
+    const size_t max_graph_vram_bytes = sd_ctx->sd->max_graph_vram_bytes_for_module(SDBackendModule::UPSCALER);
    upsampler->set_max_graph_vram_bytes(max_graph_vram_bytes);
    if (upsampler->model == nullptr) {
        LOG_ERROR("init LTX latent upsampler from metadata failed");
@ -5136,6 +5318,9 @@ SD_API bool generate_video(sd_ctx_t* sd_ctx,
    if (audio_out != nullptr) {
        *audio_out = nullptr;
    }
+
+    sd_ctx->sd->reset_cancel_flag();
+
    if (num_frames_out != nullptr) {
        *num_frames_out = 0;
    }
@ -5197,6 +5382,10 @@ SD_API bool generate_video(sd_ctx_t* sd_ctx,
    sd::Tensor<float> noise = sd::Tensor<float>::randn_like(x_t, sd_ctx->sd->rng);

    if (plan.high_noise_sample_steps > 0) {
+        if (sd_ctx->sd->get_cancel_flag() == SD_CANCEL_ALL) {
+            LOG_ERROR("cancelling generation before high-noise sampling");
+            return false;
+        }
        LOG_DEBUG("sample(high noise) %dx%dx%d", W, H, T);

        int64_t sampling_start = ggml_time_ms();
@ -5239,6 +5428,10 @@ SD_API bool generate_video(sd_ctx_t* sd_ctx,
        LOG_INFO("sampling(high noise) completed, taking %.2fs", (sampling_end - sampling_start) * 1.0f / 1000);
    }

+    if (sd_ctx->sd->get_cancel_flag() == SD_CANCEL_ALL) {
+        LOG_ERROR("cancelling generation before sampling");
+        return false;
+    }
    LOG_DEBUG("sample %dx%dx%d", W, H, T);
    int64_t sampling_start         = ggml_time_ms();
    sd::Tensor<float> final_latent = sd_ctx->sd->sample(sd_ctx->sd->diffusion_model,
@ -5275,6 +5468,10 @@ SD_API bool generate_video(sd_ctx_t* sd_ctx,
    LOG_INFO("sampling completed, taking %.2fs", (sampling_end - sampling_start) * 1.0f / 1000);

    if (latent_upscale_enabled) {
+        if (sd_ctx->sd->get_cancel_flag() == SD_CANCEL_ALL) {
+            LOG_ERROR("cancelling generation before latent upscale");
+            return false;
+        }
        int64_t upscale_start             = ggml_time_ms();
        sd::Tensor<float> upscaled_latent = upscale_ltx_spatial_video_latent(sd_ctx,
                                                                             request.hires.model_path,
@ -5334,6 +5531,10 @@ SD_API bool generate_video(sd_ctx_t* sd_ctx,
        }
        sd::Tensor<float> hires_denoise_mask;
        sd::Tensor<float> hires_video_positions;
+        if (sd_ctx->sd->get_cancel_flag() == SD_CANCEL_ALL) {
+            LOG_ERROR("cancelling generation before latent upscale refine");
+            return false;
+        }
        if (!apply_ltxv_refine_image_conditioning(sd_ctx,
                                                  sd_vid_gen_params,
                                                  hires_request,
@ -5413,6 +5614,10 @@ SD_API bool generate_video(sd_ctx_t* sd_ctx,
    if (sd_version_is_ltxav(sd_ctx->sd->version) &&
        latents.audio_length > 0 &&
        sd_ctx->sd->audio_vae_model != nullptr) {
+        if (sd_ctx->sd->get_cancel_flag() == SD_CANCEL_ALL) {
+            LOG_ERROR("cancelling generation before audio decode");
+            return false;
+        }
        int64_t audio_latent_decode_start = ggml_time_ms();

        auto audio_latent = unpack_ltxav_audio_latent(final_latent,
@ -5445,6 +5650,11 @@ SD_API bool generate_video(sd_ctx_t* sd_ctx,
        final_latent = sd::ops::slice(final_latent, 2, latents.ref_image_num, final_latent.shape()[2]);
    }

+    if (sd_ctx->sd->get_cancel_flag() == SD_CANCEL_ALL) {
+        LOG_ERROR("cancelling generation before video decode");
+        free_sd_audio(generated_audio);
+        return false;
+    }
    auto result = decode_video_outputs(sd_ctx, latent_upscale_enabled ? hires_request : request, final_latent, num_frames_out);
    if (result == nullptr) {
        free_sd_audio(generated_audio);
--- a/src/tokenizers/bpe_tokenizer.cpp
+++ b/src/tokenizers/bpe_tokenizer.cpp
@ -134,7 +134,8 @@ std::vector<int> BPETokenizer::encode(const std::string& text, on_new_token_cb_t
    std::vector<int32_t> bpe_tokens;
    std::vector<std::string> token_strs;

-    auto splited_texts = split_with_special_tokens(text, special_tokens);
+    std::string normalized_text = normalize_before_split ? normalize(text) : text;
+    auto splited_texts          = split_with_special_tokens(normalized_text, special_tokens);

    for (auto& splited_text : splited_texts) {
        if (is_special_token(splited_text)) {
@ -159,7 +160,7 @@ std::vector<int> BPETokenizer::encode(const std::string& text, on_new_token_cb_t
                }
            }

-            std::string token_str = normalize(token);
+            std::string token_str = normalize_before_split ? token : normalize(token);
            std::u32string utf32_token;
            if (byte_level_bpe) {
                for (int i = 0; i < token_str.length(); i++) {
--- a/src/tokenizers/clip_tokenizer.cpp
+++ b/src/tokenizers/clip_tokenizer.cpp
@ -22,9 +22,10 @@ CLIPTokenizer::CLIPTokenizer(int pad_token_id, const std::string& merges_utf8_st
    EOS_TOKEN_ID = 49407;
    PAD_TOKEN_ID = pad_token_id;

-    end_of_word_suffix = "</w>";
-    add_bos_token      = true;
-    add_eos_token      = true;
+    end_of_word_suffix     = "</w>";
+    add_bos_token          = true;
+    add_eos_token          = true;
+    normalize_before_split = true;

    if (merges_utf8_str.size() > 0) {
        load_from_merges(merges_utf8_str);
--- a/src/tokenizers/tokenizer.h
+++ b/src/tokenizers/tokenizer.h
@ -12,9 +12,10 @@ using on_new_token_cb_t = std::function<bool(std::string&, std::vector<int32_t>&
 class Tokenizer {
 protected:
    std::vector<std::string> special_tokens;
-    bool add_bos_token = false;
-    bool add_eos_token = false;
-    bool pad_left      = false;
+    bool add_bos_token          = false;
+    bool add_eos_token          = false;
+    bool pad_left               = false;
+    bool normalize_before_split = false;
    std::string end_of_word_suffix;

    virtual std::string decode_token(int token_id) const = 0;
--- a/src/upscaler.cpp
+++ b/src/upscaler.cpp
@ -39,20 +39,12 @@ void UpscalerGGML::set_stream_layers_enabled(bool enabled) {
 }

 bool UpscalerGGML::load_from_file(const std::string& esrgan_path,
-                                  bool offload_params_to_cpu,
                                  int n_threads) {
    ggml_log_set(ggml_log_callback_default, nullptr);

-    std::string effective_params_backend_spec = params_backend_spec;
-    if (offload_params_to_cpu) {
-        effective_params_backend_spec = effective_params_backend_spec.empty() ? "*=cpu" : "*=cpu," + effective_params_backend_spec;
-    }
    std::string error;
    if (!backend_manager.init(backend_spec.c_str(),
-                              effective_params_backend_spec.c_str(),
-                              false,
-                              false,
-                              false,
+                              params_backend_spec.c_str(),
                              &error)) {
        LOG_ERROR("upscaler backend config failed: %s", error.c_str());
        return false;
@ -181,7 +173,6 @@ struct upscaler_ctx_t {
 };

 upscaler_ctx_t* new_upscaler_ctx(const char* esrgan_path_c_str,
-                                 bool offload_params_to_cpu,
                                 bool direct,
                                 int n_threads,
                                 int tile_size,
@ -198,7 +189,7 @@ upscaler_ctx_t* new_upscaler_ctx(const char* esrgan_path_c_str,
        return nullptr;
    }

-    if (!upscaler_ctx->upscaler->load_from_file(esrgan_path, offload_params_to_cpu, n_threads)) {
+    if (!upscaler_ctx->upscaler->load_from_file(esrgan_path, n_threads)) {
        delete upscaler_ctx->upscaler;
        upscaler_ctx->upscaler = nullptr;
        free(upscaler_ctx);
--- a/src/upscaler.h
+++ b/src/upscaler.h
@ -32,7 +32,6 @@ struct UpscalerGGML {
    ~UpscalerGGML();

    bool load_from_file(const std::string& esrgan_path,
-                        bool offload_params_to_cpu,
                        int n_threads);
    void set_max_graph_vram_bytes(size_t max_vram_bytes);
    void set_stream_layers_enabled(bool enabled);
Author	SHA1	Message	Date
leejet	f440ad9c29	fix: avoid writable mmap for read-only weights (#1698 )	2026-06-23 00:39:31 +08:00
stduhpf	41f7acbfb0	feat: support guidance_schedule (#1684 )	2026-06-23 00:05:55 +08:00
leejet	b395a6972d	refactor: add Flux VAE version helper (#1696 )	2026-06-22 22:39:42 +08:00
Alex Klinkhamer	854bebfe02	feat: add --prompt-file and --negative-prompt-file flags (#1693 )	2026-06-22 22:16:54 +08:00
fszontagh	787d229d84	perf: --eager-load to pre-load params at model-load time (#1687 )	2026-06-22 22:10:09 +08:00
leejet	b12098f5d0	feat: add boogu image support (#1688 )	2026-06-22 00:36:17 +08:00
stduhpf	2bd249c971	feat: concatenate repeated cli arg strings (#1686 )	2026-06-22 00:24:13 +08:00
Daniele	e9e952462f	fix: workaround for Ernie with Vulkan and Flash Attention (#1680 )	2026-06-22 00:21:38 +08:00
Wagner Bruna	e8e012eef2	fix: workaround for Anima with Vulkan and Flash Attention (#1678 )	2026-06-22 00:20:00 +08:00
leejet	7f0e728b7d	fix: normalize CLIP prompts before special-token splitting (#1670 )	2026-06-17 00:33:00 +08:00
leejet	92a3b73cdb	sync: update sdcpp-webui (#1668 )	2026-06-16 23:55:03 +08:00
Wagner Bruna	710bc91c8f	fix: correct conversion from sd_type_t to ggml_type (#1519 )	2026-06-16 23:54:42 +08:00
Wagner Bruna	5a34bc7f6e	feat: support for cancelling generations (#1124 ) * feat: support for canceling the ongoing generation * return partial image batches on cancel --------- Co-authored-by: leejet <leejet714@gmail.com>	2026-06-16 00:36:38 +08:00
leejet	146b6cc49e	fix: simplify PuLID ID extraction setup (#1664 )	2026-06-15 23:55:38 +08:00
RapidMark	93527fda74	feat: add PuLID-Flux identity-injection support (#1595 )	2026-06-15 23:33:50 +08:00
leejet	6e66a1a4a4	fix: allow oversized Vulkan parameter tensors (#1662 )	2026-06-15 23:18:52 +08:00
leejet	bb90bfa00f	feat: support backend-specific max-vram budgets	2026-06-14 22:46:32 +08:00
leejet	517abc777d	sync: update ggml (#1656 )	2026-06-14 20:45:05 +08:00
leejet	6f00939f75	docs: refresh README guide links	2026-06-14 17:58:58 +08:00
stduhpf	c2df4e1228	feat: add RPC support (#1629 )	2026-06-14 17:30:23 +08:00
leejet	9838264c49	refactor: simplify ControlNet output caching (#1655 )	2026-06-14 16:58:37 +08:00
leejet	17d70b91e6	docs: replace example option lists with help commands	2026-06-14 16:55:15 +08:00
leejet	5db680c2c7	refactor: route cpu placement through backend specs (#1654 )	2026-06-14 15:52:24 +08:00