docs: refresh README guide links

feat: add RPC support (#1629 )
refactor: simplify ControlNet output caching (#1655 )
2026-06-23 14:46:39 +00:00 · 2026-06-14 17:58:58 +08:00 · 2026-06-14 17:30:23 +08:00 · 2026-06-14 16:58:37 +08:00 · 2026-06-14 16:55:15 +08:00 · 2026-06-14 15:52:24 +08:00
35 changed files with 570 additions and 725 deletions
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@ -204,6 +204,12 @@ if(SD_WEBM)
    endif()
 endif()

+if (SD_RPC)
+    message("-- Use RPC as backend stable-diffusion")
+    set(GGML_RPC ON)
+    add_definitions(-DSD_USE_RPC)
+endif ()
+
 set(SD_LIB stable-diffusion)

 file(GLOB SD_LIB_SOURCES CONFIGURE_DEPENDS
--- a/README.md
+++ b/README.md
@ -34,8 +34,8 @@ API and command-line option may change frequently.***
 - Super lightweight and without external dependencies
 - Supported models
  - Image Models
-    - SD1.x, SD2.x, [SD-Turbo](https://huggingface.co/stabilityai/sd-turbo)
-    - SDXL, [SDXL-Turbo](https://huggingface.co/stabilityai/sdxl-turbo)
+    - [SD1.x, SD2.x, SD-Turbo](./docs/sd.md)
+    - [SDXL, SDXL-Turbo](./docs/sd.md)
    - [Some SD1.x and SDXL distilled models](./docs/distilled_sd.md)
    - [SD3/SD3.5](./docs/sd3.md)
    - [FLUX.1-dev/FLUX.1-schnell](./docs/flux.md)
@ -59,12 +59,12 @@ API and command-line option may change frequently.***
  - Video Models
    - [Wan2.1/Wan2.2](./docs/wan.md)
    - [LTX-2.3](./docs/ltx2.md)
-  - [PhotoMaker](https://github.com/TencentARC/PhotoMaker) support.
+  - [PhotoMaker](./docs/photo_maker.md) support.
  - Control Net support with SD 1.5
  - LoRA support, same as [stable-diffusion-webui](https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Features#lora)
  - Latent Consistency Models support (LCM/LCM-LoRA)
-  - Faster and memory efficient latent decoding with [TAESD](https://github.com/madebyollin/taesd)
-  - Upscale images generated with [ESRGAN](https://github.com/xinntao/Real-ESRGAN)
+  - Faster and memory efficient latent decoding with [TAESD](./docs/taesd.md)
+  - Upscale images generated with [ESRGAN](./docs/esrgan.md)
 - Supported backends
  - CPU (AVX, AVX2 and AVX512 support for x86 architectures)
  - CUDA
@ -133,28 +133,9 @@ For runtime and parameter backend placement, see the [backend selection guide](.
 ## More Guides

 - [Backend selection](./docs/backend.md)
- [SD1.x/SD2.x/SDXL](./docs/sd.md)
- [SD3/SD3.5](./docs/sd3.md)
- [FLUX.1-dev/FLUX.1-schnell](./docs/flux.md)
- [FLUX.2-dev/FLUX.2-klein](./docs/flux2.md)
- [FLUX.1-Kontext-dev](./docs/kontext.md)
- [Chroma](./docs/chroma.md)
- [🔥Qwen Image](./docs/qwen_image.md)
- [🔥Qwen Image Edit series](./docs/qwen_image_edit.md)
- [🔥Wan2.1/Wan2.2](./docs/wan.md)
- [🔥LTX-2.3](./docs/ltx2.md)
- [🔥Z-Image](./docs/z_image.md)
- [Ovis-Image](./docs/ovis_image.md)
- [Anima](./docs/anima.md)
- [ERNIE-Image](./docs/ernie_image.md)
- [HiDream-O1-Image](./docs/hidream_o1_image.md)
- [Lens](./docs/lens.md)
- [LongCat Image / LongCat Image Edit](./docs/longcat_image.md)
+- [RPC](./docs/rpc.md)
 - [LoRA](./docs/lora.md)
 - [LCM/LCM-LoRA](./docs/lcm.md)
- [Using PhotoMaker to personalize image generation](./docs/photo_maker.md)
- [Using ESRGAN to upscale results](./docs/esrgan.md)
- [Using TAESD to faster decoding](./docs/taesd.md)
 - [Docker](./docs/docker.md)
 - [Quantization and GGUF](./docs/quantization_and_gguf.md)
 - [Inference acceleration via caching](./docs/caching.md)
--- a/docs/backend.md
+++ b/docs/backend.md
@ -3,7 +3,7 @@
 `stable-diffusion.cpp` has two backend assignments:

 - `--backend` selects the runtime backend used to execute model graphs.
- `--params-backend` selects the backend used to allocate model parameters.
+- `--params-backend` selects where model parameters are kept.

 If `--params-backend` is not set, parameters use the same backend as their module runtime backend.

@ -29,6 +29,12 @@ The same syntax is used for parameter placement:
 sd-cli -m model.safetensors -p "a cat" --backend cuda0 --params-backend te=cpu,vae=cpu
 ```

+`--params-backend` also accepts the special value `disk`:
+
+```shell
+sd-cli -m model.safetensors -p "a cat" --backend cuda0 --params-backend disk
+```
+
 Module names are case-insensitive. Hyphens and underscores in module names are ignored, so `clip_vision`, `clip-vision`, and `clipvision` are equivalent.

 `all=`, `default=`, and `*=` can be used to set the default backend inside a mixed assignment:
@ -64,9 +70,11 @@ The special values `auto`, `default`, and an empty backend name select the defau

 The special value `gpu` selects the first GPU backend, falling back to the first integrated GPU backend.

+The special value `disk` is accepted only by `--params-backend`. `--backend disk` is invalid because `disk` is a parameter residency mode, not a runtime compute backend.
+
 ## Runtime backend vs. parameter backend

-The runtime backend controls where graph execution runs. The parameter backend controls where model weights are allocated.
+The runtime backend controls where graph execution runs. The parameter backend controls where model weights are allocated or whether they are reloaded from disk on demand.

 For example:

@ -76,6 +84,16 @@ sd-cli -m model.safetensors -p "a cat" --backend cuda0 --params-backend cpu

 This runs all modules on `cuda0`, but stores parameters in CPU RAM. During execution, parameters are moved to the runtime backend as needed.

+For example:
+
+```shell
+sd-cli -m model.safetensors -p "a cat" --backend cuda0 --params-backend disk
+```
+
+This runs all modules on `cuda0`, reloads parameters from the model file as needed, and releases those parameter buffers after use.
+
+`disk` is never selected implicitly. If `--params-backend` is not set, parameters use the runtime backend.
+
 Per-module assignments can be mixed:

 ```shell
@ -100,23 +118,27 @@ uses one shared CPU backend for both `te` and `vae` runtime execution.

 Runtime and parameter assignments also share the same backend cache. If `--backend diffusion=cuda0` and `--params-backend diffusion=cuda0` resolve to the same device, both use the same backend instance.

+`--params-backend disk` does not create a separate backend instance. Parameters are loaded lazily using the module runtime backend.
+
 `SDBackendManager` owns the backend instances and frees them when the context or upscaler is destroyed. Model runners receive non-owning runtime and parameter backend pointers and do not free them.

 ## Compatibility flags

-The older CPU placement flags are still supported:
+The example CLI/server still accepts these older CPU placement flags as compatibility aliases:

 - `--clip-on-cpu`
 - `--vae-on-cpu`
 - `--control-net-cpu`
 - `--offload-to-cpu`

-`--clip-on-cpu`, `--vae-on-cpu`, and `--control-net-cpu` affect runtime backend assignment only when `--backend` is not set. They map to `te=cpu`, `vae=cpu`, and `controlnet=cpu`.
+`--clip-on-cpu`, `--vae-on-cpu`, and `--control-net-cpu` are deprecated. The example argument layer prepends `te=cpu`, `vae=cpu`, and `controlnet=cpu` to `--backend` before creating the context.

-`--offload-to-cpu` affects parameter backend assignment only when `--params-backend` is not set. It is equivalent to:
+`--offload-to-cpu` prepends a CPU default to the parameter assignment in the caller before creating the context:

 ```shell
--params-backend cpu
+--params-backend '*=cpu'
 ```

-Explicit `--backend` and `--params-backend` assignments are preferred for new commands.
+Because this default is inserted first, later explicit `--params-backend` entries can still override it, for example `--offload-to-cpu --params-backend te=disk` keeps non-TE parameters on CPU and reloads TE parameters from disk.
+
+Library callers should set `backend` and `params_backend` directly. The old CPU/offload fields are no longer part of the C API. Explicit `--backend` and `--params-backend` assignments are preferred for new commands.
--- a/docs/performance.md
+++ b/docs/performance.md
@ -21,6 +21,38 @@ and the compute buffer shrink in the debug log:

 Using `--offload-to-cpu` allows you to offload weights to the CPU, saving VRAM without reducing generation speed.

+## Use params backend to reduce VRAM or RAM usage.
+
+`--params-backend` controls where model parameters are kept. If it is not set, parameters use the same backend as `--backend`, so a GPU runtime backend also keeps parameters in VRAM.
+
+Use CPU params to reduce VRAM usage:
+
+```shell
+--backend cuda0 --params-backend cpu
+```
+
+This keeps model weights in system RAM and moves them to the runtime backend when needed. In the example CLI/server, `--offload-to-cpu` is a compatibility shortcut that prepends `*=cpu` to `--params-backend` before creating the context, so explicit module assignments can still override it:
+
+```shell
+--offload-to-cpu --params-backend te=disk
+```
+
+Use disk params to reduce both VRAM and RAM usage:
+
+```shell
+--backend cuda0 --params-backend disk
+```
+
+This reloads parameters from the model file on demand and releases them after use. It has the lowest memory residency, but can be slower because weights must be read again. `disk` is never selected implicitly; set it explicitly when RAM usage matters more than reload cost.
+
+Per-module assignments can target only the largest modules:
+
+```shell
+--backend cuda0 --params-backend diffusion=disk,te=cpu,vae=cpu
+```
+
+See [backend selection](./backend.md) for full syntax.
+
 ## Use quantization to reduce memory usage.

 [quantization](./quantization_and_gguf.md)
--- a/docs/rpc.md
+++ b/docs/rpc.md
@ -0,0 +1,220 @@
+# Building and Using the RPC Server with `stable-diffusion.cpp`
+
+This guide covers how to build a version of [the RPC server from `llama.cpp`](https://github.com/ggml-org/llama.cpp/blob/master/tools/rpc/README.md) that is compatible with your version of `stable-diffusion.cpp` to manage multi-backends setups. RPC allows you to offload specific model components to a remote server.
+
+> **Note on Model Location:** The model files (e.g., `.safetensors` or `.gguf`) remain on the **Client** machine. The client parses the file and transmits the necessary tensor data and computational graphs to the server. The server does not need to store the model files locally.
+
+## 1. Building `stable-diffusion.cpp` with RPC client
+
+First, you should build the client application from source. It requires `SD_RPC=ON` to include the RPC backend to your client.
+
+```bash
+mkdir build
+cd build
+cmake .. \
+    -DSD_RPC=ON \
+    # Add other build flags here (e.g., -DSD_VULKAN=ON)
+cmake --build . --config Release -j $(nproc)
+```
+
+> **Note:** Ensure you add the other flags you would normally use (e.g., `-DSD_VULKAN=ON`, `-DSD_CUDA=ON`, `-DSD_HIPBLAS=ON`, or `-DGGML_METAL=ON`), for more information about building `stable-diffusion.cpp` from source, please refer to the [build.md](build.md) documentation.
+
+## 2. Ensure `llama.cpp` is at the correct commit
+
+`stable-diffusion.cpp`'s RPC client is designed to work with a specific version of `llama.cpp` (compatible with the `ggml` submodule) to ensure API compatibility. The commit hash for `llama.cpp` is stored in `ggml/scripts/sync-llama.last`.
+
+> **Start from Root:** Perform these steps from the root of your `stable-diffusion.cpp` directory.
+
+1.  Read the target commit hash from the submodule tracker:
+
+    ```bash
+    # Linux / WSL / MacOS
+    HASH=$(cat ggml/scripts/sync-llama.last)
+
+    # Windows (PowerShell)
+    $HASH = Get-Content -Path "ggml\scripts\sync-llama.last"
+    ```
+
+2.  Clone `llama.cpp` at the target commit .
+    ```bash
+    git clone https://github.com/ggml-org/llama.cpp.git
+    cd llama.cpp
+    git checkout $HASH
+    ```
+    To save on download time and storage, you can use a shallow clone to download only the target commit:
+    ```bash
+    mkdir -p llama.cpp
+    cd llama.cpp
+    git init
+    git remote add origin https://github.com/ggml-org/llama.cpp.git
+    git fetch --depth 1 origin $HASH
+    git checkout FETCH_HEAD
+    ```
+
+## 3. Build `llama.cpp` (RPC Server)
+
+The RPC server acts as the worker. You must explicitly enable the **backend** (the hardware interface, such as CUDA for Nvidia, Metal for Apple Silicon, or Vulkan) when building, otherwise the server will default to using only the CPU.
+
+To find the correct flags for your system, refer to the official documentation for the [`llama.cpp`](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md) repository.
+
+> **Crucial:** You must include the compiler flags required to satisfy the API compatibility with `stable-diffusion.cpp` (`-DGGML_MAX_NAME=128`). Without this flag, `GGML_MAX_NAME` will default to `64` for the server, and data transfers between the client and server will fail. Of course, `-DGGML_RPC` must also be enabled.
+>
+> I recommend disabling the `LLAMA_CURL` flag to avoid unnecessary dependencies, and disabling shared library builds to avoid potential conflicts.
+
+> **Build Target:** We are specifically building the `rpc-server` target. This prevents the build system from compiling the entire `llama.cpp` suite (like `llama-server`), making the build significantly faster.
+
+### Linux / WSL (Vulkan)
+
+```bash
+mkdir build
+cd build
+cmake .. -DGGML_RPC=ON \
+    -DGGML_VULKAN=ON \        # Ensure backend is enabled
+    -DGGML_BUILD_SHARED_LIBS=OFF \
+    -DLLAMA_CURL=OFF \
+    -DCMAKE_C_FLAGS=-DGGML_MAX_NAME=128 \
+    -DCMAKE_CXX_FLAGS=-DGGML_MAX_NAME=128
+cmake --build . --config Release --target rpc-server -j $(nproc)
+```
+
+### macOS (Metal)
+
+```bash
+mkdir build
+cd build
+cmake .. -DGGML_RPC=ON \
+    -DGGML_METAL=ON \
+    -DGGML_BUILD_SHARED_LIBS=OFF \
+    -DLLAMA_CURL=OFF \
+    -DCMAKE_C_FLAGS=-DGGML_MAX_NAME=128 \
+    -DCMAKE_CXX_FLAGS=-DGGML_MAX_NAME=128
+cmake --build . --config Release --target rpc-server
+```
+
+### Windows (Visual Studio 2022, Vulkan)
+
+```powershell
+mkdir build
+cd build
+cmake .. -G "Visual Studio 17 2022" -A x64 `
+    -DGGML_RPC=ON `
+    -DGGML_VULKAN=ON `
+    -DGGML_BUILD_SHARED_LIBS=OFF `
+    -DLLAMA_CURL=OFF `
+    -DCMAKE_C_FLAGS=-DGGML_MAX_NAME=128 `
+    -DCMAKE_CXX_FLAGS=-DGGML_MAX_NAME=128
+cmake --build . --config Release --target rpc-server
+```
+
+## 4. Usage
+
+Once both applications are built, you can run the server and the client to manage your GPU allocation.
+
+### Step A: Run the RPC Server
+
+Start the server. It listens for connections on the default address (usually `localhost:50052`). If your server is on a different machine, ensure the server binds to the correct interface and your firewall allows the connection.
+
+**On the Server :**
+If running on the same machine, you can use the default address:
+
+```bash
+./rpc-server
+```
+
+If you want to allow connections from other machines on the network:
+
+```bash
+./rpc-server --host 0.0.0.0
+```
+
+> **Security Warning:** The RPC server does not currently support authentication or encryption. **Only run the server on trusted local networks**. Never expose the RPC server directly to the open internet.
+
+> **Drivers & Hardware:** Ensure the Server machine has the necessary drivers installed and functional (e.g., Nvidia Drivers for CUDA, Vulkan SDK, or Metal). If no devices are found, the server will simply fallback to CPU usage.
+
+<!-- ### Step B: Check if the client is able to connect to the server and see the available devices
+
+We're assuming the server is running on your local machine, and listening on the default port `50052`. If it's running on a different machine, you can replace `localhost` with the IP address of the server.
+
+**On the Client:**
+
+```bash
+./sd-cli --rpc-servers localhost:50052 --list-devices
+```
+
+If the server is running and the client is able to connect, you should see `RPC0    localhost:50052` in the list of devices.
+
+Example output:
+(Client built without GPU acceleration, two GPUs available on the server)
+
+```
+List of available GGML devices:
+Name    Description
+-------------------
+CPU     AMD Ryzen 9 5900X 12-Core Processor
+RPC0    localhost:50052
+RPC1    localhost:50052
+``` -->
+
+### Step B: Run with RPC device
+
+If everything is working correctly, you can now run the client while offloading some or all of the work to the RPC server.
+
+Example: Setting the main backend to the RPC0 device for doing all the work on the server.
+
+```bash
+./sd-cli -m models/sd1.5.safetensors -p "A cat" --rpc-servers localhost:50052  --backend RPC0
+```
+
+---
+
+## 5. Scaling: Multiple RPC Servers
+
+You can connect the client to multiple RPC servers simultaneously to scale out your hardware usage.
+
+Example: A main machine (192.168.1.10) with 3 GPUs, with one GPU running CUDA and the other two running Vulkan, and a second machine (192.168.1.11) only one GPU.
+
+**On the first machine (Running two server instances):**
+
+**Terminal 1 (CUDA):**
+
+```bash
+# Linux / WSL
+export CUDA_VISIBLE_DEVICES=0
+cd ./build_cuda/bin/Release
+./rpc-server --host 0.0.0.0
+
+# Windows PowerShell
+$env:CUDA_VISIBLE_DEVICES="0"
+cd .\build_cuda\bin\Release
+./rpc-server --host 0.0.0.0
+```
+
+**Terminal 2 (Vulkan):**
+
+```bash
+cd ./build_vulkan/bin/Release
+# ignore the first GPU (used by CUDA server)
+./rpc-server --host 0.0.0.0 --port 50053 -d Vulkan1,Vulkan2
+```
+
+**On the second machine:**
+
+```bash
+cd ./build/bin/Release
+./rpc-server --host 0.0.0.0
+```
+
+**On the Client:**
+Pass multiple server addresses separated by commas.
+
+```bash
+./sd-cli --rpc-servers 192.168.1.10:50052,192.168.1.10:50053,192.168.1.11:50052 [...]
+```
+
+The client will map these servers to sequential device IDs (e.g., RPC0 from the first server, RPC2, RPC3 from the second, and RPC4 from the third). With this setup, you could for example use RPC0 for the main backend, RPC1 and RPC2 for the text encoders, and RPC3 for the VAE.
+
+---
+
+## 6. Performance Considerations
+
+RPC performance is heavily dependent on network bandwidth, as large weights and activations must be transferred back and forth over the network, especially for large models, or when using high resolutions. For best results, ensure your network connection is stable and has sufficient bandwidth (>1Gbps recommended). This shoumd not be a concern if you are running the server and client on the same machine, as the data transfer will happen over the loopback interface.
--- a/examples/cli/README.md
+++ b/examples/cli/README.md
@ -1,204 +1,9 @@
-# Run
+# Usage

-```
-usage: ./bin/sd-cli  [options]
+For detailed command-line arguments, run:

-CLI Options:
-  -o, --output <string>         path to write result image to. you can use printf-style %d format specifiers for image
-                                sequences (default: ./output.png) (eg. output_%03d.png). Single-file video outputs
-                                support .avi, .webm, and animated .webp
-  --image <string>              path to the image to inspect (for metadata mode)
-  --metadata-format <string>    metadata output format, one of [text, json] (default: text)
-  --preview-path <string>       path to write preview image to (default: ./preview.png). Multi-frame previews support
-                                .avi, .webm, and animated .webp
-  --preview-interval <int>      interval in denoising steps between consecutive updates of the image preview file
-                                (default is 1, meaning updating at every step)
-  --output-begin-idx <int>      starting index for output image sequence, must be non-negative (default 0 if specified
-                                %d in output path, 1 otherwise)
-  --canny                       apply canny preprocessor (edge detection)
-  --convert-name                convert tensor name (for convert mode)
-  -v, --verbose                 print extra info
-  --color                       colors the logging tags according to level
-  --taesd-preview-only          prevents usage of taesd for decoding the final image. (for use with --preview tae)
-  --preview-noisy               enables previewing noisy inputs of the models rather than the denoised outputs
-  --metadata-raw                include raw hex previews for unparsed metadata payloads
-  --metadata-brief              truncate long metadata text values in text output
-  --metadata-all                include structural/container entries such as IHDR, IDAT, and non-metadata JPEG segments
-  -M, --mode                    run mode, one of [img_gen, vid_gen, upscale, convert, metadata], default: img_gen
-  --preview                     preview method. must be one of the following [none, proj, tae, vae] (default is none)
-  -h, --help                    show this help message and exit
-
-Context Options:
-  -m, --model <string>                     path to full model
-  --clip_l <string>                        path to the clip-l text encoder
-  --clip_g <string>                        path to the clip-g text encoder
-  --clip_vision <string>                   path to the clip-vision encoder
-  --t5xxl <string>                         path to the t5xxl text encoder
-  --llm <string>                           path to the llm text encoder. For example: (qwenvl2.5 for qwen-image,
-                                           mistral-small3.2 for flux2, ...)
-  --llm_vision <string>                    path to the llm vit
-  --qwen2vl <string>                       alias of --llm. Deprecated.
-  --qwen2vl_vision <string>                alias of --llm_vision. Deprecated.
-  --diffusion-model <string>               path to the standalone diffusion model
-  --high-noise-diffusion-model <string>    path to the standalone high noise diffusion model
-  --uncond-diffusion-model <string>        path to the standalone unconditional diffusion model, currently used by
-                                           Ideogram4 CFG
-  --vae <string>                           path to standalone vae model
-  --taesd <string>                         path to taesd. Using Tiny AutoEncoder for fast decoding (low quality)
-  --tae <string>                           alias of --taesd
-  --control-net <string>                   path to control net model
-  --embd-dir <string>                      embeddings directory
-  --lora-model-dir <string>                lora model directory
-  --hires-upscalers-dir <string>           highres fix upscaler model directory
-  --tensor-type-rules <string>             weight type per tensor pattern (example: "^vae\.=f16,model\.=q8_0")
-  --photo-maker <string>                   path to PHOTOMAKER model
-  --upscale-model <string>                 path to esrgan model.
-  -t, --threads <int>                      number of threads to use during computation (default: -1). If threads <= 0,
-                                           then threads will be set to the number of CPU physical cores
-  --chroma-t5-mask-pad <int>               t5 mask pad size of chroma
-  --max-vram <float>                       maximum VRAM budget in GiB for graph-cut segmented execution. 0 disables
-                                           graph splitting; a negative value auto-detects free VRAM, sparing the
-                                           specified value (e.g. -0.5 will keep at least 0.5 GiB free)
-  --force-sdxl-vae-conv-scale              force use of conv scale on sdxl vae
-  --offload-to-cpu                         place the weights in RAM to save VRAM, and automatically load them into VRAM
-                                           when needed
-  --mmap                                   whether to memory-map model
-  --control-net-cpu                        keep controlnet in cpu (for low vram)
-  --clip-on-cpu                            keep clip in cpu (for low vram)
-  --vae-on-cpu                             keep vae in cpu (for low vram)
-  --fa                                     use flash attention
-  --diffusion-fa                           use flash attention in the diffusion model only
-  --diffusion-conv-direct                  use ggml_conv2d_direct in the diffusion model
-  --vae-conv-direct                        use ggml_conv2d_direct in the vae model
-  --circular                               enable circular padding for convolutions
-  --circularx                              enable circular RoPE wrapping on x-axis (width) only
-  --circulary                              enable circular RoPE wrapping on y-axis (height) only
-  --chroma-disable-dit-mask                disable dit mask for chroma
-  --qwen-image-zero-cond-t                 enable zero_cond_t for qwen image
-  --chroma-enable-t5-mask                  enable t5 mask for chroma
-  --type                                   weight type (examples: f32, f16, q4_0, q4_1, q5_0, q5_1, q8_0, q2_K, q3_K,
-                                           q4_K). If not specified, the default is the type of the weight file
-  --rng                                    RNG, one of [std_default, cuda, cpu], default: cuda(sd-webui), cpu(comfyui)
-  --sampler-rng                            sampler RNG, one of [std_default, cuda, cpu]. If not specified, use --rng
-  --prediction                             prediction type override, one of [eps, v, edm_v, sd3_flow, flux_flow,
-                                           flux2_flow]
-  --lora-apply-mode                        the way to apply LoRA, one of [auto, immediately, at_runtime], default is
-                                           auto. In auto mode, if the model weights contain any quantized parameters,
-                                           the at_runtime mode will be used; otherwise, immediately will be used.The
-                                           immediately mode may have precision and compatibility issues with quantized
-                                           parameters, but it usually offers faster inference speed and, in some cases,
-                                           lower memory usage. The at_runtime mode, on the other hand, is exactly the
-                                           opposite.
-
-Generation Options:
-  -p, --prompt <string>                    the prompt to render
-  -n, --negative-prompt <string>           the negative prompt (default: "")
-  -i, --init-img <string>                  path to the init image
-  --end-img <string>                       path to the end image, required by flf2v
-  --mask <string>                          path to the mask image
-  --control-image <string>                 path to control image, control net
-  --control-video <string>                 path to control video frames, It must be a directory path. The video frames
-                                           inside should be stored as images in lexicographical (character) order. For
-                                           example, if the control video path is `frames`, the directory contain images
-                                           such as 00.png, 01.png, ... etc.
-  --pm-id-images-dir <string>              path to PHOTOMAKER input id images dir
-  --pm-id-embed-path <string>              path to PHOTOMAKER v2 id embed
-  --hires-upscaler <string>                highres fix upscaler, Lanczos, Nearest, Latent, Latent (nearest), Latent
-                                           (nearest-exact), Latent (antialiased), Latent (bicubic), Latent (bicubic
-                                           antialiased), or a model name under --hires-upscalers-dir (default: Latent)
-  --extra-sample-args <string>             extra sampler/scheduler/guidance args, key=value list. APG supports apg_eta,
-                                           apg_momentum, apg_norm_threshold, apg_norm_threshold_smoothing; SLG supports
-                                           slg_uncond; lcm supports noise_clip_std, noise_scale_start, noise_scale_end;
-                                           ltx2 supports max_shift, base_shift, stretch, terminal; euler_ge supports gamma
-  --extra-tiling-args <string>             extra VAE tiling args, key=value list. LTX video VAE supports
-                                           temporal_tile_frames (default: 4), temporal_tile_overlap (default: 1)
-  -H, --height <int>                       image height, in pixel space (default: 512)
-  -W, --width <int>                        image width, in pixel space (default: 512)
-  --steps <int>                            number of sample steps (default: 20)
-  --high-noise-steps <int>                 (high noise) number of sample steps (default: -1 = auto)
-  --clip-skip <int>                        ignore last layers of CLIP network; 1 ignores none, 2 ignores one layer
-                                           (default: -1). <= 0 represents unspecified, will be 1 for SD1.x, 2 for SD2.x
-  -b, --batch-count <int>                  batch count
-  --video-frames <int>                     video frames (default: 1)
-  --fps <int>                              fps (default: 24)
-  --timestep-shift <int>                   shift timestep for NitroFusion models (default: 0). recommended N for
-                                           NitroSD-Realism around 250 and 500 for NitroSD-Vibrant
-  --upscale-repeats <int>                  Run the ESRGAN upscaler this many times (default: 1)
-  --upscale-tile-size <int>                tile size for ESRGAN upscaling (default: 128)
-  --hires-width <int>                      highres fix target width, 0 to use --hires-scale (default: 0)
-  --hires-height <int>                     highres fix target height, 0 to use --hires-scale (default: 0)
-  --hires-steps <int>                      highres fix second pass sample steps, 0 to reuse --steps (default: 0)
-  --hires-upscale-tile-size <int>          highres fix upscaler tile size, reserved for model-backed upscalers (default:
-                                           128)
-  --cfg-scale <float>                      unconditional guidance scale: (default: 7.0)
-  --img-cfg-scale <float>                  image guidance scale for inpaint or image edit models: (default: same as
-                                           --cfg-scale)
-  --guidance <float>                       distilled guidance scale for models with guidance input (default: 3.5)
-  --slg-scale <float>                      skip layer guidance (SLG) scale, only for DiT models: (default: 0). 0 means
-                                           disabled, a value of 2.5 is nice for sd3.5 medium
-  --skip-layer-start <float>               SLG enabling point (default: 0.01)
-  --skip-layer-end <float>                 SLG disabling point (default: 0.2)
-  --eta <float>                            noise multiplier (default: 0 for ddim_trailing, tcd, res_multistep and
-                                           res_2s; 1 for euler_a, er_sde and dpm++2s_a)
-  --flow-shift <float>                     shift value for Flow models like SD3.x or WAN (default: auto)
-  --high-noise-cfg-scale <float>           (high noise) unconditional guidance scale: (default: 7.0)
-  --high-noise-img-cfg-scale <float>       (high noise) image guidance scale for inpaint or image edit models (default:
-                                           same as --cfg-scale)
-  --high-noise-guidance <float>            (high noise) distilled guidance scale for models with guidance input
-                                           (default: 3.5)
-  --high-noise-slg-scale <float>           (high noise) skip layer guidance (SLG) scale, only for DiT models: (default:
-                                           0)
-  --high-noise-skip-layer-start <float>    (high noise) SLG enabling point (default: 0.01)
-  --high-noise-skip-layer-end <float>      (high noise) SLG disabling point (default: 0.2)
-  --high-noise-eta <float>                 (high noise) noise multiplier (default: 0 for ddim_trailing, tcd,
-                                           res_multistep and res_2s; 1 for euler_a, er_sde and dpm++2s_a)
-  --strength <float>                       strength for noising/unnoising (default: 0.75)
-  --pm-style-strength <float>
-  --control-strength <float>               strength to apply Control Net (default: 0.9). 1.0 corresponds to full
-                                           destruction of information in init image
-  --moe-boundary <float>                   timestep boundary for Wan2.2 MoE model. (default: 0.875). Only enabled if
-                                           `--high-noise-steps` is set to -1
-  --vace-strength <float>                  wan vace strength
-  --vae-tile-overlap <float>               tile overlap for vae tiling, in fraction of tile size (default: 0.5)
-  --hires-scale <float>                    highres fix scale when target size is not set (default: 2.0)
-  --hires-denoising-strength <float>       highres fix second pass denoising strength (default: 0.7)
-  --increase-ref-index                     automatically increase the indices of references images based on the order
-                                           they are listed (starting with 1).
-  --disable-auto-resize-ref-image          disable auto resize of ref images
-  --disable-image-metadata                 do not embed generation metadata on image files
-  --vae-tiling                             process vae in tiles to reduce memory usage
-  --temporal-tiling                        enable temporal tiling for LTX video VAE decode
-  --hires                                  enable highres fix
-  -s, --seed                               RNG seed (default: 42, use random seed for < 0)
-  --sampling-method                        sampling method, one of [euler, euler_a, heun, dpm2, dpm++2s_a, dpm++2m,
-                                           dpm++2mv2, ipndm, ipndm_v, lcm, ddim_trailing, tcd, res_multistep, res_2s,
-                                           er_sde, euler_cfg_pp, euler_a_cfg_pp] (default: euler for Flux/SD3/Wan, euler_a otherwise)
-  --high-noise-sampling-method             (high noise) sampling method, one of [euler, euler_a, heun, dpm2, dpm++2s_a,
-                                           dpm++2m, dpm++2mv2, ipndm, ipndm_v, lcm, ddim_trailing, tcd, res_multistep,
-                                           res_2s, er_sde, euler_cfg_pp, euler_a_cfg_pp] default: euler for Flux/SD3/Wan, euler_a otherwise
-  --scheduler                              denoiser sigma scheduler, one of [discrete, karras, exponential, ays, gits,
-                                           smoothstep, sgm_uniform, simple, kl_optimal, lcm, bong_tangent, ltx2], default:
-                                           model-specific
-  --sigmas                                 custom sigma values for the sampler, comma-separated (e.g.,
-                                           "14.61,7.8,3.5,0.0").
-  --hires-sigmas                           custom sigma values for the highres fix second pass, comma-separated (e.g.,
-                                           "0.85,0.725,0.421875,0.0").
-  --skip-layers                            layers to skip for SLG steps (default: [7,8,9])
-  --high-noise-skip-layers                 (high noise) layers to skip for SLG steps (default: [7,8,9])
-  -r, --ref-image                          reference image for Flux Kontext models (can be used multiple times)
-  --cache-mode                             caching method: 'easycache' (DiT), 'ucache' (UNET),
-                                           'dbcache'/'taylorseer'/'cache-dit' (DiT block-level), 'spectrum' (UNET/DiT
-                                           Chebyshev+Taylor forecasting)
-  --cache-option                           named cache params (key=value format, comma-separated). easycache/ucache:
-                                           threshold=,start=,end=,decay=,relative=,reset=; dbcache/taylorseer/cache-dit:
-                                           Fn=,Bn=,threshold=,warmup=; spectrum: w=,m=,lam=,window=,flex=,warmup=,stop=.
-                                           Examples: "threshold=0.25" or "threshold=1.5,reset=0"
-  --scm-mask                               SCM steps mask for cache-dit: comma-separated 0/1 (e.g.,
-                                           "1,1,1,0,0,1,0,0,1,0") - 1=compute, 0=can cache
-  --scm-policy                             SCM policy: 'dynamic' (default) or 'static'
-  --vae-tile-size                          tile size for vae tiling, format [X]x[Y] (default: 32x32)
-  --vae-relative-tile-size                 relative tile size for vae tiling, format [X]x[Y], in fraction of image size
-                                           if < 1, in number of tiles per dim if >=1 (overrides --vae-tile-size)
+```bash
+./bin/sd-cli -h
 ```

 Metadata mode inspects PNG/JPEG container metadata without loading any model:
--- a/examples/cli/main.cpp
+++ b/examples/cli/main.cpp
@ -623,8 +623,6 @@ int main(int argc, const char* argv[]) {
        }
    }

-    bool vae_decode_only = true;
-
    auto load_image_and_update_size = [&](const std::string& path,
                                          SDImageOwner& image,
                                          bool resize_image    = true,
@ -646,21 +644,18 @@ int main(int argc, const char* argv[]) {
    };

    if (gen_params.init_image_path.size() > 0) {
-        vae_decode_only = false;
        if (!load_image_and_update_size(gen_params.init_image_path, gen_params.init_image)) {
            return 1;
        }
    }

    if (gen_params.end_image_path.size() > 0) {
-        vae_decode_only = false;
        if (!load_image_and_update_size(gen_params.end_image_path, gen_params.end_image)) {
            return 1;
        }
    }

    if (gen_params.ref_image_paths.size() > 0) {
-        vae_decode_only = false;
        gen_params.ref_images.clear();
        for (auto& path : gen_params.ref_image_paths) {
            SDImageOwner ref_image({0, 0, 3, nullptr});
@ -735,18 +730,7 @@ int main(int argc, const char* argv[]) {
        }
    }

-    if (cli_params.mode == VID_GEN) {
-        vae_decode_only = false;
-    }
-
-    if (gen_params.hires_enabled &&
-        (gen_params.resolved_hires_upscaler == SD_HIRES_UPSCALER_MODEL ||
-         gen_params.resolved_hires_upscaler == SD_HIRES_UPSCALER_LANCZOS ||
-         gen_params.resolved_hires_upscaler == SD_HIRES_UPSCALER_NEAREST)) {
-        vae_decode_only = false;
-    }
-
-    sd_ctx_params_t sd_ctx_params = ctx_params.to_sd_ctx_params_t(vae_decode_only, true, cli_params.taesd_preview);
+    sd_ctx_params_t sd_ctx_params = ctx_params.to_sd_ctx_params_t(cli_params.taesd_preview);

    SDImageVec results;
    int num_results             = 0;
@ -798,12 +782,11 @@ int main(int argc, const char* argv[]) {
    int upscale_factor = 4;  // unused for RealESRGAN_x4plus_anime_6B.pth
    if (ctx_params.esrgan_path.size() > 0 && gen_params.upscale_repeats > 0) {
        UpscalerCtxPtr upscaler_ctx(new_upscaler_ctx(ctx_params.esrgan_path.c_str(),
-                                                     ctx_params.offload_params_to_cpu,
                                                     ctx_params.diffusion_conv_direct,
                                                     ctx_params.n_threads,
                                                     gen_params.upscale_tile_size,
-                                                     ctx_params.backend.c_str(),
-                                                     ctx_params.params_backend.c_str()));
+                                                     sd_ctx_params.backend,
+                                                     sd_ctx_params.params_backend));

        if (upscaler_ctx == nullptr) {
            LOG_ERROR("new_upscaler_ctx failed");
--- a/examples/common/common.cpp
+++ b/examples/common/common.cpp
@ -51,6 +51,10 @@ static sd_vae_format_t str_to_vae_format(const std::string& value) {
    return SD_VAE_FORMAT_COUNT;
 }

+static void prepend_backend_assignment(std::string& spec, const char* assignment) {
+    spec = spec.empty() ? assignment : std::string(assignment) + "," + spec;
+}
+
 #if defined(_WIN32)
 static std::string utf16_to_utf8(const std::wstring& wstr) {
    if (wstr.empty())
@ -421,8 +425,12 @@ ArgOptions SDContextParams::get_options() {
         &backend},
        {"",
         "--params-backend",
-         "parameter backend assignment, e.g. cpu or diffusion=cpu,clip=cpu",
+         "parameter backend assignment, e.g. disk, cpu, or diffusion=disk,clip=cpu",
         &params_backend},
+        {"",
+         "--rpc-servers",
+         "comma-separated list of RPC servers to connect to for offloading, in the format host:port, e.g. localhost:50052,192.168.1.3:50052",
+         &rpc_servers},
    };

    options.int_options = {
@ -463,15 +471,15 @@ ArgOptions SDContextParams::get_options() {
         true, &enable_mmap},
        {"",
         "--control-net-cpu",
-         "keep controlnet in cpu (for low vram)",
+         "deprecated; use --backend controlnet=cpu",
         true, &control_net_cpu},
        {"",
         "--clip-on-cpu",
-         "keep clip in cpu (for low vram)",
+         "deprecated; use --backend te=cpu",
         true, &clip_on_cpu},
        {"",
         "--vae-on-cpu",
-         "keep vae in cpu (for low vram)",
+         "deprecated; use --backend vae=cpu",
         true, &vae_on_cpu},
        {"",
         "--fa",
@ -688,6 +696,25 @@ bool SDContextParams::resolve_and_validate(SDMode mode) {
    return true;
 }

+void SDContextParams::prepare_backend_assignments() {
+    effective_backend        = backend;
+    effective_params_backend = params_backend;
+
+    if (offload_params_to_cpu) {
+        prepend_backend_assignment(effective_params_backend, "*=cpu");
+    }
+
+    if (clip_on_cpu) {
+        prepend_backend_assignment(effective_backend, "te=cpu");
+    }
+    if (vae_on_cpu) {
+        prepend_backend_assignment(effective_backend, "vae=cpu");
+    }
+    if (control_net_cpu) {
+        prepend_backend_assignment(effective_backend, "controlnet=cpu");
+    }
+}
+
 std::string SDContextParams::to_string() const {
    std::ostringstream emb_ss;
    emb_ss << "{\n";
@ -757,7 +784,8 @@ std::string SDContextParams::to_string() const {
    return oss.str();
 }

-sd_ctx_params_t SDContextParams::to_sd_ctx_params_t(bool vae_decode_only, bool free_params_immediately, bool taesd_preview) {
+sd_ctx_params_t SDContextParams::to_sd_ctx_params_t(bool taesd_preview) {
+    prepare_backend_assignments();
    embedding_vec.clear();
    embedding_vec.reserve(embedding_map.size());
    for (const auto& kv : embedding_map) {
@ -767,57 +795,52 @@ sd_ctx_params_t SDContextParams::to_sd_ctx_params_t(bool vae_decode_only, bool f
        embedding_vec.emplace_back(item);
    }

-    sd_ctx_params_t sd_ctx_params = {
-        model_path.c_str(),
-        clip_l_path.c_str(),
-        clip_g_path.c_str(),
-        clip_vision_path.c_str(),
-        t5xxl_path.c_str(),
-        llm_path.c_str(),
-        llm_vision_path.c_str(),
-        diffusion_model_path.c_str(),
-        high_noise_diffusion_model_path.c_str(),
-        uncond_diffusion_model_path.c_str(),
-        embeddings_connectors_path.c_str(),
-        vae_path.c_str(),
-        audio_vae_path.c_str(),
-        taesd_path.c_str(),
-        control_net_path.c_str(),
-        embedding_vec.data(),
-        static_cast<uint32_t>(embedding_vec.size()),
-        photo_maker_path.c_str(),
-        tensor_type_rules.c_str(),
-        vae_decode_only,
-        free_params_immediately,
-        n_threads,
-        wtype,
-        rng_type,
-        sampler_rng_type,
-        prediction,
-        lora_apply_mode,
-        offload_params_to_cpu,
-        enable_mmap,
-        clip_on_cpu,
-        control_net_cpu,
-        vae_on_cpu,
-        flash_attn,
-        diffusion_flash_attn,
-        taesd_preview,
-        diffusion_conv_direct,
-        vae_conv_direct,
-        circular || circular_x,
-        circular || circular_y,
-        force_sdxl_vae_conv_scale,
-        chroma_use_dit_mask,
-        chroma_use_t5_mask,
-        chroma_t5_mask_pad,
-        qwen_image_zero_cond_t,
-        str_to_vae_format(vae_format),
-        max_vram,
-        stream_layers,
-        backend.c_str(),
-        params_backend.c_str(),
-    };
+    sd_ctx_params_t sd_ctx_params;
+    sd_ctx_params_init(&sd_ctx_params);
+    sd_ctx_params.model_path                      = model_path.c_str();
+    sd_ctx_params.clip_l_path                     = clip_l_path.c_str();
+    sd_ctx_params.clip_g_path                     = clip_g_path.c_str();
+    sd_ctx_params.clip_vision_path                = clip_vision_path.c_str();
+    sd_ctx_params.t5xxl_path                      = t5xxl_path.c_str();
+    sd_ctx_params.llm_path                        = llm_path.c_str();
+    sd_ctx_params.llm_vision_path                 = llm_vision_path.c_str();
+    sd_ctx_params.diffusion_model_path            = diffusion_model_path.c_str();
+    sd_ctx_params.high_noise_diffusion_model_path = high_noise_diffusion_model_path.c_str();
+    sd_ctx_params.uncond_diffusion_model_path     = uncond_diffusion_model_path.c_str();
+    sd_ctx_params.embeddings_connectors_path      = embeddings_connectors_path.c_str();
+    sd_ctx_params.vae_path                        = vae_path.c_str();
+    sd_ctx_params.audio_vae_path                  = audio_vae_path.c_str();
+    sd_ctx_params.taesd_path                      = taesd_path.c_str();
+    sd_ctx_params.control_net_path                = control_net_path.c_str();
+    sd_ctx_params.embeddings                      = embedding_vec.data();
+    sd_ctx_params.embedding_count                 = static_cast<uint32_t>(embedding_vec.size());
+    sd_ctx_params.photo_maker_path                = photo_maker_path.c_str();
+    sd_ctx_params.tensor_type_rules               = tensor_type_rules.c_str();
+    sd_ctx_params.n_threads                       = n_threads;
+    sd_ctx_params.wtype                           = wtype;
+    sd_ctx_params.rng_type                        = rng_type;
+    sd_ctx_params.sampler_rng_type                = sampler_rng_type;
+    sd_ctx_params.prediction                      = prediction;
+    sd_ctx_params.lora_apply_mode                 = lora_apply_mode;
+    sd_ctx_params.enable_mmap                     = enable_mmap;
+    sd_ctx_params.flash_attn                      = flash_attn;
+    sd_ctx_params.diffusion_flash_attn            = diffusion_flash_attn;
+    sd_ctx_params.tae_preview_only                = taesd_preview;
+    sd_ctx_params.diffusion_conv_direct           = diffusion_conv_direct;
+    sd_ctx_params.vae_conv_direct                 = vae_conv_direct;
+    sd_ctx_params.circular_x                      = circular || circular_x;
+    sd_ctx_params.circular_y                      = circular || circular_y;
+    sd_ctx_params.force_sdxl_vae_conv_scale       = force_sdxl_vae_conv_scale;
+    sd_ctx_params.chroma_use_dit_mask             = chroma_use_dit_mask;
+    sd_ctx_params.chroma_use_t5_mask              = chroma_use_t5_mask;
+    sd_ctx_params.chroma_t5_mask_pad              = chroma_t5_mask_pad;
+    sd_ctx_params.qwen_image_zero_cond_t          = qwen_image_zero_cond_t;
+    sd_ctx_params.vae_format                      = str_to_vae_format(vae_format);
+    sd_ctx_params.max_vram                        = max_vram;
+    sd_ctx_params.stream_layers                   = stream_layers;
+    sd_ctx_params.backend                         = effective_backend.c_str();
+    sd_ctx_params.params_backend                  = effective_params_backend.c_str();
+    sd_ctx_params.rpc_servers                     = rpc_servers.c_str();
    return sd_ctx_params;
 }

--- a/examples/common/common.h
+++ b/examples/common/common.h
@ -148,6 +148,9 @@ struct SDContextParams {
    bool stream_layers          = false;
    std::string backend;
    std::string params_backend;
+    std::string rpc_servers;
+    std::string effective_backend;
+    std::string effective_params_backend;
    bool enable_mmap           = false;
    bool control_net_cpu       = false;
    bool clip_on_cpu           = false;
@ -175,11 +178,12 @@ struct SDContextParams {
    float flow_shift = INFINITY;
    ArgOptions get_options();
    void build_embedding_map();
+    void prepare_backend_assignments();
    bool resolve(SDMode mode);
    bool validate(SDMode mode);
    bool resolve_and_validate(SDMode mode);
    std::string to_string() const;
-    sd_ctx_params_t to_sd_ctx_params_t(bool vae_decode_only, bool free_params_immediately, bool taesd_preview);
+    sd_ctx_params_t to_sd_ctx_params_t(bool taesd_preview);
 };

 struct SDGenerationParams {
--- a/examples/server/README.md
+++ b/examples/server/README.md
@ -117,188 +117,10 @@ In this case, the server will load and serve the specified `index.html` file ins
 * using a custom UI
 * avoiding rebuilding the binary after frontend modifications

-# Run
+# Usage

-```
-usage: ./bin/sd-server  [options]
-
-Svr Options:
-  -l, --listen-ip <string>      server listen ip (default: 127.0.0.1)
-  --serve-html-path <string>    path to HTML file to serve at root (optional)
-  --listen-port <int>           server listen port (default: 1234)
-  -v, --verbose                 print extra info
-  --color                       colors the logging tags according to level
-  -h, --help                    show this help message and exit
-
-Context Options:
-  -m, --model <string>                     path to full model
-  --clip_l <string>                        path to the clip-l text encoder
-  --clip_g <string>                        path to the clip-g text encoder
-  --clip_vision <string>                   path to the clip-vision encoder
-  --t5xxl <string>                         path to the t5xxl text encoder
-  --llm <string>                           path to the llm text encoder. For example: (qwenvl2.5 for qwen-image,
-                                           mistral-small3.2 for flux2, ...)
-  --llm_vision <string>                    path to the llm vit
-  --qwen2vl <string>                       alias of --llm. Deprecated.
-  --qwen2vl_vision <string>                alias of --llm_vision. Deprecated.
-  --diffusion-model <string>               path to the standalone diffusion model
-  --high-noise-diffusion-model <string>    path to the standalone high noise diffusion model
-  --uncond-diffusion-model <string>        path to the standalone unconditional diffusion model, currently used by
-                                           Ideogram4 CFG
-  --vae <string>                           path to standalone vae model
-  --taesd <string>                         path to taesd. Using Tiny AutoEncoder for fast decoding (low quality)
-  --tae <string>                           alias of --taesd
-  --control-net <string>                   path to control net model
-  --embd-dir <string>                      embeddings directory
-  --lora-model-dir <string>                lora model directory
-  --hires-upscalers-dir <string>           highres fix upscaler model directory
-  --tensor-type-rules <string>             weight type per tensor pattern (example: "^vae\.=f16,model\.=q8_0")
-  --photo-maker <string>                   path to PHOTOMAKER model
-  --upscale-model <string>                 path to esrgan model.
-  -t, --threads <int>                      number of threads to use during computation (default: -1). If threads <= 0,
-                                           then threads will be set to the number of CPU physical cores
-  --chroma-t5-mask-pad <int>               t5 mask pad size of chroma
-  --max-vram <float>                       maximum VRAM budget in GiB for graph-cut segmented execution. 0 disables
-                                           graph splitting; a negative value auto-detects free VRAM, sparing the
-                                           specified value (e.g. -0.5 will keep at least 0.5 GiB free)
-  --force-sdxl-vae-conv-scale              force use of conv scale on sdxl vae
-  --offload-to-cpu                         place the weights in RAM to save VRAM, and automatically load them into VRAM
-                                           when needed
-  --mmap                                   whether to memory-map model
-  --control-net-cpu                        keep controlnet in cpu (for low vram)
-  --clip-on-cpu                            keep clip in cpu (for low vram)
-  --vae-on-cpu                             keep vae in cpu (for low vram)
-  --fa                                     use flash attention
-  --diffusion-fa                           use flash attention in the diffusion model only
-  --diffusion-conv-direct                  use ggml_conv2d_direct in the diffusion model
-  --vae-conv-direct                        use ggml_conv2d_direct in the vae model
-  --circular                               enable circular padding for convolutions
-  --circularx                              enable circular RoPE wrapping on x-axis (width) only
-  --circulary                              enable circular RoPE wrapping on y-axis (height) only
-  --chroma-disable-dit-mask                disable dit mask for chroma
-  --qwen-image-zero-cond-t                 enable zero_cond_t for qwen image
-  --chroma-enable-t5-mask                  enable t5 mask for chroma
-  --type                                   weight type (examples: f32, f16, q4_0, q4_1, q5_0, q5_1, q8_0, q2_K, q3_K,
-                                           q4_K). If not specified, the default is the type of the weight file
-  --rng                                    RNG, one of [std_default, cuda, cpu], default: cuda(sd-webui), cpu(comfyui)
-  --sampler-rng                            sampler RNG, one of [std_default, cuda, cpu]. If not specified, use --rng
-  --prediction                             prediction type override, one of [eps, v, edm_v, sd3_flow, flux_flow,
-                                           flux2_flow]
-  --lora-apply-mode                        the way to apply LoRA, one of [auto, immediately, at_runtime], default is
-                                           auto. In auto mode, if the model weights contain any quantized parameters,
-                                           the at_runtime mode will be used; otherwise, immediately will be used.The
-                                           immediately mode may have precision and compatibility issues with quantized
-                                           parameters, but it usually offers faster inference speed and, in some cases,
-                                           lower memory usage. The at_runtime mode, on the other hand, is exactly the
-                                           opposite.
-
-Default Generation Options:
-  -p, --prompt <string>                    the prompt to render
-  -n, --negative-prompt <string>           the negative prompt (default: "")
-  -i, --init-img <string>                  path to the init image
-  --end-img <string>                       path to the end image, required by flf2v
-  --mask <string>                          path to the mask image
-  --control-image <string>                 path to control image, control net
-  --control-video <string>                 path to control video frames, It must be a directory path. The video frames
-                                           inside should be stored as images in lexicographical (character) order. For
-                                           example, if the control video path is `frames`, the directory contain images
-                                           such as 00.png, 01.png, ... etc.
-  --pm-id-images-dir <string>              path to PHOTOMAKER input id images dir
-  --pm-id-embed-path <string>              path to PHOTOMAKER v2 id embed
-  --hires-upscaler <string>                highres fix upscaler, Lanczos, Nearest, Latent, Latent (nearest), Latent
-                                           (nearest-exact), Latent (antialiased), Latent (bicubic), Latent (bicubic
-                                           antialiased), or a model name under --hires-upscalers-dir (default: Latent)
-  --extra-sample-args <string>             extra sampler/scheduler/guidance args, key=value list. APG supports apg_eta,
-                                           apg_momentum, apg_norm_threshold, apg_norm_threshold_smoothing; SLG supports
-                                           slg_uncond; lcm supports noise_clip_std, noise_scale_start, noise_scale_end;
-                                           ltx2 supports max_shift, base_shift, stretch, terminal; euler_ge supports gamma
-  --extra-tiling-args <string>             extra VAE tiling args, key=value list. LTX video VAE supports
-                                           temporal_tile_frames (default: 4), temporal_tile_overlap (default: 1)
-  -H, --height <int>                       image height, in pixel space (default: 512)
-  -W, --width <int>                        image width, in pixel space (default: 512)
-  --steps <int>                            number of sample steps (default: 20)
-  --high-noise-steps <int>                 (high noise) number of sample steps (default: -1 = auto)
-  --clip-skip <int>                        ignore last layers of CLIP network; 1 ignores none, 2 ignores one layer
-                                           (default: -1). <= 0 represents unspecified, will be 1 for SD1.x, 2 for SD2.x
-  -b, --batch-count <int>                  batch count
-  --video-frames <int>                     video frames (default: 1)
-  --fps <int>                              fps (default: 24)
-  --timestep-shift <int>                   shift timestep for NitroFusion models (default: 0). recommended N for
-                                           NitroSD-Realism around 250 and 500 for NitroSD-Vibrant
-  --upscale-repeats <int>                  Run the ESRGAN upscaler this many times (default: 1)
-  --upscale-tile-size <int>                tile size for ESRGAN upscaling (default: 128)
-  --hires-width <int>                      highres fix target width, 0 to use --hires-scale (default: 0)
-  --hires-height <int>                     highres fix target height, 0 to use --hires-scale (default: 0)
-  --hires-steps <int>                      highres fix second pass sample steps, 0 to reuse --steps (default: 0)
-  --hires-upscale-tile-size <int>          highres fix upscaler tile size, reserved for model-backed upscalers (default:
-                                           128)
-  --cfg-scale <float>                      unconditional guidance scale: (default: 7.0)
-  --img-cfg-scale <float>                  image guidance scale for inpaint or image edit models: (default: same as
-                                           --cfg-scale)
-  --guidance <float>                       distilled guidance scale for models with guidance input (default: 3.5)
-  --slg-scale <float>                      skip layer guidance (SLG) scale, only for DiT models: (default: 0). 0 means
-                                           disabled, a value of 2.5 is nice for sd3.5 medium
-  --skip-layer-start <float>               SLG enabling point (default: 0.01)
-  --skip-layer-end <float>                 SLG disabling point (default: 0.2)
-  --eta <float>                            noise multiplier (default: 0 for ddim_trailing, tcd, res_multistep and
-                                           res_2s; 1 for euler_a, er_sde and dpm++2s_a)
-  --flow-shift <float>                     shift value for Flow models like SD3.x or WAN (default: auto)
-  --high-noise-cfg-scale <float>           (high noise) unconditional guidance scale: (default: 7.0)
-  --high-noise-img-cfg-scale <float>       (high noise) image guidance scale for inpaint or image edit models (default:
-                                           same as --cfg-scale)
-  --high-noise-guidance <float>            (high noise) distilled guidance scale for models with guidance input
-                                           (default: 3.5)
-  --high-noise-slg-scale <float>           (high noise) skip layer guidance (SLG) scale, only for DiT models: (default:
-                                           0)
-  --high-noise-skip-layer-start <float>    (high noise) SLG enabling point (default: 0.01)
-  --high-noise-skip-layer-end <float>      (high noise) SLG disabling point (default: 0.2)
-  --high-noise-eta <float>                 (high noise) noise multiplier (default: 0 for ddim_trailing, tcd,
-                                           res_multistep and res_2s; 1 for euler_a, er_sde and dpm++2s_a)
-  --strength <float>                       strength for noising/unnoising (default: 0.75)
-  --pm-style-strength <float>
-  --control-strength <float>               strength to apply Control Net (default: 0.9). 1.0 corresponds to full
-                                           destruction of information in init image
-  --moe-boundary <float>                   timestep boundary for Wan2.2 MoE model. (default: 0.875). Only enabled if
-                                           `--high-noise-steps` is set to -1
-  --vace-strength <float>                  wan vace strength
-  --vae-tile-overlap <float>               tile overlap for vae tiling, in fraction of tile size (default: 0.5)
-  --hires-scale <float>                    highres fix scale when target size is not set (default: 2.0)
-  --hires-denoising-strength <float>       highres fix second pass denoising strength (default: 0.7)
-  --increase-ref-index                     automatically increase the indices of references images based on the order
-                                           they are listed (starting with 1).
-  --disable-auto-resize-ref-image          disable auto resize of ref images
-  --disable-image-metadata                 do not embed generation metadata on image files
-  --vae-tiling                             process vae in tiles to reduce memory usage
-  --temporal-tiling                        enable temporal tiling for LTX video VAE decode
-  --hires                                  enable highres fix
-  -s, --seed                               RNG seed (default: 42, use random seed for < 0)
-  --sampling-method                        sampling method, one of [euler, euler_a, heun, dpm2, dpm++2s_a, dpm++2m,
-                                           dpm++2mv2, ipndm, ipndm_v, lcm, ddim_trailing, tcd, res_multistep, res_2s,
-                                           er_sde, euler_cfg_pp, euler_a_cfg_pp] (default: euler for Flux/SD3/Wan, euler_a otherwise)
-  --high-noise-sampling-method             (high noise) sampling method, one of [euler, euler_a, heun, dpm2, dpm++2s_a,
-                                           dpm++2m, dpm++2mv2, ipndm, ipndm_v, lcm, ddim_trailing, tcd, res_multistep,
-                                           res_2s, er_sde, euler_cfg_pp, euler_a_cfg_pp] default: euler for Flux/SD3/Wan, euler_a otherwise
-  --scheduler                              denoiser sigma scheduler, one of [discrete, karras, exponential, ays, gits,
-                                           smoothstep, sgm_uniform, simple, kl_optimal, lcm, bong_tangent, ltx2], default:
-                                           model-specific
-  --sigmas                                 custom sigma values for the sampler, comma-separated (e.g.,
-                                           "14.61,7.8,3.5,0.0").
-  --hires-sigmas                           custom sigma values for the highres fix second pass, comma-separated (e.g.,
-                                           "0.85,0.725,0.421875,0.0").
-  --skip-layers                            layers to skip for SLG steps (default: [7,8,9])
-  --high-noise-skip-layers                 (high noise) layers to skip for SLG steps (default: [7,8,9])
-  -r, --ref-image                          reference image for Flux Kontext models (can be used multiple times)
-  --cache-mode                             caching method: 'easycache' (DiT), 'ucache' (UNET),
-                                           'dbcache'/'taylorseer'/'cache-dit' (DiT block-level), 'spectrum' (UNET/DiT
-                                           Chebyshev+Taylor forecasting)
-  --cache-option                           named cache params (key=value format, comma-separated). easycache/ucache:
-                                           threshold=,start=,end=,decay=,relative=,reset=; dbcache/taylorseer/cache-dit:
-                                           Fn=,Bn=,threshold=,warmup=; spectrum: w=,m=,lam=,window=,flex=,warmup=,stop=.
-                                           Examples: "threshold=0.25" or "threshold=1.5,reset=0"
-  --scm-mask                               SCM steps mask for cache-dit: comma-separated 0/1 (e.g.,
-                                           "1,1,1,0,0,1,0,0,1,0") - 1=compute, 0=can cache
-  --scm-policy                             SCM policy: 'dynamic' (default) or 'static'
-  --vae-tile-size                          tile size for vae tiling, format [X]x[Y] (default: 32x32)
-  --vae-relative-tile-size                 relative tile size for vae tiling, format [X]x[Y], in fraction of image size
-                                           if < 1, in number of tiles per dim if >=1 (overrides --vae-tile-size)
+For detailed command-line arguments, run:
+
+```bash
+./bin/sd-server -h
 ```
--- a/examples/server/main.cpp
+++ b/examples/server/main.cpp
@ -85,7 +85,7 @@ int main(int argc, const char** argv) {
    LOG_DEBUG("%s", ctx_params.to_string().c_str());
    LOG_DEBUG("%s", default_gen_params.to_string().c_str());

-    sd_ctx_params_t sd_ctx_params = ctx_params.to_sd_ctx_params_t(false, false, false);
+    sd_ctx_params_t sd_ctx_params = ctx_params.to_sd_ctx_params_t(false);
    SDCtxPtr sd_ctx(new_sd_ctx(&sd_ctx_params));

    if (sd_ctx == nullptr) {
--- a/include/stable-diffusion.h
+++ b/include/stable-diffusion.h
@ -196,19 +196,13 @@ typedef struct {
    uint32_t embedding_count;
    const char* photo_maker_path;
    const char* tensor_type_rules;
-    bool vae_decode_only;
-    bool free_params_immediately;
    int n_threads;
    enum sd_type_t wtype;
    enum rng_type_t rng_type;
    enum rng_type_t sampler_rng_type;
    enum prediction_t prediction;
    enum lora_apply_mode_t lora_apply_mode;
-    bool offload_params_to_cpu;
    bool enable_mmap;
-    bool keep_clip_on_cpu;
-    bool keep_control_net_on_cpu;
-    bool keep_vae_on_cpu;
    bool flash_attn;
    bool diffusion_flash_attn;
    bool tae_preview_only;
@ -226,6 +220,7 @@ typedef struct {
    bool stream_layers;  // Enable residency+prefetch streaming on top of --max-vram (no effect without --max-vram)
    const char* backend;
    const char* params_backend;
+    const char* rpc_servers;
 } sd_ctx_params_t;

 typedef struct {
@ -460,7 +455,6 @@ SD_API bool generate_video(sd_ctx_t* sd_ctx,
 typedef struct upscaler_ctx_t upscaler_ctx_t;

 SD_API upscaler_ctx_t* new_upscaler_ctx(const char* esrgan_path,
-                                        bool offload_params_to_cpu,
                                        bool direct,
                                        int n_threads,
                                        int tile_size,
--- a/src/core/ggml_extend.hpp
+++ b/src/core/ggml_extend.hpp
@ -2007,6 +2007,10 @@ protected:
    }

    bool copy_cache_tensors_to_cache_buffer(const std::unordered_set<std::string>* cache_keep_names = nullptr) {
+        if (cache_tensor_map.empty() && cache_keep_names == nullptr) {
+            return true;
+        }
+
        ggml_context* old_cache_ctx            = cache_ctx;
        ggml_backend_buffer_t old_cache_buffer = cache_buffer;
        cache_ctx                              = nullptr;
--- a/src/core/ggml_extend_backend.cpp
+++ b/src/core/ggml_extend_backend.cpp
@ -45,6 +45,10 @@ static bool is_default_backend_token(const std::string& name) {
    return lower.empty() || lower == "default" || lower == "auto";
 }

+static bool is_disk_backend_token(const std::string& name) {
+    return lower_copy(trim_copy(name)) == "disk";
+}
+
 static bool parse_backend_module(const std::string& raw_name, SDBackendModule* module) {
    std::string name = lower_copy(trim_copy(raw_name));
    name.erase(std::remove(name.begin(), name.end(), '-'), name.end());
@ -200,6 +204,36 @@ void ggml_ext_im_set_f32_1d(const struct ggml_tensor* tensor, int i, float value
    }
 }

+bool add_rpc_devices(const std::string& servers) {
+    const std::string in = trim_copy(servers);
+    if (in.empty()) {
+        return true;
+    }
+    auto rpc_servers = split_copy(in, ',');
+    if (rpc_servers.empty()) {
+        LOG_ERROR("invalid RPC servers specification: '%s'", servers.c_str());
+        return false;
+    }
+    ggml_backend_reg_t rpc_reg = ggml_backend_reg_by_name("RPC");
+    if (!rpc_reg) {
+        LOG_ERROR("RPC backend not found, cannot add RPC servers");
+        return false;
+    }
+    typedef ggml_backend_reg_t (*ggml_backend_rpc_add_server_t)(const char* endpoint);
+    ggml_backend_rpc_add_server_t ggml_backend_rpc_add_server_fn = (ggml_backend_rpc_add_server_t)ggml_backend_reg_get_proc_address(rpc_reg, "ggml_backend_rpc_add_server");
+    if (!ggml_backend_rpc_add_server_fn) {
+        LOG_ERROR("RPC backend does not have ggml_backend_rpc_add_server function, cannot add RPC servers");
+        return false;
+    }
+    for (const auto& server : rpc_servers) {
+        LOG_INFO("Adding RPC server: %s", server.c_str());
+        auto reg = ggml_backend_rpc_add_server_fn(server.c_str());
+        // no return value to check for success but should print errors from the RPC backend if it fails to add the server
+        ggml_backend_register(reg);
+    }
+    return true;
+}
+
 static void ggml_backend_load_all_once() {
    // If the registry already has devices and the CPU backend is present,
    // assume either static registration or explicit host-side preloading has
@ -504,6 +538,9 @@ ggml_backend_t SDBackendManager::params_backend(SDBackendModule module) {
    if (name.empty()) {
        return runtime_backend(module);
    }
+    if (is_disk_backend_token(name)) {
+        return runtime_backend(module);
+    }
    return init_cached_backend(name);
 }

@ -515,6 +552,10 @@ bool SDBackendManager::params_backend_is_cpu(SDBackendModule module) {
    return sd_backend_is_cpu(params_backend(module));
 }

+bool SDBackendManager::params_backend_is_disk(SDBackendModule module) const {
+    return is_disk_backend_token(params_assignment_.get(module));
+}
+
 bool SDBackendManager::runtime_backend_supports_host_buffer(SDBackendModule module) {
    ggml_backend_t backend = runtime_backend(module);
    if (backend == nullptr) {
@ -534,10 +575,6 @@ bool SDBackendManager::runtime_backend_supports_host_buffer(SDBackendModule modu

 bool SDBackendManager::init(const char* backend_spec,
                            const char* params_backend_spec,
-                            bool offload_params_to_cpu,
-                            bool keep_clip_on_cpu,
-                            bool keep_vae_on_cpu,
-                            bool keep_control_net_on_cpu,
                            std::string* error) {
    reset();

@ -548,30 +585,20 @@ bool SDBackendManager::init(const char* backend_spec,
        return false;
    }

-    if (runtime_assignment_.empty()) {
-        if (keep_clip_on_cpu) {
-            runtime_assignment_.set_module(SDBackendModule::TE, "cpu");
-        }
-        if (keep_vae_on_cpu) {
-            runtime_assignment_.set_module(SDBackendModule::VAE, "cpu");
-        }
-        if (keep_control_net_on_cpu) {
-            runtime_assignment_.set_module(SDBackendModule::CONTROL_NET, "cpu");
-        }
-    }
-
-    if (params_assignment_.empty() && offload_params_to_cpu) {
-        params_assignment_.set_default("cpu");
-    }
-
    return validate(error);
 }

 bool SDBackendManager::validate(std::string* error) const {
-    auto validate_name = [&](const std::string& name) -> bool {
+    auto validate_runtime_name = [&](const std::string& name) -> bool {
        if (is_default_backend_token(name)) {
            return true;
        }
+        if (is_disk_backend_token(name)) {
+            if (error != nullptr) {
+                *error = "backend 'disk' is only supported by params_backend";
+            }
+            return false;
+        }
        if (!sd_resolve_backend_name(name).empty()) {
            return true;
        }
@ -580,18 +607,24 @@ bool SDBackendManager::validate(std::string* error) const {
        }
        return false;
    };
+    auto validate_params_name = [&](const std::string& name) -> bool {
+        if (is_disk_backend_token(name)) {
+            return true;
+        }
+        return validate_runtime_name(name);
+    };

-    if (!validate_name(runtime_assignment_.default_name) ||
-        !validate_name(params_assignment_.default_name)) {
+    if (!validate_runtime_name(runtime_assignment_.default_name) ||
+        !validate_params_name(params_assignment_.default_name)) {
        return false;
    }
    for (const auto& kv : runtime_assignment_.module_names) {
-        if (!validate_name(kv.second)) {
+        if (!validate_runtime_name(kv.second)) {
            return false;
        }
    }
    for (const auto& kv : params_assignment_.module_names) {
-        if (!validate_name(kv.second)) {
+        if (!validate_params_name(kv.second)) {
            return false;
        }
    }
--- a/src/core/ggml_extend_backend.h
+++ b/src/core/ggml_extend_backend.h
@ -51,10 +51,6 @@ public:

    bool init(const char* backend_spec,
              const char* params_backend_spec,
-              bool offload_params_to_cpu,
-              bool keep_clip_on_cpu,
-              bool keep_vae_on_cpu,
-              bool keep_control_net_on_cpu,
              std::string* error);
    void reset();

@ -63,6 +59,7 @@ public:

    bool runtime_backend_is_cpu(SDBackendModule module);
    bool params_backend_is_cpu(SDBackendModule module);
+    bool params_backend_is_disk(SDBackendModule module) const;
    bool runtime_backend_supports_host_buffer(SDBackendModule module);

 private:
@ -76,4 +73,5 @@ ggml_backend_t sd_backend_cpu_init();
 bool sd_backend_cpu_set_n_threads(ggml_backend_t backend_cpu, int n_threads);
 const char* sd_backend_module_name(SDBackendModule module);
 void ggml_ext_im_set_f32_1d(const struct ggml_tensor* tensor, int i, float value);
+bool add_rpc_devices(const std::string& servers);
 #endif  // __SD_CORE_GGML_EXTEND_BACKEND_H__
--- a/src/model/adapter/lora.hpp
+++ b/src/model/adapter/lora.hpp
@ -101,7 +101,7 @@ struct LoraModel : public GGMLRunner {
        if (model_manager == nullptr ||
            !model_manager->register_param_tensors("LoRA",
                                                   std::move(tensors),
-                                                   ModelManager::ResidencyMode::Resident,
+                                                   ModelManager::ResidencyMode::ParamBackend,
                                                   runtime_backend,
                                                   params_backend) ||
            !model_manager->validate_registered_tensors()) {
--- a/src/model/adapter/pmid.hpp
+++ b/src/model/adapter/pmid.hpp
@ -622,7 +622,7 @@ struct PhotoMakerIDEmbed : public GGMLRunner {
        model_loader.load_tensors(on_new_tensor_cb);
        if (!model_manager->register_param_tensors("PhotoMaker ID embeds",
                                                   tensors,
-                                                   ModelManager::ResidencyMode::Resident,
+                                                   ModelManager::ResidencyMode::ParamBackend,
                                                   runtime_backend,
                                                   params_backend) ||
            !model_manager->validate_registered_tensors()) {
--- a/src/model/diffusion/control.hpp
+++ b/src/model/diffusion/control.hpp
@ -312,16 +312,17 @@ struct ControlNet : public GGMLRunner {
    ControlNetBlock control_net;
    std::string weight_prefix;

-    ggml_backend_buffer_t control_buffer = nullptr;
-    ggml_context* control_ctx            = nullptr;
    std::vector<ggml_tensor*> control_outputs_ggml;
    ggml_tensor* guided_hint_output_ggml = nullptr;
    std::vector<sd::Tensor<float>> controls;
-    sd::Tensor<float> guided_hint;
    bool guided_hint_cached = false;
    std::shared_ptr<ModelManager> owned_model_manager;
    ggml_backend_t params_backend = nullptr;

+    static const char* guided_hint_cache_name() {
+        return "controlnet.guided_hint";
+    }
+
    ControlNet(ggml_backend_t backend,
               ggml_backend_t params_backend_,
               const String2TensorStorage& tensor_storage_map      = {},
@ -336,44 +337,12 @@ struct ControlNet : public GGMLRunner {
        free_control_ctx();
    }

-    void alloc_control_ctx(std::vector<ggml_tensor*> outs) {
-        ggml_init_params params;
-        params.mem_size   = static_cast<size_t>(outs.size() * ggml_tensor_overhead()) + 1024 * 1024;
-        params.mem_buffer = nullptr;
-        params.no_alloc   = true;
-        control_ctx       = ggml_init(params);
-
-        control_outputs_ggml.resize(outs.size() - 1);
-
-        size_t control_buffer_size = 0;
-
-        guided_hint_output_ggml = ggml_dup_tensor(control_ctx, outs[0]);
-        control_buffer_size += ggml_nbytes(guided_hint_output_ggml);
-
-        for (int i = 0; i < outs.size() - 1; i++) {
-            control_outputs_ggml[i] = ggml_dup_tensor(control_ctx, outs[i + 1]);
-            control_buffer_size += ggml_nbytes(control_outputs_ggml[i]);
-        }
-
-        control_buffer = ggml_backend_alloc_ctx_tensors(control_ctx, runtime_backend);
-
-        LOG_DEBUG("control buffer size %.2fMB", control_buffer_size * 1.f / 1024.f / 1024.f);
-    }
-
    void free_control_ctx() {
-        if (control_buffer != nullptr) {
-            ggml_backend_buffer_free(control_buffer);
-            control_buffer = nullptr;
-        }
-        if (control_ctx != nullptr) {
-            ggml_free(control_ctx);
-            control_ctx = nullptr;
-        }
        guided_hint_output_ggml = nullptr;
        guided_hint_cached      = false;
-        guided_hint             = {};
        control_outputs_ggml.clear();
        controls.clear();
+        free_cache_ctx_and_buffer();
    }

    std::string get_desc() override {
@ -397,11 +366,17 @@ struct ControlNet : public GGMLRunner {
        ggml_tensor* context   = make_optional_input(context_tensor);
        ggml_tensor* y         = make_optional_input(y_tensor);

+        guided_hint_output_ggml = nullptr;
+        control_outputs_ggml.clear();
+
        ggml_tensor* guided_hint_input = nullptr;
-        if (guided_hint_cached && !guided_hint.empty()) {
-            guided_hint_input = make_input(guided_hint);
-            hint              = nullptr;
-        } else {
+        if (guided_hint_cached) {
+            guided_hint_input = get_cache_tensor_by_name(guided_hint_cache_name());
+            if (guided_hint_input == nullptr) {
+                guided_hint_cached = false;
+            }
+        }
+        if (guided_hint_input == nullptr) {
            hint = make_input(hint_tensor);
        }

@ -415,13 +390,19 @@ struct ControlNet : public GGMLRunner {
                                        context,
                                        y);

-        if (control_ctx == nullptr) {
-            alloc_control_ctx(outs);
+        if (guided_hint_input == nullptr && !outs.empty()) {
+            guided_hint_output_ggml = outs[0];
+            ggml_set_output(guided_hint_output_ggml);
+            cache(guided_hint_cache_name(), guided_hint_output_ggml);
+            ggml_build_forward_expand(gf, guided_hint_output_ggml);
        }

-        ggml_build_forward_expand(gf, ggml_cpy(compute_ctx, outs[0], guided_hint_output_ggml));
-        for (int i = 0; i < outs.size() - 1; i++) {
-            ggml_build_forward_expand(gf, ggml_cpy(compute_ctx, outs[i + 1], control_outputs_ggml[i]));
+        control_outputs_ggml.reserve(outs.size() > 0 ? outs.size() - 1 : 0);
+        for (size_t i = 1; i < outs.size(); i++) {
+            ggml_tensor* control_output = outs[i];
+            ggml_set_output(control_output);
+            ggml_build_forward_expand(gf, control_output);
+            control_outputs_ggml.push_back(control_output);
        }

        return gf;
@ -441,15 +422,12 @@ struct ControlNet : public GGMLRunner {
            return build_graph(x, hint, timesteps, context, y);
        };

-        auto compute_result = GGMLRunner::compute<float>(get_graph, n_threads, false, false, false);
+        auto compute_result = GGMLRunner::compute<float>(get_graph, n_threads, false, false, false, true);
        if (!compute_result.has_value()) {
            return std::nullopt;
        }

-        if (guided_hint_output_ggml != nullptr) {
-            guided_hint = restore_trailing_singleton_dims(sd::make_sd_tensor_from_ggml<float>(guided_hint_output_ggml),
-                                                          4);
-        }
+        guided_hint_cached = get_cache_tensor_by_name(guided_hint_cache_name()) != nullptr;
        controls.clear();
        controls.reserve(control_outputs_ggml.size());
        for (ggml_tensor* control : control_outputs_ggml) {
@ -457,7 +435,6 @@ struct ControlNet : public GGMLRunner {
            GGML_ASSERT(!control_host.empty());
            controls.push_back(std::move(control_host));
        }
-        guided_hint_cached = true;
        return controls;
    }

@ -482,7 +459,7 @@ struct ControlNet : public GGMLRunner {
        manager->set_n_threads(n_threads);
        if (!manager->register_param_tensors("ControlNet",
                                             std::move(tensors),
-                                             ModelManager::ResidencyMode::Resident,
+                                             ModelManager::ResidencyMode::ParamBackend,
                                             runtime_backend,
                                             params_backend) ||
            !manager->validate_registered_tensors()) {
--- a/src/model/diffusion/flux.hpp
+++ b/src/model/diffusion/flux.hpp
@ -1609,7 +1609,7 @@ namespace Flux {
            if (!model_manager->register_runner_params("Flux test",
                                                       *flux,
                                                       "model.diffusion_model",
-                                                       ModelManager::ResidencyMode::Resident,
+                                                       ModelManager::ResidencyMode::ParamBackend,
                                                       backend,
                                                       backend) ||
                !model_manager->validate_registered_tensors()) {
--- a/src/model/diffusion/ltxv.hpp
+++ b/src/model/diffusion/ltxv.hpp
@ -2048,7 +2048,7 @@ namespace LTXV {
            if (!model_manager->register_runner_params("LTXAV test",
                                                       *ltxav,
                                                       "model.diffusion_model",
-                                                       ModelManager::ResidencyMode::Resident,
+                                                       ModelManager::ResidencyMode::ParamBackend,
                                                       backend,
                                                       backend) ||
                !model_manager->validate_registered_tensors()) {
--- a/src/model/diffusion/mmdit.hpp
+++ b/src/model/diffusion/mmdit.hpp
@ -1015,7 +1015,7 @@ struct MMDiTRunner : public DiffusionModelRunner {
            if (!model_manager->register_runner_params("MMDiT test",
                                                       *mmdit,
                                                       "model.diffusion_model",
-                                                       ModelManager::ResidencyMode::Resident,
+                                                       ModelManager::ResidencyMode::ParamBackend,
                                                       backend,
                                                       backend) ||
                !model_manager->validate_registered_tensors()) {
--- a/src/model/diffusion/qwen_image.hpp
+++ b/src/model/diffusion/qwen_image.hpp
@ -715,7 +715,7 @@ namespace Qwen {
            if (!model_manager->register_runner_params("Qwen image test",
                                                       *qwen_image,
                                                       "model.diffusion_model",
-                                                       ModelManager::ResidencyMode::Resident,
+                                                       ModelManager::ResidencyMode::ParamBackend,
                                                       backend,
                                                       backend) ||
                !model_manager->validate_registered_tensors()) {
--- a/src/model/diffusion/wan.hpp
+++ b/src/model/diffusion/wan.hpp
@ -1040,7 +1040,7 @@ namespace WAN {
            if (!model_manager->register_runner_params("Wan test",
                                                       *wan,
                                                       "model.diffusion_model",
-                                                       ModelManager::ResidencyMode::Resident,
+                                                       ModelManager::ResidencyMode::ParamBackend,
                                                       backend,
                                                       backend) ||
                !model_manager->validate_registered_tensors()) {
--- a/src/model/diffusion/z_image.hpp
+++ b/src/model/diffusion/z_image.hpp
@ -723,7 +723,7 @@ namespace ZImage {
            if (!model_manager->register_runner_params("ZImage test",
                                                       *z_image,
                                                       "model.diffusion_model",
-                                                       ModelManager::ResidencyMode::Resident,
+                                                       ModelManager::ResidencyMode::ParamBackend,
                                                       backend,
                                                       backend) ||
                !model_manager->validate_registered_tensors()) {
--- a/src/model/te/llm.hpp
+++ b/src/model/te/llm.hpp
@ -2084,7 +2084,7 @@ namespace LLM {
            if (!model_manager->register_runner_params("LLM test",
                                                       *llm,
                                                       "text_encoders.llm",
-                                                       ModelManager::ResidencyMode::Resident,
+                                                       ModelManager::ResidencyMode::ParamBackend,
                                                       backend,
                                                       backend) ||
                !model_manager->validate_registered_tensors()) {
--- a/src/model/te/t5.hpp
+++ b/src/model/te/t5.hpp
@ -592,7 +592,7 @@ struct T5Embedder {
        if (!model_manager->register_runner_params("T5 test",
                                                   *t5,
                                                   "",
-                                                   ModelManager::ResidencyMode::Resident,
+                                                   ModelManager::ResidencyMode::ParamBackend,
                                                   backend,
                                                   backend) ||
            !model_manager->validate_registered_tensors()) {
--- a/src/model/vae/ltx_audio_vae.hpp
+++ b/src/model/vae/ltx_audio_vae.hpp
@ -1082,7 +1082,7 @@ namespace LTXV {

            if (!model_manager->register_runner_params("LTX audio VAE test",
                                                       *ltx_audio_vae,
-                                                       ModelManager::ResidencyMode::Resident,
+                                                       ModelManager::ResidencyMode::ParamBackend,
                                                       backend,
                                                       backend) ||
                !model_manager->validate_registered_tensors()) {
--- a/src/model/vae/ltx_vae.hpp
+++ b/src/model/vae/ltx_vae.hpp
@ -1426,7 +1426,7 @@ struct LTXVideoVAE : public VAE {
                               const sd::Tensor<float>& z,
                               bool decode_graph) override {
        if (!decode_graph && decode_only) {
-            LOG_ERROR("LTX video VAE encode requires encoder weights; create the context with vae_decode_only=false");
+            LOG_ERROR("LTX video VAE encode requires encoder weights");
            return {};
        }
        sd::Tensor<float> input = z;
@ -1538,7 +1538,7 @@ struct LTXVideoVAE : public VAE {

        if (!model_manager->register_runner_params("LTX VAE test",
                                                   *vae,
-                                                   ModelManager::ResidencyMode::Resident,
+                                                   ModelManager::ResidencyMode::ParamBackend,
                                                   backend,
                                                   backend) ||
            !model_manager->validate_registered_tensors()) {
--- a/src/model/vae/wan_vae.hpp
+++ b/src/model/vae/wan_vae.hpp
@ -1340,7 +1340,7 @@ namespace WAN {

                if (!model_manager->register_runner_params("Wan VAE test",
                                                           *vae,
-                                                           ModelManager::ResidencyMode::Resident,
+                                                           ModelManager::ResidencyMode::ParamBackend,
                                                           backend,
                                                           backend) ||
                    !model_manager->validate_registered_tensors()) {
--- a/src/model_loader.cpp
+++ b/src/model_loader.cpp
@ -1002,6 +1002,7 @@ bool ModelLoader::load_tensors(on_new_tensor_cb_t on_new_tensor_cb,
        std::atomic<size_t> tensor_idx(0);
        std::atomic<bool> failed(false);
        std::vector<std::thread> workers;
+        std::mutex rpc_backend_mutex;

        for (int i = 0; i < n_threads; ++i) {
            workers.emplace_back([&, file_path, is_zip]() {
@ -1158,7 +1159,19 @@ bool ModelLoader::load_tensors(on_new_tensor_cb_t on_new_tensor_cb,

                    if (dst_tensor->buffer != nullptr && !ggml_backend_buffer_is_host(dst_tensor->buffer)) {
                        t0 = ggml_time_ms();
-                        ggml_backend_tensor_set(dst_tensor, convert_buf, 0, ggml_nbytes(dst_tensor));
+
+                        // RPC backends require serialized access to prevent concurrency issues
+                        const char* buffer_type_name = ggml_backend_buft_name(ggml_backend_buffer_get_type(dst_tensor->buffer));
+                        bool is_rpc_buffer           = buffer_type_name != nullptr &&
+                                             std::string(buffer_type_name).find("RPC") != std::string::npos;
+
+                        if (is_rpc_buffer) {
+                            std::lock_guard<std::mutex> lock(rpc_backend_mutex);
+                            ggml_backend_tensor_set(dst_tensor, convert_buf, 0, ggml_nbytes(dst_tensor));
+                        } else {
+                            ggml_backend_tensor_set(dst_tensor, convert_buf, 0, ggml_nbytes(dst_tensor));
+                        }
+
                        t1 = ggml_time_ms();
                        copy_to_backend_time_ms.fetch_add(t1 - t0);
                    }
--- a/src/model_manager.cpp
+++ b/src/model_manager.cpp
@ -492,7 +492,7 @@ bool ModelManager::mmap_params(const std::vector<TensorState*>& states,
 }

 bool ModelManager::can_mmap_storage(const TensorState& state) const {
-    if (!enable_mmap_ || state.residency_mode != ResidencyMode::Resident) {
+    if (!enable_mmap_ || state.residency_mode != ResidencyMode::ParamBackend) {
        return false;
    }
    if (state.compute_backend == nullptr || state.params_backend == nullptr) {
--- a/src/model_manager.h
+++ b/src/model_manager.h
@ -16,7 +16,7 @@ class ModelManager : public RunnerWeightManager {
 public:
    enum class ResidencyMode {
        Disk,
-        Resident,
+        ParamBackend,
    };

    struct LoraSpec {
@ -33,7 +33,7 @@ private:
        ggml_tensor* tensor = nullptr;
        std::string desc;

-        ResidencyMode residency_mode   = ResidencyMode::Resident;
+        ResidencyMode residency_mode   = ResidencyMode::ParamBackend;
        ggml_backend_t compute_backend = nullptr;
        ggml_backend_t params_backend  = nullptr;
        bool metadata_validated        = false;
--- a/src/stable-diffusion.cpp
+++ b/src/stable-diffusion.cpp
@ -163,9 +163,7 @@ public:
    SDBackendManager backend_manager;

    SDVersion version;
-    bool vae_decode_only         = false;
    bool external_vae_is_invalid = false;
-    bool free_params_immediately = false;

    bool circular_x = false;
    bool circular_y = false;
@ -189,7 +187,6 @@ public:

    std::string taesd_path;
    sd_tiling_params_t vae_tiling_params = {false, false, 0, 0, 0.5f, 0, 0, nullptr};
-    bool offload_params_to_cpu           = false;
    bool enable_mmap                     = false;
    float max_vram                       = 0.f;
    bool stream_layers                   = false;
@ -246,20 +243,16 @@ public:
        }
        return model_manager->register_param_tensors(desc,
                                                     std::move(group_tensors),
-                                                     free_params_immediately ? ModelManager::ResidencyMode::Disk : ModelManager::ResidencyMode::Resident,
+                                                     backend_manager.params_backend_is_disk(module) ? ModelManager::ResidencyMode::Disk : ModelManager::ResidencyMode::ParamBackend,
                                                     backend_for(module),
                                                     params_backend_for(module),
                                                     params_mem_size);
    }

-    bool init_backend(const sd_ctx_params_t* sd_ctx_params) {
+    bool init_backend() {
        std::string error;
-        if (!backend_manager.init(sd_ctx_params->backend,
-                                  sd_ctx_params->params_backend,
-                                  offload_params_to_cpu,
-                                  sd_ctx_params->keep_clip_on_cpu,
-                                  sd_ctx_params->keep_vae_on_cpu,
-                                  sd_ctx_params->keep_control_net_on_cpu,
+        if (!backend_manager.init(backend_spec.c_str(),
+                                  params_backend_spec.c_str(),
                                  &error)) {
            LOG_ERROR("backend config failed: %s", error.c_str());
            return false;
@ -319,24 +312,20 @@ public:
    }

    bool init(const sd_ctx_params_t* sd_ctx_params) {
-        n_threads               = sd_ctx_params->n_threads;
-        vae_decode_only         = sd_ctx_params->vae_decode_only;
-        free_params_immediately = sd_ctx_params->free_params_immediately;
-        offload_params_to_cpu   = sd_ctx_params->offload_params_to_cpu;
-        enable_mmap             = sd_ctx_params->enable_mmap;
-        max_vram                = sd_ctx_params->max_vram;
-        stream_layers           = sd_ctx_params->stream_layers;
-        backend_spec            = SAFE_STR(sd_ctx_params->backend);
-        params_backend_spec     = SAFE_STR(sd_ctx_params->params_backend);
+        n_threads           = sd_ctx_params->n_threads;
+        enable_mmap         = sd_ctx_params->enable_mmap;
+        max_vram            = sd_ctx_params->max_vram;
+        stream_layers       = sd_ctx_params->stream_layers;
+        backend_spec        = SAFE_STR(sd_ctx_params->backend);
+        params_backend_spec = SAFE_STR(sd_ctx_params->params_backend);
+
+        std::string rpc_servers_spec = SAFE_STR(sd_ctx_params->rpc_servers);
+        add_rpc_devices(rpc_servers_spec);
+
        if (stream_layers && max_vram == 0.f) {
            LOG_WARN("--stream-layers has no effect without --max-vram set; ignoring");
            stream_layers = false;
        }
-        if (stream_layers && !offload_params_to_cpu && params_backend_spec.empty()) {
-            // Streaming needs CPU-resident params.
-            LOG_WARN("--stream-layers has no effect without --offload-to-cpu (or --params-backend); ignoring");
-            stream_layers = false;
-        }

        bool use_tae         = false;
        bool use_audio_vae   = false;
@ -351,9 +340,13 @@ public:

        ggml_log_set(ggml_log_callback_default, nullptr);

-        if (!init_backend(sd_ctx_params)) {
+        if (!init_backend()) {
            return false;
        }
+        if (stream_layers && !backend_manager.params_backend_is_cpu(SDBackendModule::DIFFUSION)) {
+            LOG_WARN("--stream-layers has no effect unless diffusion params backend is cpu; ignoring");
+            stream_layers = false;
+        }
        max_vram = sd::ggml_graph_cut::resolve_max_vram_gib(max_vram, backend_for(SDBackendModule::DIFFUSION));

        model_manager = std::make_shared<ModelManager>();
@ -537,8 +530,8 @@ public:
                }
            }
            // Avoid full-model LoRA merge buffers on constrained setups.
-            const bool streaming_constrained = stream_layers ||
-                                               sd_ctx_params->offload_params_to_cpu;
+            const bool params_offloaded      = params_backend_for(SDBackendModule::DIFFUSION) != backend_for(SDBackendModule::DIFFUSION);
+            const bool streaming_constrained = stream_layers || params_offloaded;
            if (have_quantized_weight || streaming_constrained) {
                apply_lora_immediately = false;
            } else {
@ -561,10 +554,6 @@ public:
        size_t control_net_params_mem_size  = 0;
        size_t extension_params_mem_size    = 0;

-        if (sd_version_is_control(version)) {
-            // Might need vae encode for control cond
-            vae_decode_only = false;
-        }
        bool tae_preview_only = sd_ctx_params->tae_preview_only;
        if (version == VERSION_SDXS_512_DS || version == VERSION_SDXS_09) {
            tae_preview_only = false;
@ -592,7 +581,6 @@ public:
                                                                "model.diffusion_model",
                                                                model_manager);
            } else if (sd_version_is_pid(version)) {
-                vae_decode_only  = false;
                cond_stage_model = std::make_shared<LLMEmbedder>(backend_for(SDBackendModule::TE),
                                                                 tensor_storage_map,
                                                                 version,
@ -707,15 +695,11 @@ public:
                    }
                }
            } else if (sd_version_is_qwen_image(version)) {
-                bool enable_vision = false;
-                if (!vae_decode_only) {
-                    enable_vision = true;
-                }
                cond_stage_model = std::make_shared<LLMEmbedder>(backend_for(SDBackendModule::TE),
                                                                 tensor_storage_map,
                                                                 version,
                                                                 "",
-                                                                 enable_vision,
+                                                                 true,
                                                                 model_manager);
                diffusion_model  = std::make_shared<Qwen::QwenImageRunner>(backend_for(SDBackendModule::DIFFUSION),
                                                                          tensor_storage_map,
@ -724,15 +708,11 @@ public:
                                                                          sd_ctx_params->qwen_image_zero_cond_t,
                                                                          model_manager);
            } else if (sd_version_is_longcat(version)) {
-                bool enable_vision = false;
-                if (!vae_decode_only) {
-                    enable_vision = true;
-                }
                cond_stage_model = std::make_shared<LLMEmbedder>(backend_for(SDBackendModule::TE),
                                                                 tensor_storage_map,
                                                                 version,
                                                                 "",
-                                                                 enable_vision,
+                                                                 true,
                                                                 model_manager);
                diffusion_model  = std::make_shared<Flux::FluxRunner>(backend_for(SDBackendModule::DIFFUSION),
                                                                     tensor_storage_map,
@ -828,10 +808,6 @@ public:
                return false;
            }

-            if (sd_version_is_unet_edit(version)) {
-                vae_decode_only = false;
-            }
-
            if (high_noise_diffusion_model) {
                high_noise_diffusion_model->set_max_graph_vram_bytes(max_graph_vram_bytes);
                high_noise_diffusion_model->set_stream_layers_enabled(stream_layers);
@ -847,7 +823,7 @@ public:
                return false;
            }

-            auto create_tae = [&]() -> std::shared_ptr<VAE> {
+            auto create_tae = [&](bool decode_only) -> std::shared_ptr<VAE> {
                if (sd_version_is_wan(version) ||
                    sd_version_is_qwen_image(version) ||
                    sd_version_is_anima(version) ||
@ -855,7 +831,7 @@ public:
                    return std::make_shared<TinyVideoAutoEncoder>(backend_for(SDBackendModule::VAE),
                                                                  tensor_storage_map,
                                                                  "decoder",
-                                                                  vae_decode_only,
+                                                                  decode_only,
                                                                  version,
                                                                  model_manager);

@ -863,7 +839,7 @@ public:
                    auto model = std::make_shared<TinyImageAutoEncoder>(backend_for(SDBackendModule::VAE),
                                                                        tensor_storage_map,
                                                                        "decoder.layers",
-                                                                        vae_decode_only,
+                                                                        decode_only,
                                                                        version,
                                                                        model_manager);
                    return model;
@ -885,7 +861,7 @@ public:
                    return std::make_shared<LTXVideoVAE>(backend_for(SDBackendModule::VAE),
                                                         tensor_storage_map,
                                                         "first_stage_model",
-                                                         vae_decode_only,
+                                                         false,
                                                         version,
                                                         model_manager);
                } else if (sd_version_is_wan(version) ||
@ -894,14 +870,14 @@ public:
                    return std::make_shared<WAN::WanVAERunner>(backend_for(SDBackendModule::VAE),
                                                               tensor_storage_map,
                                                               "first_stage_model",
-                                                               vae_decode_only,
+                                                               false,
                                                               version,
                                                               model_manager);
                } else {
                    auto model = std::make_shared<AutoEncoderKL>(backend_for(SDBackendModule::VAE),
                                                                 tensor_storage_map,
                                                                 "first_stage_model",
-                                                                 vae_decode_only,
+                                                                 false,
                                                                 false,
                                                                 vae_version,
                                                                 model_manager);
@ -931,7 +907,7 @@ public:
                }
            } else if (use_tae && !tae_preview_only) {
                LOG_INFO("using TAE for encoding / decoding");
-                first_stage_model = create_tae();
+                first_stage_model = create_tae(false);
                first_stage_model->set_max_graph_vram_bytes(max_graph_vram_bytes);
                if (!register_runner_params("VAE",
                                            first_stage_model,
@ -951,7 +927,7 @@ public:
                }
                if (use_tae && tae_preview_only) {
                    LOG_INFO("using TAE for preview");
-                    preview_vae = create_tae();
+                    preview_vae = create_tae(true);
                    preview_vae->set_max_graph_vram_bytes(max_graph_vram_bytes);
                    if (!register_runner_params("preview VAE",
                                                preview_vae,
@ -1081,13 +1057,6 @@ public:
        ignore_tensors.insert("model.diffusion_model.__32x32__");
        ignore_tensors.insert("model.diffusion_model.__index_timestep_zero__");

-        if (vae_decode_only) {
-            ignore_tensors.insert("first_stage_model.encoder");
-            ignore_tensors.insert("first_stage_model.conv1");
-            ignore_tensors.insert("first_stage_model.quant");
-            ignore_tensors.insert("tae.encoder");
-            ignore_tensors.insert("text_encoders.llm.visual.");
-        }
        if (audio_vae_model) {
            ignore_tensors.insert("audio_vae.encoder");
        }
@ -2642,31 +2611,25 @@ void sd_hires_params_init(sd_hires_params_t* hires_params) {
 }

 void sd_ctx_params_init(sd_ctx_params_t* sd_ctx_params) {
-    *sd_ctx_params                         = {};
-    sd_ctx_params->vae_decode_only         = true;
-    sd_ctx_params->free_params_immediately = true;
-    sd_ctx_params->n_threads               = sd_get_num_physical_cores();
-    sd_ctx_params->wtype                   = SD_TYPE_COUNT;
-    sd_ctx_params->rng_type                = CUDA_RNG;
-    sd_ctx_params->sampler_rng_type        = RNG_TYPE_COUNT;
-    sd_ctx_params->prediction              = PREDICTION_COUNT;
-    sd_ctx_params->lora_apply_mode         = LORA_APPLY_AUTO;
-    sd_ctx_params->offload_params_to_cpu   = false;
-    sd_ctx_params->max_vram                = 0.f;
-    sd_ctx_params->stream_layers           = false;
-    sd_ctx_params->enable_mmap             = false;
-    sd_ctx_params->keep_clip_on_cpu        = false;
-    sd_ctx_params->keep_control_net_on_cpu = false;
-    sd_ctx_params->keep_vae_on_cpu         = false;
-    sd_ctx_params->diffusion_flash_attn    = false;
-    sd_ctx_params->circular_x              = false;
-    sd_ctx_params->circular_y              = false;
-    sd_ctx_params->chroma_use_dit_mask     = true;
-    sd_ctx_params->chroma_use_t5_mask      = false;
-    sd_ctx_params->chroma_t5_mask_pad      = 1;
-    sd_ctx_params->vae_format              = SD_VAE_FORMAT_AUTO;
-    sd_ctx_params->backend                 = nullptr;
-    sd_ctx_params->params_backend          = nullptr;
+    *sd_ctx_params                      = {};
+    sd_ctx_params->n_threads            = sd_get_num_physical_cores();
+    sd_ctx_params->wtype                = SD_TYPE_COUNT;
+    sd_ctx_params->rng_type             = CUDA_RNG;
+    sd_ctx_params->sampler_rng_type     = RNG_TYPE_COUNT;
+    sd_ctx_params->prediction           = PREDICTION_COUNT;
+    sd_ctx_params->lora_apply_mode      = LORA_APPLY_AUTO;
+    sd_ctx_params->max_vram             = 0.f;
+    sd_ctx_params->stream_layers        = false;
+    sd_ctx_params->enable_mmap          = false;
+    sd_ctx_params->diffusion_flash_attn = false;
+    sd_ctx_params->circular_x           = false;
+    sd_ctx_params->circular_y           = false;
+    sd_ctx_params->chroma_use_dit_mask  = true;
+    sd_ctx_params->chroma_use_t5_mask   = false;
+    sd_ctx_params->chroma_t5_mask_pad   = 1;
+    sd_ctx_params->vae_format           = SD_VAE_FORMAT_AUTO;
+    sd_ctx_params->backend              = nullptr;
+    sd_ctx_params->params_backend       = nullptr;
 }

 char* sd_ctx_params_to_str(const sd_ctx_params_t* sd_ctx_params) {
@ -2693,21 +2656,15 @@ char* sd_ctx_params_to_str(const sd_ctx_params_t* sd_ctx_params) {
             "control_net_path: %s\n"
             "photo_maker_path: %s\n"
             "tensor_type_rules: %s\n"
-             "vae_decode_only: %s\n"
-             "free_params_immediately: %s\n"
             "n_threads: %d\n"
             "wtype: %s\n"
             "rng_type: %s\n"
             "sampler_rng_type: %s\n"
             "prediction: %s\n"
-             "offload_params_to_cpu: %s\n"
             "max_vram: %.3f\n"
             "stream_layers: %s\n"
             "backend: %s\n"
             "params_backend: %s\n"
-             "keep_clip_on_cpu: %s\n"
-             "keep_control_net_on_cpu: %s\n"
-             "keep_vae_on_cpu: %s\n"
             "flash_attn: %s\n"
             "diffusion_flash_attn: %s\n"
             "circular_x: %s\n"
@ -2733,21 +2690,15 @@ char* sd_ctx_params_to_str(const sd_ctx_params_t* sd_ctx_params) {
             SAFE_STR(sd_ctx_params->control_net_path),
             SAFE_STR(sd_ctx_params->photo_maker_path),
             SAFE_STR(sd_ctx_params->tensor_type_rules),
-             BOOL_STR(sd_ctx_params->vae_decode_only),
-             BOOL_STR(sd_ctx_params->free_params_immediately),
             sd_ctx_params->n_threads,
             sd_type_name(sd_ctx_params->wtype),
             sd_rng_type_name(sd_ctx_params->rng_type),
             sd_rng_type_name(sd_ctx_params->sampler_rng_type),
             sd_prediction_name(sd_ctx_params->prediction),
-             BOOL_STR(sd_ctx_params->offload_params_to_cpu),
             sd_ctx_params->max_vram,
             BOOL_STR(sd_ctx_params->stream_layers),
             SAFE_STR(sd_ctx_params->backend),
             SAFE_STR(sd_ctx_params->params_backend),
-             BOOL_STR(sd_ctx_params->keep_clip_on_cpu),
-             BOOL_STR(sd_ctx_params->keep_control_net_on_cpu),
-             BOOL_STR(sd_ctx_params->keep_vae_on_cpu),
             BOOL_STR(sd_ctx_params->flash_attn),
             BOOL_STR(sd_ctx_params->diffusion_flash_attn),
             BOOL_STR(sd_ctx_params->circular_x),
@ -3917,7 +3868,7 @@ static std::optional<ImageGenerationLatents> prepare_image_generation_latents(sd
        }
    }

-    if (!control_image_tensor.empty() && !sd_ctx->sd->vae_decode_only) {
+    if (!control_image_tensor.empty()) {
        control_latent = sd_ctx->sd->encode_first_stage(control_image_tensor);
        if (control_latent.empty()) {
            LOG_ERROR("failed to encode control image");
@ -4259,11 +4210,6 @@ static sd::Tensor<float> upscale_hires_latent(sd_ctx_t* sd_ctx,
    } else if (request.hires.upscaler == SD_HIRES_UPSCALER_MODEL ||
               request.hires.upscaler == SD_HIRES_UPSCALER_LANCZOS ||
               request.hires.upscaler == SD_HIRES_UPSCALER_NEAREST) {
-        if (sd_ctx->sd->vae_decode_only) {
-            LOG_ERROR("hires %s upscaler requires VAE encoder weights; create the context with vae_decode_only=false",
-                      sd_hires_upscaler_name(request.hires.upscaler));
-            return {};
-        }
        if (request.hires.upscaler == SD_HIRES_UPSCALER_MODEL && upscaler == nullptr) {
            LOG_ERROR("hires model upscaler context is null");
            return {};
@ -4474,7 +4420,6 @@ SD_API sd_image_t* generate_image(sd_ctx_t* sd_ctx, const sd_img_gen_params_t* s
            const size_t max_graph_vram_bytes = sd::ggml_graph_cut::max_vram_gib_to_bytes(sd_ctx->sd->max_vram);
            hires_upscaler->set_max_graph_vram_bytes(max_graph_vram_bytes);
            if (!hires_upscaler->load_from_file(request.hires.model_path,
-                                                sd_ctx->sd->offload_params_to_cpu,
                                                sd_ctx->sd->n_threads)) {
                LOG_ERROR("load hires model upscaler failed");
                return nullptr;
@ -4611,11 +4556,6 @@ static std::optional<ImageGenerationLatents> prepare_video_generation_latents(sd
        }

        if (!start_image.empty() || !end_image.empty()) {
-            if (sd_ctx->sd->vae_decode_only) {
-                LOG_ERROR("LTXAV image conditioning requires VAE encoder weights; create the context with vae_decode_only=false");
-                return std::nullopt;
-            }
-
            if (!start_image.empty() && !end_image.empty()) {
                LOG_INFO("FLF2V");
            } else if (!start_image.empty()) {
@ -5037,7 +4977,7 @@ static sd::Tensor<float> upscale_ltx_spatial_video_latent(sd_ctx_t* sd_ctx,
    upsampler->get_param_tensors(tensors);
    if (!upsampler_manager->register_param_tensors("LTX latent upsampler",
                                                   std::move(tensors),
-                                                   ModelManager::ResidencyMode::Resident,
+                                                   ModelManager::ResidencyMode::ParamBackend,
                                                   sd_ctx->sd->backend_for(SDBackendModule::UPSCALER),
                                                   sd_ctx->sd->params_backend_for(SDBackendModule::UPSCALER)) ||
        !upsampler_manager->validate_registered_tensors()) {
@ -5080,11 +5020,6 @@ static bool apply_ltxv_refine_image_conditioning(sd_ctx_t* sd_ctx,
        sd_vid_gen_params->end_image.data == nullptr) {
        return true;
    }
-    if (sd_ctx->sd->vae_decode_only) {
-        LOG_ERROR("LTXV refine image conditioning requires VAE encoder weights; create the context with vae_decode_only=false");
-        return false;
-    }
-
    constexpr float conditioning_strength = 1.f;
    int latent_channels                   = sd_ctx->sd->get_latent_channel();
    sd::Tensor<float> video_latent        = *latent;
--- a/src/upscaler.cpp
+++ b/src/upscaler.cpp
@ -39,17 +39,12 @@ void UpscalerGGML::set_stream_layers_enabled(bool enabled) {
 }

 bool UpscalerGGML::load_from_file(const std::string& esrgan_path,
-                                  bool offload_params_to_cpu,
                                  int n_threads) {
    ggml_log_set(ggml_log_callback_default, nullptr);

    std::string error;
    if (!backend_manager.init(backend_spec.c_str(),
                              params_backend_spec.c_str(),
-                              offload_params_to_cpu,
-                              false,
-                              false,
-                              false,
                              &error)) {
        LOG_ERROR("upscaler backend config failed: %s", error.c_str());
        return false;
@ -106,7 +101,7 @@ bool UpscalerGGML::load_from_file(const std::string& esrgan_path,
    esrgan_upscaler->get_param_tensors(tensors);
    if (!model_manager->register_param_tensors("ESRGAN",
                                               std::move(tensors),
-                                               ModelManager::ResidencyMode::Resident,
+                                               backend_manager.params_backend_is_disk(SDBackendModule::UPSCALER) ? ModelManager::ResidencyMode::Disk : ModelManager::ResidencyMode::ParamBackend,
                                               backend_for(SDBackendModule::UPSCALER),
                                               params_backend_for(SDBackendModule::UPSCALER)) ||
        !model_manager->validate_registered_tensors()) {
@ -178,7 +173,6 @@ struct upscaler_ctx_t {
 };

 upscaler_ctx_t* new_upscaler_ctx(const char* esrgan_path_c_str,
-                                 bool offload_params_to_cpu,
                                 bool direct,
                                 int n_threads,
                                 int tile_size,
@ -195,7 +189,7 @@ upscaler_ctx_t* new_upscaler_ctx(const char* esrgan_path_c_str,
        return nullptr;
    }

-    if (!upscaler_ctx->upscaler->load_from_file(esrgan_path, offload_params_to_cpu, n_threads)) {
+    if (!upscaler_ctx->upscaler->load_from_file(esrgan_path, n_threads)) {
        delete upscaler_ctx->upscaler;
        upscaler_ctx->upscaler = nullptr;
        free(upscaler_ctx);
--- a/src/upscaler.h
+++ b/src/upscaler.h
@ -32,7 +32,6 @@ struct UpscalerGGML {
    ~UpscalerGGML();

    bool load_from_file(const std::string& esrgan_path,
-                        bool offload_params_to_cpu,
                        int n_threads);
    void set_max_graph_vram_bytes(size_t max_vram_bytes);
    void set_stream_layers_enabled(bool enabled);
Author	SHA1	Message	Date
leejet	6f00939f75	docs: refresh README guide links	2026-06-14 17:58:58 +08:00
stduhpf	c2df4e1228	feat: add RPC support (#1629 )	2026-06-14 17:30:23 +08:00
leejet	9838264c49	refactor: simplify ControlNet output caching (#1655 )	2026-06-14 16:58:37 +08:00
leejet	17d70b91e6	docs: replace example option lists with help commands	2026-06-14 16:55:15 +08:00
leejet	5db680c2c7	refactor: route cpu placement through backend specs (#1654 )	2026-06-14 15:52:24 +08:00
leejet	749186c0eb	refactor: remove vae_decode_only context flag (#1653 )	2026-06-14 15:23:29 +08:00
leejet	bdb431ad95	feat: support disk params backend (#1651 )	2026-06-14 14:48:50 +08:00