Compare commits

...

7 Commits

35 changed files with 570 additions and 725 deletions

View File

@ -204,6 +204,12 @@ if(SD_WEBM)
endif()
endif()
if (SD_RPC)
message("-- Use RPC as backend stable-diffusion")
set(GGML_RPC ON)
add_definitions(-DSD_USE_RPC)
endif ()
set(SD_LIB stable-diffusion)
file(GLOB SD_LIB_SOURCES CONFIGURE_DEPENDS

View File

@ -34,8 +34,8 @@ API and command-line option may change frequently.***
- Super lightweight and without external dependencies
- Supported models
- Image Models
- SD1.x, SD2.x, [SD-Turbo](https://huggingface.co/stabilityai/sd-turbo)
- SDXL, [SDXL-Turbo](https://huggingface.co/stabilityai/sdxl-turbo)
- [SD1.x, SD2.x, SD-Turbo](./docs/sd.md)
- [SDXL, SDXL-Turbo](./docs/sd.md)
- [Some SD1.x and SDXL distilled models](./docs/distilled_sd.md)
- [SD3/SD3.5](./docs/sd3.md)
- [FLUX.1-dev/FLUX.1-schnell](./docs/flux.md)
@ -59,12 +59,12 @@ API and command-line option may change frequently.***
- Video Models
- [Wan2.1/Wan2.2](./docs/wan.md)
- [LTX-2.3](./docs/ltx2.md)
- [PhotoMaker](https://github.com/TencentARC/PhotoMaker) support.
- [PhotoMaker](./docs/photo_maker.md) support.
- Control Net support with SD 1.5
- LoRA support, same as [stable-diffusion-webui](https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Features#lora)
- Latent Consistency Models support (LCM/LCM-LoRA)
- Faster and memory efficient latent decoding with [TAESD](https://github.com/madebyollin/taesd)
- Upscale images generated with [ESRGAN](https://github.com/xinntao/Real-ESRGAN)
- Faster and memory efficient latent decoding with [TAESD](./docs/taesd.md)
- Upscale images generated with [ESRGAN](./docs/esrgan.md)
- Supported backends
- CPU (AVX, AVX2 and AVX512 support for x86 architectures)
- CUDA
@ -133,28 +133,9 @@ For runtime and parameter backend placement, see the [backend selection guide](.
## More Guides
- [Backend selection](./docs/backend.md)
- [SD1.x/SD2.x/SDXL](./docs/sd.md)
- [SD3/SD3.5](./docs/sd3.md)
- [FLUX.1-dev/FLUX.1-schnell](./docs/flux.md)
- [FLUX.2-dev/FLUX.2-klein](./docs/flux2.md)
- [FLUX.1-Kontext-dev](./docs/kontext.md)
- [Chroma](./docs/chroma.md)
- [🔥Qwen Image](./docs/qwen_image.md)
- [🔥Qwen Image Edit series](./docs/qwen_image_edit.md)
- [🔥Wan2.1/Wan2.2](./docs/wan.md)
- [🔥LTX-2.3](./docs/ltx2.md)
- [🔥Z-Image](./docs/z_image.md)
- [Ovis-Image](./docs/ovis_image.md)
- [Anima](./docs/anima.md)
- [ERNIE-Image](./docs/ernie_image.md)
- [HiDream-O1-Image](./docs/hidream_o1_image.md)
- [Lens](./docs/lens.md)
- [LongCat Image / LongCat Image Edit](./docs/longcat_image.md)
- [RPC](./docs/rpc.md)
- [LoRA](./docs/lora.md)
- [LCM/LCM-LoRA](./docs/lcm.md)
- [Using PhotoMaker to personalize image generation](./docs/photo_maker.md)
- [Using ESRGAN to upscale results](./docs/esrgan.md)
- [Using TAESD to faster decoding](./docs/taesd.md)
- [Docker](./docs/docker.md)
- [Quantization and GGUF](./docs/quantization_and_gguf.md)
- [Inference acceleration via caching](./docs/caching.md)

View File

@ -3,7 +3,7 @@
`stable-diffusion.cpp` has two backend assignments:
- `--backend` selects the runtime backend used to execute model graphs.
- `--params-backend` selects the backend used to allocate model parameters.
- `--params-backend` selects where model parameters are kept.
If `--params-backend` is not set, parameters use the same backend as their module runtime backend.
@ -29,6 +29,12 @@ The same syntax is used for parameter placement:
sd-cli -m model.safetensors -p "a cat" --backend cuda0 --params-backend te=cpu,vae=cpu
```
`--params-backend` also accepts the special value `disk`:
```shell
sd-cli -m model.safetensors -p "a cat" --backend cuda0 --params-backend disk
```
Module names are case-insensitive. Hyphens and underscores in module names are ignored, so `clip_vision`, `clip-vision`, and `clipvision` are equivalent.
`all=`, `default=`, and `*=` can be used to set the default backend inside a mixed assignment:
@ -64,9 +70,11 @@ The special values `auto`, `default`, and an empty backend name select the defau
The special value `gpu` selects the first GPU backend, falling back to the first integrated GPU backend.
The special value `disk` is accepted only by `--params-backend`. `--backend disk` is invalid because `disk` is a parameter residency mode, not a runtime compute backend.
## Runtime backend vs. parameter backend
The runtime backend controls where graph execution runs. The parameter backend controls where model weights are allocated.
The runtime backend controls where graph execution runs. The parameter backend controls where model weights are allocated or whether they are reloaded from disk on demand.
For example:
@ -76,6 +84,16 @@ sd-cli -m model.safetensors -p "a cat" --backend cuda0 --params-backend cpu
This runs all modules on `cuda0`, but stores parameters in CPU RAM. During execution, parameters are moved to the runtime backend as needed.
For example:
```shell
sd-cli -m model.safetensors -p "a cat" --backend cuda0 --params-backend disk
```
This runs all modules on `cuda0`, reloads parameters from the model file as needed, and releases those parameter buffers after use.
`disk` is never selected implicitly. If `--params-backend` is not set, parameters use the runtime backend.
Per-module assignments can be mixed:
```shell
@ -100,23 +118,27 @@ uses one shared CPU backend for both `te` and `vae` runtime execution.
Runtime and parameter assignments also share the same backend cache. If `--backend diffusion=cuda0` and `--params-backend diffusion=cuda0` resolve to the same device, both use the same backend instance.
`--params-backend disk` does not create a separate backend instance. Parameters are loaded lazily using the module runtime backend.
`SDBackendManager` owns the backend instances and frees them when the context or upscaler is destroyed. Model runners receive non-owning runtime and parameter backend pointers and do not free them.
## Compatibility flags
The older CPU placement flags are still supported:
The example CLI/server still accepts these older CPU placement flags as compatibility aliases:
- `--clip-on-cpu`
- `--vae-on-cpu`
- `--control-net-cpu`
- `--offload-to-cpu`
`--clip-on-cpu`, `--vae-on-cpu`, and `--control-net-cpu` affect runtime backend assignment only when `--backend` is not set. They map to `te=cpu`, `vae=cpu`, and `controlnet=cpu`.
`--clip-on-cpu`, `--vae-on-cpu`, and `--control-net-cpu` are deprecated. The example argument layer prepends `te=cpu`, `vae=cpu`, and `controlnet=cpu` to `--backend` before creating the context.
`--offload-to-cpu` affects parameter backend assignment only when `--params-backend` is not set. It is equivalent to:
`--offload-to-cpu` prepends a CPU default to the parameter assignment in the caller before creating the context:
```shell
--params-backend cpu
--params-backend '*=cpu'
```
Explicit `--backend` and `--params-backend` assignments are preferred for new commands.
Because this default is inserted first, later explicit `--params-backend` entries can still override it, for example `--offload-to-cpu --params-backend te=disk` keeps non-TE parameters on CPU and reloads TE parameters from disk.
Library callers should set `backend` and `params_backend` directly. The old CPU/offload fields are no longer part of the C API. Explicit `--backend` and `--params-backend` assignments are preferred for new commands.

View File

@ -21,6 +21,38 @@ and the compute buffer shrink in the debug log:
Using `--offload-to-cpu` allows you to offload weights to the CPU, saving VRAM without reducing generation speed.
## Use params backend to reduce VRAM or RAM usage.
`--params-backend` controls where model parameters are kept. If it is not set, parameters use the same backend as `--backend`, so a GPU runtime backend also keeps parameters in VRAM.
Use CPU params to reduce VRAM usage:
```shell
--backend cuda0 --params-backend cpu
```
This keeps model weights in system RAM and moves them to the runtime backend when needed. In the example CLI/server, `--offload-to-cpu` is a compatibility shortcut that prepends `*=cpu` to `--params-backend` before creating the context, so explicit module assignments can still override it:
```shell
--offload-to-cpu --params-backend te=disk
```
Use disk params to reduce both VRAM and RAM usage:
```shell
--backend cuda0 --params-backend disk
```
This reloads parameters from the model file on demand and releases them after use. It has the lowest memory residency, but can be slower because weights must be read again. `disk` is never selected implicitly; set it explicitly when RAM usage matters more than reload cost.
Per-module assignments can target only the largest modules:
```shell
--backend cuda0 --params-backend diffusion=disk,te=cpu,vae=cpu
```
See [backend selection](./backend.md) for full syntax.
## Use quantization to reduce memory usage.
[quantization](./quantization_and_gguf.md)

220
docs/rpc.md Normal file
View File

@ -0,0 +1,220 @@
# Building and Using the RPC Server with `stable-diffusion.cpp`
This guide covers how to build a version of [the RPC server from `llama.cpp`](https://github.com/ggml-org/llama.cpp/blob/master/tools/rpc/README.md) that is compatible with your version of `stable-diffusion.cpp` to manage multi-backends setups. RPC allows you to offload specific model components to a remote server.
> **Note on Model Location:** The model files (e.g., `.safetensors` or `.gguf`) remain on the **Client** machine. The client parses the file and transmits the necessary tensor data and computational graphs to the server. The server does not need to store the model files locally.
## 1. Building `stable-diffusion.cpp` with RPC client
First, you should build the client application from source. It requires `SD_RPC=ON` to include the RPC backend to your client.
```bash
mkdir build
cd build
cmake .. \
-DSD_RPC=ON \
# Add other build flags here (e.g., -DSD_VULKAN=ON)
cmake --build . --config Release -j $(nproc)
```
> **Note:** Ensure you add the other flags you would normally use (e.g., `-DSD_VULKAN=ON`, `-DSD_CUDA=ON`, `-DSD_HIPBLAS=ON`, or `-DGGML_METAL=ON`), for more information about building `stable-diffusion.cpp` from source, please refer to the [build.md](build.md) documentation.
## 2. Ensure `llama.cpp` is at the correct commit
`stable-diffusion.cpp`'s RPC client is designed to work with a specific version of `llama.cpp` (compatible with the `ggml` submodule) to ensure API compatibility. The commit hash for `llama.cpp` is stored in `ggml/scripts/sync-llama.last`.
> **Start from Root:** Perform these steps from the root of your `stable-diffusion.cpp` directory.
1. Read the target commit hash from the submodule tracker:
```bash
# Linux / WSL / MacOS
HASH=$(cat ggml/scripts/sync-llama.last)
# Windows (PowerShell)
$HASH = Get-Content -Path "ggml\scripts\sync-llama.last"
```
2. Clone `llama.cpp` at the target commit .
```bash
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
git checkout $HASH
```
To save on download time and storage, you can use a shallow clone to download only the target commit:
```bash
mkdir -p llama.cpp
cd llama.cpp
git init
git remote add origin https://github.com/ggml-org/llama.cpp.git
git fetch --depth 1 origin $HASH
git checkout FETCH_HEAD
```
## 3. Build `llama.cpp` (RPC Server)
The RPC server acts as the worker. You must explicitly enable the **backend** (the hardware interface, such as CUDA for Nvidia, Metal for Apple Silicon, or Vulkan) when building, otherwise the server will default to using only the CPU.
To find the correct flags for your system, refer to the official documentation for the [`llama.cpp`](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md) repository.
> **Crucial:** You must include the compiler flags required to satisfy the API compatibility with `stable-diffusion.cpp` (`-DGGML_MAX_NAME=128`). Without this flag, `GGML_MAX_NAME` will default to `64` for the server, and data transfers between the client and server will fail. Of course, `-DGGML_RPC` must also be enabled.
>
> I recommend disabling the `LLAMA_CURL` flag to avoid unnecessary dependencies, and disabling shared library builds to avoid potential conflicts.
> **Build Target:** We are specifically building the `rpc-server` target. This prevents the build system from compiling the entire `llama.cpp` suite (like `llama-server`), making the build significantly faster.
### Linux / WSL (Vulkan)
```bash
mkdir build
cd build
cmake .. -DGGML_RPC=ON \
-DGGML_VULKAN=ON \ # Ensure backend is enabled
-DGGML_BUILD_SHARED_LIBS=OFF \
-DLLAMA_CURL=OFF \
-DCMAKE_C_FLAGS=-DGGML_MAX_NAME=128 \
-DCMAKE_CXX_FLAGS=-DGGML_MAX_NAME=128
cmake --build . --config Release --target rpc-server -j $(nproc)
```
### macOS (Metal)
```bash
mkdir build
cd build
cmake .. -DGGML_RPC=ON \
-DGGML_METAL=ON \
-DGGML_BUILD_SHARED_LIBS=OFF \
-DLLAMA_CURL=OFF \
-DCMAKE_C_FLAGS=-DGGML_MAX_NAME=128 \
-DCMAKE_CXX_FLAGS=-DGGML_MAX_NAME=128
cmake --build . --config Release --target rpc-server
```
### Windows (Visual Studio 2022, Vulkan)
```powershell
mkdir build
cd build
cmake .. -G "Visual Studio 17 2022" -A x64 `
-DGGML_RPC=ON `
-DGGML_VULKAN=ON `
-DGGML_BUILD_SHARED_LIBS=OFF `
-DLLAMA_CURL=OFF `
-DCMAKE_C_FLAGS=-DGGML_MAX_NAME=128 `
-DCMAKE_CXX_FLAGS=-DGGML_MAX_NAME=128
cmake --build . --config Release --target rpc-server
```
## 4. Usage
Once both applications are built, you can run the server and the client to manage your GPU allocation.
### Step A: Run the RPC Server
Start the server. It listens for connections on the default address (usually `localhost:50052`). If your server is on a different machine, ensure the server binds to the correct interface and your firewall allows the connection.
**On the Server :**
If running on the same machine, you can use the default address:
```bash
./rpc-server
```
If you want to allow connections from other machines on the network:
```bash
./rpc-server --host 0.0.0.0
```
> **Security Warning:** The RPC server does not currently support authentication or encryption. **Only run the server on trusted local networks**. Never expose the RPC server directly to the open internet.
> **Drivers & Hardware:** Ensure the Server machine has the necessary drivers installed and functional (e.g., Nvidia Drivers for CUDA, Vulkan SDK, or Metal). If no devices are found, the server will simply fallback to CPU usage.
<!-- ### Step B: Check if the client is able to connect to the server and see the available devices
We're assuming the server is running on your local machine, and listening on the default port `50052`. If it's running on a different machine, you can replace `localhost` with the IP address of the server.
**On the Client:**
```bash
./sd-cli --rpc-servers localhost:50052 --list-devices
```
If the server is running and the client is able to connect, you should see `RPC0 localhost:50052` in the list of devices.
Example output:
(Client built without GPU acceleration, two GPUs available on the server)
```
List of available GGML devices:
Name Description
-------------------
CPU AMD Ryzen 9 5900X 12-Core Processor
RPC0 localhost:50052
RPC1 localhost:50052
``` -->
### Step B: Run with RPC device
If everything is working correctly, you can now run the client while offloading some or all of the work to the RPC server.
Example: Setting the main backend to the RPC0 device for doing all the work on the server.
```bash
./sd-cli -m models/sd1.5.safetensors -p "A cat" --rpc-servers localhost:50052 --backend RPC0
```
---
## 5. Scaling: Multiple RPC Servers
You can connect the client to multiple RPC servers simultaneously to scale out your hardware usage.
Example: A main machine (192.168.1.10) with 3 GPUs, with one GPU running CUDA and the other two running Vulkan, and a second machine (192.168.1.11) only one GPU.
**On the first machine (Running two server instances):**
**Terminal 1 (CUDA):**
```bash
# Linux / WSL
export CUDA_VISIBLE_DEVICES=0
cd ./build_cuda/bin/Release
./rpc-server --host 0.0.0.0
# Windows PowerShell
$env:CUDA_VISIBLE_DEVICES="0"
cd .\build_cuda\bin\Release
./rpc-server --host 0.0.0.0
```
**Terminal 2 (Vulkan):**
```bash
cd ./build_vulkan/bin/Release
# ignore the first GPU (used by CUDA server)
./rpc-server --host 0.0.0.0 --port 50053 -d Vulkan1,Vulkan2
```
**On the second machine:**
```bash
cd ./build/bin/Release
./rpc-server --host 0.0.0.0
```
**On the Client:**
Pass multiple server addresses separated by commas.
```bash
./sd-cli --rpc-servers 192.168.1.10:50052,192.168.1.10:50053,192.168.1.11:50052 [...]
```
The client will map these servers to sequential device IDs (e.g., RPC0 from the first server, RPC2, RPC3 from the second, and RPC4 from the third). With this setup, you could for example use RPC0 for the main backend, RPC1 and RPC2 for the text encoders, and RPC3 for the VAE.
---
## 6. Performance Considerations
RPC performance is heavily dependent on network bandwidth, as large weights and activations must be transferred back and forth over the network, especially for large models, or when using high resolutions. For best results, ensure your network connection is stable and has sufficient bandwidth (>1Gbps recommended). This shoumd not be a concern if you are running the server and client on the same machine, as the data transfer will happen over the loopback interface.

View File

@ -1,204 +1,9 @@
# Run
# Usage
```
usage: ./bin/sd-cli [options]
For detailed command-line arguments, run:
CLI Options:
-o, --output <string> path to write result image to. you can use printf-style %d format specifiers for image
sequences (default: ./output.png) (eg. output_%03d.png). Single-file video outputs
support .avi, .webm, and animated .webp
--image <string> path to the image to inspect (for metadata mode)
--metadata-format <string> metadata output format, one of [text, json] (default: text)
--preview-path <string> path to write preview image to (default: ./preview.png). Multi-frame previews support
.avi, .webm, and animated .webp
--preview-interval <int> interval in denoising steps between consecutive updates of the image preview file
(default is 1, meaning updating at every step)
--output-begin-idx <int> starting index for output image sequence, must be non-negative (default 0 if specified
%d in output path, 1 otherwise)
--canny apply canny preprocessor (edge detection)
--convert-name convert tensor name (for convert mode)
-v, --verbose print extra info
--color colors the logging tags according to level
--taesd-preview-only prevents usage of taesd for decoding the final image. (for use with --preview tae)
--preview-noisy enables previewing noisy inputs of the models rather than the denoised outputs
--metadata-raw include raw hex previews for unparsed metadata payloads
--metadata-brief truncate long metadata text values in text output
--metadata-all include structural/container entries such as IHDR, IDAT, and non-metadata JPEG segments
-M, --mode run mode, one of [img_gen, vid_gen, upscale, convert, metadata], default: img_gen
--preview preview method. must be one of the following [none, proj, tae, vae] (default is none)
-h, --help show this help message and exit
Context Options:
-m, --model <string> path to full model
--clip_l <string> path to the clip-l text encoder
--clip_g <string> path to the clip-g text encoder
--clip_vision <string> path to the clip-vision encoder
--t5xxl <string> path to the t5xxl text encoder
--llm <string> path to the llm text encoder. For example: (qwenvl2.5 for qwen-image,
mistral-small3.2 for flux2, ...)
--llm_vision <string> path to the llm vit
--qwen2vl <string> alias of --llm. Deprecated.
--qwen2vl_vision <string> alias of --llm_vision. Deprecated.
--diffusion-model <string> path to the standalone diffusion model
--high-noise-diffusion-model <string> path to the standalone high noise diffusion model
--uncond-diffusion-model <string> path to the standalone unconditional diffusion model, currently used by
Ideogram4 CFG
--vae <string> path to standalone vae model
--taesd <string> path to taesd. Using Tiny AutoEncoder for fast decoding (low quality)
--tae <string> alias of --taesd
--control-net <string> path to control net model
--embd-dir <string> embeddings directory
--lora-model-dir <string> lora model directory
--hires-upscalers-dir <string> highres fix upscaler model directory
--tensor-type-rules <string> weight type per tensor pattern (example: "^vae\.=f16,model\.=q8_0")
--photo-maker <string> path to PHOTOMAKER model
--upscale-model <string> path to esrgan model.
-t, --threads <int> number of threads to use during computation (default: -1). If threads <= 0,
then threads will be set to the number of CPU physical cores
--chroma-t5-mask-pad <int> t5 mask pad size of chroma
--max-vram <float> maximum VRAM budget in GiB for graph-cut segmented execution. 0 disables
graph splitting; a negative value auto-detects free VRAM, sparing the
specified value (e.g. -0.5 will keep at least 0.5 GiB free)
--force-sdxl-vae-conv-scale force use of conv scale on sdxl vae
--offload-to-cpu place the weights in RAM to save VRAM, and automatically load them into VRAM
when needed
--mmap whether to memory-map model
--control-net-cpu keep controlnet in cpu (for low vram)
--clip-on-cpu keep clip in cpu (for low vram)
--vae-on-cpu keep vae in cpu (for low vram)
--fa use flash attention
--diffusion-fa use flash attention in the diffusion model only
--diffusion-conv-direct use ggml_conv2d_direct in the diffusion model
--vae-conv-direct use ggml_conv2d_direct in the vae model
--circular enable circular padding for convolutions
--circularx enable circular RoPE wrapping on x-axis (width) only
--circulary enable circular RoPE wrapping on y-axis (height) only
--chroma-disable-dit-mask disable dit mask for chroma
--qwen-image-zero-cond-t enable zero_cond_t for qwen image
--chroma-enable-t5-mask enable t5 mask for chroma
--type weight type (examples: f32, f16, q4_0, q4_1, q5_0, q5_1, q8_0, q2_K, q3_K,
q4_K). If not specified, the default is the type of the weight file
--rng RNG, one of [std_default, cuda, cpu], default: cuda(sd-webui), cpu(comfyui)
--sampler-rng sampler RNG, one of [std_default, cuda, cpu]. If not specified, use --rng
--prediction prediction type override, one of [eps, v, edm_v, sd3_flow, flux_flow,
flux2_flow]
--lora-apply-mode the way to apply LoRA, one of [auto, immediately, at_runtime], default is
auto. In auto mode, if the model weights contain any quantized parameters,
the at_runtime mode will be used; otherwise, immediately will be used.The
immediately mode may have precision and compatibility issues with quantized
parameters, but it usually offers faster inference speed and, in some cases,
lower memory usage. The at_runtime mode, on the other hand, is exactly the
opposite.
Generation Options:
-p, --prompt <string> the prompt to render
-n, --negative-prompt <string> the negative prompt (default: "")
-i, --init-img <string> path to the init image
--end-img <string> path to the end image, required by flf2v
--mask <string> path to the mask image
--control-image <string> path to control image, control net
--control-video <string> path to control video frames, It must be a directory path. The video frames
inside should be stored as images in lexicographical (character) order. For
example, if the control video path is `frames`, the directory contain images
such as 00.png, 01.png, ... etc.
--pm-id-images-dir <string> path to PHOTOMAKER input id images dir
--pm-id-embed-path <string> path to PHOTOMAKER v2 id embed
--hires-upscaler <string> highres fix upscaler, Lanczos, Nearest, Latent, Latent (nearest), Latent
(nearest-exact), Latent (antialiased), Latent (bicubic), Latent (bicubic
antialiased), or a model name under --hires-upscalers-dir (default: Latent)
--extra-sample-args <string> extra sampler/scheduler/guidance args, key=value list. APG supports apg_eta,
apg_momentum, apg_norm_threshold, apg_norm_threshold_smoothing; SLG supports
slg_uncond; lcm supports noise_clip_std, noise_scale_start, noise_scale_end;
ltx2 supports max_shift, base_shift, stretch, terminal; euler_ge supports gamma
--extra-tiling-args <string> extra VAE tiling args, key=value list. LTX video VAE supports
temporal_tile_frames (default: 4), temporal_tile_overlap (default: 1)
-H, --height <int> image height, in pixel space (default: 512)
-W, --width <int> image width, in pixel space (default: 512)
--steps <int> number of sample steps (default: 20)
--high-noise-steps <int> (high noise) number of sample steps (default: -1 = auto)
--clip-skip <int> ignore last layers of CLIP network; 1 ignores none, 2 ignores one layer
(default: -1). <= 0 represents unspecified, will be 1 for SD1.x, 2 for SD2.x
-b, --batch-count <int> batch count
--video-frames <int> video frames (default: 1)
--fps <int> fps (default: 24)
--timestep-shift <int> shift timestep for NitroFusion models (default: 0). recommended N for
NitroSD-Realism around 250 and 500 for NitroSD-Vibrant
--upscale-repeats <int> Run the ESRGAN upscaler this many times (default: 1)
--upscale-tile-size <int> tile size for ESRGAN upscaling (default: 128)
--hires-width <int> highres fix target width, 0 to use --hires-scale (default: 0)
--hires-height <int> highres fix target height, 0 to use --hires-scale (default: 0)
--hires-steps <int> highres fix second pass sample steps, 0 to reuse --steps (default: 0)
--hires-upscale-tile-size <int> highres fix upscaler tile size, reserved for model-backed upscalers (default:
128)
--cfg-scale <float> unconditional guidance scale: (default: 7.0)
--img-cfg-scale <float> image guidance scale for inpaint or image edit models: (default: same as
--cfg-scale)
--guidance <float> distilled guidance scale for models with guidance input (default: 3.5)
--slg-scale <float> skip layer guidance (SLG) scale, only for DiT models: (default: 0). 0 means
disabled, a value of 2.5 is nice for sd3.5 medium
--skip-layer-start <float> SLG enabling point (default: 0.01)
--skip-layer-end <float> SLG disabling point (default: 0.2)
--eta <float> noise multiplier (default: 0 for ddim_trailing, tcd, res_multistep and
res_2s; 1 for euler_a, er_sde and dpm++2s_a)
--flow-shift <float> shift value for Flow models like SD3.x or WAN (default: auto)
--high-noise-cfg-scale <float> (high noise) unconditional guidance scale: (default: 7.0)
--high-noise-img-cfg-scale <float> (high noise) image guidance scale for inpaint or image edit models (default:
same as --cfg-scale)
--high-noise-guidance <float> (high noise) distilled guidance scale for models with guidance input
(default: 3.5)
--high-noise-slg-scale <float> (high noise) skip layer guidance (SLG) scale, only for DiT models: (default:
0)
--high-noise-skip-layer-start <float> (high noise) SLG enabling point (default: 0.01)
--high-noise-skip-layer-end <float> (high noise) SLG disabling point (default: 0.2)
--high-noise-eta <float> (high noise) noise multiplier (default: 0 for ddim_trailing, tcd,
res_multistep and res_2s; 1 for euler_a, er_sde and dpm++2s_a)
--strength <float> strength for noising/unnoising (default: 0.75)
--pm-style-strength <float>
--control-strength <float> strength to apply Control Net (default: 0.9). 1.0 corresponds to full
destruction of information in init image
--moe-boundary <float> timestep boundary for Wan2.2 MoE model. (default: 0.875). Only enabled if
`--high-noise-steps` is set to -1
--vace-strength <float> wan vace strength
--vae-tile-overlap <float> tile overlap for vae tiling, in fraction of tile size (default: 0.5)
--hires-scale <float> highres fix scale when target size is not set (default: 2.0)
--hires-denoising-strength <float> highres fix second pass denoising strength (default: 0.7)
--increase-ref-index automatically increase the indices of references images based on the order
they are listed (starting with 1).
--disable-auto-resize-ref-image disable auto resize of ref images
--disable-image-metadata do not embed generation metadata on image files
--vae-tiling process vae in tiles to reduce memory usage
--temporal-tiling enable temporal tiling for LTX video VAE decode
--hires enable highres fix
-s, --seed RNG seed (default: 42, use random seed for < 0)
--sampling-method sampling method, one of [euler, euler_a, heun, dpm2, dpm++2s_a, dpm++2m,
dpm++2mv2, ipndm, ipndm_v, lcm, ddim_trailing, tcd, res_multistep, res_2s,
er_sde, euler_cfg_pp, euler_a_cfg_pp] (default: euler for Flux/SD3/Wan, euler_a otherwise)
--high-noise-sampling-method (high noise) sampling method, one of [euler, euler_a, heun, dpm2, dpm++2s_a,
dpm++2m, dpm++2mv2, ipndm, ipndm_v, lcm, ddim_trailing, tcd, res_multistep,
res_2s, er_sde, euler_cfg_pp, euler_a_cfg_pp] default: euler for Flux/SD3/Wan, euler_a otherwise
--scheduler denoiser sigma scheduler, one of [discrete, karras, exponential, ays, gits,
smoothstep, sgm_uniform, simple, kl_optimal, lcm, bong_tangent, ltx2], default:
model-specific
--sigmas custom sigma values for the sampler, comma-separated (e.g.,
"14.61,7.8,3.5,0.0").
--hires-sigmas custom sigma values for the highres fix second pass, comma-separated (e.g.,
"0.85,0.725,0.421875,0.0").
--skip-layers layers to skip for SLG steps (default: [7,8,9])
--high-noise-skip-layers (high noise) layers to skip for SLG steps (default: [7,8,9])
-r, --ref-image reference image for Flux Kontext models (can be used multiple times)
--cache-mode caching method: 'easycache' (DiT), 'ucache' (UNET),
'dbcache'/'taylorseer'/'cache-dit' (DiT block-level), 'spectrum' (UNET/DiT
Chebyshev+Taylor forecasting)
--cache-option named cache params (key=value format, comma-separated). easycache/ucache:
threshold=,start=,end=,decay=,relative=,reset=; dbcache/taylorseer/cache-dit:
Fn=,Bn=,threshold=,warmup=; spectrum: w=,m=,lam=,window=,flex=,warmup=,stop=.
Examples: "threshold=0.25" or "threshold=1.5,reset=0"
--scm-mask SCM steps mask for cache-dit: comma-separated 0/1 (e.g.,
"1,1,1,0,0,1,0,0,1,0") - 1=compute, 0=can cache
--scm-policy SCM policy: 'dynamic' (default) or 'static'
--vae-tile-size tile size for vae tiling, format [X]x[Y] (default: 32x32)
--vae-relative-tile-size relative tile size for vae tiling, format [X]x[Y], in fraction of image size
if < 1, in number of tiles per dim if >=1 (overrides --vae-tile-size)
```bash
./bin/sd-cli -h
```
Metadata mode inspects PNG/JPEG container metadata without loading any model:

View File

@ -623,8 +623,6 @@ int main(int argc, const char* argv[]) {
}
}
bool vae_decode_only = true;
auto load_image_and_update_size = [&](const std::string& path,
SDImageOwner& image,
bool resize_image = true,
@ -646,21 +644,18 @@ int main(int argc, const char* argv[]) {
};
if (gen_params.init_image_path.size() > 0) {
vae_decode_only = false;
if (!load_image_and_update_size(gen_params.init_image_path, gen_params.init_image)) {
return 1;
}
}
if (gen_params.end_image_path.size() > 0) {
vae_decode_only = false;
if (!load_image_and_update_size(gen_params.end_image_path, gen_params.end_image)) {
return 1;
}
}
if (gen_params.ref_image_paths.size() > 0) {
vae_decode_only = false;
gen_params.ref_images.clear();
for (auto& path : gen_params.ref_image_paths) {
SDImageOwner ref_image({0, 0, 3, nullptr});
@ -735,18 +730,7 @@ int main(int argc, const char* argv[]) {
}
}
if (cli_params.mode == VID_GEN) {
vae_decode_only = false;
}
if (gen_params.hires_enabled &&
(gen_params.resolved_hires_upscaler == SD_HIRES_UPSCALER_MODEL ||
gen_params.resolved_hires_upscaler == SD_HIRES_UPSCALER_LANCZOS ||
gen_params.resolved_hires_upscaler == SD_HIRES_UPSCALER_NEAREST)) {
vae_decode_only = false;
}
sd_ctx_params_t sd_ctx_params = ctx_params.to_sd_ctx_params_t(vae_decode_only, true, cli_params.taesd_preview);
sd_ctx_params_t sd_ctx_params = ctx_params.to_sd_ctx_params_t(cli_params.taesd_preview);
SDImageVec results;
int num_results = 0;
@ -798,12 +782,11 @@ int main(int argc, const char* argv[]) {
int upscale_factor = 4; // unused for RealESRGAN_x4plus_anime_6B.pth
if (ctx_params.esrgan_path.size() > 0 && gen_params.upscale_repeats > 0) {
UpscalerCtxPtr upscaler_ctx(new_upscaler_ctx(ctx_params.esrgan_path.c_str(),
ctx_params.offload_params_to_cpu,
ctx_params.diffusion_conv_direct,
ctx_params.n_threads,
gen_params.upscale_tile_size,
ctx_params.backend.c_str(),
ctx_params.params_backend.c_str()));
sd_ctx_params.backend,
sd_ctx_params.params_backend));
if (upscaler_ctx == nullptr) {
LOG_ERROR("new_upscaler_ctx failed");

View File

@ -51,6 +51,10 @@ static sd_vae_format_t str_to_vae_format(const std::string& value) {
return SD_VAE_FORMAT_COUNT;
}
static void prepend_backend_assignment(std::string& spec, const char* assignment) {
spec = spec.empty() ? assignment : std::string(assignment) + "," + spec;
}
#if defined(_WIN32)
static std::string utf16_to_utf8(const std::wstring& wstr) {
if (wstr.empty())
@ -421,8 +425,12 @@ ArgOptions SDContextParams::get_options() {
&backend},
{"",
"--params-backend",
"parameter backend assignment, e.g. cpu or diffusion=cpu,clip=cpu",
"parameter backend assignment, e.g. disk, cpu, or diffusion=disk,clip=cpu",
&params_backend},
{"",
"--rpc-servers",
"comma-separated list of RPC servers to connect to for offloading, in the format host:port, e.g. localhost:50052,192.168.1.3:50052",
&rpc_servers},
};
options.int_options = {
@ -463,15 +471,15 @@ ArgOptions SDContextParams::get_options() {
true, &enable_mmap},
{"",
"--control-net-cpu",
"keep controlnet in cpu (for low vram)",
"deprecated; use --backend controlnet=cpu",
true, &control_net_cpu},
{"",
"--clip-on-cpu",
"keep clip in cpu (for low vram)",
"deprecated; use --backend te=cpu",
true, &clip_on_cpu},
{"",
"--vae-on-cpu",
"keep vae in cpu (for low vram)",
"deprecated; use --backend vae=cpu",
true, &vae_on_cpu},
{"",
"--fa",
@ -688,6 +696,25 @@ bool SDContextParams::resolve_and_validate(SDMode mode) {
return true;
}
void SDContextParams::prepare_backend_assignments() {
effective_backend = backend;
effective_params_backend = params_backend;
if (offload_params_to_cpu) {
prepend_backend_assignment(effective_params_backend, "*=cpu");
}
if (clip_on_cpu) {
prepend_backend_assignment(effective_backend, "te=cpu");
}
if (vae_on_cpu) {
prepend_backend_assignment(effective_backend, "vae=cpu");
}
if (control_net_cpu) {
prepend_backend_assignment(effective_backend, "controlnet=cpu");
}
}
std::string SDContextParams::to_string() const {
std::ostringstream emb_ss;
emb_ss << "{\n";
@ -757,7 +784,8 @@ std::string SDContextParams::to_string() const {
return oss.str();
}
sd_ctx_params_t SDContextParams::to_sd_ctx_params_t(bool vae_decode_only, bool free_params_immediately, bool taesd_preview) {
sd_ctx_params_t SDContextParams::to_sd_ctx_params_t(bool taesd_preview) {
prepare_backend_assignments();
embedding_vec.clear();
embedding_vec.reserve(embedding_map.size());
for (const auto& kv : embedding_map) {
@ -767,57 +795,52 @@ sd_ctx_params_t SDContextParams::to_sd_ctx_params_t(bool vae_decode_only, bool f
embedding_vec.emplace_back(item);
}
sd_ctx_params_t sd_ctx_params = {
model_path.c_str(),
clip_l_path.c_str(),
clip_g_path.c_str(),
clip_vision_path.c_str(),
t5xxl_path.c_str(),
llm_path.c_str(),
llm_vision_path.c_str(),
diffusion_model_path.c_str(),
high_noise_diffusion_model_path.c_str(),
uncond_diffusion_model_path.c_str(),
embeddings_connectors_path.c_str(),
vae_path.c_str(),
audio_vae_path.c_str(),
taesd_path.c_str(),
control_net_path.c_str(),
embedding_vec.data(),
static_cast<uint32_t>(embedding_vec.size()),
photo_maker_path.c_str(),
tensor_type_rules.c_str(),
vae_decode_only,
free_params_immediately,
n_threads,
wtype,
rng_type,
sampler_rng_type,
prediction,
lora_apply_mode,
offload_params_to_cpu,
enable_mmap,
clip_on_cpu,
control_net_cpu,
vae_on_cpu,
flash_attn,
diffusion_flash_attn,
taesd_preview,
diffusion_conv_direct,
vae_conv_direct,
circular || circular_x,
circular || circular_y,
force_sdxl_vae_conv_scale,
chroma_use_dit_mask,
chroma_use_t5_mask,
chroma_t5_mask_pad,
qwen_image_zero_cond_t,
str_to_vae_format(vae_format),
max_vram,
stream_layers,
backend.c_str(),
params_backend.c_str(),
};
sd_ctx_params_t sd_ctx_params;
sd_ctx_params_init(&sd_ctx_params);
sd_ctx_params.model_path = model_path.c_str();
sd_ctx_params.clip_l_path = clip_l_path.c_str();
sd_ctx_params.clip_g_path = clip_g_path.c_str();
sd_ctx_params.clip_vision_path = clip_vision_path.c_str();
sd_ctx_params.t5xxl_path = t5xxl_path.c_str();
sd_ctx_params.llm_path = llm_path.c_str();
sd_ctx_params.llm_vision_path = llm_vision_path.c_str();
sd_ctx_params.diffusion_model_path = diffusion_model_path.c_str();
sd_ctx_params.high_noise_diffusion_model_path = high_noise_diffusion_model_path.c_str();
sd_ctx_params.uncond_diffusion_model_path = uncond_diffusion_model_path.c_str();
sd_ctx_params.embeddings_connectors_path = embeddings_connectors_path.c_str();
sd_ctx_params.vae_path = vae_path.c_str();
sd_ctx_params.audio_vae_path = audio_vae_path.c_str();
sd_ctx_params.taesd_path = taesd_path.c_str();
sd_ctx_params.control_net_path = control_net_path.c_str();
sd_ctx_params.embeddings = embedding_vec.data();
sd_ctx_params.embedding_count = static_cast<uint32_t>(embedding_vec.size());
sd_ctx_params.photo_maker_path = photo_maker_path.c_str();
sd_ctx_params.tensor_type_rules = tensor_type_rules.c_str();
sd_ctx_params.n_threads = n_threads;
sd_ctx_params.wtype = wtype;
sd_ctx_params.rng_type = rng_type;
sd_ctx_params.sampler_rng_type = sampler_rng_type;
sd_ctx_params.prediction = prediction;
sd_ctx_params.lora_apply_mode = lora_apply_mode;
sd_ctx_params.enable_mmap = enable_mmap;
sd_ctx_params.flash_attn = flash_attn;
sd_ctx_params.diffusion_flash_attn = diffusion_flash_attn;
sd_ctx_params.tae_preview_only = taesd_preview;
sd_ctx_params.diffusion_conv_direct = diffusion_conv_direct;
sd_ctx_params.vae_conv_direct = vae_conv_direct;
sd_ctx_params.circular_x = circular || circular_x;
sd_ctx_params.circular_y = circular || circular_y;
sd_ctx_params.force_sdxl_vae_conv_scale = force_sdxl_vae_conv_scale;
sd_ctx_params.chroma_use_dit_mask = chroma_use_dit_mask;
sd_ctx_params.chroma_use_t5_mask = chroma_use_t5_mask;
sd_ctx_params.chroma_t5_mask_pad = chroma_t5_mask_pad;
sd_ctx_params.qwen_image_zero_cond_t = qwen_image_zero_cond_t;
sd_ctx_params.vae_format = str_to_vae_format(vae_format);
sd_ctx_params.max_vram = max_vram;
sd_ctx_params.stream_layers = stream_layers;
sd_ctx_params.backend = effective_backend.c_str();
sd_ctx_params.params_backend = effective_params_backend.c_str();
sd_ctx_params.rpc_servers = rpc_servers.c_str();
return sd_ctx_params;
}

View File

@ -148,6 +148,9 @@ struct SDContextParams {
bool stream_layers = false;
std::string backend;
std::string params_backend;
std::string rpc_servers;
std::string effective_backend;
std::string effective_params_backend;
bool enable_mmap = false;
bool control_net_cpu = false;
bool clip_on_cpu = false;
@ -175,11 +178,12 @@ struct SDContextParams {
float flow_shift = INFINITY;
ArgOptions get_options();
void build_embedding_map();
void prepare_backend_assignments();
bool resolve(SDMode mode);
bool validate(SDMode mode);
bool resolve_and_validate(SDMode mode);
std::string to_string() const;
sd_ctx_params_t to_sd_ctx_params_t(bool vae_decode_only, bool free_params_immediately, bool taesd_preview);
sd_ctx_params_t to_sd_ctx_params_t(bool taesd_preview);
};
struct SDGenerationParams {

View File

@ -117,188 +117,10 @@ In this case, the server will load and serve the specified `index.html` file ins
* using a custom UI
* avoiding rebuilding the binary after frontend modifications
# Run
# Usage
```
usage: ./bin/sd-server [options]
Svr Options:
-l, --listen-ip <string> server listen ip (default: 127.0.0.1)
--serve-html-path <string> path to HTML file to serve at root (optional)
--listen-port <int> server listen port (default: 1234)
-v, --verbose print extra info
--color colors the logging tags according to level
-h, --help show this help message and exit
Context Options:
-m, --model <string> path to full model
--clip_l <string> path to the clip-l text encoder
--clip_g <string> path to the clip-g text encoder
--clip_vision <string> path to the clip-vision encoder
--t5xxl <string> path to the t5xxl text encoder
--llm <string> path to the llm text encoder. For example: (qwenvl2.5 for qwen-image,
mistral-small3.2 for flux2, ...)
--llm_vision <string> path to the llm vit
--qwen2vl <string> alias of --llm. Deprecated.
--qwen2vl_vision <string> alias of --llm_vision. Deprecated.
--diffusion-model <string> path to the standalone diffusion model
--high-noise-diffusion-model <string> path to the standalone high noise diffusion model
--uncond-diffusion-model <string> path to the standalone unconditional diffusion model, currently used by
Ideogram4 CFG
--vae <string> path to standalone vae model
--taesd <string> path to taesd. Using Tiny AutoEncoder for fast decoding (low quality)
--tae <string> alias of --taesd
--control-net <string> path to control net model
--embd-dir <string> embeddings directory
--lora-model-dir <string> lora model directory
--hires-upscalers-dir <string> highres fix upscaler model directory
--tensor-type-rules <string> weight type per tensor pattern (example: "^vae\.=f16,model\.=q8_0")
--photo-maker <string> path to PHOTOMAKER model
--upscale-model <string> path to esrgan model.
-t, --threads <int> number of threads to use during computation (default: -1). If threads <= 0,
then threads will be set to the number of CPU physical cores
--chroma-t5-mask-pad <int> t5 mask pad size of chroma
--max-vram <float> maximum VRAM budget in GiB for graph-cut segmented execution. 0 disables
graph splitting; a negative value auto-detects free VRAM, sparing the
specified value (e.g. -0.5 will keep at least 0.5 GiB free)
--force-sdxl-vae-conv-scale force use of conv scale on sdxl vae
--offload-to-cpu place the weights in RAM to save VRAM, and automatically load them into VRAM
when needed
--mmap whether to memory-map model
--control-net-cpu keep controlnet in cpu (for low vram)
--clip-on-cpu keep clip in cpu (for low vram)
--vae-on-cpu keep vae in cpu (for low vram)
--fa use flash attention
--diffusion-fa use flash attention in the diffusion model only
--diffusion-conv-direct use ggml_conv2d_direct in the diffusion model
--vae-conv-direct use ggml_conv2d_direct in the vae model
--circular enable circular padding for convolutions
--circularx enable circular RoPE wrapping on x-axis (width) only
--circulary enable circular RoPE wrapping on y-axis (height) only
--chroma-disable-dit-mask disable dit mask for chroma
--qwen-image-zero-cond-t enable zero_cond_t for qwen image
--chroma-enable-t5-mask enable t5 mask for chroma
--type weight type (examples: f32, f16, q4_0, q4_1, q5_0, q5_1, q8_0, q2_K, q3_K,
q4_K). If not specified, the default is the type of the weight file
--rng RNG, one of [std_default, cuda, cpu], default: cuda(sd-webui), cpu(comfyui)
--sampler-rng sampler RNG, one of [std_default, cuda, cpu]. If not specified, use --rng
--prediction prediction type override, one of [eps, v, edm_v, sd3_flow, flux_flow,
flux2_flow]
--lora-apply-mode the way to apply LoRA, one of [auto, immediately, at_runtime], default is
auto. In auto mode, if the model weights contain any quantized parameters,
the at_runtime mode will be used; otherwise, immediately will be used.The
immediately mode may have precision and compatibility issues with quantized
parameters, but it usually offers faster inference speed and, in some cases,
lower memory usage. The at_runtime mode, on the other hand, is exactly the
opposite.
Default Generation Options:
-p, --prompt <string> the prompt to render
-n, --negative-prompt <string> the negative prompt (default: "")
-i, --init-img <string> path to the init image
--end-img <string> path to the end image, required by flf2v
--mask <string> path to the mask image
--control-image <string> path to control image, control net
--control-video <string> path to control video frames, It must be a directory path. The video frames
inside should be stored as images in lexicographical (character) order. For
example, if the control video path is `frames`, the directory contain images
such as 00.png, 01.png, ... etc.
--pm-id-images-dir <string> path to PHOTOMAKER input id images dir
--pm-id-embed-path <string> path to PHOTOMAKER v2 id embed
--hires-upscaler <string> highres fix upscaler, Lanczos, Nearest, Latent, Latent (nearest), Latent
(nearest-exact), Latent (antialiased), Latent (bicubic), Latent (bicubic
antialiased), or a model name under --hires-upscalers-dir (default: Latent)
--extra-sample-args <string> extra sampler/scheduler/guidance args, key=value list. APG supports apg_eta,
apg_momentum, apg_norm_threshold, apg_norm_threshold_smoothing; SLG supports
slg_uncond; lcm supports noise_clip_std, noise_scale_start, noise_scale_end;
ltx2 supports max_shift, base_shift, stretch, terminal; euler_ge supports gamma
--extra-tiling-args <string> extra VAE tiling args, key=value list. LTX video VAE supports
temporal_tile_frames (default: 4), temporal_tile_overlap (default: 1)
-H, --height <int> image height, in pixel space (default: 512)
-W, --width <int> image width, in pixel space (default: 512)
--steps <int> number of sample steps (default: 20)
--high-noise-steps <int> (high noise) number of sample steps (default: -1 = auto)
--clip-skip <int> ignore last layers of CLIP network; 1 ignores none, 2 ignores one layer
(default: -1). <= 0 represents unspecified, will be 1 for SD1.x, 2 for SD2.x
-b, --batch-count <int> batch count
--video-frames <int> video frames (default: 1)
--fps <int> fps (default: 24)
--timestep-shift <int> shift timestep for NitroFusion models (default: 0). recommended N for
NitroSD-Realism around 250 and 500 for NitroSD-Vibrant
--upscale-repeats <int> Run the ESRGAN upscaler this many times (default: 1)
--upscale-tile-size <int> tile size for ESRGAN upscaling (default: 128)
--hires-width <int> highres fix target width, 0 to use --hires-scale (default: 0)
--hires-height <int> highres fix target height, 0 to use --hires-scale (default: 0)
--hires-steps <int> highres fix second pass sample steps, 0 to reuse --steps (default: 0)
--hires-upscale-tile-size <int> highres fix upscaler tile size, reserved for model-backed upscalers (default:
128)
--cfg-scale <float> unconditional guidance scale: (default: 7.0)
--img-cfg-scale <float> image guidance scale for inpaint or image edit models: (default: same as
--cfg-scale)
--guidance <float> distilled guidance scale for models with guidance input (default: 3.5)
--slg-scale <float> skip layer guidance (SLG) scale, only for DiT models: (default: 0). 0 means
disabled, a value of 2.5 is nice for sd3.5 medium
--skip-layer-start <float> SLG enabling point (default: 0.01)
--skip-layer-end <float> SLG disabling point (default: 0.2)
--eta <float> noise multiplier (default: 0 for ddim_trailing, tcd, res_multistep and
res_2s; 1 for euler_a, er_sde and dpm++2s_a)
--flow-shift <float> shift value for Flow models like SD3.x or WAN (default: auto)
--high-noise-cfg-scale <float> (high noise) unconditional guidance scale: (default: 7.0)
--high-noise-img-cfg-scale <float> (high noise) image guidance scale for inpaint or image edit models (default:
same as --cfg-scale)
--high-noise-guidance <float> (high noise) distilled guidance scale for models with guidance input
(default: 3.5)
--high-noise-slg-scale <float> (high noise) skip layer guidance (SLG) scale, only for DiT models: (default:
0)
--high-noise-skip-layer-start <float> (high noise) SLG enabling point (default: 0.01)
--high-noise-skip-layer-end <float> (high noise) SLG disabling point (default: 0.2)
--high-noise-eta <float> (high noise) noise multiplier (default: 0 for ddim_trailing, tcd,
res_multistep and res_2s; 1 for euler_a, er_sde and dpm++2s_a)
--strength <float> strength for noising/unnoising (default: 0.75)
--pm-style-strength <float>
--control-strength <float> strength to apply Control Net (default: 0.9). 1.0 corresponds to full
destruction of information in init image
--moe-boundary <float> timestep boundary for Wan2.2 MoE model. (default: 0.875). Only enabled if
`--high-noise-steps` is set to -1
--vace-strength <float> wan vace strength
--vae-tile-overlap <float> tile overlap for vae tiling, in fraction of tile size (default: 0.5)
--hires-scale <float> highres fix scale when target size is not set (default: 2.0)
--hires-denoising-strength <float> highres fix second pass denoising strength (default: 0.7)
--increase-ref-index automatically increase the indices of references images based on the order
they are listed (starting with 1).
--disable-auto-resize-ref-image disable auto resize of ref images
--disable-image-metadata do not embed generation metadata on image files
--vae-tiling process vae in tiles to reduce memory usage
--temporal-tiling enable temporal tiling for LTX video VAE decode
--hires enable highres fix
-s, --seed RNG seed (default: 42, use random seed for < 0)
--sampling-method sampling method, one of [euler, euler_a, heun, dpm2, dpm++2s_a, dpm++2m,
dpm++2mv2, ipndm, ipndm_v, lcm, ddim_trailing, tcd, res_multistep, res_2s,
er_sde, euler_cfg_pp, euler_a_cfg_pp] (default: euler for Flux/SD3/Wan, euler_a otherwise)
--high-noise-sampling-method (high noise) sampling method, one of [euler, euler_a, heun, dpm2, dpm++2s_a,
dpm++2m, dpm++2mv2, ipndm, ipndm_v, lcm, ddim_trailing, tcd, res_multistep,
res_2s, er_sde, euler_cfg_pp, euler_a_cfg_pp] default: euler for Flux/SD3/Wan, euler_a otherwise
--scheduler denoiser sigma scheduler, one of [discrete, karras, exponential, ays, gits,
smoothstep, sgm_uniform, simple, kl_optimal, lcm, bong_tangent, ltx2], default:
model-specific
--sigmas custom sigma values for the sampler, comma-separated (e.g.,
"14.61,7.8,3.5,0.0").
--hires-sigmas custom sigma values for the highres fix second pass, comma-separated (e.g.,
"0.85,0.725,0.421875,0.0").
--skip-layers layers to skip for SLG steps (default: [7,8,9])
--high-noise-skip-layers (high noise) layers to skip for SLG steps (default: [7,8,9])
-r, --ref-image reference image for Flux Kontext models (can be used multiple times)
--cache-mode caching method: 'easycache' (DiT), 'ucache' (UNET),
'dbcache'/'taylorseer'/'cache-dit' (DiT block-level), 'spectrum' (UNET/DiT
Chebyshev+Taylor forecasting)
--cache-option named cache params (key=value format, comma-separated). easycache/ucache:
threshold=,start=,end=,decay=,relative=,reset=; dbcache/taylorseer/cache-dit:
Fn=,Bn=,threshold=,warmup=; spectrum: w=,m=,lam=,window=,flex=,warmup=,stop=.
Examples: "threshold=0.25" or "threshold=1.5,reset=0"
--scm-mask SCM steps mask for cache-dit: comma-separated 0/1 (e.g.,
"1,1,1,0,0,1,0,0,1,0") - 1=compute, 0=can cache
--scm-policy SCM policy: 'dynamic' (default) or 'static'
--vae-tile-size tile size for vae tiling, format [X]x[Y] (default: 32x32)
--vae-relative-tile-size relative tile size for vae tiling, format [X]x[Y], in fraction of image size
if < 1, in number of tiles per dim if >=1 (overrides --vae-tile-size)
For detailed command-line arguments, run:
```bash
./bin/sd-server -h
```

View File

@ -85,7 +85,7 @@ int main(int argc, const char** argv) {
LOG_DEBUG("%s", ctx_params.to_string().c_str());
LOG_DEBUG("%s", default_gen_params.to_string().c_str());
sd_ctx_params_t sd_ctx_params = ctx_params.to_sd_ctx_params_t(false, false, false);
sd_ctx_params_t sd_ctx_params = ctx_params.to_sd_ctx_params_t(false);
SDCtxPtr sd_ctx(new_sd_ctx(&sd_ctx_params));
if (sd_ctx == nullptr) {

View File

@ -196,19 +196,13 @@ typedef struct {
uint32_t embedding_count;
const char* photo_maker_path;
const char* tensor_type_rules;
bool vae_decode_only;
bool free_params_immediately;
int n_threads;
enum sd_type_t wtype;
enum rng_type_t rng_type;
enum rng_type_t sampler_rng_type;
enum prediction_t prediction;
enum lora_apply_mode_t lora_apply_mode;
bool offload_params_to_cpu;
bool enable_mmap;
bool keep_clip_on_cpu;
bool keep_control_net_on_cpu;
bool keep_vae_on_cpu;
bool flash_attn;
bool diffusion_flash_attn;
bool tae_preview_only;
@ -226,6 +220,7 @@ typedef struct {
bool stream_layers; // Enable residency+prefetch streaming on top of --max-vram (no effect without --max-vram)
const char* backend;
const char* params_backend;
const char* rpc_servers;
} sd_ctx_params_t;
typedef struct {
@ -460,7 +455,6 @@ SD_API bool generate_video(sd_ctx_t* sd_ctx,
typedef struct upscaler_ctx_t upscaler_ctx_t;
SD_API upscaler_ctx_t* new_upscaler_ctx(const char* esrgan_path,
bool offload_params_to_cpu,
bool direct,
int n_threads,
int tile_size,

View File

@ -2007,6 +2007,10 @@ protected:
}
bool copy_cache_tensors_to_cache_buffer(const std::unordered_set<std::string>* cache_keep_names = nullptr) {
if (cache_tensor_map.empty() && cache_keep_names == nullptr) {
return true;
}
ggml_context* old_cache_ctx = cache_ctx;
ggml_backend_buffer_t old_cache_buffer = cache_buffer;
cache_ctx = nullptr;

View File

@ -45,6 +45,10 @@ static bool is_default_backend_token(const std::string& name) {
return lower.empty() || lower == "default" || lower == "auto";
}
static bool is_disk_backend_token(const std::string& name) {
return lower_copy(trim_copy(name)) == "disk";
}
static bool parse_backend_module(const std::string& raw_name, SDBackendModule* module) {
std::string name = lower_copy(trim_copy(raw_name));
name.erase(std::remove(name.begin(), name.end(), '-'), name.end());
@ -200,6 +204,36 @@ void ggml_ext_im_set_f32_1d(const struct ggml_tensor* tensor, int i, float value
}
}
bool add_rpc_devices(const std::string& servers) {
const std::string in = trim_copy(servers);
if (in.empty()) {
return true;
}
auto rpc_servers = split_copy(in, ',');
if (rpc_servers.empty()) {
LOG_ERROR("invalid RPC servers specification: '%s'", servers.c_str());
return false;
}
ggml_backend_reg_t rpc_reg = ggml_backend_reg_by_name("RPC");
if (!rpc_reg) {
LOG_ERROR("RPC backend not found, cannot add RPC servers");
return false;
}
typedef ggml_backend_reg_t (*ggml_backend_rpc_add_server_t)(const char* endpoint);
ggml_backend_rpc_add_server_t ggml_backend_rpc_add_server_fn = (ggml_backend_rpc_add_server_t)ggml_backend_reg_get_proc_address(rpc_reg, "ggml_backend_rpc_add_server");
if (!ggml_backend_rpc_add_server_fn) {
LOG_ERROR("RPC backend does not have ggml_backend_rpc_add_server function, cannot add RPC servers");
return false;
}
for (const auto& server : rpc_servers) {
LOG_INFO("Adding RPC server: %s", server.c_str());
auto reg = ggml_backend_rpc_add_server_fn(server.c_str());
// no return value to check for success but should print errors from the RPC backend if it fails to add the server
ggml_backend_register(reg);
}
return true;
}
static void ggml_backend_load_all_once() {
// If the registry already has devices and the CPU backend is present,
// assume either static registration or explicit host-side preloading has
@ -504,6 +538,9 @@ ggml_backend_t SDBackendManager::params_backend(SDBackendModule module) {
if (name.empty()) {
return runtime_backend(module);
}
if (is_disk_backend_token(name)) {
return runtime_backend(module);
}
return init_cached_backend(name);
}
@ -515,6 +552,10 @@ bool SDBackendManager::params_backend_is_cpu(SDBackendModule module) {
return sd_backend_is_cpu(params_backend(module));
}
bool SDBackendManager::params_backend_is_disk(SDBackendModule module) const {
return is_disk_backend_token(params_assignment_.get(module));
}
bool SDBackendManager::runtime_backend_supports_host_buffer(SDBackendModule module) {
ggml_backend_t backend = runtime_backend(module);
if (backend == nullptr) {
@ -534,10 +575,6 @@ bool SDBackendManager::runtime_backend_supports_host_buffer(SDBackendModule modu
bool SDBackendManager::init(const char* backend_spec,
const char* params_backend_spec,
bool offload_params_to_cpu,
bool keep_clip_on_cpu,
bool keep_vae_on_cpu,
bool keep_control_net_on_cpu,
std::string* error) {
reset();
@ -548,30 +585,20 @@ bool SDBackendManager::init(const char* backend_spec,
return false;
}
if (runtime_assignment_.empty()) {
if (keep_clip_on_cpu) {
runtime_assignment_.set_module(SDBackendModule::TE, "cpu");
}
if (keep_vae_on_cpu) {
runtime_assignment_.set_module(SDBackendModule::VAE, "cpu");
}
if (keep_control_net_on_cpu) {
runtime_assignment_.set_module(SDBackendModule::CONTROL_NET, "cpu");
}
}
if (params_assignment_.empty() && offload_params_to_cpu) {
params_assignment_.set_default("cpu");
}
return validate(error);
}
bool SDBackendManager::validate(std::string* error) const {
auto validate_name = [&](const std::string& name) -> bool {
auto validate_runtime_name = [&](const std::string& name) -> bool {
if (is_default_backend_token(name)) {
return true;
}
if (is_disk_backend_token(name)) {
if (error != nullptr) {
*error = "backend 'disk' is only supported by params_backend";
}
return false;
}
if (!sd_resolve_backend_name(name).empty()) {
return true;
}
@ -580,18 +607,24 @@ bool SDBackendManager::validate(std::string* error) const {
}
return false;
};
auto validate_params_name = [&](const std::string& name) -> bool {
if (is_disk_backend_token(name)) {
return true;
}
return validate_runtime_name(name);
};
if (!validate_name(runtime_assignment_.default_name) ||
!validate_name(params_assignment_.default_name)) {
if (!validate_runtime_name(runtime_assignment_.default_name) ||
!validate_params_name(params_assignment_.default_name)) {
return false;
}
for (const auto& kv : runtime_assignment_.module_names) {
if (!validate_name(kv.second)) {
if (!validate_runtime_name(kv.second)) {
return false;
}
}
for (const auto& kv : params_assignment_.module_names) {
if (!validate_name(kv.second)) {
if (!validate_params_name(kv.second)) {
return false;
}
}

View File

@ -51,10 +51,6 @@ public:
bool init(const char* backend_spec,
const char* params_backend_spec,
bool offload_params_to_cpu,
bool keep_clip_on_cpu,
bool keep_vae_on_cpu,
bool keep_control_net_on_cpu,
std::string* error);
void reset();
@ -63,6 +59,7 @@ public:
bool runtime_backend_is_cpu(SDBackendModule module);
bool params_backend_is_cpu(SDBackendModule module);
bool params_backend_is_disk(SDBackendModule module) const;
bool runtime_backend_supports_host_buffer(SDBackendModule module);
private:
@ -76,4 +73,5 @@ ggml_backend_t sd_backend_cpu_init();
bool sd_backend_cpu_set_n_threads(ggml_backend_t backend_cpu, int n_threads);
const char* sd_backend_module_name(SDBackendModule module);
void ggml_ext_im_set_f32_1d(const struct ggml_tensor* tensor, int i, float value);
bool add_rpc_devices(const std::string& servers);
#endif // __SD_CORE_GGML_EXTEND_BACKEND_H__

View File

@ -101,7 +101,7 @@ struct LoraModel : public GGMLRunner {
if (model_manager == nullptr ||
!model_manager->register_param_tensors("LoRA",
std::move(tensors),
ModelManager::ResidencyMode::Resident,
ModelManager::ResidencyMode::ParamBackend,
runtime_backend,
params_backend) ||
!model_manager->validate_registered_tensors()) {

View File

@ -622,7 +622,7 @@ struct PhotoMakerIDEmbed : public GGMLRunner {
model_loader.load_tensors(on_new_tensor_cb);
if (!model_manager->register_param_tensors("PhotoMaker ID embeds",
tensors,
ModelManager::ResidencyMode::Resident,
ModelManager::ResidencyMode::ParamBackend,
runtime_backend,
params_backend) ||
!model_manager->validate_registered_tensors()) {

View File

@ -312,16 +312,17 @@ struct ControlNet : public GGMLRunner {
ControlNetBlock control_net;
std::string weight_prefix;
ggml_backend_buffer_t control_buffer = nullptr;
ggml_context* control_ctx = nullptr;
std::vector<ggml_tensor*> control_outputs_ggml;
ggml_tensor* guided_hint_output_ggml = nullptr;
std::vector<sd::Tensor<float>> controls;
sd::Tensor<float> guided_hint;
bool guided_hint_cached = false;
std::shared_ptr<ModelManager> owned_model_manager;
ggml_backend_t params_backend = nullptr;
static const char* guided_hint_cache_name() {
return "controlnet.guided_hint";
}
ControlNet(ggml_backend_t backend,
ggml_backend_t params_backend_,
const String2TensorStorage& tensor_storage_map = {},
@ -336,44 +337,12 @@ struct ControlNet : public GGMLRunner {
free_control_ctx();
}
void alloc_control_ctx(std::vector<ggml_tensor*> outs) {
ggml_init_params params;
params.mem_size = static_cast<size_t>(outs.size() * ggml_tensor_overhead()) + 1024 * 1024;
params.mem_buffer = nullptr;
params.no_alloc = true;
control_ctx = ggml_init(params);
control_outputs_ggml.resize(outs.size() - 1);
size_t control_buffer_size = 0;
guided_hint_output_ggml = ggml_dup_tensor(control_ctx, outs[0]);
control_buffer_size += ggml_nbytes(guided_hint_output_ggml);
for (int i = 0; i < outs.size() - 1; i++) {
control_outputs_ggml[i] = ggml_dup_tensor(control_ctx, outs[i + 1]);
control_buffer_size += ggml_nbytes(control_outputs_ggml[i]);
}
control_buffer = ggml_backend_alloc_ctx_tensors(control_ctx, runtime_backend);
LOG_DEBUG("control buffer size %.2fMB", control_buffer_size * 1.f / 1024.f / 1024.f);
}
void free_control_ctx() {
if (control_buffer != nullptr) {
ggml_backend_buffer_free(control_buffer);
control_buffer = nullptr;
}
if (control_ctx != nullptr) {
ggml_free(control_ctx);
control_ctx = nullptr;
}
guided_hint_output_ggml = nullptr;
guided_hint_cached = false;
guided_hint = {};
control_outputs_ggml.clear();
controls.clear();
free_cache_ctx_and_buffer();
}
std::string get_desc() override {
@ -397,11 +366,17 @@ struct ControlNet : public GGMLRunner {
ggml_tensor* context = make_optional_input(context_tensor);
ggml_tensor* y = make_optional_input(y_tensor);
guided_hint_output_ggml = nullptr;
control_outputs_ggml.clear();
ggml_tensor* guided_hint_input = nullptr;
if (guided_hint_cached && !guided_hint.empty()) {
guided_hint_input = make_input(guided_hint);
hint = nullptr;
} else {
if (guided_hint_cached) {
guided_hint_input = get_cache_tensor_by_name(guided_hint_cache_name());
if (guided_hint_input == nullptr) {
guided_hint_cached = false;
}
}
if (guided_hint_input == nullptr) {
hint = make_input(hint_tensor);
}
@ -415,13 +390,19 @@ struct ControlNet : public GGMLRunner {
context,
y);
if (control_ctx == nullptr) {
alloc_control_ctx(outs);
if (guided_hint_input == nullptr && !outs.empty()) {
guided_hint_output_ggml = outs[0];
ggml_set_output(guided_hint_output_ggml);
cache(guided_hint_cache_name(), guided_hint_output_ggml);
ggml_build_forward_expand(gf, guided_hint_output_ggml);
}
ggml_build_forward_expand(gf, ggml_cpy(compute_ctx, outs[0], guided_hint_output_ggml));
for (int i = 0; i < outs.size() - 1; i++) {
ggml_build_forward_expand(gf, ggml_cpy(compute_ctx, outs[i + 1], control_outputs_ggml[i]));
control_outputs_ggml.reserve(outs.size() > 0 ? outs.size() - 1 : 0);
for (size_t i = 1; i < outs.size(); i++) {
ggml_tensor* control_output = outs[i];
ggml_set_output(control_output);
ggml_build_forward_expand(gf, control_output);
control_outputs_ggml.push_back(control_output);
}
return gf;
@ -441,15 +422,12 @@ struct ControlNet : public GGMLRunner {
return build_graph(x, hint, timesteps, context, y);
};
auto compute_result = GGMLRunner::compute<float>(get_graph, n_threads, false, false, false);
auto compute_result = GGMLRunner::compute<float>(get_graph, n_threads, false, false, false, true);
if (!compute_result.has_value()) {
return std::nullopt;
}
if (guided_hint_output_ggml != nullptr) {
guided_hint = restore_trailing_singleton_dims(sd::make_sd_tensor_from_ggml<float>(guided_hint_output_ggml),
4);
}
guided_hint_cached = get_cache_tensor_by_name(guided_hint_cache_name()) != nullptr;
controls.clear();
controls.reserve(control_outputs_ggml.size());
for (ggml_tensor* control : control_outputs_ggml) {
@ -457,7 +435,6 @@ struct ControlNet : public GGMLRunner {
GGML_ASSERT(!control_host.empty());
controls.push_back(std::move(control_host));
}
guided_hint_cached = true;
return controls;
}
@ -482,7 +459,7 @@ struct ControlNet : public GGMLRunner {
manager->set_n_threads(n_threads);
if (!manager->register_param_tensors("ControlNet",
std::move(tensors),
ModelManager::ResidencyMode::Resident,
ModelManager::ResidencyMode::ParamBackend,
runtime_backend,
params_backend) ||
!manager->validate_registered_tensors()) {

View File

@ -1609,7 +1609,7 @@ namespace Flux {
if (!model_manager->register_runner_params("Flux test",
*flux,
"model.diffusion_model",
ModelManager::ResidencyMode::Resident,
ModelManager::ResidencyMode::ParamBackend,
backend,
backend) ||
!model_manager->validate_registered_tensors()) {

View File

@ -2048,7 +2048,7 @@ namespace LTXV {
if (!model_manager->register_runner_params("LTXAV test",
*ltxav,
"model.diffusion_model",
ModelManager::ResidencyMode::Resident,
ModelManager::ResidencyMode::ParamBackend,
backend,
backend) ||
!model_manager->validate_registered_tensors()) {

View File

@ -1015,7 +1015,7 @@ struct MMDiTRunner : public DiffusionModelRunner {
if (!model_manager->register_runner_params("MMDiT test",
*mmdit,
"model.diffusion_model",
ModelManager::ResidencyMode::Resident,
ModelManager::ResidencyMode::ParamBackend,
backend,
backend) ||
!model_manager->validate_registered_tensors()) {

View File

@ -715,7 +715,7 @@ namespace Qwen {
if (!model_manager->register_runner_params("Qwen image test",
*qwen_image,
"model.diffusion_model",
ModelManager::ResidencyMode::Resident,
ModelManager::ResidencyMode::ParamBackend,
backend,
backend) ||
!model_manager->validate_registered_tensors()) {

View File

@ -1040,7 +1040,7 @@ namespace WAN {
if (!model_manager->register_runner_params("Wan test",
*wan,
"model.diffusion_model",
ModelManager::ResidencyMode::Resident,
ModelManager::ResidencyMode::ParamBackend,
backend,
backend) ||
!model_manager->validate_registered_tensors()) {

View File

@ -723,7 +723,7 @@ namespace ZImage {
if (!model_manager->register_runner_params("ZImage test",
*z_image,
"model.diffusion_model",
ModelManager::ResidencyMode::Resident,
ModelManager::ResidencyMode::ParamBackend,
backend,
backend) ||
!model_manager->validate_registered_tensors()) {

View File

@ -2084,7 +2084,7 @@ namespace LLM {
if (!model_manager->register_runner_params("LLM test",
*llm,
"text_encoders.llm",
ModelManager::ResidencyMode::Resident,
ModelManager::ResidencyMode::ParamBackend,
backend,
backend) ||
!model_manager->validate_registered_tensors()) {

View File

@ -592,7 +592,7 @@ struct T5Embedder {
if (!model_manager->register_runner_params("T5 test",
*t5,
"",
ModelManager::ResidencyMode::Resident,
ModelManager::ResidencyMode::ParamBackend,
backend,
backend) ||
!model_manager->validate_registered_tensors()) {

View File

@ -1082,7 +1082,7 @@ namespace LTXV {
if (!model_manager->register_runner_params("LTX audio VAE test",
*ltx_audio_vae,
ModelManager::ResidencyMode::Resident,
ModelManager::ResidencyMode::ParamBackend,
backend,
backend) ||
!model_manager->validate_registered_tensors()) {

View File

@ -1426,7 +1426,7 @@ struct LTXVideoVAE : public VAE {
const sd::Tensor<float>& z,
bool decode_graph) override {
if (!decode_graph && decode_only) {
LOG_ERROR("LTX video VAE encode requires encoder weights; create the context with vae_decode_only=false");
LOG_ERROR("LTX video VAE encode requires encoder weights");
return {};
}
sd::Tensor<float> input = z;
@ -1538,7 +1538,7 @@ struct LTXVideoVAE : public VAE {
if (!model_manager->register_runner_params("LTX VAE test",
*vae,
ModelManager::ResidencyMode::Resident,
ModelManager::ResidencyMode::ParamBackend,
backend,
backend) ||
!model_manager->validate_registered_tensors()) {

View File

@ -1340,7 +1340,7 @@ namespace WAN {
if (!model_manager->register_runner_params("Wan VAE test",
*vae,
ModelManager::ResidencyMode::Resident,
ModelManager::ResidencyMode::ParamBackend,
backend,
backend) ||
!model_manager->validate_registered_tensors()) {

View File

@ -1002,6 +1002,7 @@ bool ModelLoader::load_tensors(on_new_tensor_cb_t on_new_tensor_cb,
std::atomic<size_t> tensor_idx(0);
std::atomic<bool> failed(false);
std::vector<std::thread> workers;
std::mutex rpc_backend_mutex;
for (int i = 0; i < n_threads; ++i) {
workers.emplace_back([&, file_path, is_zip]() {
@ -1158,7 +1159,19 @@ bool ModelLoader::load_tensors(on_new_tensor_cb_t on_new_tensor_cb,
if (dst_tensor->buffer != nullptr && !ggml_backend_buffer_is_host(dst_tensor->buffer)) {
t0 = ggml_time_ms();
ggml_backend_tensor_set(dst_tensor, convert_buf, 0, ggml_nbytes(dst_tensor));
// RPC backends require serialized access to prevent concurrency issues
const char* buffer_type_name = ggml_backend_buft_name(ggml_backend_buffer_get_type(dst_tensor->buffer));
bool is_rpc_buffer = buffer_type_name != nullptr &&
std::string(buffer_type_name).find("RPC") != std::string::npos;
if (is_rpc_buffer) {
std::lock_guard<std::mutex> lock(rpc_backend_mutex);
ggml_backend_tensor_set(dst_tensor, convert_buf, 0, ggml_nbytes(dst_tensor));
} else {
ggml_backend_tensor_set(dst_tensor, convert_buf, 0, ggml_nbytes(dst_tensor));
}
t1 = ggml_time_ms();
copy_to_backend_time_ms.fetch_add(t1 - t0);
}

View File

@ -492,7 +492,7 @@ bool ModelManager::mmap_params(const std::vector<TensorState*>& states,
}
bool ModelManager::can_mmap_storage(const TensorState& state) const {
if (!enable_mmap_ || state.residency_mode != ResidencyMode::Resident) {
if (!enable_mmap_ || state.residency_mode != ResidencyMode::ParamBackend) {
return false;
}
if (state.compute_backend == nullptr || state.params_backend == nullptr) {

View File

@ -16,7 +16,7 @@ class ModelManager : public RunnerWeightManager {
public:
enum class ResidencyMode {
Disk,
Resident,
ParamBackend,
};
struct LoraSpec {
@ -33,7 +33,7 @@ private:
ggml_tensor* tensor = nullptr;
std::string desc;
ResidencyMode residency_mode = ResidencyMode::Resident;
ResidencyMode residency_mode = ResidencyMode::ParamBackend;
ggml_backend_t compute_backend = nullptr;
ggml_backend_t params_backend = nullptr;
bool metadata_validated = false;

View File

@ -163,9 +163,7 @@ public:
SDBackendManager backend_manager;
SDVersion version;
bool vae_decode_only = false;
bool external_vae_is_invalid = false;
bool free_params_immediately = false;
bool circular_x = false;
bool circular_y = false;
@ -189,7 +187,6 @@ public:
std::string taesd_path;
sd_tiling_params_t vae_tiling_params = {false, false, 0, 0, 0.5f, 0, 0, nullptr};
bool offload_params_to_cpu = false;
bool enable_mmap = false;
float max_vram = 0.f;
bool stream_layers = false;
@ -246,20 +243,16 @@ public:
}
return model_manager->register_param_tensors(desc,
std::move(group_tensors),
free_params_immediately ? ModelManager::ResidencyMode::Disk : ModelManager::ResidencyMode::Resident,
backend_manager.params_backend_is_disk(module) ? ModelManager::ResidencyMode::Disk : ModelManager::ResidencyMode::ParamBackend,
backend_for(module),
params_backend_for(module),
params_mem_size);
}
bool init_backend(const sd_ctx_params_t* sd_ctx_params) {
bool init_backend() {
std::string error;
if (!backend_manager.init(sd_ctx_params->backend,
sd_ctx_params->params_backend,
offload_params_to_cpu,
sd_ctx_params->keep_clip_on_cpu,
sd_ctx_params->keep_vae_on_cpu,
sd_ctx_params->keep_control_net_on_cpu,
if (!backend_manager.init(backend_spec.c_str(),
params_backend_spec.c_str(),
&error)) {
LOG_ERROR("backend config failed: %s", error.c_str());
return false;
@ -319,24 +312,20 @@ public:
}
bool init(const sd_ctx_params_t* sd_ctx_params) {
n_threads = sd_ctx_params->n_threads;
vae_decode_only = sd_ctx_params->vae_decode_only;
free_params_immediately = sd_ctx_params->free_params_immediately;
offload_params_to_cpu = sd_ctx_params->offload_params_to_cpu;
enable_mmap = sd_ctx_params->enable_mmap;
max_vram = sd_ctx_params->max_vram;
stream_layers = sd_ctx_params->stream_layers;
backend_spec = SAFE_STR(sd_ctx_params->backend);
params_backend_spec = SAFE_STR(sd_ctx_params->params_backend);
n_threads = sd_ctx_params->n_threads;
enable_mmap = sd_ctx_params->enable_mmap;
max_vram = sd_ctx_params->max_vram;
stream_layers = sd_ctx_params->stream_layers;
backend_spec = SAFE_STR(sd_ctx_params->backend);
params_backend_spec = SAFE_STR(sd_ctx_params->params_backend);
std::string rpc_servers_spec = SAFE_STR(sd_ctx_params->rpc_servers);
add_rpc_devices(rpc_servers_spec);
if (stream_layers && max_vram == 0.f) {
LOG_WARN("--stream-layers has no effect without --max-vram set; ignoring");
stream_layers = false;
}
if (stream_layers && !offload_params_to_cpu && params_backend_spec.empty()) {
// Streaming needs CPU-resident params.
LOG_WARN("--stream-layers has no effect without --offload-to-cpu (or --params-backend); ignoring");
stream_layers = false;
}
bool use_tae = false;
bool use_audio_vae = false;
@ -351,9 +340,13 @@ public:
ggml_log_set(ggml_log_callback_default, nullptr);
if (!init_backend(sd_ctx_params)) {
if (!init_backend()) {
return false;
}
if (stream_layers && !backend_manager.params_backend_is_cpu(SDBackendModule::DIFFUSION)) {
LOG_WARN("--stream-layers has no effect unless diffusion params backend is cpu; ignoring");
stream_layers = false;
}
max_vram = sd::ggml_graph_cut::resolve_max_vram_gib(max_vram, backend_for(SDBackendModule::DIFFUSION));
model_manager = std::make_shared<ModelManager>();
@ -537,8 +530,8 @@ public:
}
}
// Avoid full-model LoRA merge buffers on constrained setups.
const bool streaming_constrained = stream_layers ||
sd_ctx_params->offload_params_to_cpu;
const bool params_offloaded = params_backend_for(SDBackendModule::DIFFUSION) != backend_for(SDBackendModule::DIFFUSION);
const bool streaming_constrained = stream_layers || params_offloaded;
if (have_quantized_weight || streaming_constrained) {
apply_lora_immediately = false;
} else {
@ -561,10 +554,6 @@ public:
size_t control_net_params_mem_size = 0;
size_t extension_params_mem_size = 0;
if (sd_version_is_control(version)) {
// Might need vae encode for control cond
vae_decode_only = false;
}
bool tae_preview_only = sd_ctx_params->tae_preview_only;
if (version == VERSION_SDXS_512_DS || version == VERSION_SDXS_09) {
tae_preview_only = false;
@ -592,7 +581,6 @@ public:
"model.diffusion_model",
model_manager);
} else if (sd_version_is_pid(version)) {
vae_decode_only = false;
cond_stage_model = std::make_shared<LLMEmbedder>(backend_for(SDBackendModule::TE),
tensor_storage_map,
version,
@ -707,15 +695,11 @@ public:
}
}
} else if (sd_version_is_qwen_image(version)) {
bool enable_vision = false;
if (!vae_decode_only) {
enable_vision = true;
}
cond_stage_model = std::make_shared<LLMEmbedder>(backend_for(SDBackendModule::TE),
tensor_storage_map,
version,
"",
enable_vision,
true,
model_manager);
diffusion_model = std::make_shared<Qwen::QwenImageRunner>(backend_for(SDBackendModule::DIFFUSION),
tensor_storage_map,
@ -724,15 +708,11 @@ public:
sd_ctx_params->qwen_image_zero_cond_t,
model_manager);
} else if (sd_version_is_longcat(version)) {
bool enable_vision = false;
if (!vae_decode_only) {
enable_vision = true;
}
cond_stage_model = std::make_shared<LLMEmbedder>(backend_for(SDBackendModule::TE),
tensor_storage_map,
version,
"",
enable_vision,
true,
model_manager);
diffusion_model = std::make_shared<Flux::FluxRunner>(backend_for(SDBackendModule::DIFFUSION),
tensor_storage_map,
@ -828,10 +808,6 @@ public:
return false;
}
if (sd_version_is_unet_edit(version)) {
vae_decode_only = false;
}
if (high_noise_diffusion_model) {
high_noise_diffusion_model->set_max_graph_vram_bytes(max_graph_vram_bytes);
high_noise_diffusion_model->set_stream_layers_enabled(stream_layers);
@ -847,7 +823,7 @@ public:
return false;
}
auto create_tae = [&]() -> std::shared_ptr<VAE> {
auto create_tae = [&](bool decode_only) -> std::shared_ptr<VAE> {
if (sd_version_is_wan(version) ||
sd_version_is_qwen_image(version) ||
sd_version_is_anima(version) ||
@ -855,7 +831,7 @@ public:
return std::make_shared<TinyVideoAutoEncoder>(backend_for(SDBackendModule::VAE),
tensor_storage_map,
"decoder",
vae_decode_only,
decode_only,
version,
model_manager);
@ -863,7 +839,7 @@ public:
auto model = std::make_shared<TinyImageAutoEncoder>(backend_for(SDBackendModule::VAE),
tensor_storage_map,
"decoder.layers",
vae_decode_only,
decode_only,
version,
model_manager);
return model;
@ -885,7 +861,7 @@ public:
return std::make_shared<LTXVideoVAE>(backend_for(SDBackendModule::VAE),
tensor_storage_map,
"first_stage_model",
vae_decode_only,
false,
version,
model_manager);
} else if (sd_version_is_wan(version) ||
@ -894,14 +870,14 @@ public:
return std::make_shared<WAN::WanVAERunner>(backend_for(SDBackendModule::VAE),
tensor_storage_map,
"first_stage_model",
vae_decode_only,
false,
version,
model_manager);
} else {
auto model = std::make_shared<AutoEncoderKL>(backend_for(SDBackendModule::VAE),
tensor_storage_map,
"first_stage_model",
vae_decode_only,
false,
false,
vae_version,
model_manager);
@ -931,7 +907,7 @@ public:
}
} else if (use_tae && !tae_preview_only) {
LOG_INFO("using TAE for encoding / decoding");
first_stage_model = create_tae();
first_stage_model = create_tae(false);
first_stage_model->set_max_graph_vram_bytes(max_graph_vram_bytes);
if (!register_runner_params("VAE",
first_stage_model,
@ -951,7 +927,7 @@ public:
}
if (use_tae && tae_preview_only) {
LOG_INFO("using TAE for preview");
preview_vae = create_tae();
preview_vae = create_tae(true);
preview_vae->set_max_graph_vram_bytes(max_graph_vram_bytes);
if (!register_runner_params("preview VAE",
preview_vae,
@ -1081,13 +1057,6 @@ public:
ignore_tensors.insert("model.diffusion_model.__32x32__");
ignore_tensors.insert("model.diffusion_model.__index_timestep_zero__");
if (vae_decode_only) {
ignore_tensors.insert("first_stage_model.encoder");
ignore_tensors.insert("first_stage_model.conv1");
ignore_tensors.insert("first_stage_model.quant");
ignore_tensors.insert("tae.encoder");
ignore_tensors.insert("text_encoders.llm.visual.");
}
if (audio_vae_model) {
ignore_tensors.insert("audio_vae.encoder");
}
@ -2642,31 +2611,25 @@ void sd_hires_params_init(sd_hires_params_t* hires_params) {
}
void sd_ctx_params_init(sd_ctx_params_t* sd_ctx_params) {
*sd_ctx_params = {};
sd_ctx_params->vae_decode_only = true;
sd_ctx_params->free_params_immediately = true;
sd_ctx_params->n_threads = sd_get_num_physical_cores();
sd_ctx_params->wtype = SD_TYPE_COUNT;
sd_ctx_params->rng_type = CUDA_RNG;
sd_ctx_params->sampler_rng_type = RNG_TYPE_COUNT;
sd_ctx_params->prediction = PREDICTION_COUNT;
sd_ctx_params->lora_apply_mode = LORA_APPLY_AUTO;
sd_ctx_params->offload_params_to_cpu = false;
sd_ctx_params->max_vram = 0.f;
sd_ctx_params->stream_layers = false;
sd_ctx_params->enable_mmap = false;
sd_ctx_params->keep_clip_on_cpu = false;
sd_ctx_params->keep_control_net_on_cpu = false;
sd_ctx_params->keep_vae_on_cpu = false;
sd_ctx_params->diffusion_flash_attn = false;
sd_ctx_params->circular_x = false;
sd_ctx_params->circular_y = false;
sd_ctx_params->chroma_use_dit_mask = true;
sd_ctx_params->chroma_use_t5_mask = false;
sd_ctx_params->chroma_t5_mask_pad = 1;
sd_ctx_params->vae_format = SD_VAE_FORMAT_AUTO;
sd_ctx_params->backend = nullptr;
sd_ctx_params->params_backend = nullptr;
*sd_ctx_params = {};
sd_ctx_params->n_threads = sd_get_num_physical_cores();
sd_ctx_params->wtype = SD_TYPE_COUNT;
sd_ctx_params->rng_type = CUDA_RNG;
sd_ctx_params->sampler_rng_type = RNG_TYPE_COUNT;
sd_ctx_params->prediction = PREDICTION_COUNT;
sd_ctx_params->lora_apply_mode = LORA_APPLY_AUTO;
sd_ctx_params->max_vram = 0.f;
sd_ctx_params->stream_layers = false;
sd_ctx_params->enable_mmap = false;
sd_ctx_params->diffusion_flash_attn = false;
sd_ctx_params->circular_x = false;
sd_ctx_params->circular_y = false;
sd_ctx_params->chroma_use_dit_mask = true;
sd_ctx_params->chroma_use_t5_mask = false;
sd_ctx_params->chroma_t5_mask_pad = 1;
sd_ctx_params->vae_format = SD_VAE_FORMAT_AUTO;
sd_ctx_params->backend = nullptr;
sd_ctx_params->params_backend = nullptr;
}
char* sd_ctx_params_to_str(const sd_ctx_params_t* sd_ctx_params) {
@ -2693,21 +2656,15 @@ char* sd_ctx_params_to_str(const sd_ctx_params_t* sd_ctx_params) {
"control_net_path: %s\n"
"photo_maker_path: %s\n"
"tensor_type_rules: %s\n"
"vae_decode_only: %s\n"
"free_params_immediately: %s\n"
"n_threads: %d\n"
"wtype: %s\n"
"rng_type: %s\n"
"sampler_rng_type: %s\n"
"prediction: %s\n"
"offload_params_to_cpu: %s\n"
"max_vram: %.3f\n"
"stream_layers: %s\n"
"backend: %s\n"
"params_backend: %s\n"
"keep_clip_on_cpu: %s\n"
"keep_control_net_on_cpu: %s\n"
"keep_vae_on_cpu: %s\n"
"flash_attn: %s\n"
"diffusion_flash_attn: %s\n"
"circular_x: %s\n"
@ -2733,21 +2690,15 @@ char* sd_ctx_params_to_str(const sd_ctx_params_t* sd_ctx_params) {
SAFE_STR(sd_ctx_params->control_net_path),
SAFE_STR(sd_ctx_params->photo_maker_path),
SAFE_STR(sd_ctx_params->tensor_type_rules),
BOOL_STR(sd_ctx_params->vae_decode_only),
BOOL_STR(sd_ctx_params->free_params_immediately),
sd_ctx_params->n_threads,
sd_type_name(sd_ctx_params->wtype),
sd_rng_type_name(sd_ctx_params->rng_type),
sd_rng_type_name(sd_ctx_params->sampler_rng_type),
sd_prediction_name(sd_ctx_params->prediction),
BOOL_STR(sd_ctx_params->offload_params_to_cpu),
sd_ctx_params->max_vram,
BOOL_STR(sd_ctx_params->stream_layers),
SAFE_STR(sd_ctx_params->backend),
SAFE_STR(sd_ctx_params->params_backend),
BOOL_STR(sd_ctx_params->keep_clip_on_cpu),
BOOL_STR(sd_ctx_params->keep_control_net_on_cpu),
BOOL_STR(sd_ctx_params->keep_vae_on_cpu),
BOOL_STR(sd_ctx_params->flash_attn),
BOOL_STR(sd_ctx_params->diffusion_flash_attn),
BOOL_STR(sd_ctx_params->circular_x),
@ -3917,7 +3868,7 @@ static std::optional<ImageGenerationLatents> prepare_image_generation_latents(sd
}
}
if (!control_image_tensor.empty() && !sd_ctx->sd->vae_decode_only) {
if (!control_image_tensor.empty()) {
control_latent = sd_ctx->sd->encode_first_stage(control_image_tensor);
if (control_latent.empty()) {
LOG_ERROR("failed to encode control image");
@ -4259,11 +4210,6 @@ static sd::Tensor<float> upscale_hires_latent(sd_ctx_t* sd_ctx,
} else if (request.hires.upscaler == SD_HIRES_UPSCALER_MODEL ||
request.hires.upscaler == SD_HIRES_UPSCALER_LANCZOS ||
request.hires.upscaler == SD_HIRES_UPSCALER_NEAREST) {
if (sd_ctx->sd->vae_decode_only) {
LOG_ERROR("hires %s upscaler requires VAE encoder weights; create the context with vae_decode_only=false",
sd_hires_upscaler_name(request.hires.upscaler));
return {};
}
if (request.hires.upscaler == SD_HIRES_UPSCALER_MODEL && upscaler == nullptr) {
LOG_ERROR("hires model upscaler context is null");
return {};
@ -4474,7 +4420,6 @@ SD_API sd_image_t* generate_image(sd_ctx_t* sd_ctx, const sd_img_gen_params_t* s
const size_t max_graph_vram_bytes = sd::ggml_graph_cut::max_vram_gib_to_bytes(sd_ctx->sd->max_vram);
hires_upscaler->set_max_graph_vram_bytes(max_graph_vram_bytes);
if (!hires_upscaler->load_from_file(request.hires.model_path,
sd_ctx->sd->offload_params_to_cpu,
sd_ctx->sd->n_threads)) {
LOG_ERROR("load hires model upscaler failed");
return nullptr;
@ -4611,11 +4556,6 @@ static std::optional<ImageGenerationLatents> prepare_video_generation_latents(sd
}
if (!start_image.empty() || !end_image.empty()) {
if (sd_ctx->sd->vae_decode_only) {
LOG_ERROR("LTXAV image conditioning requires VAE encoder weights; create the context with vae_decode_only=false");
return std::nullopt;
}
if (!start_image.empty() && !end_image.empty()) {
LOG_INFO("FLF2V");
} else if (!start_image.empty()) {
@ -5037,7 +4977,7 @@ static sd::Tensor<float> upscale_ltx_spatial_video_latent(sd_ctx_t* sd_ctx,
upsampler->get_param_tensors(tensors);
if (!upsampler_manager->register_param_tensors("LTX latent upsampler",
std::move(tensors),
ModelManager::ResidencyMode::Resident,
ModelManager::ResidencyMode::ParamBackend,
sd_ctx->sd->backend_for(SDBackendModule::UPSCALER),
sd_ctx->sd->params_backend_for(SDBackendModule::UPSCALER)) ||
!upsampler_manager->validate_registered_tensors()) {
@ -5080,11 +5020,6 @@ static bool apply_ltxv_refine_image_conditioning(sd_ctx_t* sd_ctx,
sd_vid_gen_params->end_image.data == nullptr) {
return true;
}
if (sd_ctx->sd->vae_decode_only) {
LOG_ERROR("LTXV refine image conditioning requires VAE encoder weights; create the context with vae_decode_only=false");
return false;
}
constexpr float conditioning_strength = 1.f;
int latent_channels = sd_ctx->sd->get_latent_channel();
sd::Tensor<float> video_latent = *latent;

View File

@ -39,17 +39,12 @@ void UpscalerGGML::set_stream_layers_enabled(bool enabled) {
}
bool UpscalerGGML::load_from_file(const std::string& esrgan_path,
bool offload_params_to_cpu,
int n_threads) {
ggml_log_set(ggml_log_callback_default, nullptr);
std::string error;
if (!backend_manager.init(backend_spec.c_str(),
params_backend_spec.c_str(),
offload_params_to_cpu,
false,
false,
false,
&error)) {
LOG_ERROR("upscaler backend config failed: %s", error.c_str());
return false;
@ -106,7 +101,7 @@ bool UpscalerGGML::load_from_file(const std::string& esrgan_path,
esrgan_upscaler->get_param_tensors(tensors);
if (!model_manager->register_param_tensors("ESRGAN",
std::move(tensors),
ModelManager::ResidencyMode::Resident,
backend_manager.params_backend_is_disk(SDBackendModule::UPSCALER) ? ModelManager::ResidencyMode::Disk : ModelManager::ResidencyMode::ParamBackend,
backend_for(SDBackendModule::UPSCALER),
params_backend_for(SDBackendModule::UPSCALER)) ||
!model_manager->validate_registered_tensors()) {
@ -178,7 +173,6 @@ struct upscaler_ctx_t {
};
upscaler_ctx_t* new_upscaler_ctx(const char* esrgan_path_c_str,
bool offload_params_to_cpu,
bool direct,
int n_threads,
int tile_size,
@ -195,7 +189,7 @@ upscaler_ctx_t* new_upscaler_ctx(const char* esrgan_path_c_str,
return nullptr;
}
if (!upscaler_ctx->upscaler->load_from_file(esrgan_path, offload_params_to_cpu, n_threads)) {
if (!upscaler_ctx->upscaler->load_from_file(esrgan_path, n_threads)) {
delete upscaler_ctx->upscaler;
upscaler_ctx->upscaler = nullptr;
free(upscaler_ctx);

View File

@ -32,7 +32,6 @@ struct UpscalerGGML {
~UpscalerGGML();
bool load_from_file(const std::string& esrgan_path,
bool offload_params_to_cpu,
int n_threads);
void set_max_graph_vram_bytes(size_t max_vram_bytes);
void set_stream_layers_enabled(bool enabled);