mirror of
https://github.com/leejet/stable-diffusion.cpp.git
synced 2026-06-17 11:46:38 +00:00
59 lines
2.2 KiB
Markdown
59 lines
2.2 KiB
Markdown
## Use Flash Attention to save memory and improve speed.
|
|
|
|
Enabling flash attention for the diffusion model reduces memory usage by varying amounts of MB.
|
|
eg.:
|
|
- flux 768x768 ~600mb
|
|
- SD2 768x768 ~1400mb
|
|
|
|
For most backends, it slows things down, but for cuda it generally speeds it up too.
|
|
At the moment, it is only supported for some models and some backends (like cpu, cuda/rocm, metal).
|
|
|
|
Run by adding `--diffusion-fa` to the arguments and watch for:
|
|
```
|
|
[INFO ] stable-diffusion.cpp:312 - Using flash attention in the diffusion model
|
|
```
|
|
and the compute buffer shrink in the debug log:
|
|
```
|
|
[DEBUG] ggml_extend.hpp:1004 - flux compute buffer size: 650.00 MB(VRAM)
|
|
```
|
|
|
|
## Offload weights to the CPU to save VRAM without reducing generation speed.
|
|
|
|
Using `--offload-to-cpu` allows you to offload weights to the CPU, saving VRAM without reducing generation speed.
|
|
|
|
## Use params backend to reduce VRAM or RAM usage.
|
|
|
|
`--params-backend` controls where model parameters are kept. If it is not set, parameters use the same backend as `--backend`, so a GPU runtime backend also keeps parameters in VRAM.
|
|
|
|
Use CPU params to reduce VRAM usage:
|
|
|
|
```shell
|
|
--backend cuda0 --params-backend cpu
|
|
```
|
|
|
|
This keeps model weights in system RAM and moves them to the runtime backend when needed. In the example CLI/server, `--offload-to-cpu` is a compatibility shortcut that prepends `*=cpu` to `--params-backend` before creating the context, so explicit module assignments can still override it:
|
|
|
|
```shell
|
|
--offload-to-cpu --params-backend te=disk
|
|
```
|
|
|
|
Use disk params to reduce both VRAM and RAM usage:
|
|
|
|
```shell
|
|
--backend cuda0 --params-backend disk
|
|
```
|
|
|
|
This reloads parameters from the model file on demand and releases them after use. It has the lowest memory residency, but can be slower because weights must be read again. `disk` is never selected implicitly; set it explicitly when RAM usage matters more than reload cost.
|
|
|
|
Per-module assignments can target only the largest modules:
|
|
|
|
```shell
|
|
--backend cuda0 --params-backend diffusion=disk,te=cpu,vae=cpu
|
|
```
|
|
|
|
See [backend selection](./backend.md) for full syntax.
|
|
|
|
## Use quantization to reduce memory usage.
|
|
|
|
[quantization](./quantization_and_gguf.md)
|