feat: add wan2.1/2.2 support (#778)

* add wan vae suppport * add wan model support * add umt5 support * add wan2.1 t2i support * make flash attn work with wan * make wan a little faster * add wan2.1 t2v support * add wan gguf support * add offload params to cpu support * add wan2.1 i2v support * crop image before resize * set default fps to 16 * add diff lora support * fix wan2.1 i2v * introduce sd_sample_params_t * add wan2.2 t2v support * add wan2.2 14B i2v support * add wan2.2 ti2v support * add high noise lora support * sync: update ggml submodule url * avoid build failure on linux * avoid build failure * update ggml * update ggml * fix sd_version_is_wan * update ggml, fix cpu im2col_3d * fix ggml_nn_attention_ext mask * add cache support to ggml runner * fix the issue of illegal memory access * unify image loading processing * add wan2.1/2.2 FLF2V support * fix end_image mask * update to latest ggml * add GGUFReader * update docs
2026-06-23 22:56:42 +00:00 · 2025-09-06 18:08:03 +08:00 · 2025-09-06 18:08:03 +08:00 · cb1d975e96
commit cb1d975e96
parent 2eb3845df5
46 changed files with 768088 additions and 1427 deletions
--- a/.gitmodules
+++ b/.gitmodules
@ -1,3 +1,3 @@
 [submodule "ggml"]
    path = ggml
-	url = https://github.com/ggerganov/ggml.git
+	url = https://github.com/ggml-org/ggml.git
--- a/README.md
+++ b/README.md
@ -4,19 +4,33 @@
 # stable-diffusion.cpp
-Inference of Stable Diffusion and Flux in pure C/C++
+Diffusion model(SD,Flux,Wan,...) inference in pure C/C++
 ***Note that this project is under active development. \
 API and command-line parameters may change frequently.***
 ## Features
 - Plain C/C++ implementation based on [ggml](https://github.com/ggerganov/ggml), working in the same way as [llama.cpp](https://github.com/ggerganov/llama.cpp)
 - Super lightweight and without external dependencies
- SD1.x, SD2.x, SDXL and [SD3/SD3.5](./docs/sd3.md) support
+- Supported models
-    - !!!The VAE in SDXL encounters NaN issues under FP16, but unfortunately, the ggml_conv_2d only operates under FP16. Hence, a parameter is needed to specify the VAE that has fixed the FP16 NaN issue. You can find it here: [SDXL VAE FP16 Fix](https://huggingface.co/madebyollin/sdxl-vae-fp16-fix/blob/main/sdxl_vae.safetensors).
+  - Image Models
- [Flux-dev/Flux-schnell Support](./docs/flux.md)
+    - SD1.x, SD2.x, [SD-Turbo](https://huggingface.co/stabilityai/sd-turbo)
- [FLUX.1-Kontext-dev](./docs/kontext.md)
+    - SDXL, [SDXL-Turbo](https://huggingface.co/stabilityai/sdxl-turbo)
- [Chroma](./docs/chroma.md)
+      - !!!The VAE in SDXL encounters NaN issues under FP16, but unfortunately, the ggml_conv_2d only operates under FP16. Hence, a parameter is needed to specify the VAE that has fixed the FP16 NaN issue. You can find it here: [SDXL VAE FP16 Fix](https://huggingface.co/madebyollin/sdxl-vae-fp16-fix/blob/main/sdxl_vae.safetensors).
- [SD-Turbo](https://huggingface.co/stabilityai/sd-turbo) and [SDXL-Turbo](https://huggingface.co/stabilityai/sdxl-turbo) support
+    - [SD3/SD3.5](./docs/sd3.md)
- [PhotoMaker](https://github.com/TencentARC/PhotoMaker) support.
+    - [Flux-dev/Flux-schnell](./docs/flux.md)
    - [Chroma](./docs/chroma.md)
  - Image Edit Models
    - [FLUX.1-Kontext-dev](./docs/kontext.md)
  - Video Models
    - [Wan2.1/Wan2.2](./docs/wan.md)
  - [PhotoMaker](https://github.com/TencentARC/PhotoMaker) support.
  - Control Net support with SD 1.5
  - LoRA support, same as [stable-diffusion-webui](https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Features#lora)
  - Latent Consistency Models support (LCM/LCM-LoRA)
  - Faster and memory efficient latent decoding with [TAESD](https://github.com/madebyollin/taesd)
  - Upscale images generated with [ESRGAN](https://github.com/xinntao/Real-ESRGAN)
 - 16-bit, 32-bit float support
 - 2-bit, 3-bit, 4-bit, 5-bit and 8-bit integer quantization support
 - Accelerated memory-efficient CPU inference
@ -26,15 +40,9 @@ Inference of Stable Diffusion and Flux in pure C/C++
 - Can load ckpt, safetensors and diffusers models/checkpoints. Standalone VAEs models
    - No need to convert to `.ggml` or `.gguf` anymore!
 - Flash Attention for memory usage optimization
 - Original `txt2img` and `img2img` mode
 - Negative prompt
 - [stable-diffusion-webui](https://github.com/AUTOMATIC1111/stable-diffusion-webui) style tokenizer (not all the features, only token weighting for now)
 - LoRA support, same as [stable-diffusion-webui](https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Features#lora)
 - Latent Consistency Models support (LCM/LCM-LoRA)
 - Faster and memory efficient latent decoding with [TAESD](https://github.com/madebyollin/taesd)
 - Upscale images generated with [ESRGAN](https://github.com/xinntao/Real-ESRGAN)
 - VAE tiling processing for reduce memory usage
 - Control Net support with SD 1.5
 - Sampling method
    - `Euler A`
    - `Euler`
@ -287,8 +295,10 @@ arguments:
                                     If threads <= 0, then threads will be set to the number of CPU physical cores
  -m, --model [MODEL]                path to full model
  --diffusion-model                  path to the standalone diffusion model
  --high-noise-diffusion-model       path to the standalone high noise diffusion model
  --clip_l                           path to the clip-l text encoder
  --clip_g                           path to the clip-g text encoder
  --clip_vision                      path to the clip-vision encoder
  --t5xxl                            path to the t5xxl text encoder
  --vae [VAE]                        path to vae
  --taesd [TAESD_PATH]               path to taesd. Using Tiny AutoEncoder for fast decoding (low quality)
@ -303,8 +313,9 @@ arguments:
                                     If not specified, the default is the type of the weight file
  --tensor-type-rules [EXPRESSION]   weight type per tensor pattern (example: "^vae\.=f16,model\.=q8_0")
  --lora-model-dir [DIR]             lora model directory
-  -i, --init-img [IMAGE]             path to the input image, required by img2img
+  -i, --init-img [IMAGE]             path to the init image, required by img2img
  --mask [MASK]                      path to the mask image, required by img2img with mask
  -i, --end-img [IMAGE]              path to the end image, required by flf2v
  --control-image [IMAGE]            path to image condition, control net
  -r, --ref-image [PATH]             reference image for Flux Kontext models (can be used multiple times)
  -o, --output OUTPUT                path to write result image to (default: ./output.png)
@ -319,6 +330,23 @@ arguments:
  --skip-layers LAYERS               Layers to skip for SLG steps: (default: [7,8,9])
  --skip-layer-start START           SLG enabling point: (default: 0.01)
  --skip-layer-end END               SLG disabling point: (default: 0.2)
  --scheduler {discrete, karras, exponential, ays, gits} Denoiser sigma scheduler (default: discrete)
  --sampling-method {euler, euler_a, heun, dpm2, dpm++2s_a, dpm++2m, dpm++2mv2, ipndm, ipndm_v, lcm, ddim_trailing, tcd}
                                     sampling method (default: "euler_a")
  --steps  STEPS                     number of sample steps (default: 20)
  --high-noise-cfg-scale SCALE       (high noise) unconditional guidance scale: (default: 7.0)
  --high-noise-img-cfg-scale SCALE   (high noise) image guidance scale for inpaint or instruct-pix2pix models: (default: same as --cfg-scale)
  --high-noise-guidance SCALE        (high noise) distilled guidance scale for models with guidance input (default: 3.5)
  --high-noise-slg-scale SCALE       (high noise) skip layer guidance (SLG) scale, only for DiT models: (default: 0)
                                     0 means disabled, a value of 2.5 is nice for sd3.5 medium
  --high-noise-eta SCALE             (high noise) eta in DDIM, only for DDIM and TCD: (default: 0)
  --high-noise-skip-layers LAYERS    (high noise) Layers to skip for SLG steps: (default: [7,8,9])
  --high-noise-skip-layer-start      (high noise) SLG enabling point: (default: 0.01)
  --high-noise-skip-layer-end END    (high noise) SLG disabling point: (default: 0.2)
  --high-noise-scheduler {discrete, karras, exponential, ays, gits} Denoiser sigma scheduler (default: discrete)
  --high-noise-sampling-method {euler, euler_a, heun, dpm2, dpm++2s_a, dpm++2m, dpm++2mv2, ipndm, ipndm_v, lcm, ddim_trailing, tcd}
                                     (high noise) sampling method (default: "euler_a")
  --high-noise-steps  STEPS          (high noise) number of sample steps (default: 20)
                                     SLG will be enabled at step int([STEPS]*[START]) and disabled at int([STEPS]*[END])
  --strength STRENGTH                strength for noising/unnoising (default: 0.75)
  --style-ratio STYLE-RATIO          strength for keeping input identity (default: 20)
@ -326,14 +354,10 @@ arguments:
                                     1.0 corresponds to full destruction of information in init image
  -H, --height H                     image height, in pixel space (default: 512)
  -W, --width W                      image width, in pixel space (default: 512)
  --sampling-method {euler, euler_a, heun, dpm2, dpm++2s_a, dpm++2m, dpm++2mv2, ipndm, ipndm_v, lcm, ddim_trailing, tcd}
                                     sampling method (default: "euler_a")
  --steps  STEPS                     number of sample steps (default: 20)
  --rng {std_default, cuda}          RNG (default: cuda)
  -s SEED, --seed SEED               RNG seed (default: 42, use random seed for < 0)
  -b, --batch-count COUNT            number of images to generate
-  --schedule {discrete, karras, exponential, ays, gits} Denoiser sigma schedule (default: discrete)
+  --clip-skip N                      ignore last_dot_pos layers of CLIP network; 1 ignores none, 2 ignores one layer (default: -1)
  --clip-skip N                      ignore last layers of CLIP network; 1 ignores none, 2 ignores one layer (default: -1)
                                     <= 0 represents unspecified, will be 1 for SD1.x, 2 for SD2.x
  --vae-tiling                       process vae in tiles to reduce memory usage
  --vae-on-cpu                       keep vae in cpu (for low vram)
@ -351,6 +375,8 @@ arguments:
  --chroma-disable-dit-mask          disable dit mask for chroma
  --chroma-enable-t5-mask            enable t5 mask for chroma
  --chroma-t5-mask-pad  PAD_SIZE     t5 mask pad size of chroma
  --video-frames                     video frames (default: 1)
  --fps                              fps (default: 24)
  -v, --verbose                      print extra info
 ```
@ -438,3 +464,5 @@ Thank you to all the people who have already contributed to stable-diffusion.cpp
 - [latent-consistency-model](https://github.com/luosiallen/latent-consistency-model)
 - [generative-models](https://github.com/Stability-AI/generative-models/)
 - [PhotoMaker](https://github.com/TencentARC/PhotoMaker)
 - [Wan2.1](https://github.com/Wan-Video/Wan2.1)
 - [Wan2.2](https://github.com/Wan-Video/Wan2.2)
--- a/assets/wan/Wan2.1_1.3B_t2v.mp4
+++ b/assets/wan/Wan2.1_1.3B_t2v.mp4
--- a/assets/wan/Wan2.1_14B_flf2v.mp4
+++ b/assets/wan/Wan2.1_14B_flf2v.mp4
--- a/assets/wan/Wan2.1_14B_i2v.mp4
+++ b/assets/wan/Wan2.1_14B_i2v.mp4
--- a/assets/wan/Wan2.1_14B_t2v.mp4
+++ b/assets/wan/Wan2.1_14B_t2v.mp4
--- a/assets/wan/Wan2.2_14B_flf2v.mp4
+++ b/assets/wan/Wan2.2_14B_flf2v.mp4
--- a/assets/wan/Wan2.2_14B_i2v.mp4
+++ b/assets/wan/Wan2.2_14B_i2v.mp4
--- a/assets/wan/Wan2.2_14B_t2i.png
+++ b/assets/wan/Wan2.2_14B_t2i.png
--- a/assets/wan/Wan2.2_14B_t2v.mp4
+++ b/assets/wan/Wan2.2_14B_t2v.mp4
--- a/assets/wan/Wan2.2_14B_t2v_lora.mp4
+++ b/assets/wan/Wan2.2_14B_t2v_lora.mp4
--- a/assets/wan/Wan2.2_5B_i2v.mp4
+++ b/assets/wan/Wan2.2_5B_i2v.mp4
--- a/assets/wan/Wan2.2_5B_t2v.mp4
+++ b/assets/wan/Wan2.2_5B_t2v.mp4
--- a/clip.hpp
+++ b/clip.hpp
@ -179,9 +179,9 @@ public:
        auto it = encoder.find(utf8_to_utf32("img</w>"));
        if (it != encoder.end()) {
-            LOG_DEBUG(" trigger word img already in vocab");
+            LOG_DEBUG("trigger word img already in vocab");
        } else {
-            LOG_DEBUG(" trigger word img not in vocab yet");
+            LOG_DEBUG("trigger word img not in vocab yet");
        }
        int rank = 0;
@ -733,7 +733,7 @@ public:
            if (text_projection != NULL) {
                pooled = ggml_nn_linear(ctx, pooled, text_projection, NULL);
            } else {
-                LOG_DEBUG("Missing text_projection matrix, assuming identity...");
+                LOG_DEBUG("identity projection");
            }
            return pooled;  // [hidden_size, 1, 1]
        }
@ -774,7 +774,10 @@ public:
        blocks["post_layernorm"] = std::shared_ptr<GGMLBlock>(new LayerNorm(hidden_size));
    }
-    struct ggml_tensor* forward(struct ggml_context* ctx, struct ggml_tensor* pixel_values, bool return_pooled = true) {
+    struct ggml_tensor* forward(struct ggml_context* ctx,
                                struct ggml_tensor* pixel_values,
                                bool return_pooled = true,
                                int clip_skip      = -1) {
        // pixel_values: [N, num_channels, image_size, image_size]
        auto embeddings     = std::dynamic_pointer_cast<CLIPVisionEmbeddings>(blocks["embeddings"]);
        auto pre_layernorm  = std::dynamic_pointer_cast<LayerNorm>(blocks["pre_layernorm"]);
@ -783,7 +786,7 @@ public:
        auto x = embeddings->forward(ctx, pixel_values);  // [N, num_positions, embed_dim]
        x      = pre_layernorm->forward(ctx, x);
-        x      = encoder->forward(ctx, x, -1, false);
+        x      = encoder->forward(ctx, x, clip_skip, false);
        // print_ggml_tensor(x, true, "ClipVisionModel x: ");
        auto last_hidden_state = x;
        x                      = post_layernorm->forward(ctx, x);  // [N, n_token, hidden_size]
@ -851,16 +854,22 @@ public:
        blocks["visual_projection"] = std::shared_ptr<GGMLBlock>(new CLIPProjection(hidden_size, projection_dim, transpose_proj_w));
    }
-    struct ggml_tensor* forward(struct ggml_context* ctx, struct ggml_tensor* pixel_values) {
+    struct ggml_tensor* forward(struct ggml_context* ctx,
                                struct ggml_tensor* pixel_values,
                                bool return_pooled = true,
                                int clip_skip      = -1) {
        // pixel_values: [N, num_channels, image_size, image_size]
-        // return: [N, projection_dim]
+        // return: [N, projection_dim] if return_pooled else [N, n_token, hidden_size]
        auto vision_model      = std::dynamic_pointer_cast<CLIPVisionModel>(blocks["vision_model"]);
        auto visual_projection = std::dynamic_pointer_cast<CLIPProjection>(blocks["visual_projection"]);
-        auto x = vision_model->forward(ctx, pixel_values);  // [N, hidden_size]
+        auto x = vision_model->forward(ctx, pixel_values, return_pooled, clip_skip);  // [N, hidden_size] or [N, n_token, hidden_size]
        x      = visual_projection->forward(ctx, x);        // [N, projection_dim]
-        return x;  // [N, projection_dim]
+        if (return_pooled) {
            x = visual_projection->forward(ctx, x);  // [N, projection_dim]
        }
        return x;
    }
 };
@ -868,12 +877,13 @@ struct CLIPTextModelRunner : public GGMLRunner {
    CLIPTextModel model;
    CLIPTextModelRunner(ggml_backend_t backend,
                        bool offload_params_to_cpu,
                        const String2GGMLType& tensor_types,
                        const std::string prefix,
                        CLIPVersion version = OPENAI_CLIP_VIT_L_14,
                        bool with_final_ln  = true,
                        int clip_skip_value = -1)
-        : GGMLRunner(backend), model(version, with_final_ln, clip_skip_value) {
+        : GGMLRunner(backend, offload_params_to_cpu), model(version, with_final_ln, clip_skip_value) {
        model.init(params_ctx, tensor_types, prefix);
    }
--- a/conditioner.hpp
+++ b/conditioner.hpp
@ -21,12 +21,12 @@ struct Conditioner {
                                              int clip_skip,
                                              int width,
                                              int height,
-                                              int adm_in_channels        = -1,
+                                              int adm_in_channels  = -1,
-                                              bool force_zero_embeddings = false)                                             = 0;
+                                              bool zero_out_masked = false)                                             = 0;
-    virtual void alloc_params_buffer()                                                                                        = 0;
+    virtual void alloc_params_buffer()                                                                                  = 0;
-    virtual void free_params_buffer()                                                                                         = 0;
+    virtual void free_params_buffer()                                                                                   = 0;
-    virtual void get_param_tensors(std::map<std::string, struct ggml_tensor*>& tensors)                                       = 0;
+    virtual void get_param_tensors(std::map<std::string, struct ggml_tensor*>& tensors)                                 = 0;
-    virtual size_t get_params_buffer_size()                                                                                   = 0;
+    virtual size_t get_params_buffer_size()                                                                             = 0;
    virtual std::tuple<SDCondition, std::vector<bool>> get_learned_condition_with_trigger(ggml_context* work_ctx,
                                                                                          int n_threads,
                                                                                          const std::string& text,
@ -34,10 +34,10 @@ struct Conditioner {
                                                                                          int width,
                                                                                          int height,
                                                                                          int num_input_imgs,
-                                                                                          int adm_in_channels        = -1,
+                                                                                          int adm_in_channels  = -1,
-                                                                                          bool force_zero_embeddings = false) = 0;
+                                                                                          bool zero_out_masked = false) = 0;
    virtual std::string remove_trigger_from_prompt(ggml_context* work_ctx,
-                                                   const std::string& prompt)                                                 = 0;
+                                                   const std::string& prompt)                                           = 0;
 };
 // ldm.modules.encoders.modules.FrozenCLIPEmbedder
@ -57,6 +57,7 @@ struct FrozenCLIPEmbedderWithCustomWords : public Conditioner {
    std::vector<std::string> readed_embeddings;
    FrozenCLIPEmbedderWithCustomWords(ggml_backend_t backend,
                                      bool offload_params_to_cpu,
                                      const String2GGMLType& tensor_types,
                                      const std::string& embd_dir,
                                      SDVersion version = VERSION_SD1,
@ -64,12 +65,12 @@ struct FrozenCLIPEmbedderWithCustomWords : public Conditioner {
                                      int clip_skip     = -1)
        : version(version), pm_version(pv), tokenizer(sd_version_is_sd2(version) ? 0 : 49407), embd_dir(embd_dir) {
        if (sd_version_is_sd1(version)) {
-            text_model = std::make_shared<CLIPTextModelRunner>(backend, tensor_types, "cond_stage_model.transformer.text_model", OPENAI_CLIP_VIT_L_14);
+            text_model = std::make_shared<CLIPTextModelRunner>(backend, offload_params_to_cpu, tensor_types, "cond_stage_model.transformer.text_model", OPENAI_CLIP_VIT_L_14);
        } else if (sd_version_is_sd2(version)) {
-            text_model = std::make_shared<CLIPTextModelRunner>(backend, tensor_types, "cond_stage_model.transformer.text_model", OPEN_CLIP_VIT_H_14);
+            text_model = std::make_shared<CLIPTextModelRunner>(backend, offload_params_to_cpu, tensor_types, "cond_stage_model.transformer.text_model", OPEN_CLIP_VIT_H_14);
        } else if (sd_version_is_sdxl(version)) {
-            text_model  = std::make_shared<CLIPTextModelRunner>(backend, tensor_types, "cond_stage_model.transformer.text_model", OPENAI_CLIP_VIT_L_14, false);
+            text_model  = std::make_shared<CLIPTextModelRunner>(backend, offload_params_to_cpu, tensor_types, "cond_stage_model.transformer.text_model", OPENAI_CLIP_VIT_L_14, false);
-            text_model2 = std::make_shared<CLIPTextModelRunner>(backend, tensor_types, "cond_stage_model.1.transformer.text_model", OPEN_CLIP_VIT_BIGG_14, false);
+            text_model2 = std::make_shared<CLIPTextModelRunner>(backend, offload_params_to_cpu, tensor_types, "cond_stage_model.1.transformer.text_model", OPEN_CLIP_VIT_BIGG_14, false);
        }
        set_clip_skip(clip_skip);
    }
@ -154,7 +155,7 @@ struct FrozenCLIPEmbedderWithCustomWords : public Conditioner {
            }
            return true;
        };
-        model_loader.load_tensors(on_load, NULL);
+        model_loader.load_tensors(on_load);
        readed_embeddings.push_back(embd_name);
        if (embd) {
            int64_t hidden_size = text_model->model.hidden_size;
@ -409,8 +410,8 @@ struct FrozenCLIPEmbedderWithCustomWords : public Conditioner {
                                             int clip_skip,
                                             int width,
                                             int height,
-                                             int adm_in_channels        = -1,
+                                             int adm_in_channels  = -1,
-                                             bool force_zero_embeddings = false) {
+                                             bool zero_out_masked = false) {
        set_clip_skip(clip_skip);
        int64_t t0                               = ggml_time_ms();
        struct ggml_tensor* hidden_states        = NULL;  // [N, n_token, hidden_size]
@ -499,7 +500,7 @@ struct FrozenCLIPEmbedderWithCustomWords : public Conditioner {
                float new_mean = ggml_tensor_mean(result);
                ggml_tensor_scale(result, (original_mean / new_mean));
            }
-            if (force_zero_embeddings) {
+            if (zero_out_masked) {
                float* vec = (float*)result->data;
                for (int i = 0; i < ggml_nelements(result); i++) {
                    vec[i] = 0;
@ -562,8 +563,8 @@ struct FrozenCLIPEmbedderWithCustomWords : public Conditioner {
                                       int width,
                                       int height,
                                       int num_input_imgs,
-                                       int adm_in_channels        = -1,
+                                       int adm_in_channels  = -1,
-                                       bool force_zero_embeddings = false) {
+                                       bool zero_out_masked = false) {
        auto image_tokens = convert_token_to_id(trigger_word);
        // if(image_tokens.size() == 1){
        //     printf(" image token id is: %d \n", image_tokens[0]);
@ -584,7 +585,7 @@ struct FrozenCLIPEmbedderWithCustomWords : public Conditioner {
        // for(int i = 0; i < clsm.size(); ++i)
        //    printf("%d ", clsm[i]?1:0);
        // printf("\n");
-        auto cond = get_learned_condition_common(work_ctx, n_threads, tokens, weights, clip_skip, width, height, adm_in_channels, force_zero_embeddings);
+        auto cond = get_learned_condition_common(work_ctx, n_threads, tokens, weights, clip_skip, width, height, adm_in_channels, zero_out_masked);
        return std::make_tuple(cond, clsm);
    }
@ -606,20 +607,22 @@ struct FrozenCLIPEmbedderWithCustomWords : public Conditioner {
                                      int clip_skip,
                                      int width,
                                      int height,
-                                      int adm_in_channels        = -1,
+                                      int adm_in_channels  = -1,
-                                      bool force_zero_embeddings = false) {
+                                      bool zero_out_masked = false) {
        auto tokens_and_weights     = tokenize(text, true);
        std::vector<int>& tokens    = tokens_and_weights.first;
        std::vector<float>& weights = tokens_and_weights.second;
-        return get_learned_condition_common(work_ctx, n_threads, tokens, weights, clip_skip, width, height, adm_in_channels, force_zero_embeddings);
+        return get_learned_condition_common(work_ctx, n_threads, tokens, weights, clip_skip, width, height, adm_in_channels, zero_out_masked);
    }
 };
 struct FrozenCLIPVisionEmbedder : public GGMLRunner {
    CLIPVisionModelProjection vision_model;
-    FrozenCLIPVisionEmbedder(ggml_backend_t backend, const String2GGMLType& tensor_types = {})
+    FrozenCLIPVisionEmbedder(ggml_backend_t backend,
-        : vision_model(OPEN_CLIP_VIT_H_14, true), GGMLRunner(backend) {
+                             bool offload_params_to_cpu,
                             const String2GGMLType& tensor_types = {})
        : vision_model(OPEN_CLIP_VIT_H_14), GGMLRunner(backend, offload_params_to_cpu) {
        vision_model.init(params_ctx, tensor_types, "cond_stage_model.transformer");
    }
@ -631,12 +634,12 @@ struct FrozenCLIPVisionEmbedder : public GGMLRunner {
        vision_model.get_param_tensors(tensors, "cond_stage_model.transformer");
    }
-    struct ggml_cgraph* build_graph(struct ggml_tensor* pixel_values) {
+    struct ggml_cgraph* build_graph(struct ggml_tensor* pixel_values, bool return_pooled, int clip_skip) {
        struct ggml_cgraph* gf = ggml_new_graph(compute_ctx);
        pixel_values = to_backend(pixel_values);
-        struct ggml_tensor* hidden_states = vision_model.forward(compute_ctx, pixel_values);
+        struct ggml_tensor* hidden_states = vision_model.forward(compute_ctx, pixel_values, return_pooled, clip_skip);
        ggml_build_forward_expand(gf, hidden_states);
@ -645,10 +648,12 @@ struct FrozenCLIPVisionEmbedder : public GGMLRunner {
    void compute(const int n_threads,
                 ggml_tensor* pixel_values,
                 bool return_pooled,
                 int clip_skip,
                 ggml_tensor** output,
                 ggml_context* output_ctx) {
        auto get_graph = [&]() -> struct ggml_cgraph* {
-            return build_graph(pixel_values);
+            return build_graph(pixel_values, return_pooled, clip_skip);
        };
        GGMLRunner::compute(get_graph, n_threads, true, output, output_ctx);
    }
@ -663,12 +668,13 @@ struct SD3CLIPEmbedder : public Conditioner {
    std::shared_ptr<T5Runner> t5;
    SD3CLIPEmbedder(ggml_backend_t backend,
                    bool offload_params_to_cpu,
                    const String2GGMLType& tensor_types = {},
                    int clip_skip                       = -1)
        : clip_g_tokenizer(0) {
-        clip_l = std::make_shared<CLIPTextModelRunner>(backend, tensor_types, "text_encoders.clip_l.transformer.text_model", OPENAI_CLIP_VIT_L_14, false);
+        clip_l = std::make_shared<CLIPTextModelRunner>(backend, offload_params_to_cpu, tensor_types, "text_encoders.clip_l.transformer.text_model", OPENAI_CLIP_VIT_L_14, false);
-        clip_g = std::make_shared<CLIPTextModelRunner>(backend, tensor_types, "text_encoders.clip_g.transformer.text_model", OPEN_CLIP_VIT_BIGG_14, false);
+        clip_g = std::make_shared<CLIPTextModelRunner>(backend, offload_params_to_cpu, tensor_types, "text_encoders.clip_g.transformer.text_model", OPEN_CLIP_VIT_BIGG_14, false);
-        t5     = std::make_shared<T5Runner>(backend, tensor_types, "text_encoders.t5xxl.transformer");
+        t5     = std::make_shared<T5Runner>(backend, offload_params_to_cpu, tensor_types, "text_encoders.t5xxl.transformer");
        set_clip_skip(clip_skip);
    }
@ -773,7 +779,7 @@ struct SD3CLIPEmbedder : public Conditioner {
                                             int n_threads,
                                             std::vector<std::pair<std::vector<int>, std::vector<float>>> token_and_weights,
                                             int clip_skip,
-                                             bool force_zero_embeddings = false) {
+                                             bool zero_out_masked = false) {
        set_clip_skip(clip_skip);
        auto& clip_l_tokens  = token_and_weights[0].first;
        auto& clip_l_weights = token_and_weights[0].second;
@ -952,7 +958,7 @@ struct SD3CLIPEmbedder : public Conditioner {
            int64_t t1 = ggml_time_ms();
            LOG_DEBUG("computing condition graph completed, taking %" PRId64 " ms", t1 - t0);
-            if (force_zero_embeddings) {
+            if (zero_out_masked) {
                float* vec = (float*)chunk_hidden_states->data;
                for (int i = 0; i < ggml_nelements(chunk_hidden_states); i++) {
                    vec[i] = 0;
@ -978,10 +984,10 @@ struct SD3CLIPEmbedder : public Conditioner {
                                      int clip_skip,
                                      int width,
                                      int height,
-                                      int adm_in_channels        = -1,
+                                      int adm_in_channels  = -1,
-                                      bool force_zero_embeddings = false) {
+                                      bool zero_out_masked = false) {
        auto tokens_and_weights = tokenize(text, 77, true);
-        return get_learned_condition_common(work_ctx, n_threads, tokens_and_weights, clip_skip, force_zero_embeddings);
+        return get_learned_condition_common(work_ctx, n_threads, tokens_and_weights, clip_skip, zero_out_masked);
    }
    std::tuple<SDCondition, std::vector<bool>> get_learned_condition_with_trigger(ggml_context* work_ctx,
@ -991,8 +997,8 @@ struct SD3CLIPEmbedder : public Conditioner {
                                                                                  int width,
                                                                                  int height,
                                                                                  int num_input_imgs,
-                                                                                  int adm_in_channels        = -1,
+                                                                                  int adm_in_channels  = -1,
-                                                                                  bool force_zero_embeddings = false) {
+                                                                                  bool zero_out_masked = false) {
        GGML_ASSERT(0 && "Not implemented yet!");
    }
@ -1010,10 +1016,11 @@ struct FluxCLIPEmbedder : public Conditioner {
    size_t chunk_len = 256;
    FluxCLIPEmbedder(ggml_backend_t backend,
                     bool offload_params_to_cpu,
                     const String2GGMLType& tensor_types = {},
                     int clip_skip                       = -1) {
-        clip_l = std::make_shared<CLIPTextModelRunner>(backend, tensor_types, "text_encoders.clip_l.transformer.text_model", OPENAI_CLIP_VIT_L_14, true);
+        clip_l = std::make_shared<CLIPTextModelRunner>(backend, offload_params_to_cpu, tensor_types, "text_encoders.clip_l.transformer.text_model", OPENAI_CLIP_VIT_L_14, true);
-        t5     = std::make_shared<T5Runner>(backend, tensor_types, "text_encoders.t5xxl.transformer");
+        t5     = std::make_shared<T5Runner>(backend, offload_params_to_cpu, tensor_types, "text_encoders.t5xxl.transformer");
        set_clip_skip(clip_skip);
    }
@ -1101,7 +1108,7 @@ struct FluxCLIPEmbedder : public Conditioner {
                                             int n_threads,
                                             std::vector<std::pair<std::vector<int>, std::vector<float>>> token_and_weights,
                                             int clip_skip,
-                                             bool force_zero_embeddings = false) {
+                                             bool zero_out_masked = false) {
        set_clip_skip(clip_skip);
        auto& clip_l_tokens  = token_and_weights[0].first;
        auto& clip_l_weights = token_and_weights[0].second;
@ -1173,7 +1180,7 @@ struct FluxCLIPEmbedder : public Conditioner {
            int64_t t1 = ggml_time_ms();
            LOG_DEBUG("computing condition graph completed, taking %" PRId64 " ms", t1 - t0);
-            if (force_zero_embeddings) {
+            if (zero_out_masked) {
                float* vec = (float*)chunk_hidden_states->data;
                for (int i = 0; i < ggml_nelements(chunk_hidden_states); i++) {
                    vec[i] = 0;
@ -1199,10 +1206,10 @@ struct FluxCLIPEmbedder : public Conditioner {
                                      int clip_skip,
                                      int width,
                                      int height,
-                                      int adm_in_channels        = -1,
+                                      int adm_in_channels  = -1,
-                                      bool force_zero_embeddings = false) {
+                                      bool zero_out_masked = false) {
        auto tokens_and_weights = tokenize(text, chunk_len, true);
-        return get_learned_condition_common(work_ctx, n_threads, tokens_and_weights, clip_skip, force_zero_embeddings);
+        return get_learned_condition_common(work_ctx, n_threads, tokens_and_weights, clip_skip, zero_out_masked);
    }
    std::tuple<SDCondition, std::vector<bool>> get_learned_condition_with_trigger(ggml_context* work_ctx,
@ -1212,8 +1219,8 @@ struct FluxCLIPEmbedder : public Conditioner {
                                                                                  int width,
                                                                                  int height,
                                                                                  int num_input_imgs,
-                                                                                  int adm_in_channels        = -1,
+                                                                                  int adm_in_channels  = -1,
-                                                                                  bool force_zero_embeddings = false) {
+                                                                                  bool zero_out_masked = false) {
        GGML_ASSERT(0 && "Not implemented yet!");
    }
@ -1223,20 +1230,23 @@ struct FluxCLIPEmbedder : public Conditioner {
    }
 };
-struct PixArtCLIPEmbedder : public Conditioner {
+struct T5CLIPEmbedder : public Conditioner {
    T5UniGramTokenizer t5_tokenizer;
    std::shared_ptr<T5Runner> t5;
    size_t chunk_len = 512;
    bool use_mask    = false;
    int mask_pad     = 1;
    bool is_umt5     = false;
-    PixArtCLIPEmbedder(ggml_backend_t backend,
+    T5CLIPEmbedder(ggml_backend_t backend,
-                       const String2GGMLType& tensor_types = {},
+                   bool offload_params_to_cpu,
-                       int clip_skip                       = -1,
+                   const String2GGMLType& tensor_types = {},
-                       bool use_mask                       = false,
+                   int clip_skip                       = -1,
-                       int mask_pad                        = 1)
+                   bool use_mask                       = false,
-        : use_mask(use_mask), mask_pad(mask_pad) {
+                   int mask_pad                        = 1,
-        t5 = std::make_shared<T5Runner>(backend, tensor_types, "text_encoders.t5xxl.transformer");
+                   bool is_umt5                        = false)
        : use_mask(use_mask), mask_pad(mask_pad), t5_tokenizer(is_umt5) {
        t5 = std::make_shared<T5Runner>(backend, offload_params_to_cpu, tensor_types, "text_encoders.t5xxl.transformer", is_umt5);
    }
    void set_clip_skip(int clip_skip) {
@ -1317,16 +1327,16 @@ struct PixArtCLIPEmbedder : public Conditioner {
                                             int n_threads,
                                             std::tuple<std::vector<int>, std::vector<float>, std::vector<float>> token_and_weights,
                                             int clip_skip,
-                                             bool force_zero_embeddings = false) {
+                                             bool zero_out_masked = false) {
        auto& t5_tokens        = std::get<0>(token_and_weights);
        auto& t5_weights       = std::get<1>(token_and_weights);
        auto& t5_attn_mask_vec = std::get<2>(token_and_weights);
        int64_t t0                              = ggml_time_ms();
-        struct ggml_tensor* hidden_states       = NULL;                                               // [N, n_token, 4096]
+        struct ggml_tensor* hidden_states       = NULL;  // [N, n_token, 4096]
-        struct ggml_tensor* chunk_hidden_states = NULL;                                               // [n_token, 4096]
+        struct ggml_tensor* chunk_hidden_states = NULL;  // [n_token, 4096]
-        struct ggml_tensor* pooled              = NULL;                                               // [768,]
+        struct ggml_tensor* pooled              = NULL;
-        struct ggml_tensor* t5_attn_mask        = vector_to_ggml_tensor(work_ctx, t5_attn_mask_vec);  // [768,]
+        struct ggml_tensor* t5_attn_mask        = vector_to_ggml_tensor(work_ctx, t5_attn_mask_vec);  // [n_token]
        std::vector<float> hidden_states_vec;
@ -1367,10 +1377,16 @@ struct PixArtCLIPEmbedder : public Conditioner {
            int64_t t1 = ggml_time_ms();
            LOG_DEBUG("computing condition graph completed, taking %" PRId64 " ms", t1 - t0);
-            if (force_zero_embeddings) {
+            if (zero_out_masked) {
-                float* vec = (float*)chunk_hidden_states->data;
+                auto tensor = chunk_hidden_states;
-                for (int i = 0; i < ggml_nelements(chunk_hidden_states); i++) {
+                for (int i2 = 0; i2 < tensor->ne[2]; i2++) {
-                    vec[i] = 0;
+                    for (int i1 = 0; i1 < tensor->ne[1]; i1++) {
                        for (int i0 = 0; i0 < tensor->ne[0]; i0++) {
                            if (chunk_mask[i1] < 0.f) {
                                ggml_tensor_set_f32(tensor, 0.f, i0, i1, i2);
                            }
                        }
                    }
                }
            }
@ -1379,16 +1395,12 @@ struct PixArtCLIPEmbedder : public Conditioner {
                                     ((float*)chunk_hidden_states->data) + ggml_nelements(chunk_hidden_states));
        }
-        if (hidden_states_vec.size() > 0) {
+        GGML_ASSERT(hidden_states_vec.size() > 0);
-            hidden_states = vector_to_ggml_tensor(work_ctx, hidden_states_vec);
+        hidden_states = vector_to_ggml_tensor(work_ctx, hidden_states_vec);
-            hidden_states = ggml_reshape_2d(work_ctx,
+        hidden_states = ggml_reshape_2d(work_ctx,
-                                            hidden_states,
+                                        hidden_states,
-                                            chunk_hidden_states->ne[0],
+                                        chunk_hidden_states->ne[0],
-                                            ggml_nelements(hidden_states) / chunk_hidden_states->ne[0]);
+                                        ggml_nelements(hidden_states) / chunk_hidden_states->ne[0]);
        } else {
            hidden_states = ggml_new_tensor_2d(work_ctx, GGML_TYPE_F32, 4096, 256);
            ggml_set_f32(hidden_states, 0.f);
        }
        modify_mask_to_attend_padding(t5_attn_mask, ggml_nelements(t5_attn_mask), mask_pad);
@ -1401,10 +1413,10 @@ struct PixArtCLIPEmbedder : public Conditioner {
                                      int clip_skip,
                                      int width,
                                      int height,
-                                      int adm_in_channels        = -1,
+                                      int adm_in_channels  = -1,
-                                      bool force_zero_embeddings = false) {
+                                      bool zero_out_masked = false) {
        auto tokens_and_weights = tokenize(text, chunk_len, true);
-        return get_learned_condition_common(work_ctx, n_threads, tokens_and_weights, clip_skip, force_zero_embeddings);
+        return get_learned_condition_common(work_ctx, n_threads, tokens_and_weights, clip_skip, zero_out_masked);
    }
    std::tuple<SDCondition, std::vector<bool>> get_learned_condition_with_trigger(ggml_context* work_ctx,
@ -1414,8 +1426,8 @@ struct PixArtCLIPEmbedder : public Conditioner {
                                                                                  int width,
                                                                                  int height,
                                                                                  int num_input_imgs,
-                                                                                  int adm_in_channels        = -1,
+                                                                                  int adm_in_channels  = -1,
-                                                                                  bool force_zero_embeddings = false) {
+                                                                                  bool zero_out_masked = false) {
        GGML_ASSERT(0 && "Not implemented yet!");
    }
--- a/control.hpp
+++ b/control.hpp
@ -317,9 +317,10 @@ struct ControlNet : public GGMLRunner {
    bool guided_hint_cached         = false;
    ControlNet(ggml_backend_t backend,
               bool offload_params_to_cpu,
               const String2GGMLType& tensor_types = {},
               SDVersion version                   = VERSION_SD1)
-        : GGMLRunner(backend), control_net(version) {
+        : GGMLRunner(backend, offload_params_to_cpu), control_net(version) {
        control_net.init(params_ctx, tensor_types, "");
    }
@ -357,7 +358,7 @@ struct ControlNet : public GGMLRunner {
            control_buffer_size += ggml_nbytes(controls[i]);
        }
-        control_buffer = ggml_backend_alloc_ctx_tensors(control_ctx, backend);
+        control_buffer = ggml_backend_alloc_ctx_tensors(control_ctx, runtime_backend);
        LOG_DEBUG("control buffer size %.2fMB", control_buffer_size * 1.f / 1024.f / 1024.f);
    }
@ -454,7 +455,7 @@ struct ControlNet : public GGMLRunner {
            return false;
        }
-        bool success = model_loader.load_tensors(tensors, backend, ignore_tensors);
+        bool success = model_loader.load_tensors(tensors, ignore_tensors);
        if (!success) {
            LOG_ERROR("load control net tensors from model loader failed");
--- a/denoiser.hpp
+++ b/denoiser.hpp
@ -252,7 +252,7 @@ struct KarrasSchedule : SigmaSchedule {
 };
 struct Denoiser {
-    std::shared_ptr<SigmaSchedule> schedule                                                  = std::make_shared<DiscreteSchedule>();
+    std::shared_ptr<SigmaSchedule> scheduler                                                 = std::make_shared<DiscreteSchedule>();
    virtual float sigma_min()                                                                = 0;
    virtual float sigma_max()                                                                = 0;
    virtual float sigma_to_t(float sigma)                                                    = 0;
@ -263,7 +263,7 @@ struct Denoiser {
    virtual std::vector<float> get_sigmas(uint32_t n) {
        auto bound_t_to_sigma = std::bind(&Denoiser::t_to_sigma, this, std::placeholders::_1);
-        return schedule->get_sigmas(n, sigma_min(), sigma_max(), bound_t_to_sigma);
+        return scheduler->get_sigmas(n, sigma_min(), sigma_max(), bound_t_to_sigma);
    }
 };
@ -349,7 +349,7 @@ struct EDMVDenoiser : public CompVisVDenoiser {
    EDMVDenoiser(float min_sigma = 0.002, float max_sigma = 120.0)
        : min_sigma(min_sigma), max_sigma(max_sigma) {
-        schedule = std::make_shared<ExponentialSchedule>();
+        scheduler = std::make_shared<ExponentialSchedule>();
    }
    float t_to_sigma(float t) {
--- a/diffusion_model.hpp
+++ b/diffusion_model.hpp
@ -4,8 +4,10 @@
 #include "flux.hpp"
 #include "mmdit.hpp"
 #include "unet.hpp"
 #include "wan.hpp"
 struct DiffusionModel {
    virtual std::string get_desc()                                                      = 0;
    virtual void compute(int n_threads,
                         struct ggml_tensor* x,
                         struct ggml_tensor* timesteps,
@ -32,10 +34,15 @@ struct UNetModel : public DiffusionModel {
    UNetModelRunner unet;
    UNetModel(ggml_backend_t backend,
              bool offload_params_to_cpu,
              const String2GGMLType& tensor_types = {},
              SDVersion version                   = VERSION_SD1,
              bool flash_attn                     = false)
-        : unet(backend, tensor_types, "model.diffusion_model", version, flash_attn) {
+        : unet(backend, offload_params_to_cpu, tensor_types, "model.diffusion_model", version, flash_attn) {
    }
    std::string get_desc() {
        return unet.get_desc();
    }
    void alloc_params_buffer() {
@ -85,8 +92,13 @@ struct MMDiTModel : public DiffusionModel {
    MMDiTRunner mmdit;
    MMDiTModel(ggml_backend_t backend,
               bool offload_params_to_cpu,
               const String2GGMLType& tensor_types = {})
-        : mmdit(backend, tensor_types, "model.diffusion_model") {
+        : mmdit(backend, offload_params_to_cpu, tensor_types, "model.diffusion_model") {
    }
    std::string get_desc() {
        return mmdit.get_desc();
    }
    void alloc_params_buffer() {
@ -135,11 +147,16 @@ struct FluxModel : public DiffusionModel {
    Flux::FluxRunner flux;
    FluxModel(ggml_backend_t backend,
              bool offload_params_to_cpu,
              const String2GGMLType& tensor_types = {},
              SDVersion version                   = VERSION_FLUX,
              bool flash_attn                     = false,
              bool use_mask                       = false)
-        : flux(backend, tensor_types, "model.diffusion_model", version, flash_attn, use_mask) {
+        : flux(backend, offload_params_to_cpu, tensor_types, "model.diffusion_model", version, flash_attn, use_mask) {
    }
    std::string get_desc() {
        return flux.get_desc();
    }
    void alloc_params_buffer() {
@ -184,4 +201,63 @@ struct FluxModel : public DiffusionModel {
    }
 };
 struct WanModel : public DiffusionModel {
    std::string prefix;
    WAN::WanRunner wan;
    WanModel(ggml_backend_t backend,
             bool offload_params_to_cpu,
             const String2GGMLType& tensor_types = {},
             const std::string prefix            = "model.diffusion_model",
             SDVersion version                   = VERSION_WAN2,
             bool flash_attn                     = false)
        : prefix(prefix), wan(backend, offload_params_to_cpu, tensor_types, prefix, version, flash_attn) {
    }
    std::string get_desc() {
        return wan.get_desc();
    }
    void alloc_params_buffer() {
        wan.alloc_params_buffer();
    }
    void free_params_buffer() {
        wan.free_params_buffer();
    }
    void free_compute_buffer() {
        wan.free_compute_buffer();
    }
    void get_param_tensors(std::map<std::string, struct ggml_tensor*>& tensors) {
        wan.get_param_tensors(tensors, prefix);
    }
    size_t get_params_buffer_size() {
        return wan.get_params_buffer_size();
    }
    int64_t get_adm_in_channels() {
        return 768;
    }
    void compute(int n_threads,
                 struct ggml_tensor* x,
                 struct ggml_tensor* timesteps,
                 struct ggml_tensor* context,
                 struct ggml_tensor* c_concat,
                 struct ggml_tensor* y,
                 struct ggml_tensor* guidance,
                 std::vector<ggml_tensor*> ref_latents     = {},
                 int num_video_frames                      = -1,
                 std::vector<struct ggml_tensor*> controls = {},
                 float control_strength                    = 0.f,
                 struct ggml_tensor** output               = NULL,
                 struct ggml_context* output_ctx           = NULL,
                 std::vector<int> skip_layers              = std::vector<int>()) {
        return wan.compute(n_threads, x, timesteps, context, y, c_concat, NULL, output, output_ctx);
    }
 };
 #endif
--- a/docs/wan.md
+++ b/docs/wan.md
@ -0,0 +1,141 @@
 # How to Use
 ## Download weights
 - Download Wan
    - Wan2.1
        - Wan2.1 T2V 1.3B
            - safetensors: https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/tree/main/split_files/diffusion_models
        - Wan2.1 T2V 14B
            - safetensors: https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/tree/main/split_files/diffusion_models
            - gguf: https://huggingface.co/city96/Wan2.1-T2V-14B-gguf/tree/main
        - Wan2.1 I2V 14B 480P
            - safetensors: https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/tree/main/split_files/diffusion_models
            - gguf: https://huggingface.co/city96/Wan2.1-I2V-14B-480P-gguf/tree/main
        - Wan2.1 I2V 14B 720P
            - safetensors: https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/tree/main/split_files/diffusion_models
            - gguf: https://huggingface.co/city96/Wan2.1-I2V-14B-720P-gguf/tree/main
        - Wan2.1 FLF2V 14B 720P
            - safetensors: https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/tree/main/split_files/diffusion_models
            - gguf: https://huggingface.co/city96/Wan2.1-FLF2V-14B-720P-gguf/tree/main
    - Wan2.2
        - Wan2.2 TI2V 5B
            - safetensors: https://huggingface.co/Comfy-Org/Wan_2.2_ComfyUI_Repackaged/tree/main/split_files/diffusion_models
            - gguf: https://huggingface.co/QuantStack/Wan2.2-TI2V-5B-GGUF/tree/main
        - Wan2.2 T2V A14B
            - safetensors: https://huggingface.co/Comfy-Org/Wan_2.2_ComfyUI_Repackaged/tree/main/split_files/diffusion_models
            - gguf: https://huggingface.co/QuantStack/Wan2.2-T2V-A14B-GGUF/tree/main
        - Wan2.2 I2V A14B
            - safetensors: https://huggingface.co/Comfy-Org/Wan_2.2_ComfyUI_Repackaged/tree/main/split_files/diffusion_models
            - gguf: https://huggingface.co/QuantStack/Wan2.2-I2V-A14B-GGUF/tree/main
 - Download vae
    - wan_2.1_vae (for all the wan model except Wan2.2 TI2V 5B)
        - safetensors: https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/blob/main/split_files/vae/wan_2.1_vae.safetensors
    - wan_2.2_vae (for Wan2.2 TI2V 5B only)
        - safetensors: https://huggingface.co/Comfy-Org/Wan_2.2_ComfyUI_Repackaged/blob/main/split_files/vae/wan2.2_vae.safetensors
 - Download umt5_xxl
    - safetensors: https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/blob/main/split_files/text_encoders/umt5_xxl_fp16.safetensors
    - gguf: https://huggingface.co/city96/umt5-xxl-encoder-gguf/tree/main
 - Download clip_vison_h (for Wan2.1 I2V/FLF2V only)
    - safetensors: https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/blob/main/split_files/clip_vision/clip_vision_h.safetensors
 ## Examples
 Since GitHub does not support AVI files, the file I uploaded was converted from AVI to MP4.
 ### Wan2.1 T2V 1.3B
 ```
 .\bin\Release\sd.exe -M vid_gen --diffusion-model  ..\..\ComfyUI\models\diffusion_models\wan2.1_t2v_1.3B_fp16.safetensors --vae ..\..\ComfyUI\models\vae\wan_2.1_vae.safetensors --t5xxl ..\..\ComfyUI\models\text_encoders\umt5-xxl-encoder-Q8_0.gguf  -p "a lovely cat" --cfg-scale 6.0 --sampling-method euler -v -n "色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部， 畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走" -W 832 -H 480 --diffusion-fa --video-frames 33
 ```
 <video src=../assets/wan/Wan2.1_1.3B_t2v.mp4 controls="controls" muted="muted" type="video/mp4"></video>
 ### Wan2.1 T2V 14B
 ```
 .\bin\Release\sd.exe -M vid_gen --diffusion-model  ..\..\ComfyUI\models\diffusion_models\wan2.1-t2v-14b-Q8_0.gguf --vae ..\..\ComfyUI\models\vae\wan_2.1_vae.safetensors --t5xxl ..\..\ComfyUI\models\text_encoders\umt5-xxl-encoder-Q8_0.gguf  -p "a lovely cat" --cfg-scale 6.0 --sampling-method euler -v -n "色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走" -W 832 -H 480 --diffusion-fa  --offload-to-cpu --video-frames 33
 ```
 <video src=../assets/wan/Wan2.1_14B_t2v.mp4 controls="controls" muted="muted" type="video/mp4"></video>
 ### Wan2.1 I2V 14B
 ```
 .\bin\Release\sd.exe -M vid_gen --diffusion-model  ..\..\ComfyUI\models\diffusion_models\wan2.1-i2v-14b-480p-Q8_0.gguf --vae ..\..\ComfyUI\models\vae\wan_2.1_vae.safetensors --t5xxl ..\..\ComfyUI\models\text_encoders\umt5-xxl-encoder-Q8_0.gguf --clip_vision ..\..\ComfyUI\models\clip_vision\clip_vision_h.safetensors -p "a lovely cat" --cfg-scale 6.0 --sampling-method euler -v -n "色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走" -W 480 -H 832 --diffusion-fa --video-frames 33 --offload-to-cpu -i ..\assets\cat_with_sd_cpp_42.png
 ```
 <video src=../assets/wan/Wan2.1_14B_i2v.mp4 controls="controls" muted="muted" type="video/mp4"></video>
 ### Wan2.2 T2V A14B
 ```
 .\bin\Release\sd.exe -M vid_gen --diffusion-model  ..\..\ComfyUI\models\diffusion_models\Wan2.2-T2V-A14B-LowNoise-Q8_0.gguf --high-noise-diffusion-model  ..\..\ComfyUI\models\diffusion_models\Wan2.2-T2V-A14B-HighNoise-Q8_0.gguf --vae ..\..\ComfyUI\models\vae\wan_2.1_vae.safetensors --t5xxl ..\..\ComfyUI\models\text_encoders\umt5-xxl-encoder-Q8_0.gguf  -p "a lovely cat" --cfg-scale 3.5 --sampling-method euler --steps 10 --high-noise-cfg-scale 3.5 --high-noise-sampling-method euler --high-noise-steps 8 -v -n "色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走" -W 832 -H 480 --diffusion-fa --offload-to-cpu --video-frames 33
 ```
 <video src=../assets/wan/Wan2.2_14B_t2v.mp4 controls="controls" muted="muted" type="video/mp4"></video>
 ### Wan2.2 I2V A14B
 ```
 .\bin\Release\sd.exe -M vid_gen --diffusion-model  ..\..\ComfyUI\models\diffusion_models\Wan2.2-I2V-A14B-LowNoise-Q8_0.gguf --high-noise-diffusion-model  ..\..\ComfyUI\models\diffusion_models\Wan2.2-I2V-A14B-HighNoise-Q8_0.gguf --vae ..\..\ComfyUI\models\vae\wan_2.1_vae.safetensors --t5xxl ..\..\ComfyUI\models\text_encoders\umt5-xxl-encoder-Q8_0.gguf  -p "a lovely cat" --cfg-scale 3.5 --sampling-method euler --steps 10 --high-noise-cfg-scale 3.5 --high-noise-sampling-method euler --high-noise-steps 8 -v -n "色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走" -W 832 -H 480 --diffusion-fa --offload-to-cpu --video-frames 33 --offload-to-cpu -i ..\assets\cat_with_sd_cpp_42.png
 ```
 <video src=../assets/wan/Wan2.2_14B_i2v.mp4 controls="controls" muted="muted" type="video/mp4"></video>
 ### Wan2.2 T2V A14B T2I
 ```
 .\bin\Release\sd.exe -M vid_gen --diffusion-model  ..\..\ComfyUI\models\diffusion_models\Wan2.2-T2V-A14B-LowNoise-Q8_0.gguf --high-noise-diffusion-model  ..\..\ComfyUI\models\diffusion_models\Wan2.2-T2V-A14B-HighNoise-Q8_0.gguf --vae ..\..\ComfyUI\models\vae\wan_2.1_vae.safetensors --t5xxl ..\..\ComfyUI\models\text_encoders\umt5-xxl-encoder-Q8_0.gguf  -p "a lovely cat" --cfg-scale 3.5 --sampling-method euler --steps 10 --high-noise-cfg-scale 3.5 --high-noise-sampling-method euler --high-noise-steps 8 -v -n "色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走" -W 832 -H 480 --diffusion-fa --offload-to-cpu
 ```
 <img width="832" height="480" alt="Wan2 2_14B_t2i" src="../assets/wan/Wan2.2_14B_t2i.png" />
 ### Wan2.2 T2V 14B with Lora
 ```
 .\bin\Release\sd.exe -M vid_gen --diffusion-model  ..\..\ComfyUI\models\diffusion_models\Wan2.2-T2V-A14B-LowNoise-Q8_0.gguf --high-noise-diffusion-model  ..\..\ComfyUI\models\diffusion_models\Wan2.2-T2V-A14B-HighNoise-Q8_0.gguf --vae ..\..\ComfyUI\models\vae\wan_2.1_vae.safetensors --t5xxl ..\..\ComfyUI\models\text_encoders\umt5-xxl-encoder-Q8_0.gguf  -p "a lovely cat<lora:wan2.2_t2v_lightx2v_4steps_lora_v1.1_low_noise:1><lora:|high_noise|wan2.2_t2v_lightx2v_4steps_lora_v1.1_high_noise:1>" --cfg-scale 3.5 --sampling-method euler --steps 4 --high-noise-cfg-scale 3.5 --high-noise-sampling-method euler --high-noise-steps 4 -v -n "色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走" -W 832 -H 480 --diffusion-fa --offload-to-cpu --lora-model-dir ..\..\ComfyUI\models\loras --video-frames 33
 ```
 <video src=../assets/wan/Wan2.2_14B_t2v_lora.mp4 controls="controls" muted="muted" type="video/mp4"></video>
 ### Wan2.2 TI2V 5B
 #### T2V
 ```
 .\bin\Release\sd.exe -M vid_gen --diffusion-model  ..\..\ComfyUI\models\diffusion_models\wan2.2_ti2v_5B_fp16.safetensors --vae ..\..\ComfyUI\models\vae\wan2.2_vae.safetensors --t5xxl ..\..\ComfyUI\models\text_encoders\umt5-xxl-encoder-Q8_0.gguf  -p "a lovely cat" --cfg-scale 6.0 --sampling-method euler -v -n "色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走" -W 480 -H 832 --diffusion-fa --offload-to-cpu --video-frames 33
 ```
 <video src=../assets/wan/Wan2.2_5B_t2v.mp4 controls="controls" muted="muted" type="video/mp4"></video>
 #### I2V
 ```
 .\bin\Release\sd.exe -M vid_gen --diffusion-model  ..\..\ComfyUI\models\diffusion_models\wan2.2_ti2v_5B_fp16.safetensors --vae ..\..\ComfyUI\models\vae\wan2.2_vae.safetensors --t5xxl ..\..\ComfyUI\models\text_encoders\umt5-xxl-encoder-Q8_0.gguf  -p "a lovely cat" --cfg-scale 6.0 --sampling-method euler -v -n "色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走" -W 480 -H 832 --diffusion-fa --offload-to-cpu --video-frames 33 -i ..\assets\cat_with_sd_cpp_42.png
 ```
 <video src=../assets/wan/Wan2.2_5B_i2v.mp4 controls="controls" muted="muted" type="video/mp4"></video>
 ### Wan2.1 FLF2V 14B
 ```
 .\bin\Release\sd.exe -M vid_gen --diffusion-model  ..\..\ComfyUI\models\diffusion_models\wan2.1-flf2v-14b-720p-Q8_0.gguf --vae ..\..\ComfyUI\models\vae\wan_2.1_vae.safetensors --t5xxl ..\..\ComfyUI\models\text_encoders\umt5-xxl-encoder-Q8_0.gguf --clip_vision ..\..\ComfyUI\models\clip_vision\clip_vision_h.safetensors -p "glass flower blossom" --cfg-scale 6.0 --sampling-method euler -v -n "色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走" -W 480 -H 832 --diffusion-fa --video-frames 33 --offload-to-cpu --init-img ..\..\ComfyUI\input\start_image.png --end-img ..\..\ComfyUI\input\end_image.png
 ```
 <video src=../assets/wan/Wan2.1_14B_flf2v.mp4 controls="controls" muted="muted" type="video/mp4"></video>
 ### Wan2.2 FLF2V 14B
 ```
 .\bin\Release\sd.exe -M vid_gen --diffusion-model  ..\..\ComfyUI\models\diffusion_models\Wan2.2-I2V-A14B-LowNoise-Q8_0.gguf --high-noise-diffusion-model  ..\..\ComfyUI\models\diffusion_models\Wan2.2-I2V-A14B-HighNoise-Q8_0.gguf --vae ..\..\ComfyUI\models\vae\wan_2.1_vae.safetensors --t5xxl ..\..\ComfyUI\models\text_encoders\umt5-xxl-encoder-Q8_0.gguf --cfg-scale 3.5 --sampling-method euler --steps 10 --high-noise-cfg-scale 3.5 --high-noise-sampling-method euler --high-noise-steps 8 -v -p "glass flower blossom" -n "色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走" -W 480 -H 832 --diffusion-fa --video-frames 33 --offload-to-cpu --init-img ..\..\ComfyUI\input\start_image.png --end-img ..\..\ComfyUI\input\end_image.png
 ```
 <video src=../assets/wan/Wan2.2_14B_flf2v.mp4 controls="controls" muted="muted" type="video/mp4"></video>
--- a/esrgan.hpp
+++ b/esrgan.hpp
@ -142,8 +142,10 @@ struct ESRGAN : public GGMLRunner {
    int scale     = 4;
    int tile_size = 128;  // avoid cuda OOM for 4gb VRAM
-    ESRGAN(ggml_backend_t backend, const String2GGMLType& tensor_types = {})
+    ESRGAN(ggml_backend_t backend,
-        : GGMLRunner(backend) {
+           bool offload_params_to_cpu,
           const String2GGMLType& tensor_types = {})
        : GGMLRunner(backend, offload_params_to_cpu) {
        rrdb_net.init(params_ctx, tensor_types, "");
    }
@ -175,7 +177,7 @@ struct ESRGAN : public GGMLRunner {
            return false;
        }
-        bool success = model_loader.load_tensors(esrgan_tensors, backend);
+        bool success = model_loader.load_tensors(esrgan_tensors);
        if (!success) {
            LOG_ERROR("load esrgan tensors from model loader failed");
--- a/examples/cli/avi_writer.h
+++ b/examples/cli/avi_writer.h
@ -0,0 +1,217 @@
 #ifndef __AVI_WRITER_H__
 #define __AVI_WRITER_H__
 #include <stdint.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 #include "stable-diffusion.h"
 #ifndef INCLUDE_STB_IMAGE_WRITE_H
 #include "stb_image_write.h"
 #endif
 typedef struct {
    uint32_t offset;
    uint32_t size;
 } avi_index_entry;
 // Write 32-bit little-endian integer
 void write_u32_le(FILE* f, uint32_t val) {
    fwrite(&val, 4, 1, f);
 }
 // Write 16-bit little-endian integer
 void write_u16_le(FILE* f, uint16_t val) {
    fwrite(&val, 2, 1, f);
 }
 /**
 * Create an MJPG AVI file from an array of sd_image_t images.
 * Images are encoded to JPEG using stb_image_write.
 *
 * @param filename Output AVI file name.
 * @param images Array of input images.
 * @param num_images Number of images in the array.
 * @param fps Frames per second for the video.
 * @param quality JPEG quality (0-100).
 * @return 0 on success, -1 on failure.
 */
 int create_mjpg_avi_from_sd_images(const char* filename, sd_image_t* images, int num_images, int fps, int quality = 90) {
    if (num_images == 0) {
        fprintf(stderr, "Error: Image array is empty.\n");
        return -1;
    }
    FILE* f = fopen(filename, "wb");
    if (!f) {
        perror("Error opening file for writing");
        return -1;
    }
    uint32_t width    = images[0].width;
    uint32_t height   = images[0].height;
    uint32_t channels = images[0].channel;
    if (channels != 3 && channels != 4) {
        fprintf(stderr, "Error: Unsupported channel count: %u\n", channels);
        fclose(f);
        return -1;
    }
    // --- RIFF AVI Header ---
    fwrite("RIFF", 4, 1, f);
    long riff_size_pos = ftell(f);
    write_u32_le(f, 0);  // Placeholder for file size
    fwrite("AVI ", 4, 1, f);
    // 'hdrl' LIST (header list)
    fwrite("LIST", 4, 1, f);
    write_u32_le(f, 4 + 8 + 56 + 8 + 4 + 8 + 56 + 8 + 40);
    fwrite("hdrl", 4, 1, f);
    // 'avih' chunk (AVI main header)
    fwrite("avih", 4, 1, f);
    write_u32_le(f, 56);
    write_u32_le(f, 1000000 / fps);       // Microseconds per frame
    write_u32_le(f, 0);                   // Max bytes per second
    write_u32_le(f, 0);                   // Padding granularity
    write_u32_le(f, 0x110);               // Flags (HASINDEX | ISINTERLEAVED)
    write_u32_le(f, num_images);          // Total frames
    write_u32_le(f, 0);                   // Initial frames
    write_u32_le(f, 1);                   // Number of streams
    write_u32_le(f, width * height * 3);  // Suggested buffer size
    write_u32_le(f, width);
    write_u32_le(f, height);
    write_u32_le(f, 0);  // Reserved
    write_u32_le(f, 0);  // Reserved
    write_u32_le(f, 0);  // Reserved
    write_u32_le(f, 0);  // Reserved
    // 'strl' LIST (stream list)
    fwrite("LIST", 4, 1, f);
    write_u32_le(f, 4 + 8 + 56 + 8 + 40);
    fwrite("strl", 4, 1, f);
    // 'strh' chunk (stream header)
    fwrite("strh", 4, 1, f);
    write_u32_le(f, 56);
    fwrite("vids", 4, 1, f);              // Stream type: video
    fwrite("MJPG", 4, 1, f);              // Codec: Motion JPEG
    write_u32_le(f, 0);                   // Flags
    write_u16_le(f, 0);                   // Priority
    write_u16_le(f, 0);                   // Language
    write_u32_le(f, 0);                   // Initial frames
    write_u32_le(f, 1);                   // Scale
    write_u32_le(f, fps);                 // Rate
    write_u32_le(f, 0);                   // Start
    write_u32_le(f, num_images);          // Length
    write_u32_le(f, width * height * 3);  // Suggested buffer size
    write_u32_le(f, (uint32_t)-1);        // Quality
    write_u32_le(f, 0);                   // Sample size
    write_u16_le(f, 0);                   // rcFrame.left
    write_u16_le(f, 0);                   // rcFrame.top
    write_u16_le(f, 0);                   // rcFrame.right
    write_u16_le(f, 0);                   // rcFrame.bottom
    // 'strf' chunk (stream format: BITMAPINFOHEADER)
    fwrite("strf", 4, 1, f);
    write_u32_le(f, 40);
    write_u32_le(f, 40);  // biSize
    write_u32_le(f, width);
    write_u32_le(f, height);
    write_u16_le(f, 1);                   // biPlanes
    write_u16_le(f, 24);                  // biBitCount
    fwrite("MJPG", 4, 1, f);              // biCompression (FOURCC)
    write_u32_le(f, width * height * 3);  // biSizeImage
    write_u32_le(f, 0);                   // XPelsPerMeter
    write_u32_le(f, 0);                   // YPelsPerMeter
    write_u32_le(f, 0);                   // Colors used
    write_u32_le(f, 0);                   // Colors important
    // 'movi' LIST (video frames)
    long movi_list_pos = ftell(f);
    fwrite("LIST", 4, 1, f);
    long movi_size_pos = ftell(f);
    write_u32_le(f, 0);  // Placeholder for movi size
    fwrite("movi", 4, 1, f);
    avi_index_entry* index = (avi_index_entry*)malloc(sizeof(avi_index_entry) * num_images);
    if (!index) {
        fclose(f);
        return -1;
    }
    // Encode and write each frame as JPEG
    struct {
        uint8_t* buf;
        size_t size;
    } jpeg_data;
    for (int i = 0; i < num_images; i++) {
        jpeg_data.buf  = NULL;
        jpeg_data.size = 0;
        // Callback function to collect JPEG data into memory
        auto write_to_buf = [](void* context, void* data, int size) {
            auto jd = (decltype(jpeg_data)*)context;
            jd->buf = (uint8_t*)realloc(jd->buf, jd->size + size);
            memcpy(jd->buf + jd->size, data, size);
            jd->size += size;
        };
        // Encode to JPEG in memory
        stbi_write_jpg_to_func(
            write_to_buf,
            &jpeg_data,
            images[i].width,
            images[i].height,
            channels,
            images[i].data,
            quality);
        // Write '00dc' chunk (video frame)
        fwrite("00dc", 4, 1, f);
        write_u32_le(f, jpeg_data.size);
        index[i].offset = ftell(f) - 8;
        index[i].size   = jpeg_data.size;
        fwrite(jpeg_data.buf, 1, jpeg_data.size, f);
        // Align to even byte size
        if (jpeg_data.size % 2)
            fputc(0, f);
        free(jpeg_data.buf);
    }
    // Finalize 'movi' size
    long cur_pos   = ftell(f);
    long movi_size = cur_pos - movi_size_pos - 4;
    fseek(f, movi_size_pos, SEEK_SET);
    write_u32_le(f, movi_size);
    fseek(f, cur_pos, SEEK_SET);
    // Write 'idx1' index
    fwrite("idx1", 4, 1, f);
    write_u32_le(f, num_images * 16);
    for (int i = 0; i < num_images; i++) {
        fwrite("00dc", 4, 1, f);
        write_u32_le(f, 0x10);
        write_u32_le(f, index[i].offset);
        write_u32_le(f, index[i].size);
    }
    // Finalize RIFF size
    cur_pos        = ftell(f);
    long file_size = cur_pos - riff_size_pos - 4;
    fseek(f, riff_size_pos, SEEK_SET);
    write_u32_le(f, file_size);
    fseek(f, cur_pos, SEEK_SET);
    fclose(f);
    free(index);
    return 0;
 }
 #endif  // __AVI_WRITER_H__
--- a/examples/cli/main.cpp
+++ b/examples/cli/main.cpp
--- a/flux.hpp
+++ b/flux.hpp
@ -5,6 +5,7 @@
 #include "ggml_extend.hpp"
 #include "model.h"
 #include "rope.hpp"
 #define FLUX_GRAPH_SIZE 10240
@ -610,179 +611,11 @@ namespace Flux {
    };
    struct Flux : public GGMLBlock {
    public:
        std::vector<float> linspace(float start, float end, int num) {
            std::vector<float> result(num);
            float step = (end - start) / (num - 1);
            for (int i = 0; i < num; ++i) {
                result[i] = start + i * step;
            }
            return result;
        }
        std::vector<std::vector<float>> transpose(const std::vector<std::vector<float>>& mat) {
            int rows = mat.size();
            int cols = mat[0].size();
            std::vector<std::vector<float>> transposed(cols, std::vector<float>(rows));
            for (int i = 0; i < rows; ++i) {
                for (int j = 0; j < cols; ++j) {
                    transposed[j][i] = mat[i][j];
                }
            }
            return transposed;
        }
        std::vector<float> flatten(const std::vector<std::vector<float>>& vec) {
            std::vector<float> flat_vec;
            for (const auto& sub_vec : vec) {
                flat_vec.insert(flat_vec.end(), sub_vec.begin(), sub_vec.end());
            }
            return flat_vec;
        }
        std::vector<std::vector<float>> rope(const std::vector<float>& pos, int dim, int theta) {
            assert(dim % 2 == 0);
            int half_dim = dim / 2;
            std::vector<float> scale = linspace(0, (dim * 1.0f - 2) / dim, half_dim);
            std::vector<float> omega(half_dim);
            for (int i = 0; i < half_dim; ++i) {
                omega[i] = 1.0 / std::pow(theta, scale[i]);
            }
            int pos_size = pos.size();
            std::vector<std::vector<float>> out(pos_size, std::vector<float>(half_dim));
            for (int i = 0; i < pos_size; ++i) {
                for (int j = 0; j < half_dim; ++j) {
                    out[i][j] = pos[i] * omega[j];
                }
            }
            std::vector<std::vector<float>> result(pos_size, std::vector<float>(half_dim * 4));
            for (int i = 0; i < pos_size; ++i) {
                for (int j = 0; j < half_dim; ++j) {
                    result[i][4 * j]     = std::cos(out[i][j]);
                    result[i][4 * j + 1] = -std::sin(out[i][j]);
                    result[i][4 * j + 2] = std::sin(out[i][j]);
                    result[i][4 * j + 3] = std::cos(out[i][j]);
                }
            }
            return result;
        }
        // Generate IDs for image patches and text
        std::vector<std::vector<float>> gen_txt_ids(int bs, int context_len) {
            return std::vector<std::vector<float>>(bs * context_len, std::vector<float>(3, 0.0));
        }
        std::vector<std::vector<float>> gen_img_ids(int h, int w, int patch_size, int bs, int index = 0, int h_offset = 0, int w_offset = 0) {
            int h_len = (h + (patch_size / 2)) / patch_size;
            int w_len = (w + (patch_size / 2)) / patch_size;
            std::vector<std::vector<float>> img_ids(h_len * w_len, std::vector<float>(3, 0.0));
            std::vector<float> row_ids = linspace(h_offset, h_len - 1 + h_offset, h_len);
            std::vector<float> col_ids = linspace(w_offset, w_len - 1 + w_offset, w_len);
            for (int i = 0; i < h_len; ++i) {
                for (int j = 0; j < w_len; ++j) {
                    img_ids[i * w_len + j][0] = index;
                    img_ids[i * w_len + j][1] = row_ids[i];
                    img_ids[i * w_len + j][2] = col_ids[j];
                }
            }
            std::vector<std::vector<float>> img_ids_repeated(bs * img_ids.size(), std::vector<float>(3));
            for (int i = 0; i < bs; ++i) {
                for (int j = 0; j < img_ids.size(); ++j) {
                    img_ids_repeated[i * img_ids.size() + j] = img_ids[j];
                }
            }
            return img_ids_repeated;
        }
        std::vector<std::vector<float>> concat_ids(const std::vector<std::vector<float>>& a,
                                                   const std::vector<std::vector<float>>& b,
                                                   int bs) {
            size_t a_len = a.size() / bs;
            size_t b_len = b.size() / bs;
            std::vector<std::vector<float>> ids(a.size() + b.size(), std::vector<float>(3));
            for (int i = 0; i < bs; ++i) {
                for (int j = 0; j < a_len; ++j) {
                    ids[i * (a_len + b_len) + j] = a[i * a_len + j];
                }
                for (int j = 0; j < b_len; ++j) {
                    ids[i * (a_len + b_len) + a_len + j] = b[i * b_len + j];
                }
            }
            return ids;
        }
        std::vector<std::vector<float>> gen_ids(int h, int w, int patch_size, int bs, int context_len, std::vector<ggml_tensor*> ref_latents) {
            auto txt_ids = gen_txt_ids(bs, context_len);
            auto img_ids = gen_img_ids(h, w, patch_size, bs);
            auto ids               = concat_ids(txt_ids, img_ids, bs);
            uint64_t curr_h_offset = 0;
            uint64_t curr_w_offset = 0;
            for (ggml_tensor* ref : ref_latents) {
                uint64_t h_offset = 0;
                uint64_t w_offset = 0;
                if (ref->ne[1] + curr_h_offset > ref->ne[0] + curr_w_offset) {
                    w_offset = curr_w_offset;
                } else {
                    h_offset = curr_h_offset;
                }
                auto ref_ids = gen_img_ids(ref->ne[1], ref->ne[0], patch_size, bs, 1, h_offset, w_offset);
                ids          = concat_ids(ids, ref_ids, bs);
                curr_h_offset = std::max(curr_h_offset, ref->ne[1] + h_offset);
                curr_w_offset = std::max(curr_w_offset, ref->ne[0] + w_offset);
            }
            return ids;
        }
        // Generate positional embeddings
        std::vector<float> gen_pe(int h, int w, int patch_size, int bs, int context_len, std::vector<ggml_tensor*> ref_latents, int theta, const std::vector<int>& axes_dim) {
            std::vector<std::vector<float>> ids       = gen_ids(h, w, patch_size, bs, context_len, ref_latents);
            std::vector<std::vector<float>> trans_ids = transpose(ids);
            size_t pos_len                            = ids.size();
            int num_axes                              = axes_dim.size();
            for (int i = 0; i < pos_len; i++) {
                // std::cout << trans_ids[0][i] << " " << trans_ids[1][i] << " " << trans_ids[2][i] << std::endl;
            }
            int emb_dim = 0;
            for (int d : axes_dim)
                emb_dim += d / 2;
            std::vector<std::vector<float>> emb(bs * pos_len, std::vector<float>(emb_dim * 2 * 2, 0.0));
            int offset = 0;
            for (int i = 0; i < num_axes; ++i) {
                std::vector<std::vector<float>> rope_emb = rope(trans_ids[i], axes_dim[i], theta);  // [bs*pos_len, axes_dim[i]/2 * 2 * 2]
                for (int b = 0; b < bs; ++b) {
                    for (int j = 0; j < pos_len; ++j) {
                        for (int k = 0; k < rope_emb[0].size(); ++k) {
                            emb[b * pos_len + j][offset + k] = rope_emb[j][k];
                        }
                    }
                }
                offset += rope_emb[0].size();
            }
            return flatten(emb);
        }
    public:
        FluxParams params;
        Flux() {}
        Flux(FluxParams params)
            : params(params) {
            int64_t pe_dim = params.hidden_size / params.num_heads;
            blocks["img_in"] = std::shared_ptr<GGMLBlock>(new Linear(params.in_channels, params.hidden_size, true));
            if (params.is_chroma) {
                blocks["distilled_guidance_layer"] = std::shared_ptr<GGMLBlock>(new ChromaApproximator(params.in_channels, params.hidden_size));
@ -1048,12 +881,13 @@ namespace Flux {
        bool use_mask = false;
        FluxRunner(ggml_backend_t backend,
                   bool offload_params_to_cpu,
                   const String2GGMLType& tensor_types = {},
                   const std::string prefix            = "",
                   SDVersion version                   = VERSION_FLUX,
                   bool flash_attn                     = false,
                   bool use_mask                       = false)
-            : GGMLRunner(backend), use_mask(use_mask) {
+            : GGMLRunner(backend, offload_params_to_cpu), use_mask(use_mask) {
            flux_params.flash_attn          = flash_attn;
            flux_params.guidance_embed      = false;
            flux_params.depth               = 0;
@ -1063,7 +897,7 @@ namespace Flux {
            }
            for (auto pair : tensor_types) {
                std::string tensor_name = pair.first;
-                if (tensor_name.find("model.diffusion_model.") == std::string::npos)
+                if (!starts_with(tensor_name, prefix))
                    continue;
                if (tensor_name.find("guidance_in.in_layer.weight") != std::string::npos) {
                    // not schnell
@ -1150,7 +984,14 @@ namespace Flux {
                ref_latents[i] = to_backend(ref_latents[i]);
            }
-            pe_vec      = flux.gen_pe(x->ne[1], x->ne[0], 2, x->ne[3], context->ne[1], ref_latents, flux_params.theta, flux_params.axes_dim);
+            pe_vec      = Rope::gen_flux_pe(x->ne[1],
                                            x->ne[0],
                                            2,
                                            x->ne[3],
                                            context->ne[1],
                                            ref_latents,
                                            flux_params.theta,
                                            flux_params.axes_dim);
            int pos_len = pe_vec.size() / flux_params.axes_dim_sum / 2;
            // LOG_DEBUG("pos_len %d", pos_len);
            auto pe = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, 2, 2, flux_params.axes_dim_sum / 2, pos_len);
@ -1245,7 +1086,7 @@ namespace Flux {
            // ggml_backend_t backend    = ggml_backend_cuda_init(0);
            ggml_backend_t backend           = ggml_backend_cpu_init();
            ggml_type model_data_type        = GGML_TYPE_Q8_0;
-            std::shared_ptr<FluxRunner> flux = std::shared_ptr<FluxRunner>(new FluxRunner(backend));
+            std::shared_ptr<FluxRunner> flux = std::shared_ptr<FluxRunner>(new FluxRunner(backend, false));
            {
                LOG_INFO("loading from '%s'", file_path.c_str());
@ -1259,7 +1100,7 @@ namespace Flux {
                    return;
                }
-                bool success = model_loader.load_tensors(tensors, backend);
+                bool success = model_loader.load_tensors(tensors);
                if (!success) {
                    LOG_ERROR("load tensors from model loader failed");
--- a/format-code.sh
+++ b/format-code.sh
@ -1,2 +1,5 @@
-clang-format -style=file -i *.cpp *.h *.hpp
+for f in *.cpp *.h *.hpp examples/cli/*.cpp examples/cli/*.h; do
-clang-format -style=file -i examples/cli/*.cpp
+  [[ "$f" == vocab* ]] && continue
  echo "formatting '$f'"
  clang-format -style=file -i "$f"
 done
--- a/2
+++ b/2
@ -1 +1 @@
-Subproject commit 7dee1d6a1e7611f238d09be96738388da97c88ed
+Subproject commit 5fdc78fff274094e2a1b155928131983362d8a71
--- a/ggml_extend.hpp
+++ b/ggml_extend.hpp
@ -212,7 +212,7 @@ __STATIC_INLINE__ void print_ggml_tensor(struct ggml_tensor* tensor, bool shape_
                    if (tensor->type == GGML_TYPE_F32) {
                        printf("  [%d, %d, %d, %d] = %f\n", i, j, k, l, ggml_tensor_get_f32(tensor, l, k, j, i));
                    } else if (tensor->type == GGML_TYPE_F16) {
-                        printf("  [%d, %d, %d, %d] = %i\n", i, j, k, l, ggml_tensor_get_f16(tensor, l, k, j, i));
+                        printf("  [%d, %d, %d, %d] = %f\n", i, j, k, l, ggml_fp16_to_fp32(ggml_tensor_get_f16(tensor, l, k, j, i)));
                    } else if (tensor->type == GGML_TYPE_I32) {
                        printf("  [%d, %d, %d, %d] = %i\n", i, j, k, l, ggml_tensor_get_i32(tensor, l, k, j, i));
                    }
@ -237,6 +237,8 @@ __STATIC_INLINE__ ggml_tensor* load_tensor_from_file(ggml_context* ctx, const st
    file.read(reinterpret_cast<char*>(&length), sizeof(length));
    file.read(reinterpret_cast<char*>(&ttype), sizeof(ttype));
    LOG_DEBUG("load_tensor_from_file %d %d %d", n_dims, length, ttype);
    if (file.eof()) {
        LOG_ERROR("incomplete file '%s'", file_path.c_str());
        return NULL;
@ -325,17 +327,27 @@ __STATIC_INLINE__ uint8_t* sd_tensor_to_image(struct ggml_tensor* input) {
    return image_data;
 }
-__STATIC_INLINE__ uint8_t* sd_tensor_to_mul_image(struct ggml_tensor* input, int idx) {
+__STATIC_INLINE__ uint8_t* sd_tensor_to_image(struct ggml_tensor* input, int idx, bool video = false) {
-    int64_t width    = input->ne[0];
+    int64_t width  = input->ne[0];
-    int64_t height   = input->ne[1];
+    int64_t height = input->ne[1];
-    int64_t channels = input->ne[2];
+    int64_t channels;
    if (video) {
        channels = input->ne[3];
    } else {
        channels = input->ne[2];
    }
    GGML_ASSERT(channels == 3 && input->type == GGML_TYPE_F32);
    uint8_t* image_data = (uint8_t*)malloc(width * height * channels);
-    for (int iy = 0; iy < height; iy++) {
+    for (int ih = 0; ih < height; ih++) {
-        for (int ix = 0; ix < width; ix++) {
+        for (int iw = 0; iw < width; iw++) {
-            for (int k = 0; k < channels; k++) {
+            for (int ic = 0; ic < channels; ic++) {
-                float value                                               = ggml_tensor_get_f32(input, ix, iy, k, idx);
+                float value;
-                *(image_data + iy * width * channels + ix * channels + k) = (uint8_t)(value * 255.0f);
+                if (video) {
                    value = ggml_tensor_get_f32(input, iw, ih, idx, ic);
                } else {
                    value = ggml_tensor_get_f32(input, iw, ih, ic, idx);
                }
                *(image_data + ih * width * channels + iw * channels + ic) = (uint8_t)(value * 255.0f);
            }
        }
    }
@ -581,7 +593,7 @@ __STATIC_INLINE__ struct ggml_tensor* ggml_tensor_concat(struct ggml_context* ct
 }
 // convert values from [0, 1] to [-1, 1]
-__STATIC_INLINE__ void ggml_tensor_scale_input(struct ggml_tensor* src) {
+__STATIC_INLINE__ void process_vae_input_tensor(struct ggml_tensor* src) {
    int64_t nelements = ggml_nelements(src);
    float* data       = (float*)src->data;
    for (int i = 0; i < nelements; i++) {
@ -591,7 +603,7 @@ __STATIC_INLINE__ void ggml_tensor_scale_input(struct ggml_tensor* src) {
 }
 // convert values from [-1, 1] to [0, 1]
-__STATIC_INLINE__ void ggml_tensor_scale_output(struct ggml_tensor* src) {
+__STATIC_INLINE__ void process_vae_output_tensor(struct ggml_tensor* src) {
    int64_t nelements = ggml_nelements(src);
    float* data       = (float*)src->data;
    for (int i = 0; i < nelements; i++) {
@ -600,6 +612,125 @@ __STATIC_INLINE__ void ggml_tensor_scale_output(struct ggml_tensor* src) {
    }
 }
 __STATIC_INLINE__ struct ggml_tensor* ggml_nn_cont(struct ggml_context* ctx,
                                                   struct ggml_tensor* x) {
    if (ggml_is_contiguous(x)) {
        return x;
    }
    return ggml_cont(ctx, x);
 }
 // torch like permute
 __STATIC_INLINE__ struct ggml_tensor* ggml_torch_permute(struct ggml_context* ctx,
                                                         struct ggml_tensor* x,
                                                         int axis0,
                                                         int axis1,
                                                         int axis2,
                                                         int axis3) {
    int torch_axes[4] = {axis0, axis1, axis2, axis3};
    int ggml_axes[4] = {0};
    for (int i = 0; i < 4; ++i) {
        int found = 0;
        for (int j = 0; j < 4; ++j) {
            if (torch_axes[j] == i) {
                ggml_axes[i] = j;
                found        = 1;
                break;
            }
        }
        GGML_ASSERT(found && "Invalid permute input: must be a permutation of 0-3");
    }
    return ggml_permute(ctx, x, ggml_axes[0], ggml_axes[1], ggml_axes[2], ggml_axes[3]);
 }
 __STATIC_INLINE__ struct ggml_tensor* ggml_slice(struct ggml_context* ctx,
                                                 struct ggml_tensor* x,
                                                 int64_t dim,
                                                 int64_t start,
                                                 int64_t end) {
    GGML_ASSERT(dim >= 0 && dim < 4);
    if (x->ne[dim] == 1) {
        return x;
    }
    while (start < 0) {
        start = x->ne[dim] + start;
    }
    while (end < 0) {
        end = x->ne[dim] + end;
    }
    GGML_ASSERT(end > start);
    GGML_ASSERT(start >= 0 && start < x->ne[dim]);
    GGML_ASSERT(end > start && end <= x->ne[dim]);
    int perm[4] = {0, 1, 2, 3};
    for (int i = dim; i < 3; ++i)
        perm[i] = perm[i + 1];
    perm[3] = dim;
    int inv_perm[4];
    for (int i = 0; i < 4; ++i)
        inv_perm[perm[i]] = i;
    if (dim != 3) {
        x = ggml_torch_permute(ctx, x, perm[0], perm[1], perm[2], perm[3]);
        x = ggml_cont(ctx, x);
    }
    x = ggml_view_4d(
        ctx, x,
        x->ne[0], x->ne[1], x->ne[2], end - start,
        x->nb[1], x->nb[2], x->nb[3], x->nb[3] * start);
    if (dim != 3) {
        x = ggml_torch_permute(ctx, x, inv_perm[0], inv_perm[1], inv_perm[2], inv_perm[3]);
        x = ggml_cont(ctx, x);
    }
    return x;
 }
 // example: [N, 3*C, H, W] => ([N, C, H, W], [N, C, H, W], [N, C, H, W])
 __STATIC_INLINE__ std::vector<struct ggml_tensor*> ggml_chunk(struct ggml_context* ctx,
                                                              struct ggml_tensor* x,
                                                              int num,
                                                              int64_t dim) {
    GGML_ASSERT(dim >= 0 && dim < 4);
    GGML_ASSERT(x->ne[dim] % num == 0);
    int perm[4] = {0, 1, 2, 3};
    for (int i = dim; i < 3; ++i)
        perm[i] = perm[i + 1];
    perm[3] = dim;
    int inv_perm[4];
    for (int i = 0; i < 4; ++i)
        inv_perm[perm[i]] = i;
    if (dim != 3) {
        x = ggml_torch_permute(ctx, x, perm[0], perm[1], perm[2], perm[3]);
        x = ggml_cont(ctx, x);
    }
    std::vector<struct ggml_tensor*> chunks;
    int64_t chunk_size = x->ne[3] / num;
    for (int i = 0; i < num; i++) {
        auto chunk = ggml_view_4d(
            ctx, x,
            x->ne[0], x->ne[1], x->ne[2], chunk_size,
            x->nb[1], x->nb[2], x->nb[3], x->nb[3] * i * chunk_size);
        if (dim != 3) {
            chunk = ggml_torch_permute(ctx, chunk, inv_perm[0], inv_perm[1], inv_perm[2], inv_perm[3]);
            chunk = ggml_cont(ctx, chunk);
        }
        chunks.push_back(chunk);
    }
    return chunks;
 }
 typedef std::function<void(ggml_tensor*, ggml_tensor*, bool)> on_tile_process;
 // Tiling
@ -680,7 +811,7 @@ __STATIC_INLINE__ struct ggml_tensor* ggml_nn_linear(struct ggml_context* ctx,
                                                     struct ggml_tensor* b) {
    x = ggml_mul_mat(ctx, w, x);
    if (b != NULL) {
-        x = ggml_add(ctx, x, b);
+        x = ggml_add_inplace(ctx, x, b);
    }
    return x;
 }
@ -703,11 +834,13 @@ __STATIC_INLINE__ struct ggml_tensor* ggml_nn_conv_2d(struct ggml_context* ctx,
    if (b != NULL) {
        b = ggml_reshape_4d(ctx, b, 1, 1, b->ne[0], 1);
        // b = ggml_repeat(ctx, b, x);
-        x = ggml_add(ctx, x, b);
+        x = ggml_add_inplace(ctx, x, b);
    }
    return x;
 }
 // w: [OC*IC, KD, KH, KW]
 // x: [N*IC, ID, IH, IW]
 __STATIC_INLINE__ struct ggml_tensor* ggml_nn_conv_2d_direct(struct ggml_context* ctx,
                                                             struct ggml_tensor* x,
                                                             struct ggml_tensor* w,
@ -730,35 +863,30 @@ __STATIC_INLINE__ struct ggml_tensor* ggml_nn_conv_2d_direct(struct ggml_context
 // w: [OC，IC, KD, 1 * 1]
 // x: [N, IC, IH, IW]
 // b: [OC,]
-// result: [N, OC, OH, OW]
+// result: [N*OC, OD, OH, OW]
-__STATIC_INLINE__ struct ggml_tensor* ggml_nn_conv_3d_nx1x1_bak(struct ggml_context* ctx,
+__STATIC_INLINE__ struct ggml_tensor* ggml_nn_conv_3d(struct ggml_context* ctx,
-                                                                struct ggml_tensor* x,
+                                                      struct ggml_tensor* x,
-                                                                struct ggml_tensor* w,
+                                                      struct ggml_tensor* w,
-                                                                struct ggml_tensor* b,
+                                                      struct ggml_tensor* b,
-                                                                int s2 = 1,
+                                                      int64_t IC,
-                                                                int p2 = 1,
+                                                      int s0 = 1,
-                                                                int d2 = 1) {
+                                                      int s1 = 1,
-    GGML_ASSERT(w->ne[0] == 1);
+                                                      int s2 = 1,
-    // timesteps = x.shape[0]
+                                                      int p0 = 0,
-    // x = rearrange(x, "(b t) c h w -> b c t h w", t=timesteps)
+                                                      int p1 = 0,
-    // x = conv3d(x)
+                                                      int p2 = 0,
-    // return rearrange(x, "b c t h w -> (b t) c h w")
+                                                      int d0 = 1,
-    int64_t T = x->ne[3];
+                                                      int d1 = 1,
-    int64_t B = x->ne[3] / T;
+                                                      int d2 = 1) {
-    int64_t C = x->ne[2];
+    int64_t OC = w->ne[3] / IC;
-    int64_t H = x->ne[1];
+    int64_t N  = x->ne[3] / IC;
-    int64_t W = x->ne[0];
+    x          = ggml_conv_3d(ctx, w, x, IC, s0, s1, s2, p0, p1, p2, d0, d1, d2);
    x = ggml_reshape_4d(ctx, x, W * H, C, T, B);           // (b t) c h w -> b t c (h w)
    x = ggml_cont(ctx, ggml_permute(ctx, x, 0, 2, 1, 3));  // b t c (h w) -> b c t (h w)
    x = ggml_conv_2d(ctx, w, x, 1, s2, 0, p2, 1, d2);      // [B, OC, T, OH * OW]
    if (b != NULL) {
-        b = ggml_reshape_4d(ctx, b, 1, 1, b->ne[0], 1);
+        b = ggml_reshape_4d(ctx, b, 1, 1, 1, b->ne[0]);  // [OC, 1, 1, 1]
-        x = ggml_add(ctx, x, b);
+        x = ggml_add_inplace(ctx, x, b);
    }
-    x = ggml_cont(ctx, ggml_permute(ctx, x, 0, 2, 1, 3));  // b c t (h w) -> b t c (h w)
+    return x;
    x = ggml_reshape_4d(ctx, x, W, H, C, T * B);           // b t c (h w) -> (b t) c h w
    return x;                                              // [B*T, OC, OH, OW]
 }
 // w: [OC，IC, KD, 1 * 1]
@ -794,6 +922,54 @@ __STATIC_INLINE__ std::vector<struct ggml_tensor*> split_qkv(struct ggml_context
    return {q, k, v};
 }
 // qkv: [N, 3*C, H, W]
 // return: ([N, C, H, W], [N, C, H, W], [N, C, H, W])
 __STATIC_INLINE__ std::vector<struct ggml_tensor*> split_image_qkv(struct ggml_context* ctx,
                                                                   struct ggml_tensor* qkv) {
    int64_t W   = qkv->ne[0];
    int64_t H   = qkv->ne[1];
    int64_t C   = qkv->ne[2] / 3;
    int64_t N   = qkv->ne[3];
    int64_t nb1 = qkv->nb[1];
    int64_t nb2 = qkv->nb[2];
    qkv         = ggml_reshape_4d(ctx, qkv, W * H, C, 3, N);                 // [N, 3, C, H*W]
    qkv         = ggml_cont(ctx, ggml_torch_permute(ctx, qkv, 0, 1, 3, 2));  // [3, N, C, H*W]
    int64_t offset = qkv->nb[2] * qkv->ne[2];
    auto q         = ggml_view_4d(ctx, qkv, W, H, C, N, nb1, nb2, qkv->nb[3], offset * 0);  // [N, C, H, W]
    auto k         = ggml_view_4d(ctx, qkv, W, H, C, N, nb1, nb2, qkv->nb[3], offset * 1);  // [N, C, H, W]
    auto v         = ggml_view_4d(ctx, qkv, W, H, C, N, nb1, nb2, qkv->nb[3], offset * 2);  // [N, C, H, W]
    return {q, k, v};
 }
 __STATIC_INLINE__ struct ggml_tensor* ggml_full(struct ggml_context* ctx,
                                                float value,
                                                int64_t ne0,
                                                int64_t ne1,
                                                int64_t ne2,
                                                int64_t ne3) {
    auto one = ggml_get_tensor(ctx, "ggml_runner_build_in_tensor:one");
    auto t   = ggml_scale(ctx, one, value);                 // [1,]
    t        = ggml_repeat_4d(ctx, t, ne0, ne1, ne2, ne3);  // [ne0, ne1, ne2, ne3]
    return t;
 }
 __STATIC_INLINE__ struct ggml_tensor* ggml_zeros(struct ggml_context* ctx,
                                                 int64_t ne0,
                                                 int64_t ne1,
                                                 int64_t ne2,
                                                 int64_t ne3) {
    return ggml_full(ctx, 0.f, ne0, ne1, ne2, ne3);
 }
 __STATIC_INLINE__ struct ggml_tensor* ggml_ones(struct ggml_context* ctx,
                                                int64_t ne0,
                                                int64_t ne1,
                                                int64_t ne2,
                                                int64_t ne3) {
    return ggml_full(ctx, 1.f, ne0, ne1, ne2, ne3);
 }
 // q: [N * n_head, n_token, d_head]
 // k: [N * n_head, n_k, d_head]
 // v: [N * n_head, d_head, n_k]
@ -821,6 +997,7 @@ __STATIC_INLINE__ struct ggml_tensor* ggml_nn_attention(struct ggml_context* ctx
 // q: [N, L_q, C] or [N*n_head, L_q, d_head]
 // k: [N, L_k, C] or [N*n_head, L_k, d_head]
 // v: [N, L_k, C] or [N, L_k, n_head, d_head]
 // mask: [N, L_q, L_k]
 // return: [N, L_q, C]
 __STATIC_INLINE__ struct ggml_tensor* ggml_nn_attention_ext(struct ggml_context* ctx,
                                                            struct ggml_tensor* q,
@ -842,13 +1019,13 @@ __STATIC_INLINE__ struct ggml_tensor* ggml_nn_attention_ext(struct ggml_context*
        C      = q->ne[0];
        N      = q->ne[2];
        d_head = C / n_head;
-        q      = ggml_reshape_4d(ctx, q, d_head, n_head, L_q, N);   // [N, L_q, n_head, d_head]
+        q      = ggml_reshape_4d(ctx, q, d_head, n_head, L_q, N);      // [N, L_q, n_head, d_head]
-        q      = ggml_cont(ctx, ggml_permute(ctx, q, 0, 2, 1, 3));  // [N, n_head, L_q, d_head]
+        q      = ggml_nn_cont(ctx, ggml_permute(ctx, q, 0, 2, 1, 3));  // [N, n_head, L_q, d_head]
-        q      = ggml_reshape_3d(ctx, q, d_head, L_q, n_head * N);  // [N * n_head, L_q, d_head]
+        q      = ggml_reshape_3d(ctx, q, d_head, L_q, n_head * N);     // [N * n_head, L_q, d_head]
-        k = ggml_reshape_4d(ctx, k, d_head, n_head, L_k, N);   // [N, L_k, n_head, d_head]
+        k = ggml_reshape_4d(ctx, k, d_head, n_head, L_k, N);      // [N, L_k, n_head, d_head]
-        k = ggml_cont(ctx, ggml_permute(ctx, k, 0, 2, 1, 3));  // [N, n_head, L_k, d_head]
+        k = ggml_nn_cont(ctx, ggml_permute(ctx, k, 0, 2, 1, 3));  // [N, n_head, L_k, d_head]
-        k = ggml_reshape_3d(ctx, k, d_head, L_k, n_head * N);  // [N * n_head, L_k, d_head]
+        k = ggml_reshape_3d(ctx, k, d_head, L_k, n_head * N);     // [N * n_head, L_k, d_head]
        v = ggml_reshape_4d(ctx, v, d_head, n_head, L_k, N);  // [N, L_k, n_head, d_head]
    } else {
@ -862,43 +1039,25 @@ __STATIC_INLINE__ struct ggml_tensor* ggml_nn_attention_ext(struct ggml_context*
    float scale = (1.0f / sqrt((float)d_head));
    int kv_pad = 0;
-    // if (flash_attn) {
+    if (flash_attn) {
-    //     LOG_DEBUG("attention_ext L_q:%d L_k:%d n_head:%d C:%d d_head:%d N:%d", L_q, L_k, n_head, C, d_head, N);
+        // LOG_DEBUG("attention_ext L_q:%d L_k:%d n_head:%d C:%d d_head:%d N:%d", L_q, L_k, n_head, C, d_head, N);
-    // }
+        bool can_use_flash_attn = true;
-    //   is there anything oddly shaped?? ping Green-Sky if you can trip this assert
+        if (can_use_flash_attn && L_k % 256 != 0) {
    GGML_ASSERT(((L_k % 256 == 0) && L_q == L_k) || !(L_k % 256 == 0));
    bool can_use_flash_attn = true;
    can_use_flash_attn      = can_use_flash_attn && (d_head == 64 ||
                                                d_head == 80 ||
                                                d_head == 96 ||
                                                d_head == 112 ||
                                                d_head == 128 ||
                                                d_head == 256);
 #if 0
    can_use_flash_attn      = can_use_flash_attn && L_k % 256 == 0;
 #else
    if (can_use_flash_attn && L_k % 256 != 0) {
        // TODO(Green-Sky): might be worth just padding by default
        if (L_k == 77 || L_k == 4208 || L_k == 3952) {
            kv_pad = GGML_PAD(L_k, 256) - L_k;
-        } else {
+        }
-            can_use_flash_attn = false;
+
        if (mask != nullptr) {
            // TODO(Green-Sky): figure out if we can bend t5 to work too
            can_use_flash_attn = can_use_flash_attn && mask->ne[3] == 1;
        }
        if (!can_use_flash_attn) {
            flash_attn = false;
        }
    }
 #endif
    if (mask != nullptr) {
        // TODO(Green-Sky): figure out if we can bend t5 to work too
        can_use_flash_attn = can_use_flash_attn && mask->ne[2] == 1;
        can_use_flash_attn = can_use_flash_attn && mask->ne[3] == 1;
    }
    // TODO(Green-Sky): more pad or disable for funny tensor shapes
    ggml_tensor* kqv = nullptr;
-    // GGML_ASSERT((flash_attn && can_use_flash_attn) || !flash_attn);
+    if (flash_attn) {
    if (can_use_flash_attn && flash_attn) {
        // LOG_DEBUG(" uses flash attention");
        if (kv_pad != 0) {
            // LOG_DEBUG(" padding k and v dim1 by %d", kv_pad);
@ -906,8 +1065,8 @@ __STATIC_INLINE__ struct ggml_tensor* ggml_nn_attention_ext(struct ggml_context*
        }
        k = ggml_cast(ctx, k, GGML_TYPE_F16);
-        v = ggml_cont(ctx, ggml_permute(ctx, v, 0, 2, 1, 3));  // [N, n_head, L_k, d_head]
+        v = ggml_nn_cont(ctx, ggml_permute(ctx, v, 0, 2, 1, 3));  // [N, n_head, L_k, d_head]
-        v = ggml_reshape_3d(ctx, v, d_head, L_k, n_head * N);  // [N * n_head, L_k, d_head]
+        v = ggml_reshape_3d(ctx, v, d_head, L_k, n_head * N);     // [N * n_head, L_k, d_head]
        if (kv_pad != 0) {
            v = ggml_pad(ctx, v, 0, kv_pad, 0, 0);
        }
@ -915,14 +1074,25 @@ __STATIC_INLINE__ struct ggml_tensor* ggml_nn_attention_ext(struct ggml_context*
        if (mask != nullptr) {
            mask = ggml_transpose(ctx, mask);
-
+        } else {
-            if (mask->ne[1] < GGML_PAD(q->ne[1], GGML_KQ_MASK_PAD)) {
+            if (kv_pad > 0) {
-                LOG_DEBUG("mask dims %ld, %ld, %ld, %ld\n", mask->ne[0], mask->ne[1], mask->ne[2], mask->ne[3]);
+                mask            = ggml_zeros(ctx, L_k, L_q, 1, 1);               // [L_q, L_k]
-                LOG_DEBUG("needs padding, padding from %ld to %ld\n", mask->ne[1], GGML_PAD(q->ne[1], GGML_KQ_MASK_PAD));
+                auto pad_tensor = ggml_full(ctx, -INFINITY, kv_pad, L_q, 1, 1);  // [L_q, kv_pad]
-                mask = ggml_pad(ctx, mask, 0, GGML_PAD(q->ne[1], GGML_KQ_MASK_PAD) - mask->ne[1], 0, 0);
+                mask            = ggml_concat(ctx, mask, pad_tensor, 0);         // [L_q, L_k + kv_pad]
            }
        }
        // mask pad
        if (mask != nullptr) {
            int mask_pad = 0;
            if (mask->ne[1] % GGML_KQ_MASK_PAD != 0) {
                mask_pad = GGML_PAD(L_q, GGML_KQ_MASK_PAD) - mask->ne[1];
            }
            if (mask_pad > 0) {
                mask = ggml_pad(ctx, mask, 0, mask_pad, 0, 0);  // [L_q + mask_pad, L_k + kv_pad]
            }
            mask = ggml_cast(ctx, mask, GGML_TYPE_F16);
            // LOG_DEBUG("L_k: %ld, L_q: %ld, mask->ne[1]: %ld, mask_pad: %d, kv_pad: %d", L_k, L_q, mask->ne[1], mask_pad, kv_pad);
        }
        kqv = ggml_flash_attn_ext(ctx, q, k, v, mask, scale, 0, 0);
@ -931,8 +1101,8 @@ __STATIC_INLINE__ struct ggml_tensor* ggml_nn_attention_ext(struct ggml_context*
        // kqv = ggml_view_3d(ctx, kqv, d_head, n_head, L_k, kqv->nb[1], kqv->nb[2], 0);
        kqv = ggml_view_3d(ctx, kqv, d_head, n_head, L_q, kqv->nb[1], kqv->nb[2], 0);
    } else {
-        v = ggml_cont(ctx, ggml_permute(ctx, v, 1, 2, 0, 3));  // [N, n_head, d_head, L_k]
+        v = ggml_nn_cont(ctx, ggml_permute(ctx, v, 1, 2, 0, 3));  // [N, n_head, d_head, L_k]
-        v = ggml_reshape_3d(ctx, v, L_k, d_head, n_head * N);  // [N * n_head, d_head, L_k]
+        v = ggml_reshape_3d(ctx, v, L_k, d_head, n_head * N);     // [N * n_head, d_head, L_k]
        auto kq = ggml_mul_mat(ctx, k, q);  // [N * n_head, L_q, L_k]
        kq      = ggml_scale_inplace(ctx, kq, scale);
@ -950,7 +1120,7 @@ __STATIC_INLINE__ struct ggml_tensor* ggml_nn_attention_ext(struct ggml_context*
        kqv = ggml_permute(ctx, kqv, 0, 2, 1, 3);                 // [N, L_q, n_head, d_head]
    }
-    kqv = ggml_cont(ctx, kqv);
+    kqv = ggml_nn_cont(ctx, kqv);
    kqv = ggml_reshape_3d(ctx, kqv, d_head * n_head, L_q, N);  // [N, L_q, C]
    return kqv;
@ -963,9 +1133,9 @@ __STATIC_INLINE__ struct ggml_tensor* ggml_nn_layer_norm(struct ggml_context* ct
                                                         float eps = EPS) {
    x = ggml_norm(ctx, x, eps);
    if (w != NULL) {
-        x = ggml_mul(ctx, x, w);
+        x = ggml_mul_inplace(ctx, x, w);
        if (b != NULL) {
-            x = ggml_add(ctx, x, b);
+            x = ggml_add_inplace(ctx, x, b);
        }
    }
    return x;
@ -984,9 +1154,9 @@ __STATIC_INLINE__ struct ggml_tensor* ggml_nn_group_norm(struct ggml_context* ct
    const float eps = 1e-6f;  // default eps parameter
    x               = ggml_group_norm(ctx, x, num_groups, eps);
    if (w != NULL && b != NULL) {
-        x = ggml_mul(ctx, x, w);
+        x = ggml_mul_inplace(ctx, x, w);
        // b = ggml_repeat(ctx, b, x);
-        x = ggml_add(ctx, x, b);
+        x = ggml_add_inplace(ctx, x, b);
    }
    return x;
 }
@ -1005,14 +1175,18 @@ __STATIC_INLINE__ void ggml_backend_tensor_get_and_sync(ggml_backend_t backend,
 }
 __STATIC_INLINE__ float ggml_backend_tensor_get_f32(ggml_tensor* tensor) {
-    GGML_ASSERT(tensor->type == GGML_TYPE_F32 || tensor->type == GGML_TYPE_F16);
+    GGML_ASSERT(tensor->type == GGML_TYPE_F32 || tensor->type == GGML_TYPE_F16 || tensor->type == GGML_TYPE_I32);
    float value;
    if (tensor->type == GGML_TYPE_F32) {
        ggml_backend_tensor_get(tensor, &value, 0, sizeof(value));
-    } else {  // GGML_TYPE_F16
+    } else if (tensor->type == GGML_TYPE_F16) {
        ggml_fp16_t f16_value;
        ggml_backend_tensor_get(tensor, &f16_value, 0, sizeof(f16_value));
        value = ggml_fp16_to_fp32(f16_value);
    } else {  // GGML_TYPE_I32
        int int32_value;
        ggml_backend_tensor_get(tensor, &int32_value, 0, sizeof(int32_value));
        value = (float)int32_value;
    }
    return value;
 }
@ -1116,7 +1290,7 @@ __STATIC_INLINE__ size_t ggml_tensor_num(ggml_context* ctx) {
 /* SDXL with LoRA requires more space */
 #define MAX_PARAMS_TENSOR_NUM 32768
-#define MAX_GRAPH_SIZE 32768
+#define MAX_GRAPH_SIZE 327680
 typedef std::map<std::string, enum ggml_type> String2GGMLType;
@ -1124,15 +1298,27 @@ struct GGMLRunner {
 protected:
    typedef std::function<struct ggml_cgraph*()> get_graph_cb_t;
-    struct ggml_context* params_ctx     = NULL;
+    ggml_backend_t params_backend  = NULL;
-    ggml_backend_buffer_t params_buffer = NULL;
+    ggml_backend_t runtime_backend = NULL;
    struct ggml_context* params_ctx             = NULL;
    ggml_backend_buffer_t params_buffer         = NULL;
    struct ggml_context* offload_ctx            = NULL;
    ggml_backend_buffer_t runtime_params_buffer = NULL;
    bool params_on_runtime_backend              = false;
    struct ggml_context* cache_ctx     = NULL;
    ggml_backend_buffer_t cache_buffer = NULL;
    struct ggml_context* compute_ctx    = NULL;
    struct ggml_gallocr* compute_allocr = NULL;
-    std::map<struct ggml_tensor*, const void*> backend_tensor_data_map;
+    std::vector<float> one_vec = {1.f};
    ggml_tensor* one_tensor    = NULL;
-    ggml_backend_t backend = NULL;
+    std::map<struct ggml_tensor*, const void*> backend_tensor_data_map;
    std::map<std::string, struct ggml_tensor*> cache_tensor_map;  // name -> tensor
    const std::string final_result_name = "ggml_runner_final_result_tensor";
    void alloc_params_ctx() {
        struct ggml_init_params params;
@ -1142,6 +1328,10 @@ protected:
        params_ctx = ggml_init(params);
        GGML_ASSERT(params_ctx != NULL);
        if (params_backend != runtime_backend) {
            offload_ctx = ggml_init(params);
            GGML_ASSERT(offload_ctx != NULL);
        }
    }
    void free_params_ctx() {
@ -1149,6 +1339,27 @@ protected:
            ggml_free(params_ctx);
            params_ctx = NULL;
        }
        if (offload_ctx != NULL) {
            ggml_free(offload_ctx);
            offload_ctx = NULL;
        }
    }
    void alloc_cache_ctx() {
        struct ggml_init_params params;
        params.mem_size   = static_cast<size_t>(MAX_PARAMS_TENSOR_NUM * ggml_tensor_overhead());
        params.mem_buffer = NULL;
        params.no_alloc   = true;
        cache_ctx = ggml_init(params);
        GGML_ASSERT(cache_ctx != NULL);
    }
    void free_cache_ctx() {
        if (cache_ctx != NULL) {
            ggml_free(cache_ctx);
            cache_ctx = NULL;
        }
    }
    void alloc_compute_ctx() {
@ -1168,14 +1379,33 @@ protected:
        }
    }
    void prepare_build_in_tensor_before() {
        one_tensor = ggml_new_tensor_1d(compute_ctx, GGML_TYPE_F32, 1);
        ggml_set_name(one_tensor, "ggml_runner_build_in_tensor:one");
        set_backend_tensor_data(one_tensor, one_vec.data());
    }
    void prepare_build_in_tensor_after(struct ggml_cgraph* gf) {
        ggml_build_forward_expand(gf, one_tensor);
    }
    struct ggml_cgraph* get_compute_graph(get_graph_cb_t get_graph) {
        prepare_build_in_tensor_before();
        struct ggml_cgraph* gf = get_graph();
        auto result            = ggml_graph_node(gf, -1);
        ggml_set_name(result, final_result_name.c_str());
        prepare_build_in_tensor_after(gf);
        return gf;
    }
    bool alloc_compute_buffer(get_graph_cb_t get_graph) {
        if (compute_allocr != NULL) {
            return true;
        }
        reset_compute_ctx();
-        struct ggml_cgraph* gf = get_graph();
+        struct ggml_cgraph* gf = get_compute_graph(get_graph);
        backend_tensor_data_map.clear();
-        compute_allocr = ggml_gallocr_new(ggml_backend_get_default_buffer_type(backend));
+        compute_allocr = ggml_gallocr_new(ggml_backend_get_default_buffer_type(runtime_backend));
        if (!ggml_gallocr_reserve(compute_allocr, gf)) {
            // failed to allocate the compute buffer
@ -1189,11 +1419,47 @@ protected:
        LOG_DEBUG("%s compute buffer size: %.2f MB(%s)",
                  get_desc().c_str(),
                  compute_buffer_size / 1024.0 / 1024.0,
-                  ggml_backend_is_cpu(backend) ? "RAM" : "VRAM");
+                  ggml_backend_is_cpu(runtime_backend) ? "RAM" : "VRAM");
        return true;
    }
-    void cpy_data_to_backend_tensor() {
+    void free_cache_buffer() {
        if (cache_buffer != NULL) {
            ggml_backend_buffer_free(cache_buffer);
            cache_buffer = NULL;
        }
    }
    void copy_cache_tensors_to_cache_buffer() {
        if (cache_tensor_map.size() == 0) {
            return;
        }
        free_cache_ctx_and_buffer();
        alloc_cache_ctx();
        GGML_ASSERT(cache_buffer == NULL);
        std::map<ggml_tensor*, ggml_tensor*> runtime_tensor_to_cache_tensor;
        for (auto kv : cache_tensor_map) {
            auto cache_tensor = ggml_dup_tensor(cache_ctx, kv.second);
            ggml_set_name(cache_tensor, kv.first.c_str());
            runtime_tensor_to_cache_tensor[kv.second] = cache_tensor;
        }
        size_t num_tensors = ggml_tensor_num(cache_ctx);
        cache_buffer       = ggml_backend_alloc_ctx_tensors(cache_ctx, runtime_backend);
        GGML_ASSERT(cache_buffer != NULL);
        for (auto kv : runtime_tensor_to_cache_tensor) {
            ggml_backend_tensor_copy(kv.first, kv.second);
        }
        ggml_backend_synchronize(runtime_backend);
        cache_tensor_map.clear();
        size_t cache_buffer_size = ggml_backend_buffer_get_size(cache_buffer);
        LOG_DEBUG("%s cache backend buffer size = % 6.2f MB(%s) (%i tensors)",
                  get_desc().c_str(),
                  cache_buffer_size / (1024.f * 1024.f),
                  ggml_backend_is_cpu(runtime_backend) ? "RAM" : "VRAM",
                  num_tensors);
    }
    void copy_data_to_backend_tensor() {
        for (auto& kv : backend_tensor_data_map) {
            auto tensor = kv.first;
            auto data   = kv.second;
@ -1204,12 +1470,96 @@ protected:
        backend_tensor_data_map.clear();
    }
    bool offload_params_to_runtime_backend() {
        if (params_backend == runtime_backend) {
            return true;
        }
        if (params_on_runtime_backend) {
            return true;
        }
        GGML_ASSERT(runtime_params_buffer == NULL);
        int64_t t0         = ggml_time_ms();
        size_t num_tensors = ggml_tensor_num(offload_ctx);
        if (num_tensors == 0) {
            for (ggml_tensor* t = ggml_get_first_tensor(params_ctx); t != NULL; t = ggml_get_next_tensor(params_ctx, t)) {
                GGML_ASSERT(t->view_src == NULL);
                ggml_dup_tensor(offload_ctx, t);
            }
        }
        num_tensors = ggml_tensor_num(offload_ctx);
        GGML_ASSERT(num_tensors == ggml_tensor_num(params_ctx));
        runtime_params_buffer = ggml_backend_alloc_ctx_tensors(offload_ctx, runtime_backend);
        if (runtime_params_buffer == NULL) {
            LOG_ERROR("%s alloc runtime params backend buffer failed, num_tensors = %i",
                      get_desc().c_str(),
                      num_tensors);
            return false;
        }
        ggml_tensor* t         = ggml_get_first_tensor(params_ctx);
        ggml_tensor* offload_t = ggml_get_first_tensor(offload_ctx);
        while (t != NULL && offload_t != NULL) {
            ggml_backend_tensor_copy(t, offload_t);
            std::swap(t->buffer, offload_t->buffer);
            std::swap(t->data, offload_t->data);
            t         = ggml_get_next_tensor(params_ctx, t);
            offload_t = ggml_get_next_tensor(offload_ctx, offload_t);
        }
        int64_t t1 = ggml_time_ms();
        size_t params_buffer_size = ggml_backend_buffer_get_size(runtime_params_buffer);
        LOG_INFO("%s offload params (%6.2f MB, %i tensors) to runtime backend (%s), taking %.2fs",
                 get_desc().c_str(),
                 params_buffer_size / (1024.f * 1024.f),
                 num_tensors,
                 ggml_backend_name(runtime_backend),
                 (t1 - t0) * 1.0f / 1000);
        params_on_runtime_backend = true;
        return true;
    }
    void offload_params_to_params_backend() {
        if (!params_on_runtime_backend) {
            return;
        }
        ggml_tensor* t         = ggml_get_first_tensor(params_ctx);
        ggml_tensor* offload_t = ggml_get_first_tensor(offload_ctx);
        while (t != NULL && offload_t != NULL) {
            t->buffer         = offload_t->buffer;
            t->data           = offload_t->data;
            offload_t->buffer = NULL;
            offload_t->data   = NULL;
            t         = ggml_get_next_tensor(params_ctx, t);
            offload_t = ggml_get_next_tensor(offload_ctx, offload_t);
        }
        if (runtime_params_buffer != NULL) {
            ggml_backend_buffer_free(runtime_params_buffer);
            runtime_params_buffer = NULL;
        }
        params_on_runtime_backend = false;
    }
 public:
    virtual std::string get_desc() = 0;
-    GGMLRunner(ggml_backend_t backend)
+    GGMLRunner(ggml_backend_t backend, bool offload_params_to_cpu = false)
-        : backend(backend) {
+        : runtime_backend(backend) {
        alloc_params_ctx();
        if (!ggml_backend_is_cpu(runtime_backend) && offload_params_to_cpu) {
            params_backend = ggml_backend_cpu_init();
        } else {
            params_backend = runtime_backend;
        }
    }
    virtual ~GGMLRunner() {
@ -1217,6 +1567,10 @@ public:
        free_compute_buffer();
        free_params_ctx();
        free_compute_ctx();
        if (params_backend != runtime_backend) {
            ggml_backend_free(params_backend);
        }
        free_cache_ctx_and_buffer();
    }
    void reset_compute_ctx() {
@ -1226,7 +1580,7 @@ public:
    bool alloc_params_buffer() {
        size_t num_tensors = ggml_tensor_num(params_ctx);
-        params_buffer      = ggml_backend_alloc_ctx_tensors(params_ctx, backend);
+        params_buffer      = ggml_backend_alloc_ctx_tensors(params_ctx, params_backend);
        if (params_buffer == NULL) {
            LOG_ERROR("%s alloc params backend buffer failed, num_tensors = %i",
                      get_desc().c_str(),
@ -1236,14 +1590,9 @@ public:
        size_t params_buffer_size = ggml_backend_buffer_get_size(params_buffer);
        LOG_DEBUG("%s params backend buffer size = % 6.2f MB(%s) (%i tensors)",
                  get_desc().c_str(),
-                  params_buffer_size / (1024.0 * 1024.0),
+                  params_buffer_size / (1024.f * 1024.f),
-                  ggml_backend_is_cpu(backend) ? "RAM" : "VRAM",
+                  ggml_backend_is_cpu(params_backend) ? "RAM" : "VRAM",
                  num_tensors);
        // printf("%s params backend buffer size = % 6.2f MB(%s) (%i tensors)\n",
        //           get_desc().c_str(),
        //           params_buffer_size / (1024.0 * 1024.0),
        //           ggml_backend_is_cpu(backend) ? "RAM" : "VRAM",
        //           num_tensors);
        return true;
    }
@ -1261,11 +1610,17 @@ public:
        return 0;
    }
    void free_cache_ctx_and_buffer() {
        free_cache_buffer();
        free_cache_ctx();
    }
    void free_compute_buffer() {
        if (compute_allocr != NULL) {
            ggml_gallocr_free(compute_allocr);
            compute_allocr = NULL;
        }
        offload_params_to_params_backend();
    }
    // do copy after alloc graph
@ -1279,7 +1634,7 @@ public:
            return NULL;
        }
        // it's performing a compute, check if backend isn't cpu
-        if (!ggml_backend_is_cpu(backend) && (tensor->buffer == NULL || ggml_backend_buffer_is_host(tensor->buffer))) {
+        if (!ggml_backend_is_cpu(runtime_backend) && (tensor->buffer == NULL || ggml_backend_buffer_is_host(tensor->buffer))) {
            // pass input tensors to gpu memory
            auto backend_tensor = ggml_dup_tensor(compute_ctx, tensor);
@ -1290,31 +1645,47 @@ public:
        }
    }
    void cache(const std::string name, struct ggml_tensor* tensor) {
        cache_tensor_map[name] = tensor;
    }
    struct ggml_tensor* get_cache_tensor_by_name(const std::string& name) {
        if (cache_ctx == NULL) {
            return NULL;
        }
        return ggml_get_tensor(cache_ctx, name.c_str());
    }
    void compute(get_graph_cb_t get_graph,
                 int n_threads,
                 bool free_compute_buffer_immediately = true,
                 struct ggml_tensor** output          = NULL,
                 struct ggml_context* output_ctx      = NULL) {
        if (!offload_params_to_runtime_backend()) {
            LOG_ERROR("%s offload params to runtime backend failed", get_desc().c_str());
            return;
        }
        alloc_compute_buffer(get_graph);
        reset_compute_ctx();
-        struct ggml_cgraph* gf = get_graph();
+        struct ggml_cgraph* gf = get_compute_graph(get_graph);
        GGML_ASSERT(ggml_gallocr_alloc_graph(compute_allocr, gf));
-        cpy_data_to_backend_tensor();
+        copy_data_to_backend_tensor();
-        if (ggml_backend_is_cpu(backend)) {
+        if (ggml_backend_is_cpu(runtime_backend)) {
-            ggml_backend_cpu_set_n_threads(backend, n_threads);
+            ggml_backend_cpu_set_n_threads(runtime_backend, n_threads);
        }
-        ggml_backend_graph_compute(backend, gf);
+        ggml_backend_graph_compute(runtime_backend, gf);
 #ifdef GGML_PERF
        ggml_graph_print(gf);
 #endif
        copy_cache_tensors_to_cache_buffer();
        if (output != NULL) {
-            auto result = ggml_graph_node(gf, -1);
+            auto result = ggml_get_tensor(compute_ctx, final_result_name.c_str());
            if (*output == NULL && output_ctx != NULL) {
                *output = ggml_dup_tensor(output_ctx, result);
            }
            if (*output != NULL) {
-                ggml_backend_tensor_get_and_sync(backend, result, (*output)->data, 0, ggml_nbytes(*output));
+                ggml_backend_tensor_get_and_sync(runtime_backend, result, (*output)->data, 0, ggml_nbytes(*output));
            }
        }
@ -1416,6 +1787,13 @@ public:
    virtual struct ggml_tensor* forward(struct ggml_context* ctx, struct ggml_tensor* x) = 0;
 };
 class Identity : public UnaryBlock {
 public:
    struct ggml_tensor* forward(struct ggml_context* ctx, struct ggml_tensor* x) {
        return x;
    }
 };
 class Linear : public UnaryBlock {
 protected:
    int64_t in_features;
@ -1430,7 +1808,7 @@ protected:
        }
        params["weight"] = ggml_new_tensor_2d(ctx, wtype, in_features, out_features);
        if (bias) {
-            enum ggml_type wtype = GGML_TYPE_F32;  //(tensor_types.ypes.find(prefix + "bias") != tensor_types.end()) ? tensor_types[prefix + "bias"] : GGML_TYPE_F32;
+            enum ggml_type wtype = GGML_TYPE_F32;
            params["bias"]       = ggml_new_tensor_1d(ctx, wtype, out_features);
        }
    }
@ -1594,6 +1972,58 @@ public:
    }
 };
 class Conv3d : public UnaryBlock {
 protected:
    int64_t in_channels;
    int64_t out_channels;
    std::tuple<int, int, int> kernel_size;
    std::tuple<int, int, int> stride;
    std::tuple<int, int, int> padding;
    std::tuple<int, int, int> dilation;
    bool bias;
    void init_params(struct ggml_context* ctx, const String2GGMLType& tensor_types, const std::string prefix = "") {
        enum ggml_type wtype = GGML_TYPE_F16;
        params["weight"]     = ggml_new_tensor_4d(ctx,
                                                  wtype,
                                                  std::get<2>(kernel_size),
                                                  std::get<1>(kernel_size),
                                                  std::get<0>(kernel_size),
                                                  in_channels * out_channels);
        if (bias) {
            params["bias"] = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, out_channels);
        }
    }
 public:
    Conv3d(int64_t in_channels,
           int64_t out_channels,
           std::tuple<int, int, int> kernel_size,
           std::tuple<int, int, int> stride   = {1, 1, 1},
           std::tuple<int, int, int> padding  = {0, 0, 0},
           std::tuple<int, int, int> dilation = {1, 1, 1},
           bool bias                          = true)
        : in_channels(in_channels),
          out_channels(out_channels),
          kernel_size(kernel_size),
          stride(stride),
          padding(padding),
          dilation(dilation),
          bias(bias) {}
    struct ggml_tensor* forward(struct ggml_context* ctx, struct ggml_tensor* x) {
        struct ggml_tensor* w = params["weight"];
        struct ggml_tensor* b = NULL;
        if (bias) {
            b = params["bias"];
        }
        return ggml_nn_conv_3d(ctx, x, w, b, in_channels,
                               std::get<2>(stride), std::get<1>(stride), std::get<0>(stride),
                               std::get<2>(padding), std::get<1>(padding), std::get<0>(padding),
                               std::get<2>(dilation), std::get<1>(dilation), std::get<0>(dilation));
    }
 };
 class LayerNorm : public UnaryBlock {
 protected:
    int64_t normalized_shape;
@ -1679,6 +2109,30 @@ public:
        : GroupNorm(32, num_channels, 1e-06f) {}
 };
 class RMSNorm : public UnaryBlock {
 protected:
    int64_t hidden_size;
    float eps;
    void init_params(struct ggml_context* ctx, const String2GGMLType& tensor_types = {}, std::string prefix = "") {
        enum ggml_type wtype = GGML_TYPE_F32;
        params["weight"]     = ggml_new_tensor_1d(ctx, wtype, hidden_size);
    }
 public:
    RMSNorm(int64_t hidden_size,
            float eps = 1e-06f)
        : hidden_size(hidden_size),
          eps(eps) {}
    struct ggml_tensor* forward(struct ggml_context* ctx, struct ggml_tensor* x) {
        struct ggml_tensor* w = params["weight"];
        x                     = ggml_rms_norm(ctx, x, eps);
        x                     = ggml_mul_inplace(ctx, x, w);
        return x;
    }
 };
 class MultiheadAttention : public GGMLBlock {
 protected:
    int64_t embed_dim;
--- a/gguf_reader.hpp
+++ b/gguf_reader.hpp
@ -0,0 +1,231 @@
 #ifndef __GGUF_READER_HPP__
 #define __GGUF_READER_HPP__
 #include <cstdint>
 #include <fstream>
 #include <string>
 #include <vector>
 #include "ggml.h"
 #include "util.h"
 struct GGUFTensorInfo {
    std::string name;
    ggml_type type;
    std::vector<int64_t> shape;
    size_t offset;
 };
 enum class GGUFMetadataType : uint32_t {
    UINT8   = 0,
    INT8    = 1,
    UINT16  = 2,
    INT16   = 3,
    UINT32  = 4,
    INT32   = 5,
    FLOAT32 = 6,
    BOOL    = 7,
    STRING  = 8,
    ARRAY   = 9,
    UINT64  = 10,
    INT64   = 11,
    FLOAT64 = 12,
 };
 class GGUFReader {
 private:
    std::vector<GGUFTensorInfo> tensors_;
    size_t data_offset_;
    size_t alignment_ = 32;  // default alignment is 32
    template <typename T>
    bool safe_read(std::ifstream& fin, T& value) {
        fin.read(reinterpret_cast<char*>(&value), sizeof(T));
        return fin.good();
    }
    bool safe_read(std::ifstream& fin, char* buffer, size_t size) {
        fin.read(buffer, size);
        return fin.good();
    }
    bool safe_seek(std::ifstream& fin, std::streamoff offset, std::ios::seekdir dir) {
        fin.seekg(offset, dir);
        return fin.good();
    }
    bool read_metadata(std::ifstream& fin) {
        uint64_t key_len = 0;
        if (!safe_read(fin, key_len))
            return false;
        std::string key(key_len, '\0');
        if (!safe_read(fin, (char*)key.data(), key_len))
            return false;
        uint32_t type = 0;
        if (!safe_read(fin, type))
            return false;
        if (key == "general.alignment") {
            uint32_t align_val = 0;
            if (!safe_read(fin, align_val))
                return false;
            if (align_val != 0 && (align_val & (align_val - 1)) == 0) {
                alignment_ = align_val;
                LOG_DEBUG("Found alignment: %zu", alignment_);
            } else {
                LOG_ERROR("Invalid alignment value %u, fallback to default %zu", align_val, alignment_);
            }
            return true;
        }
        switch (static_cast<GGUFMetadataType>(type)) {
            case GGUFMetadataType::UINT8:
            case GGUFMetadataType::INT8:
            case GGUFMetadataType::BOOL:
                return safe_seek(fin, 1, std::ios::cur);
            case GGUFMetadataType::UINT16:
            case GGUFMetadataType::INT16:
                return safe_seek(fin, 2, std::ios::cur);
            case GGUFMetadataType::UINT32:
            case GGUFMetadataType::INT32:
            case GGUFMetadataType::FLOAT32:
                return safe_seek(fin, 4, std::ios::cur);
            case GGUFMetadataType::UINT64:
            case GGUFMetadataType::INT64:
            case GGUFMetadataType::FLOAT64:
                return safe_seek(fin, 8, std::ios::cur);
            case GGUFMetadataType::STRING: {
                uint64_t len = 0;
                if (!safe_read(fin, len))
                    return false;
                return safe_seek(fin, len, std::ios::cur);
            }
            case GGUFMetadataType::ARRAY: {
                uint32_t elem_type = 0;
                uint64_t len       = 0;
                if (!safe_read(fin, elem_type))
                    return false;
                if (!safe_read(fin, len))
                    return false;
                for (uint64_t i = 0; i < len; i++) {
                    if (!read_metadata(fin))
                        return false;
                }
                return true;
            }
            default:
                LOG_ERROR("Unknown metadata type=%u", type);
                return false;
        }
    }
    GGUFTensorInfo read_tensor_info(std::ifstream& fin) {
        GGUFTensorInfo info;
        uint64_t name_len;
        if (!safe_read(fin, name_len))
            throw std::runtime_error("read tensor name length failed");
        info.name.resize(name_len);
        if (!safe_read(fin, (char*)info.name.data(), name_len))
            throw std::runtime_error("read tensor name failed");
        uint32_t n_dims;
        if (!safe_read(fin, n_dims))
            throw std::runtime_error("read tensor dims failed");
        info.shape.resize(n_dims);
        for (uint32_t i = 0; i < n_dims; i++) {
            if (!safe_read(fin, info.shape[i]))
                throw std::runtime_error("read tensor shape failed");
        }
        if (n_dims > GGML_MAX_DIMS) {
            for (int i = GGML_MAX_DIMS; i < n_dims; i++) {
                info.shape[GGML_MAX_DIMS - 1] *= info.shape[i];  // stack to last dim;
            }
            info.shape.resize(GGML_MAX_DIMS);
            n_dims = GGML_MAX_DIMS;
        }
        uint32_t type;
        if (!safe_read(fin, type))
            throw std::runtime_error("read tensor type failed");
        info.type = static_cast<ggml_type>(type);
        if (!safe_read(fin, info.offset))
            throw std::runtime_error("read tensor offset failed");
        return info;
    }
 public:
    bool load(const std::string& file_path) {
        std::ifstream fin(file_path, std::ios::binary);
        if (!fin) {
            LOG_ERROR("failed to open '%s'", file_path.c_str());
            return false;
        }
        // --- Header ---
        char magic[4];
        if (!safe_read(fin, magic, 4) || strncmp(magic, "GGUF", 4) != 0) {
            LOG_ERROR("not a valid GGUF file");
            return false;
        }
        uint32_t version;
        if (!safe_read(fin, version))
            return false;
        uint64_t tensor_count, metadata_kv_count;
        if (!safe_read(fin, tensor_count))
            return false;
        if (!safe_read(fin, metadata_kv_count))
            return false;
        LOG_DEBUG("GGUF v%u, tensor_count=%llu, metadata_kv_count=%llu",
                  version, (unsigned long long)tensor_count, (unsigned long long)metadata_kv_count);
        // --- Read Metadata ---
        for (uint64_t i = 0; i < metadata_kv_count; i++) {
            if (!read_metadata(fin)) {
                LOG_ERROR("read meta data failed");
                return false;
            }
        }
        // --- Tensor Infos ---
        tensors_.clear();
        try {
            for (uint64_t i = 0; i < tensor_count; i++) {
                tensors_.push_back(read_tensor_info(fin));
            }
        } catch (const std::runtime_error& e) {
            LOG_ERROR("%s", e.what());
            return false;
        }
        data_offset_ = static_cast<size_t>(fin.tellg());
        if ((data_offset_ % alignment_) != 0) {
            data_offset_ = ((data_offset_ + alignment_ - 1) / alignment_) * alignment_;
        }
        fin.close();
        return true;
    }
    const std::vector<GGUFTensorInfo>& tensors() const { return tensors_; }
    size_t data_offset() const { return data_offset_; }
 };
 #endif  // __GGUF_READER_HPP__
--- a/lora.hpp
+++ b/lora.hpp
@ -92,6 +92,7 @@ struct LoraModel : public GGMLRunner {
    float multiplier = 1.0f;
    std::map<std::string, struct ggml_tensor*> lora_tensors;
    std::map<ggml_tensor*, ggml_tensor*> original_tensor_to_final_tensor;
    std::string file_path;
    ModelLoader model_loader;
    bool load_failed                = false;
@ -103,7 +104,7 @@ struct LoraModel : public GGMLRunner {
    LoraModel(ggml_backend_t backend,
              const std::string& file_path = "",
              const std::string prefix     = "")
-        : file_path(file_path), GGMLRunner(backend) {
+        : file_path(file_path), GGMLRunner(backend, false) {
        if (!model_loader.init_from_file(file_path, prefix)) {
            load_failed = true;
        }
@ -129,7 +130,7 @@ struct LoraModel : public GGMLRunner {
                // LOG_INFO("skipping LoRA tesnor '%s'", name.c_str());
                return true;
            }
-            // LOG_INFO("%s", name.c_str());
+            // LOG_INFO("lora_tensor %s", name.c_str());
            for (int i = 0; i < LORA_TYPE_COUNT; i++) {
                if (name.find(type_fingerprints[i]) != std::string::npos) {
                    type = (lora_t)i;
@ -151,11 +152,11 @@ struct LoraModel : public GGMLRunner {
            return true;
        };
-        model_loader.load_tensors(on_new_tensor_cb, backend);
+        model_loader.load_tensors(on_new_tensor_cb);
        alloc_params_buffer();
        // exit(0);
        dry_run = false;
-        model_loader.load_tensors(on_new_tensor_cb, backend);
+        model_loader.load_tensors(on_new_tensor_cb);
        LOG_DEBUG("lora type: \"%s\"/\"%s\"", lora_downs[type].c_str(), lora_ups[type].c_str());
@ -167,6 +168,7 @@ struct LoraModel : public GGMLRunner {
        auto out = ggml_reshape_1d(ctx, a, ggml_nelements(a));
        out      = ggml_get_rows(ctx, out, zero_index);
        out      = ggml_reshape(ctx, out, a);
        // auto out = ggml_cast(ctx, a, GGML_TYPE_F32);
        return out;
    }
@ -245,14 +247,22 @@ struct LoraModel : public GGMLRunner {
        set_backend_tensor_data(zero_index, zero_index_vec.data());
        ggml_build_forward_expand(gf, zero_index);
        original_tensor_to_final_tensor.clear();
        std::set<std::string> applied_lora_tensors;
        for (auto it : model_tensors) {
-            std::string k_tensor       = it.first;
+            std::string model_tensor_name    = it.first;
-            struct ggml_tensor* weight = model_tensors[it.first];
+            struct ggml_tensor* model_tensor = model_tensors[it.first];
-            std::vector<std::string> keys = to_lora_keys(k_tensor, version);
+            std::vector<std::string> keys = to_lora_keys(model_tensor_name, version);
-            if (keys.size() == 0)
+            bool is_bias                  = ends_with(model_tensor_name, ".bias");
-                continue;
+            if (keys.size() == 0) {
                if (is_bias) {
                    keys.push_back(model_tensor_name.substr(0, model_tensor_name.size() - 5));  // remove .bias
                } else {
                    continue;
                }
            }
            for (auto& key : keys) {
                bool is_qkv_split = starts_with(key, "SPLIT|");
@ -265,8 +275,22 @@ struct LoraModel : public GGMLRunner {
                }
                struct ggml_tensor* updown = NULL;
                float scale_value          = 1.0f;
-                std::string fk             = lora_pre[type] + key;
+                std::string full_key       = lora_pre[type] + key;
-                if (lora_tensors.find(fk + ".hada_w1_a") != lora_tensors.end()) {
+                if (is_bias) {
                    if (lora_tensors.find(full_key + ".diff_b") != lora_tensors.end()) {
                        std::string diff_name = full_key + ".diff_b";
                        ggml_tensor* diff     = lora_tensors[diff_name];
                        updown                = to_f32(compute_ctx, diff);
                        applied_lora_tensors.insert(diff_name);
                    } else {
                        continue;
                    }
                } else if (lora_tensors.find(full_key + ".diff") != lora_tensors.end()) {
                    std::string diff_name = full_key + ".diff";
                    ggml_tensor* diff     = lora_tensors[diff_name];
                    updown                = to_f32(compute_ctx, diff);
                    applied_lora_tensors.insert(diff_name);
                } else if (lora_tensors.find(full_key + ".hada_w1_a") != lora_tensors.end()) {
                    // LoHa mode
                    // TODO: split qkv convention for LoHas (is it ever used?)
@ -292,9 +316,9 @@ struct LoraModel : public GGMLRunner {
                    std::string hada_2_down_name = "";
                    std::string hada_2_up_name   = "";
-                    hada_1_down_name = fk + ".hada_w1_b";
+                    hada_1_down_name = full_key + ".hada_w1_b";
-                    hada_1_up_name   = fk + ".hada_w1_a";
+                    hada_1_up_name   = full_key + ".hada_w1_a";
-                    hada_1_mid_name  = fk + ".hada_t1";
+                    hada_1_mid_name  = full_key + ".hada_t1";
                    if (lora_tensors.find(hada_1_down_name) != lora_tensors.end()) {
                        hada_1_down = to_f32(compute_ctx, lora_tensors[hada_1_down_name]);
                    }
@ -307,9 +331,9 @@ struct LoraModel : public GGMLRunner {
                        hada_1_up = ggml_cont(compute_ctx, ggml_transpose(compute_ctx, hada_1_up));
                    }
-                    hada_2_down_name = fk + ".hada_w2_b";
+                    hada_2_down_name = full_key + ".hada_w2_b";
-                    hada_2_up_name   = fk + ".hada_w2_a";
+                    hada_2_up_name   = full_key + ".hada_w2_a";
-                    hada_2_mid_name  = fk + ".hada_t2";
+                    hada_2_mid_name  = full_key + ".hada_t2";
                    if (lora_tensors.find(hada_2_down_name) != lora_tensors.end()) {
                        hada_2_down = to_f32(compute_ctx, lora_tensors[hada_2_down_name]);
                    }
@ -322,7 +346,7 @@ struct LoraModel : public GGMLRunner {
                        hada_2_up = ggml_cont(compute_ctx, ggml_transpose(compute_ctx, hada_2_up));
                    }
-                    alpha_name = fk + ".alpha";
+                    alpha_name = full_key + ".alpha";
                    applied_lora_tensors.insert(hada_1_down_name);
                    applied_lora_tensors.insert(hada_1_up_name);
@ -345,7 +369,7 @@ struct LoraModel : public GGMLRunner {
                        float alpha = ggml_backend_tensor_get_f32(lora_tensors[alpha_name]);
                        scale_value = alpha / rank;
                    }
-                } else if (lora_tensors.find(fk + ".lokr_w1") != lora_tensors.end() || lora_tensors.find(fk + ".lokr_w1_a") != lora_tensors.end()) {
+                } else if (lora_tensors.find(full_key + ".lokr_w1") != lora_tensors.end() || lora_tensors.find(full_key + ".lokr_w1_a") != lora_tensors.end()) {
                    // LoKr mode
                    // TODO: split qkv convention for LoKrs (is it ever used?)
@ -354,7 +378,7 @@ struct LoraModel : public GGMLRunner {
                        break;
                    }
-                    std::string alpha_name = fk + ".alpha";
+                    std::string alpha_name = full_key + ".alpha";
                    ggml_tensor* lokr_w1 = NULL;
                    ggml_tensor* lokr_w2 = NULL;
@ -362,8 +386,8 @@ struct LoraModel : public GGMLRunner {
                    std::string lokr_w1_name = "";
                    std::string lokr_w2_name = "";
-                    lokr_w1_name = fk + ".lokr_w1";
+                    lokr_w1_name = full_key + ".lokr_w1";
-                    lokr_w2_name = fk + ".lokr_w2";
+                    lokr_w2_name = full_key + ".lokr_w2";
                    if (lora_tensors.find(lokr_w1_name) != lora_tensors.end()) {
                        lokr_w1 = to_f32(compute_ctx, lora_tensors[lokr_w1_name]);
@ -435,29 +459,29 @@ struct LoraModel : public GGMLRunner {
                    if (is_qkv_split) {
                        std::string suffix  = "";
-                        auto split_q_d_name = fk + "q" + suffix + lora_downs[type] + ".weight";
+                        auto split_q_d_name = full_key + "q" + suffix + lora_downs[type] + ".weight";
                        if (lora_tensors.find(split_q_d_name) == lora_tensors.end()) {
                            suffix         = "_proj";
-                            split_q_d_name = fk + "q" + suffix + lora_downs[type] + ".weight";
+                            split_q_d_name = full_key + "q" + suffix + lora_downs[type] + ".weight";
                        }
                        if (lora_tensors.find(split_q_d_name) != lora_tensors.end()) {
                            // print_ggml_tensor(it.second, true);  //[3072, 21504, 1, 1]
                            // find qkv and mlp up parts in LoRA model
-                            auto split_k_d_name = fk + "k" + suffix + lora_downs[type] + ".weight";
+                            auto split_k_d_name = full_key + "k" + suffix + lora_downs[type] + ".weight";
-                            auto split_v_d_name = fk + "v" + suffix + lora_downs[type] + ".weight";
+                            auto split_v_d_name = full_key + "v" + suffix + lora_downs[type] + ".weight";
-                            auto split_q_u_name = fk + "q" + suffix + lora_ups[type] + ".weight";
+                            auto split_q_u_name = full_key + "q" + suffix + lora_ups[type] + ".weight";
-                            auto split_k_u_name = fk + "k" + suffix + lora_ups[type] + ".weight";
+                            auto split_k_u_name = full_key + "k" + suffix + lora_ups[type] + ".weight";
-                            auto split_v_u_name = fk + "v" + suffix + lora_ups[type] + ".weight";
+                            auto split_v_u_name = full_key + "v" + suffix + lora_ups[type] + ".weight";
-                            auto split_q_scale_name = fk + "q" + suffix + ".scale";
+                            auto split_q_scale_name = full_key + "q" + suffix + ".scale";
-                            auto split_k_scale_name = fk + "k" + suffix + ".scale";
+                            auto split_k_scale_name = full_key + "k" + suffix + ".scale";
-                            auto split_v_scale_name = fk + "v" + suffix + ".scale";
+                            auto split_v_scale_name = full_key + "v" + suffix + ".scale";
-                            auto split_q_alpha_name = fk + "q" + suffix + ".alpha";
+                            auto split_q_alpha_name = full_key + "q" + suffix + ".alpha";
-                            auto split_k_alpha_name = fk + "k" + suffix + ".alpha";
+                            auto split_k_alpha_name = full_key + "k" + suffix + ".alpha";
-                            auto split_v_alpha_name = fk + "v" + suffix + ".alpha";
+                            auto split_v_alpha_name = full_key + "v" + suffix + ".alpha";
                            ggml_tensor* lora_q_down = NULL;
                            ggml_tensor* lora_q_up   = NULL;
@ -571,29 +595,29 @@ struct LoraModel : public GGMLRunner {
                            applied_lora_tensors.insert(split_v_d_name);
                        }
                    } else if (is_qkvm_split) {
-                        auto split_q_d_name = fk + "attn.to_q" + lora_downs[type] + ".weight";
+                        auto split_q_d_name = full_key + "attn.to_q" + lora_downs[type] + ".weight";
                        if (lora_tensors.find(split_q_d_name) != lora_tensors.end()) {
                            // print_ggml_tensor(it.second, true);  //[3072, 21504, 1, 1]
                            // find qkv and mlp up parts in LoRA model
-                            auto split_k_d_name = fk + "attn.to_k" + lora_downs[type] + ".weight";
+                            auto split_k_d_name = full_key + "attn.to_k" + lora_downs[type] + ".weight";
-                            auto split_v_d_name = fk + "attn.to_v" + lora_downs[type] + ".weight";
+                            auto split_v_d_name = full_key + "attn.to_v" + lora_downs[type] + ".weight";
-                            auto split_q_u_name = fk + "attn.to_q" + lora_ups[type] + ".weight";
+                            auto split_q_u_name = full_key + "attn.to_q" + lora_ups[type] + ".weight";
-                            auto split_k_u_name = fk + "attn.to_k" + lora_ups[type] + ".weight";
+                            auto split_k_u_name = full_key + "attn.to_k" + lora_ups[type] + ".weight";
-                            auto split_v_u_name = fk + "attn.to_v" + lora_ups[type] + ".weight";
+                            auto split_v_u_name = full_key + "attn.to_v" + lora_ups[type] + ".weight";
-                            auto split_m_d_name = fk + "proj_mlp" + lora_downs[type] + ".weight";
+                            auto split_m_d_name = full_key + "proj_mlp" + lora_downs[type] + ".weight";
-                            auto split_m_u_name = fk + "proj_mlp" + lora_ups[type] + ".weight";
+                            auto split_m_u_name = full_key + "proj_mlp" + lora_ups[type] + ".weight";
-                            auto split_q_scale_name = fk + "attn.to_q" + ".scale";
+                            auto split_q_scale_name = full_key + "attn.to_q" + ".scale";
-                            auto split_k_scale_name = fk + "attn.to_k" + ".scale";
+                            auto split_k_scale_name = full_key + "attn.to_k" + ".scale";
-                            auto split_v_scale_name = fk + "attn.to_v" + ".scale";
+                            auto split_v_scale_name = full_key + "attn.to_v" + ".scale";
-                            auto split_m_scale_name = fk + "proj_mlp" + ".scale";
+                            auto split_m_scale_name = full_key + "proj_mlp" + ".scale";
-                            auto split_q_alpha_name = fk + "attn.to_q" + ".alpha";
+                            auto split_q_alpha_name = full_key + "attn.to_q" + ".alpha";
-                            auto split_k_alpha_name = fk + "attn.to_k" + ".alpha";
+                            auto split_k_alpha_name = full_key + "attn.to_k" + ".alpha";
-                            auto split_v_alpha_name = fk + "attn.to_v" + ".alpha";
+                            auto split_v_alpha_name = full_key + "attn.to_v" + ".alpha";
-                            auto split_m_alpha_name = fk + "proj_mlp" + ".alpha";
+                            auto split_m_alpha_name = full_key + "proj_mlp" + ".alpha";
                            ggml_tensor* lora_q_down = NULL;
                            ggml_tensor* lora_q_up   = NULL;
@ -748,30 +772,27 @@ struct LoraModel : public GGMLRunner {
                            applied_lora_tensors.insert(split_m_d_name);
                        }
                    } else {
-                        lora_up_name   = fk + lora_ups[type] + ".weight";
+                        lora_up_name   = full_key + lora_ups[type] + ".weight";
-                        lora_down_name = fk + lora_downs[type] + ".weight";
+                        lora_down_name = full_key + lora_downs[type] + ".weight";
-                        lora_mid_name  = fk + ".lora_mid.weight";
+                        lora_mid_name  = full_key + ".lora_mid.weight";
-                        alpha_name = fk + ".alpha";
+                        alpha_name = full_key + ".alpha";
-                        scale_name = fk + ".scale";
+                        scale_name = full_key + ".scale";
                        if (lora_tensors.find(lora_up_name) != lora_tensors.end()) {
                            lora_up = to_f32(compute_ctx, lora_tensors[lora_up_name]);
                            applied_lora_tensors.insert(lora_up_name);
                        }
                        if (lora_tensors.find(lora_down_name) != lora_tensors.end()) {
                            lora_down = to_f32(compute_ctx, lora_tensors[lora_down_name]);
                            applied_lora_tensors.insert(lora_down_name);
                        }
                        if (lora_tensors.find(lora_mid_name) != lora_tensors.end()) {
                            lora_mid = to_f32(compute_ctx, lora_tensors[lora_mid_name]);
                            applied_lora_tensors.insert(lora_mid_name);
                        }
                        applied_lora_tensors.insert(lora_up_name);
                        applied_lora_tensors.insert(lora_down_name);
                        applied_lora_tensors.insert(alpha_name);
                        applied_lora_tensors.insert(scale_name);
                    }
                    if (lora_up == NULL || lora_down == NULL) {
@ -782,29 +803,37 @@ struct LoraModel : public GGMLRunner {
                    int64_t rank = lora_down->ne[ggml_n_dims(lora_down) - 1];
                    if (lora_tensors.find(scale_name) != lora_tensors.end()) {
                        scale_value = ggml_backend_tensor_get_f32(lora_tensors[scale_name]);
                        applied_lora_tensors.insert(scale_name);
                    } else if (lora_tensors.find(alpha_name) != lora_tensors.end()) {
                        float alpha = ggml_backend_tensor_get_f32(lora_tensors[alpha_name]);
                        scale_value = alpha / rank;
                        // LOG_DEBUG("rank %s %ld %.2f %.2f", alpha_name.c_str(), rank, alpha, scale_value);
                        applied_lora_tensors.insert(alpha_name);
                    }
                    updown = ggml_merge_lora(compute_ctx, lora_down, lora_up, lora_mid);
                }
                scale_value *= multiplier;
-                updown = ggml_reshape(compute_ctx, updown, weight);
+                ggml_tensor* original_tensor = model_tensor;
-                GGML_ASSERT(ggml_nelements(updown) == ggml_nelements(weight));
+                if (!ggml_backend_is_cpu(runtime_backend) && ggml_backend_buffer_is_host(original_tensor->buffer)) {
-                updown = ggml_scale_inplace(compute_ctx, updown, scale_value);
+                    model_tensor = ggml_dup_tensor(compute_ctx, model_tensor);
-                ggml_tensor* final_weight;
+                    set_backend_tensor_data(model_tensor, original_tensor->data);
-                if (weight->type != GGML_TYPE_F32 && weight->type != GGML_TYPE_F16) {
+                }
-                    // final_weight = ggml_new_tensor(compute_ctx, GGML_TYPE_F32, ggml_n_dims(weight), weight->ne);
+                updown = ggml_reshape(compute_ctx, updown, model_tensor);
-                    // final_weight = ggml_cpy(compute_ctx, weight, final_weight);
+                GGML_ASSERT(ggml_nelements(updown) == ggml_nelements(model_tensor));
-                    final_weight = to_f32(compute_ctx, weight);
+                updown = ggml_scale_inplace(compute_ctx, updown, scale_value);
-                    final_weight = ggml_add_inplace(compute_ctx, final_weight, updown);
+                ggml_tensor* final_tensor;
-                    final_weight = ggml_cpy(compute_ctx, final_weight, weight);
+                if (model_tensor->type != GGML_TYPE_F32 && model_tensor->type != GGML_TYPE_F16) {
-                } else {
+                    final_tensor = to_f32(compute_ctx, model_tensor);
-                    final_weight = ggml_add_inplace(compute_ctx, weight, updown);
+                    final_tensor = ggml_add_inplace(compute_ctx, final_tensor, updown);
                    final_tensor = ggml_cpy(compute_ctx, final_tensor, model_tensor);
                } else {
                    final_tensor = ggml_add_inplace(compute_ctx, model_tensor, updown);
                }
                ggml_build_forward_expand(gf, final_tensor);
                if (!ggml_backend_is_cpu(runtime_backend) && ggml_backend_buffer_is_host(original_tensor->buffer)) {
                    original_tensor_to_final_tensor[original_tensor] = final_tensor;
                }
                // final_weight = ggml_add_inplace(compute_ctx, weight, updown);  // apply directly
                ggml_build_forward_expand(gf, final_weight);
                break;
            }
        }
@ -825,10 +854,10 @@ struct LoraModel : public GGMLRunner {
         * this function is called once to calculate the required buffer size
         * and then again to actually generate a graph to be used */
        if (applied_lora_tensors_count != total_lora_tensors_count) {
-            LOG_WARN("Only (%lu / %lu) LoRA tensors have been applied",
+            LOG_WARN("Only (%lu / %lu) LoRA tensors will be applied",
                     applied_lora_tensors_count, total_lora_tensors_count);
        } else {
-            LOG_DEBUG("(%lu / %lu) LoRA tensors applied successfully",
+            LOG_DEBUG("(%lu / %lu) LoRA tensors will be applied",
                      applied_lora_tensors_count, total_lora_tensors_count);
        }
@ -839,7 +868,15 @@ struct LoraModel : public GGMLRunner {
        auto get_graph = [&]() -> struct ggml_cgraph* {
            return build_lora_graph(model_tensors, version);
        };
-        GGMLRunner::compute(get_graph, n_threads, true);
+        GGMLRunner::compute(get_graph, n_threads, false);
        for (auto item : original_tensor_to_final_tensor) {
            ggml_tensor* original_tensor = item.first;
            ggml_tensor* final_tensor    = item.second;
            ggml_backend_tensor_copy(final_tensor, original_tensor);
        }
        original_tensor_to_final_tensor.clear();
        GGMLRunner::free_compute_buffer();
    }
 };
--- a/ltxv.hpp
+++ b/ltxv.hpp
@ -0,0 +1,74 @@
 #ifndef __LTXV_HPP__
 #define __LTXV_HPP__
 #include "common.hpp"
 #include "ggml_extend.hpp"
 namespace LTXV {
    class CausalConv3d : public GGMLBlock {
    protected:
        int time_kernel_size;
    public:
        CausalConv3d(int64_t in_channels,
                     int64_t out_channels,
                     int kernel_size        = 3,
                     std::tuple<int> stride = {1, 1, 1},
                     int dilation           = 1,
                     bool bias              = true) {
            time_kernel_size = kernel_size / 2;
            blocks["conv"]   = std::shared_ptr<GGMLBlock>(new Conv3d(in_channels,
                                                                     out_channels,
                                                                     {kernel_size, kernel_size, kernel_size},
                                                                     stride,
                                                                     {0, kernel_size / 2, kernel_size / 2},
                                                                     {dilation, 1, 1},
                                                                     bias));
        }
        struct ggml_tensor* forward(struct ggml_context* ctx,
                                    struct ggml_tensor* x,
                                    bool causal = true) {
            // x: [N*IC, ID, IH, IW]
            // result: [N*OC, OD, OH, OW]
            auto conv = std::dynamic_pointer_cast<Conv3d>(blocks["conv"]);
            if (causal) {
                auto h               = ggml_cont(ctx, ggml_permute(ctx, x, 0, 1, 3, 2));                                                  // [ID, N*IC, IH, IW]
                auto first_frame     = ggml_view_3d(ctx, h, h->ne[0], h->ne[1], h->ne[2], h->nb[1], h->nb[2], 0);                         // [N*IC, IH, IW]
                first_frame          = ggml_reshape_4d(ctx, first_frame, first_frame->ne[0], first_frame->ne[1], 1, first_frame->ne[2]);  // [N*IC, 1, IH, IW]
                auto first_frame_pad = first_frame;
                for (int i = 1; i < time_kernel_size - 1; i++) {
                    first_frame_pad = ggml_concat(ctx, first_frame_pad, first_frame, 2);
                }
                x = ggml_concat(ctx, first_frame_pad, x, 2);
            } else {
                auto h         = ggml_cont(ctx, ggml_permute(ctx, x, 0, 1, 3, 2));  // [ID, N*IC, IH, IW]
                int64_t offset = h->nb[2] * h->ne[2];
                auto first_frame     = ggml_view_3d(ctx, h, h->ne[0], h->ne[1], h->ne[2], h->nb[1], h->nb[2], 0);                         // [N*IC, IH, IW]
                first_frame          = ggml_reshape_4d(ctx, first_frame, first_frame->ne[0], first_frame->ne[1], 1, first_frame->ne[2]);  // [N*IC, 1, IH, IW]
                auto first_frame_pad = first_frame;
                for (int i = 1; i < (time_kernel_size - 1) / 2; i++) {
                    first_frame_pad = ggml_concat(ctx, first_frame_pad, first_frame, 2);
                }
                auto last_frame     = ggml_view_3d(ctx, h, h->ne[0], h->ne[1], h->ne[2], h->nb[1], h->nb[2], offset * (h->ne[3] - 1));  // [N*IC, IH, IW]
                last_frame          = ggml_reshape_4d(ctx, last_frame, last_frame->ne[0], last_frame->ne[1], 1, last_frame->ne[2]);     // [N*IC, 1, IH, IW]
                auto last_frame_pad = last_frame;
                for (int i = 1; i < (time_kernel_size - 1) / 2; i++) {
                    last_frame_pad = ggml_concat(ctx, last_frame_pad, last_frame, 2);
                }
                x = ggml_concat(ctx, first_frame_pad, x, 2);
                x = ggml_concat(ctx, x, last_frame_pad, 2);
            }
            x = conv->forward(ctx, x);
            return x;
        }
    };
 };
 #endif
--- a/mmdit.hpp
+++ b/mmdit.hpp
@ -142,30 +142,6 @@ public:
    }
 };
 class RMSNorm : public UnaryBlock {
 protected:
    int64_t hidden_size;
    float eps;
    void init_params(struct ggml_context* ctx, const String2GGMLType& tensor_types = {}, std::string prefix = "") {
        enum ggml_type wtype = GGML_TYPE_F32;
        params["weight"]     = ggml_new_tensor_1d(ctx, wtype, hidden_size);
    }
 public:
    RMSNorm(int64_t hidden_size,
            float eps = 1e-06f)
        : hidden_size(hidden_size),
          eps(eps) {}
    struct ggml_tensor* forward(struct ggml_context* ctx, struct ggml_tensor* x) {
        struct ggml_tensor* w = params["weight"];
        x                     = ggml_rms_norm(ctx, x, eps);
        x                     = ggml_mul(ctx, x, w);
        return x;
    }
 };
 class SelfAttention : public GGMLBlock {
 public:
    int64_t num_heads;
@ -870,9 +846,10 @@ struct MMDiTRunner : public GGMLRunner {
    MMDiT mmdit;
    MMDiTRunner(ggml_backend_t backend,
                bool offload_params_to_cpu,
                const String2GGMLType& tensor_types = {},
                const std::string prefix            = "")
-        : GGMLRunner(backend), mmdit(tensor_types) {
+        : GGMLRunner(backend, offload_params_to_cpu), mmdit(tensor_types) {
        mmdit.init(params_ctx, tensor_types, prefix);
    }
@ -970,7 +947,7 @@ struct MMDiTRunner : public GGMLRunner {
        // ggml_backend_t backend    = ggml_backend_cuda_init(0);
        ggml_backend_t backend             = ggml_backend_cpu_init();
        ggml_type model_data_type          = GGML_TYPE_F16;
-        std::shared_ptr<MMDiTRunner> mmdit = std::shared_ptr<MMDiTRunner>(new MMDiTRunner(backend));
+        std::shared_ptr<MMDiTRunner> mmdit = std::shared_ptr<MMDiTRunner>(new MMDiTRunner(backend, false));
        {
            LOG_INFO("loading from '%s'", file_path.c_str());
@ -984,7 +961,7 @@ struct MMDiTRunner : public GGMLRunner {
                return;
            }
-            bool success = model_loader.load_tensors(tensors, backend);
+            bool success = model_loader.load_tensors(tensors);
            if (!success) {
                LOG_ERROR("load tensors from model loader failed");
--- a/model.cpp
+++ b/model.cpp
@ -6,10 +6,12 @@
 #include <unordered_map>
 #include <vector>
 #include "gguf_reader.hpp"
 #include "model.h"
 #include "stable-diffusion.h"
 #include "util.h"
 #include "vocab.hpp"
 #include "vocab_umt5.hpp"
 #include "ggml-alloc.h"
 #include "ggml-backend.h"
@ -88,6 +90,7 @@ const char* unused_tensors[] = {
    "posterior_mean_coef1",
    "posterior_mean_coef2",
    "cond_stage_model.transformer.text_model.embeddings.position_ids",
    "cond_stage_model.transformer.vision_model.embeddings.position_ids",
    "cond_stage_model.model.logit_scale",
    "cond_stage_model.model.text_projection",
    "conditioner.embedders.0.transformer.text_model.embeddings.position_ids",
@ -141,6 +144,11 @@ std::unordered_map<std::string, std::string> open_clip_to_hk_clip_resblock = {
    {"mlp.c_proj.weight", "mlp.fc2.weight"},
 };
 std::unordered_map<std::string, std::string> cond_model_name_map = {
    {"transformer.vision_model.pre_layrnorm.weight", "transformer.vision_model.pre_layernorm.weight"},
    {"transformer.vision_model.pre_layrnorm.bias", "transformer.vision_model.pre_layernorm.bias"},
 };
 std::unordered_map<std::string, std::string> vae_decoder_name_map = {
    {"first_stage_model.decoder.mid.attn_1.to_k.bias", "first_stage_model.decoder.mid.attn_1.k.bias"},
    {"first_stage_model.decoder.mid.attn_1.to_k.weight", "first_stage_model.decoder.mid.attn_1.k.weight"},
@ -179,7 +187,7 @@ std::unordered_map<std::string, std::string> pmid_v2_name_map = {
     "pmid.qformer_perceiver.token_proj.fc2.weight"},
 };
-std::string convert_open_clip_to_hf_clip(const std::string& name) {
+std::string convert_cond_model_name(const std::string& name) {
    std::string new_name = name;
    std::string prefix;
    if (contains(new_name, ".enc.")) {
@ -268,6 +276,10 @@ std::string convert_open_clip_to_hf_clip(const std::string& name) {
        new_name = open_clip_to_hf_clip_model[new_name];
    }
    if (cond_model_name_map.find(new_name) != cond_model_name_map.end()) {
        new_name = cond_model_name_map[new_name];
    }
    std::string open_clip_resblock_prefix = "model.transformer.resblocks.";
    std::string hf_clip_resblock_prefix   = "transformer.text_model.encoder.layers.";
@ -563,7 +575,7 @@ std::string convert_tensor_name(std::string name) {
    // }
    std::string new_name = name;
    if (starts_with(name, "cond_stage_model.") || starts_with(name, "conditioner.embedders.") || starts_with(name, "text_encoders.") || ends_with(name, ".vision_model.visual_projection.weight")) {
-        new_name = convert_open_clip_to_hf_clip(name);
+        new_name = convert_cond_model_name(name);
    } else if (starts_with(name, "first_stage_model.decoder")) {
        new_name = convert_vae_decoder_name(name);
    } else if (starts_with(name, "pmid.qformer_perceiver")) {
@ -592,9 +604,11 @@ std::string convert_tensor_name(std::string name) {
        } else {
            new_name = name;
        }
    } else if (ends_with(name, ".diff") || ends_with(name, ".diff_b")) {
        new_name = "lora." + name;
    } else if (contains(name, "lora_up") || contains(name, "lora_down") ||
               contains(name, "lora.up") || contains(name, "lora.down") ||
-               contains(name, "lora_linear")) {
+               contains(name, "lora_linear") || ends_with(name, ".alpha")) {
        size_t pos = new_name.find(".processor");
        if (pos != std::string::npos) {
            new_name.replace(pos, strlen(".processor"), "");
@ -602,7 +616,11 @@ std::string convert_tensor_name(std::string name) {
        // if (starts_with(new_name, "transformer.transformer_blocks") || starts_with(new_name, "transformer.single_transformer_blocks")) {
        //     new_name = "model.diffusion_model." + new_name;
        // }
-        pos = new_name.rfind("lora");
+        if (ends_with(name, ".alpha")) {
            pos = new_name.rfind("alpha");
        } else {
            pos = new_name.rfind("lora");
        }
        if (pos != std::string::npos) {
            std::string name_without_network_parts = new_name.substr(0, pos - 1);
            std::string network_part               = new_name.substr(pos);
@ -684,6 +702,13 @@ void preprocess_tensor(TensorStorage tensor_storage,
        tensor_storage.unsqueeze();
    }
    // wan vae
    if (ends_with(new_name, "gamma")) {
        tensor_storage.reverse_ne();
        tensor_storage.n_dims = 1;
        tensor_storage.reverse_ne();
    }
    tensor_storage.name = new_name;
    if (new_name.find("cond_stage_model") != std::string::npos &&
@ -1030,10 +1055,38 @@ bool ModelLoader::init_from_gguf_file(const std::string& file_path, const std::s
    gguf_context* ctx_gguf_ = NULL;
    ggml_context* ctx_meta_ = NULL;
-    ctx_gguf_               = gguf_init_from_file(file_path.c_str(), {true, &ctx_meta_});
+
    ctx_gguf_ = gguf_init_from_file(file_path.c_str(), {true, &ctx_meta_});
    if (!ctx_gguf_) {
-        LOG_ERROR("failed to open '%s'", file_path.c_str());
+        LOG_ERROR("failed to open '%s' with gguf_init_from_file. Try to open it with GGUFReader.", file_path.c_str());
-        return false;
+        GGUFReader gguf_reader;
        if (!gguf_reader.load(file_path)) {
            LOG_ERROR("failed to open '%s' with GGUFReader.", file_path.c_str());
            return false;
        }
        size_t data_offset = gguf_reader.data_offset();
        for (const auto& gguf_tensor_info : gguf_reader.tensors()) {
            std::string name = gguf_tensor_info.name;
            if (!starts_with(name, prefix)) {
                name = prefix + name;
            }
            TensorStorage tensor_storage(
                name,
                gguf_tensor_info.type,
                gguf_tensor_info.shape.data(),
                gguf_tensor_info.shape.size(),
                file_index,
                data_offset + gguf_tensor_info.offset);
            // LOG_DEBUG("%s %s", name.c_str(), tensor_storage.to_string().c_str());
            tensor_storages.push_back(tensor_storage);
            add_preprocess_tensor_storage_types(tensor_storages_types, tensor_storage.name, tensor_storage.type);
        }
        return true;
    }
    int n_tensors = gguf_get_n_tensors(ctx_gguf_);
@ -1047,7 +1100,11 @@ bool ModelLoader::init_from_gguf_file(const std::string& file_path, const std::s
        // LOG_DEBUG("%s", name.c_str());
-        TensorStorage tensor_storage(prefix + name, dummy->type, dummy->ne, ggml_n_dims(dummy), file_index, offset);
+        if (!starts_with(name, prefix)) {
            name = prefix + name;
        }
        TensorStorage tensor_storage(name, dummy->type, dummy->ne, ggml_n_dims(dummy), file_index, offset);
        GGML_ASSERT(ggml_nbytes(dummy) == tensor_storage.nbytes());
@ -1085,7 +1142,7 @@ ggml_type str_to_ggml_type(const std::string& dtype) {
 // https://huggingface.co/docs/safetensors/index
 bool ModelLoader::init_from_safetensors_file(const std::string& file_path, const std::string& prefix) {
-    LOG_DEBUG("init from '%s'", file_path.c_str());
+    LOG_DEBUG("init from '%s', prefix = '%s'", file_path.c_str(), prefix.c_str());
    file_paths_.push_back(file_path);
    size_t file_index = file_paths_.size() - 1;
    std::ifstream file(file_path, std::ios::binary);
@ -1150,6 +1207,10 @@ bool ModelLoader::init_from_safetensors_file(const std::string& file_path, const
        std::string dtype    = tensor_info["dtype"];
        nlohmann::json shape = tensor_info["shape"];
        if (dtype == "U8") {
            continue;
        }
        size_t begin = tensor_info["data_offsets"][0].get<size_t>();
        size_t end   = tensor_info["data_offsets"][1].get<size_t>();
@ -1171,12 +1232,11 @@ bool ModelLoader::init_from_safetensors_file(const std::string& file_path, const
        }
        if (n_dims == 5) {
-            if (ne[3] == 1 && ne[4] == 1) {
+            n_dims = 4;
-                n_dims = 4;
+            ne[0]  = ne[0] * ne[1];
-            } else {
+            ne[1]  = ne[2];
-                LOG_ERROR("invalid tensor '%s'", name.c_str());
+            ne[2]  = ne[3];
-                return false;
+            ne[3]  = ne[4];
            }
        }
        // ggml_n_dims returns 1 for scalars
@ -1184,7 +1244,11 @@ bool ModelLoader::init_from_safetensors_file(const std::string& file_path, const
            n_dims = 1;
        }
-        TensorStorage tensor_storage(prefix + name, type, ne, n_dims, file_index, ST_HEADER_SIZE_LEN + header_size_ + begin);
+        if (!starts_with(name, prefix)) {
            name = prefix + name;
        }
        TensorStorage tensor_storage(name, type, ne, n_dims, file_index, ST_HEADER_SIZE_LEN + header_size_ + begin);
        tensor_storage.reverse_ne();
        size_t tensor_data_size = end - begin;
@ -1569,7 +1633,11 @@ bool ModelLoader::parse_data_pkl(uint8_t* buffer,
                        reader.tensor_storage.file_index = file_index;
                        // if(strcmp(prefix.c_str(), "scarlett") == 0)
                        // printf(" ZIP got tensor %s \n ", reader.tensor_storage.name.c_str());
-                        reader.tensor_storage.name = prefix + reader.tensor_storage.name;
+                        std::string name = reader.tensor_storage.name;
                        if (!starts_with(name, prefix)) {
                            name = prefix + name;
                        }
                        reader.tensor_storage.name = name;
                        tensor_storages.push_back(reader.tensor_storage);
                        add_preprocess_tensor_storage_types(tensor_storages_types, reader.tensor_storage.name, reader.tensor_storage.type);
@ -1641,12 +1709,14 @@ SDVersion ModelLoader::get_sd_version() {
    bool has_multiple_encoders = false;
    bool is_unet               = false;
-    bool is_xl   = false;
+    bool is_xl                       = false;
-    bool is_flux = false;
+    bool is_flux                     = false;
    bool is_wan                      = false;
    int64_t patch_embedding_channels = 0;
    bool has_img_emb                 = false;
 #define found_family (is_xl || is_flux)
    for (auto& tensor_storage : tensor_storages) {
-        if (!found_family) {
+        if (!(is_xl || is_flux)) {
            if (tensor_storage.name.find("model.diffusion_model.double_blocks.") != std::string::npos) {
                is_flux = true;
                if (input_block_checked) {
@ -1656,6 +1726,15 @@ SDVersion ModelLoader::get_sd_version() {
            if (tensor_storage.name.find("model.diffusion_model.joint_blocks.") != std::string::npos) {
                return VERSION_SD3;
            }
            if (tensor_storage.name.find("model.diffusion_model.blocks.0.cross_attn.norm_k.weight") != std::string::npos) {
                is_wan = true;
            }
            if (tensor_storage.name.find("model.diffusion_model.patch_embedding.weight") != std::string::npos) {
                patch_embedding_channels = tensor_storage.ne[3];
            }
            if (tensor_storage.name.find("model.diffusion_model.img_emb") != std::string::npos) {
                has_img_emb = true;
            }
            if (tensor_storage.name.find("model.diffusion_model.input_blocks.") != std::string::npos || tensor_storage.name.find("unet.down_blocks.") != std::string::npos) {
                is_unet = true;
                if (has_multiple_encoders) {
@ -1690,11 +1769,21 @@ SDVersion ModelLoader::get_sd_version() {
        if (tensor_storage.name == "model.diffusion_model.input_blocks.0.0.weight" || tensor_storage.name == "model.diffusion_model.img_in.weight" || tensor_storage.name == "unet.conv_in.weight") {
            input_block_weight  = tensor_storage;
            input_block_checked = true;
-            if (found_family) {
+            if (is_xl || is_flux) {
                break;
            }
        }
    }
    if (is_wan) {
        LOG_DEBUG("patch_embedding_channels %d", patch_embedding_channels);
        if (patch_embedding_channels == 184320 && !has_img_emb) {
            return VERSION_WAN2_2_I2V;
        }
        if (patch_embedding_channels == 147456 && !has_img_emb) {
            return VERSION_WAN2_2_TI2V;
        }
        return VERSION_WAN2;
    }
    bool is_inpaint = input_block_weight.ne[2] == 9;
    bool is_ip2p    = input_block_weight.ne[2] == 8;
    if (is_xl) {
@ -1850,6 +1939,11 @@ std::string ModelLoader::load_t5_tokenizer_json() {
    return json_str;
 }
 std::string ModelLoader::load_umt5_tokenizer_json() {
    std::string json_str(reinterpret_cast<const char*>(umt5_tokenizer_json_str), sizeof(umt5_tokenizer_json_str));
    return json_str;
 }
 std::vector<TensorStorage> remove_duplicates(const std::vector<TensorStorage>& vec) {
    std::vector<TensorStorage> res;
    std::unordered_map<std::string, size_t> name_to_index_map;
@ -1871,7 +1965,7 @@ std::vector<TensorStorage> remove_duplicates(const std::vector<TensorStorage>& v
    return res;
 }
-bool ModelLoader::load_tensors(on_new_tensor_cb_t on_new_tensor_cb, ggml_backend_t backend) {
+bool ModelLoader::load_tensors(on_new_tensor_cb_t on_new_tensor_cb) {
    std::vector<TensorStorage> processed_tensor_storages;
    for (auto& tensor_storage : tensor_storages) {
        // LOG_DEBUG("%s", name.c_str());
@ -2080,7 +2174,6 @@ bool ModelLoader::load_tensors(on_new_tensor_cb_t on_new_tensor_cb, ggml_backend
 }
 bool ModelLoader::load_tensors(std::map<std::string, struct ggml_tensor*>& tensors,
                               ggml_backend_t backend,
                               std::set<std::string> ignore_tensors) {
    std::set<std::string> tensor_names_in_file;
    auto on_new_tensor_cb = [&](const TensorStorage& tensor_storage, ggml_tensor** dst_tensor) -> bool {
@ -2120,7 +2213,7 @@ bool ModelLoader::load_tensors(std::map<std::string, struct ggml_tensor*>& tenso
        return true;
    };
-    bool success = load_tensors(on_new_tensor_cb, backend);
+    bool success = load_tensors(on_new_tensor_cb);
    if (!success) {
        LOG_ERROR("load tensors from file failed");
        return false;
@ -2151,7 +2244,7 @@ bool ModelLoader::load_tensors(std::map<std::string, struct ggml_tensor*>& tenso
 std::vector<std::pair<std::string, ggml_type>> parse_tensor_type_rules(const std::string& tensor_type_rules) {
    std::vector<std::pair<std::string, ggml_type>> result;
-    for (const auto& item : splitString(tensor_type_rules, ',')) {
+    for (const auto& item : split_string(tensor_type_rules, ',')) {
        if (item.size() == 0)
            continue;
        std::string::size_type pos = item.find('=');
@ -2264,7 +2357,7 @@ bool ModelLoader::save_to_gguf_file(const std::string& file_path, ggml_type type
        return true;
    };
-    bool success = load_tensors(on_new_tensor_cb, backend);
+    bool success = load_tensors(on_new_tensor_cb);
    ggml_backend_free(backend);
    LOG_INFO("load tensors done");
    LOG_INFO("trying to save tensors to %s", file_path.c_str());
--- a/model.h
+++ b/model.h
@ -31,23 +31,12 @@ enum SDVersion {
    VERSION_SD3,
    VERSION_FLUX,
    VERSION_FLUX_FILL,
    VERSION_WAN2,
    VERSION_WAN2_2_I2V,
    VERSION_WAN2_2_TI2V,
    VERSION_COUNT,
 };
 static inline bool sd_version_is_flux(SDVersion version) {
    if (version == VERSION_FLUX || version == VERSION_FLUX_FILL) {
        return true;
    }
    return false;
 }
 static inline bool sd_version_is_sd3(SDVersion version) {
    if (version == VERSION_SD3) {
        return true;
    }
    return false;
 }
 static inline bool sd_version_is_sd1(SDVersion version) {
    if (version == VERSION_SD1 || version == VERSION_SD1_INPAINT || version == VERSION_SD1_PIX2PIX) {
        return true;
@ -69,6 +58,27 @@ static inline bool sd_version_is_sdxl(SDVersion version) {
    return false;
 }
 static inline bool sd_version_is_sd3(SDVersion version) {
    if (version == VERSION_SD3) {
        return true;
    }
    return false;
 }
 static inline bool sd_version_is_flux(SDVersion version) {
    if (version == VERSION_FLUX || version == VERSION_FLUX_FILL) {
        return true;
    }
    return false;
 }
 static inline bool sd_version_is_wan(SDVersion version) {
    if (version == VERSION_WAN2 || version == VERSION_WAN2_2_I2V || version == VERSION_WAN2_2_TI2V) {
        return true;
    }
    return false;
 }
 static inline bool sd_version_is_inpaint(SDVersion version) {
    if (version == VERSION_SD1_INPAINT || version == VERSION_SD2_INPAINT || version == VERSION_SDXL_INPAINT || version == VERSION_FLUX_FILL) {
        return true;
@ -77,7 +87,7 @@ static inline bool sd_version_is_inpaint(SDVersion version) {
 }
 static inline bool sd_version_is_dit(SDVersion version) {
-    if (sd_version_is_flux(version) || sd_version_is_sd3(version)) {
+    if (sd_version_is_flux(version) || sd_version_is_sd3(version) || sd_version_is_wan(version)) {
        return true;
    }
    return false;
@ -113,7 +123,7 @@ struct TensorStorage {
    TensorStorage() = default;
-    TensorStorage(const std::string& name, ggml_type type, int64_t* ne, int n_dims, size_t file_index, size_t offset = 0)
+    TensorStorage(const std::string& name, ggml_type type, const int64_t* ne, int n_dims, size_t file_index, size_t offset = 0)
        : name(name), type(type), n_dims(n_dims), file_index(file_index), offset(offset) {
        for (int i = 0; i < n_dims; i++) {
            this->ne[i] = ne[i];
@ -237,9 +247,8 @@ public:
    ggml_type get_diffusion_model_wtype();
    ggml_type get_vae_wtype();
    void set_wtype_override(ggml_type wtype, std::string prefix = "");
-    bool load_tensors(on_new_tensor_cb_t on_new_tensor_cb, ggml_backend_t backend);
+    bool load_tensors(on_new_tensor_cb_t on_new_tensor_cb);
    bool load_tensors(std::map<std::string, struct ggml_tensor*>& tensors,
                      ggml_backend_t backend,
                      std::set<std::string> ignore_tensors = {});
    bool save_to_gguf_file(const std::string& file_path, ggml_type type, const std::string& tensor_type_rules);
@ -249,6 +258,7 @@ public:
    static std::string load_merges();
    static std::string load_t5_tokenizer_json();
    static std::string load_umt5_tokenizer_json();
 };
 #endif  // __MODEL_H__
--- a/pmid.hpp
+++ b/pmid.hpp
@ -624,12 +624,13 @@ public:
 public:
    PhotoMakerIDEncoder(ggml_backend_t backend,
                        bool offload_params_to_cpu,
                        const String2GGMLType& tensor_types,
                        const std::string prefix,
                        SDVersion version = VERSION_SDXL,
                        PMVersion pm_v    = PM_VERSION_1,
                        float sty         = 20.f)
-        : GGMLRunner(backend),
+        : GGMLRunner(backend, offload_params_to_cpu),
          version(version),
          pm_version(pm_v),
          style_strength(sty) {
@ -785,10 +786,11 @@ struct PhotoMakerIDEmbed : public GGMLRunner {
    bool applied     = false;
    PhotoMakerIDEmbed(ggml_backend_t backend,
                      bool offload_params_to_cpu,
                      ModelLoader* ml,
                      const std::string& file_path = "",
                      const std::string& prefix    = "")
-        : file_path(file_path), GGMLRunner(backend), model_loader(ml) {
+        : file_path(file_path), GGMLRunner(backend, offload_params_to_cpu), model_loader(ml) {
        if (!model_loader->init_from_file(file_path, prefix)) {
            load_failed = true;
        }
@ -828,11 +830,11 @@ struct PhotoMakerIDEmbed : public GGMLRunner {
            return true;
        };
-        model_loader->load_tensors(on_new_tensor_cb, backend);
+        model_loader->load_tensors(on_new_tensor_cb);
        alloc_params_buffer();
        dry_run = false;
-        model_loader->load_tensors(on_new_tensor_cb, backend);
+        model_loader->load_tensors(on_new_tensor_cb);
        LOG_DEBUG("finished loading PhotoMaker ID Embeds ");
        return true;
--- a/rope.hpp
+++ b/rope.hpp
@ -0,0 +1,252 @@
 #ifndef __ROPE_HPP__
 #define __ROPE_HPP__
 #include <vector>
 #include "ggml_extend.hpp"
 struct Rope {
    template <class T>
    static std::vector<T> linspace(T start, T end, int num) {
        std::vector<T> result(num);
        if (num == 1) {
            result[0] = start;
            return result;
        }
        T step = (end - start) / (num - 1);
        for (int i = 0; i < num; ++i) {
            result[i] = start + i * step;
        }
        return result;
    }
    static std::vector<std::vector<float>> transpose(const std::vector<std::vector<float>>& mat) {
        int rows = mat.size();
        int cols = mat[0].size();
        std::vector<std::vector<float>> transposed(cols, std::vector<float>(rows));
        for (int i = 0; i < rows; ++i) {
            for (int j = 0; j < cols; ++j) {
                transposed[j][i] = mat[i][j];
            }
        }
        return transposed;
    }
    static std::vector<float> flatten(const std::vector<std::vector<float>>& vec) {
        std::vector<float> flat_vec;
        for (const auto& sub_vec : vec) {
            flat_vec.insert(flat_vec.end(), sub_vec.begin(), sub_vec.end());
        }
        return flat_vec;
    }
    static std::vector<std::vector<float>> rope(const std::vector<float>& pos, int dim, int theta) {
        assert(dim % 2 == 0);
        int half_dim = dim / 2;
        std::vector<float> scale = linspace(0.f, (dim * 1.f - 2) / dim, half_dim);
        std::vector<float> omega(half_dim);
        for (int i = 0; i < half_dim; ++i) {
            omega[i] = 1.0 / std::pow(theta, scale[i]);
        }
        int pos_size = pos.size();
        std::vector<std::vector<float>> out(pos_size, std::vector<float>(half_dim));
        for (int i = 0; i < pos_size; ++i) {
            for (int j = 0; j < half_dim; ++j) {
                out[i][j] = pos[i] * omega[j];
            }
        }
        std::vector<std::vector<float>> result(pos_size, std::vector<float>(half_dim * 4));
        for (int i = 0; i < pos_size; ++i) {
            for (int j = 0; j < half_dim; ++j) {
                result[i][4 * j]     = std::cos(out[i][j]);
                result[i][4 * j + 1] = -std::sin(out[i][j]);
                result[i][4 * j + 2] = std::sin(out[i][j]);
                result[i][4 * j + 3] = std::cos(out[i][j]);
            }
        }
        return result;
    }
    // Generate IDs for image patches and text
    static std::vector<std::vector<float>> gen_txt_ids(int bs, int context_len) {
        return std::vector<std::vector<float>>(bs * context_len, std::vector<float>(3, 0.0));
    }
    static std::vector<std::vector<float>> gen_img_ids(int h, int w, int patch_size, int bs, int index = 0, int h_offset = 0, int w_offset = 0) {
        int h_len = (h + (patch_size / 2)) / patch_size;
        int w_len = (w + (patch_size / 2)) / patch_size;
        std::vector<std::vector<float>> img_ids(h_len * w_len, std::vector<float>(3, 0.0));
        std::vector<float> row_ids = linspace<float>(h_offset, h_len - 1 + h_offset, h_len);
        std::vector<float> col_ids = linspace<float>(w_offset, w_len - 1 + w_offset, w_len);
        for (int i = 0; i < h_len; ++i) {
            for (int j = 0; j < w_len; ++j) {
                img_ids[i * w_len + j][0] = index;
                img_ids[i * w_len + j][1] = row_ids[i];
                img_ids[i * w_len + j][2] = col_ids[j];
            }
        }
        std::vector<std::vector<float>> img_ids_repeated(bs * img_ids.size(), std::vector<float>(3));
        for (int i = 0; i < bs; ++i) {
            for (int j = 0; j < img_ids.size(); ++j) {
                img_ids_repeated[i * img_ids.size() + j] = img_ids[j];
            }
        }
        return img_ids_repeated;
    }
    static std::vector<std::vector<float>> concat_ids(const std::vector<std::vector<float>>& a,
                                                      const std::vector<std::vector<float>>& b,
                                                      int bs) {
        size_t a_len = a.size() / bs;
        size_t b_len = b.size() / bs;
        std::vector<std::vector<float>> ids(a.size() + b.size(), std::vector<float>(3));
        for (int i = 0; i < bs; ++i) {
            for (int j = 0; j < a_len; ++j) {
                ids[i * (a_len + b_len) + j] = a[i * a_len + j];
            }
            for (int j = 0; j < b_len; ++j) {
                ids[i * (a_len + b_len) + a_len + j] = b[i * b_len + j];
            }
        }
        return ids;
    }
    static std::vector<float> embed_nd(const std::vector<std::vector<float>>& ids,
                                       int bs,
                                       int theta,
                                       const std::vector<int>& axes_dim) {
        std::vector<std::vector<float>> trans_ids = transpose(ids);
        size_t pos_len                            = ids.size() / bs;
        int num_axes                              = axes_dim.size();
        // for (int i = 0; i < pos_len; i++) {
        //     std::cout << trans_ids[0][i] << " " << trans_ids[1][i] << " " << trans_ids[2][i] << std::endl;
        // }
        int emb_dim = 0;
        for (int d : axes_dim)
            emb_dim += d / 2;
        std::vector<std::vector<float>> emb(bs * pos_len, std::vector<float>(emb_dim * 2 * 2, 0.0));
        int offset = 0;
        for (int i = 0; i < num_axes; ++i) {
            std::vector<std::vector<float>> rope_emb = rope(trans_ids[i], axes_dim[i], theta);  // [bs*pos_len, axes_dim[i]/2 * 2 * 2]
            for (int b = 0; b < bs; ++b) {
                for (int j = 0; j < pos_len; ++j) {
                    for (int k = 0; k < rope_emb[0].size(); ++k) {
                        emb[b * pos_len + j][offset + k] = rope_emb[j][k];
                    }
                }
            }
            offset += rope_emb[0].size();
        }
        return flatten(emb);
    }
    static std::vector<std::vector<float>> gen_flux_ids(int h,
                                                        int w,
                                                        int patch_size,
                                                        int bs,
                                                        int context_len,
                                                        std::vector<ggml_tensor*> ref_latents) {
        auto txt_ids = gen_txt_ids(bs, context_len);
        auto img_ids = gen_img_ids(h, w, patch_size, bs);
        auto ids               = concat_ids(txt_ids, img_ids, bs);
        uint64_t curr_h_offset = 0;
        uint64_t curr_w_offset = 0;
        for (ggml_tensor* ref : ref_latents) {
            uint64_t h_offset = 0;
            uint64_t w_offset = 0;
            if (ref->ne[1] + curr_h_offset > ref->ne[0] + curr_w_offset) {
                w_offset = curr_w_offset;
            } else {
                h_offset = curr_h_offset;
            }
            auto ref_ids = gen_img_ids(ref->ne[1], ref->ne[0], patch_size, bs, 1, h_offset, w_offset);
            ids          = concat_ids(ids, ref_ids, bs);
            curr_h_offset = std::max(curr_h_offset, ref->ne[1] + h_offset);
            curr_w_offset = std::max(curr_w_offset, ref->ne[0] + w_offset);
        }
        return ids;
    }
    // Generate flux positional embeddings
    static std::vector<float> gen_flux_pe(int h,
                                          int w,
                                          int patch_size,
                                          int bs,
                                          int context_len,
                                          std::vector<ggml_tensor*> ref_latents,
                                          int theta,
                                          const std::vector<int>& axes_dim) {
        std::vector<std::vector<float>> ids = gen_flux_ids(h, w, patch_size, bs, context_len, ref_latents);
        return embed_nd(ids, bs, theta, axes_dim);
    }
    static std::vector<std::vector<float>> gen_vid_ids(int t,
                                                       int h,
                                                       int w,
                                                       int pt,
                                                       int ph,
                                                       int pw,
                                                       int bs,
                                                       int t_offset = 0,
                                                       int h_offset = 0,
                                                       int w_offset = 0) {
        int t_len = (t + (pt / 2)) / pt;
        int h_len = (h + (ph / 2)) / ph;
        int w_len = (w + (pw / 2)) / pw;
        std::vector<std::vector<float>> vid_ids(t_len * h_len * w_len, std::vector<float>(3, 0.0));
        std::vector<float> t_ids = linspace<float>(t_offset, t_len - 1 + t_offset, t_len);
        std::vector<float> h_ids = linspace<float>(h_offset, h_len - 1 + h_offset, h_len);
        std::vector<float> w_ids = linspace<float>(w_offset, w_len - 1 + w_offset, w_len);
        for (int i = 0; i < t_len; ++i) {
            for (int j = 0; j < h_len; ++j) {
                for (int k = 0; k < w_len; ++k) {
                    int idx         = i * h_len * w_len + j * w_len + k;
                    vid_ids[idx][0] = t_ids[i];
                    vid_ids[idx][1] = h_ids[j];
                    vid_ids[idx][2] = w_ids[k];
                }
            }
        }
        std::vector<std::vector<float>> vid_ids_repeated(bs * vid_ids.size(), std::vector<float>(3));
        for (int i = 0; i < bs; ++i) {
            for (int j = 0; j < vid_ids.size(); ++j) {
                vid_ids_repeated[i * vid_ids.size() + j] = vid_ids[j];
            }
        }
        return vid_ids_repeated;
    }
    // Generate wan positional embeddings
    static std::vector<float> gen_wan_pe(int t,
                                         int h,
                                         int w,
                                         int pt,
                                         int ph,
                                         int pw,
                                         int bs,
                                         int theta,
                                         const std::vector<int>& axes_dim) {
        std::vector<std::vector<float>> ids = gen_vid_ids(t, h, w, pt, ph, pw, bs);
        return embed_nd(ids, bs, theta, axes_dim);
    }
 };  // struct Rope
 #endif  // __ROPE_HPP__
--- a/stable-diffusion.cpp
+++ b/stable-diffusion.cpp
--- a/stable-diffusion.h
+++ b/stable-diffusion.h
@ -50,7 +50,7 @@ enum sample_method_t {
    SAMPLE_METHOD_COUNT
 };
-enum schedule_t {
+enum scheduler_t {
    DEFAULT,
    DISCRETE,
    KARRAS,
@ -101,7 +101,8 @@ enum sd_type_t {
    // SD_TYPE_IQ4_NL_4_4 = 36,
    // SD_TYPE_IQ4_NL_4_8 = 37,
    // SD_TYPE_IQ4_NL_8_8 = 38,
-    SD_TYPE_COUNT = 39,
+    SD_TYPE_MXFP4 = 39,  // MXFP4 (1 block)
    SD_TYPE_COUNT = 40,
 };
 enum sd_log_level_t {
@ -115,8 +116,10 @@ typedef struct {
    const char* model_path;
    const char* clip_l_path;
    const char* clip_g_path;
    const char* clip_vision_path;
    const char* t5xxl_path;
    const char* diffusion_model_path;
    const char* high_noise_diffusion_model_path;
    const char* vae_path;
    const char* taesd_path;
    const char* control_net_path;
@ -129,7 +132,7 @@ typedef struct {
    int n_threads;
    enum sd_type_t wtype;
    enum rng_type_t rng_type;
-    enum schedule_t schedule;
+    bool offload_params_to_cpu;
    bool keep_clip_on_cpu;
    bool keep_control_net_on_cpu;
    bool keep_vae_on_cpu;
@ -159,29 +162,33 @@ typedef struct {
 typedef struct {
    float txt_cfg;
    float img_cfg;
    float min_cfg;
    float distilled_guidance;
    sd_slg_params_t slg;
 } sd_guidance_params_t;
 typedef struct {
    sd_guidance_params_t guidance;
    enum scheduler_t scheduler;
    enum sample_method_t sample_method;
    int sample_steps;
    float eta;
 } sd_sample_params_t;
 typedef struct {
    const char* prompt;
    const char* negative_prompt;
    int clip_skip;
    sd_guidance_params_t guidance;
    sd_image_t init_image;
    sd_image_t* ref_images;
    int ref_images_count;
    sd_image_t mask_image;
    int width;
    int height;
-    enum sample_method_t sample_method;
+    sd_sample_params_t sample_params;
    int sample_steps;
    float eta;
    float strength;
    int64_t seed;
    int batch_count;
-    const sd_image_t* control_cond;
+    sd_image_t control_image;
    float control_strength;
    float style_strength;
    bool normalize_input;
@ -189,18 +196,18 @@ typedef struct {
 } sd_img_gen_params_t;
 typedef struct {
    const char* prompt;
    const char* negative_prompt;
    int clip_skip;
    sd_image_t init_image;
    sd_image_t end_image;
    int width;
    int height;
-    sd_guidance_params_t guidance;
+    sd_sample_params_t sample_params;
-    enum sample_method_t sample_method;
+    sd_sample_params_t high_noise_sample_params;
    int sample_steps;
    float strength;
    int64_t seed;
    int video_frames;
    int motion_bucket_id;
    int fps;
    float augmentation_level;
 } sd_vid_gen_params_t;
 typedef struct sd_ctx_t sd_ctx_t;
@ -219,8 +226,8 @@ SD_API const char* sd_rng_type_name(enum rng_type_t rng_type);
 SD_API enum rng_type_t str_to_rng_type(const char* str);
 SD_API const char* sd_sample_method_name(enum sample_method_t sample_method);
 SD_API enum sample_method_t str_to_sample_method(const char* str);
-SD_API const char* sd_schedule_name(enum schedule_t schedule);
+SD_API const char* sd_schedule_name(enum scheduler_t scheduler);
-SD_API enum schedule_t str_to_schedule(const char* str);
+SD_API enum scheduler_t str_to_schedule(const char* str);
 SD_API void sd_ctx_params_init(sd_ctx_params_t* sd_ctx_params);
 SD_API char* sd_ctx_params_to_str(const sd_ctx_params_t* sd_ctx_params);
@ -228,21 +235,27 @@ SD_API char* sd_ctx_params_to_str(const sd_ctx_params_t* sd_ctx_params);
 SD_API sd_ctx_t* new_sd_ctx(const sd_ctx_params_t* sd_ctx_params);
 SD_API void free_sd_ctx(sd_ctx_t* sd_ctx);
 SD_API void sd_sample_params_init(sd_sample_params_t* sample_params);
 SD_API char* sd_sample_params_to_str(const sd_sample_params_t* sample_params);
 SD_API void sd_img_gen_params_init(sd_img_gen_params_t* sd_img_gen_params);
 SD_API char* sd_img_gen_params_to_str(const sd_img_gen_params_t* sd_img_gen_params);
 SD_API sd_image_t* generate_image(sd_ctx_t* sd_ctx, const sd_img_gen_params_t* sd_img_gen_params);
 SD_API void sd_vid_gen_params_init(sd_vid_gen_params_t* sd_vid_gen_params);
-SD_API sd_image_t* generate_video(sd_ctx_t* sd_ctx, const sd_vid_gen_params_t* sd_vid_gen_params);  // broken
+SD_API sd_image_t* generate_video(sd_ctx_t* sd_ctx, const sd_vid_gen_params_t* sd_vid_gen_params, int* num_frames_out);
 typedef struct upscaler_ctx_t upscaler_ctx_t;
 SD_API upscaler_ctx_t* new_upscaler_ctx(const char* esrgan_path,
-                                        int n_threads,
+                                        bool offload_params_to_cpu,
-                                        bool direct);
+                                        bool direct,
                                        int n_threads);
 SD_API void free_upscaler_ctx(upscaler_ctx_t* upscaler_ctx);
-SD_API sd_image_t upscale(upscaler_ctx_t* upscaler_ctx, sd_image_t input_image, uint32_t upscale_factor);
+SD_API sd_image_t upscale(upscaler_ctx_t* upscaler_ctx,
                          sd_image_t input_image,
                          uint32_t upscale_factor);
 SD_API bool convert(const char* input_path,
                    const char* vae_path,
--- a/t5.hpp
+++ b/t5.hpp
@ -124,7 +124,10 @@ protected:
                return;
            }
            std::string piece = item[0];
-            float score       = item[1];
+            if (piece.empty()) {
                piece = "<empty_token>";
            }
            float score = item[1];
            piece_score_pairs.emplace_back(piece, score);
        }
    }
@ -147,6 +150,7 @@ protected:
        std::vector<const char*> key(pieces->size());
        std::vector<int> value(pieces->size());
        for (size_t i = 0; i < pieces->size(); ++i) {
            // LOG_DEBUG("%s %d", (*pieces)[i].first.c_str(), (*pieces)[i].second);
            key[i]   = (*pieces)[i].first.data();  // sorted piece.
            value[i] = (*pieces)[i].second;        // vocab_id
        }
@ -335,9 +339,9 @@ protected:
    }
 public:
-    explicit T5UniGramTokenizer(const std::string& json_str = "") {
+    explicit T5UniGramTokenizer(bool is_umt5 = false) {
-        if (json_str.size() != 0) {
+        if (is_umt5) {
-            InitializePieces(json_str);
+            InitializePieces(ModelLoader::load_umt5_tokenizer_json());
        } else {
            InitializePieces(ModelLoader::load_t5_tokenizer_json());
        }
@ -673,10 +677,11 @@ public:
            int64_t model_dim,
            int64_t inner_dim,
            int64_t ff_dim,
-            int64_t num_heads)
+            int64_t num_heads,
            bool relative_attention = true)
        : num_layers(num_layers) {
        for (int i = 0; i < num_layers; i++) {
-            blocks["block." + std::to_string(i)] = std::shared_ptr<GGMLBlock>(new T5Block(model_dim, inner_dim, ff_dim, num_heads, i == 0));
+            blocks["block." + std::to_string(i)] = std::shared_ptr<GGMLBlock>(new T5Block(model_dim, inner_dim, ff_dim, num_heads, (!relative_attention || i == 0)));
        }
        blocks["final_layer_norm"] = std::shared_ptr<GGMLBlock>(new T5LayerNorm(model_dim));
@ -703,15 +708,30 @@ public:
    }
 };
 struct T5Params {
    int64_t num_layers      = 24;
    int64_t model_dim       = 4096;
    int64_t ff_dim          = 10240;
    int64_t num_heads       = 64;
    int64_t vocab_size      = 32128;
    bool relative_attention = true;
 };
 struct T5 : public GGMLBlock {
    T5Params params;
 public:
-    T5(int64_t num_layers,
+    T5() {}
-       int64_t model_dim,
+    T5(T5Params params)
-       int64_t ff_dim,
+        : params(params) {
-       int64_t num_heads,
+        blocks["encoder"] = std::shared_ptr<GGMLBlock>(new T5Stack(params.num_layers,
-       int64_t vocab_size) {
+                                                                   params.model_dim,
-        blocks["encoder"] = std::shared_ptr<GGMLBlock>(new T5Stack(num_layers, model_dim, model_dim, ff_dim, num_heads));
+                                                                   params.model_dim,
-        blocks["shared"]  = std::shared_ptr<GGMLBlock>(new Embedding(vocab_size, model_dim));
+                                                                   params.ff_dim,
                                                                   params.num_heads,
                                                                   params.relative_attention));
        blocks["shared"]  = std::shared_ptr<GGMLBlock>(new Embedding(params.vocab_size,
                                                                     params.model_dim));
    }
    struct ggml_tensor* forward(struct ggml_context* ctx,
@ -731,18 +751,21 @@ public:
 };
 struct T5Runner : public GGMLRunner {
    T5Params params;
    T5 model;
    std::vector<int> relative_position_bucket_vec;
    T5Runner(ggml_backend_t backend,
             bool offload_params_to_cpu,
             const String2GGMLType& tensor_types,
             const std::string prefix,
-             int64_t num_layers = 24,
+             bool is_umt5 = false)
-             int64_t model_dim  = 4096,
+        : GGMLRunner(backend, offload_params_to_cpu) {
-             int64_t ff_dim     = 10240,
+        if (is_umt5) {
-             int64_t num_heads  = 64,
+            params.vocab_size         = 256384;
-             int64_t vocab_size = 32128)
+            params.relative_attention = false;
-        : GGMLRunner(backend), model(num_layers, model_dim, ff_dim, num_heads, vocab_size) {
+        }
        model = T5(params);
        model.init(params_ctx, tensor_types, prefix);
    }
@ -769,7 +792,8 @@ struct T5Runner : public GGMLRunner {
                                    struct ggml_tensor* attention_mask = NULL) {
        struct ggml_cgraph* gf = ggml_new_graph(compute_ctx);
-        input_ids = to_backend(input_ids);
+        input_ids      = to_backend(input_ids);
        attention_mask = to_backend(attention_mask);
        relative_position_bucket_vec = compute_relative_position_bucket(input_ids->ne[0], input_ids->ne[0]);
@ -877,14 +901,11 @@ struct T5Embedder {
    T5Runner model;
    T5Embedder(ggml_backend_t backend,
               bool offload_params_to_cpu,
               const String2GGMLType& tensor_types = {},
               const std::string prefix            = "",
-               int64_t num_layers                  = 24,
+               bool is_umt5                        = false)
-               int64_t model_dim                   = 4096,
+        : model(backend, offload_params_to_cpu, tensor_types, prefix, is_umt5), tokenizer(is_umt5) {
               int64_t ff_dim                      = 10240,
               int64_t num_heads                   = 64,
               int64_t vocab_size                  = 32128)
        : model(backend, tensor_types, prefix, num_layers, model_dim, ff_dim, num_heads, vocab_size) {
    }
    void get_param_tensors(std::map<std::string, struct ggml_tensor*>& tensors, const std::string prefix) {
@ -946,25 +967,22 @@ struct T5Embedder {
        GGML_ASSERT(work_ctx != NULL);
        {
            // cpu f16: pass
            // cpu f32: pass
            // cuda f16: nan
            // cuda f32: pass
            // cuda q8_0: nan
            // TODO: fix cuda nan
            std::string text("a lovely cat");
-            auto tokens_and_weights     = tokenize(text, 77, true);
+            // std::string text("一只可爱的猫"); // umt5 chinease test
            auto tokens_and_weights     = tokenize(text, 512, true);
            std::vector<int>& tokens    = std::get<0>(tokens_and_weights);
            std::vector<float>& weights = std::get<1>(tokens_and_weights);
            std::vector<float>& masks   = std::get<2>(tokens_and_weights);
            for (auto token : tokens) {
                printf("%d ", token);
            }
            printf("\n");
            auto input_ids          = vector_to_ggml_tensor_i32(work_ctx, tokens);
            auto attention_mask     = vector_to_ggml_tensor(work_ctx, masks);
            struct ggml_tensor* out = NULL;
            int t0 = ggml_time_ms();
-            model.compute(8, input_ids, NULL, &out, work_ctx);
+            model.compute(8, input_ids, attention_mask, &out, work_ctx);
            int t1 = ggml_time_ms();
            print_ggml_tensor(out);
@ -973,32 +991,43 @@ struct T5Embedder {
    }
    static void load_from_file_and_test(const std::string& file_path) {
-        // ggml_backend_t backend    = ggml_backend_cuda_init(0);
+        // cpu f16: pass
-        ggml_backend_t backend         = ggml_backend_cpu_init();
+        // cpu f32: pass
-        ggml_type model_data_type      = GGML_TYPE_F32;
+        // cuda f16: pass
-        std::shared_ptr<T5Embedder> t5 = std::shared_ptr<T5Embedder>(new T5Embedder(backend));
+        // cuda f32: pass
-        {
+        // cuda q8_0: pass
-            LOG_INFO("loading from '%s'", file_path.c_str());
+        // ggml_backend_t backend = ggml_backend_cuda_init(0);
        ggml_backend_t backend    = ggml_backend_cpu_init();
        ggml_type model_data_type = GGML_TYPE_F16;
-            t5->alloc_params_buffer();
+        ModelLoader model_loader;
-            std::map<std::string, ggml_tensor*> tensors;
+        if (!model_loader.init_from_file(file_path)) {
-            t5->get_param_tensors(tensors, "");
+            LOG_ERROR("init model loader from file failed: '%s'", file_path.c_str());
-
+            return;
            ModelLoader model_loader;
            if (!model_loader.init_from_file(file_path)) {
                LOG_ERROR("init model loader from file failed: '%s'", file_path.c_str());
                return;
            }
            bool success = model_loader.load_tensors(tensors, backend);
            if (!success) {
                LOG_ERROR("load tensors from model loader failed");
                return;
            }
            LOG_INFO("t5 model loaded");
        }
        auto tensor_types = model_loader.tensor_storages_types;
        for (auto& item : tensor_types) {
            // LOG_DEBUG("%s %u", item.first.c_str(), item.second);
            if (ends_with(item.first, "weight")) {
                item.second = model_data_type;
            }
        }
        std::shared_ptr<T5Embedder> t5 = std::shared_ptr<T5Embedder>(new T5Embedder(backend, false, tensor_types, "", true));
        t5->alloc_params_buffer();
        std::map<std::string, ggml_tensor*> tensors;
        t5->get_param_tensors(tensors, "");
        bool success = model_loader.load_tensors(tensors);
        if (!success) {
            LOG_ERROR("load tensors from model loader failed");
            return;
        }
        LOG_INFO("t5 model loaded");
        t5->test();
    }
 };
--- a/tae.hpp
+++ b/tae.hpp
@ -196,13 +196,14 @@ struct TinyAutoEncoder : public GGMLRunner {
    bool decode_only = false;
    TinyAutoEncoder(ggml_backend_t backend,
                    bool offload_params_to_cpu,
                    const String2GGMLType& tensor_types,
                    const std::string prefix,
                    bool decoder_only = true,
                    SDVersion version = VERSION_SD1)
        : decode_only(decoder_only),
          taesd(decoder_only, version),
-          GGMLRunner(backend) {
+          GGMLRunner(backend, offload_params_to_cpu) {
        taesd.init(params_ctx, tensor_types, prefix);
    }
@ -237,7 +238,7 @@ struct TinyAutoEncoder : public GGMLRunner {
            return false;
        }
-        bool success = model_loader.load_tensors(taesd_tensors, backend, ignore_tensors);
+        bool success = model_loader.load_tensors(taesd_tensors, ignore_tensors);
        if (!success) {
            LOG_ERROR("load tae tensors from model loader failed");
--- a/thirdparty/darts.h
+++ b/thirdparty/darts.h
@ -4,6 +4,7 @@
 #include <cstdio>
 #include <exception>
 #include <new>
 #include <iostream>
 #define DARTS_VERSION "0.32"
@ -1140,9 +1141,11 @@ inline void DawgBuilder::insert(const char *key, std::size_t length,
  if (value < 0) {
    DARTS_THROW("failed to insert key: negative value");
  } else if (length == 0) {
    std::cout << value << std::endl;
    DARTS_THROW("failed to insert key: zero-length key");
  }
  id_type id = 0;
  std::size_t key_pos = 0;
--- a/unet.hpp
+++ b/unet.hpp
@ -538,11 +538,12 @@ struct UNetModelRunner : public GGMLRunner {
    UnetModelBlock unet;
    UNetModelRunner(ggml_backend_t backend,
                    bool offload_params_to_cpu,
                    const String2GGMLType& tensor_types,
                    const std::string prefix,
                    SDVersion version = VERSION_SD1,
                    bool flash_attn   = false)
-        : GGMLRunner(backend), unet(version, tensor_types, flash_attn) {
+        : GGMLRunner(backend, offload_params_to_cpu), unet(version, tensor_types, flash_attn) {
        unet.init(params_ctx, tensor_types, prefix);
    }
--- a/upscaler.cpp
+++ b/upscaler.cpp
@ -17,7 +17,8 @@ struct UpscalerGGML {
          direct(direct) {
    }
-    bool load_from_file(const std::string& esrgan_path) {
+    bool load_from_file(const std::string& esrgan_path,
                        bool offload_params_to_cpu) {
 #ifdef SD_USE_CUDA
        LOG_DEBUG("Using CUDA backend");
        backend = ggml_backend_cuda_init(0);
@ -49,7 +50,7 @@ struct UpscalerGGML {
            backend = ggml_backend_cpu_init();
        }
        LOG_INFO("Upscaler weight type: %s", ggml_type_name(model_data_type));
-        esrgan_upscaler = std::make_shared<ESRGAN>(backend, model_loader.tensor_storages_types);
+        esrgan_upscaler = std::make_shared<ESRGAN>(backend, offload_params_to_cpu, model_loader.tensor_storages_types);
        if (direct) {
            esrgan_upscaler->enable_conv2d_direct();
        }
@ -110,8 +111,9 @@ struct upscaler_ctx_t {
 };
 upscaler_ctx_t* new_upscaler_ctx(const char* esrgan_path_c_str,
-                                 int n_threads,
+                                 bool offload_params_to_cpu,
-                                 bool direct = false) {
+                                 bool direct,
                                 int n_threads) {
    upscaler_ctx_t* upscaler_ctx = (upscaler_ctx_t*)malloc(sizeof(upscaler_ctx_t));
    if (upscaler_ctx == NULL) {
        return NULL;
@ -123,7 +125,7 @@ upscaler_ctx_t* new_upscaler_ctx(const char* esrgan_path_c_str,
        return NULL;
    }
-    if (!upscaler_ctx->upscaler->load_from_file(esrgan_path)) {
+    if (!upscaler_ctx->upscaler->load_from_file(esrgan_path, offload_params_to_cpu)) {
        delete upscaler_ctx->upscaler;
        upscaler_ctx->upscaler = NULL;
        free(upscaler_ctx);
--- a/util.cpp
+++ b/util.cpp
@ -72,6 +72,17 @@ std::string format(const char* fmt, ...) {
    return std::string(buf.data(), size);
 }
 int round_up_to(int value, int base) {
    if (base <= 0) {
        return value;
    }
    if (value % base == 0) {
        return value;
    } else {
        return ((value / base) + 1) * base;
    }
 }
 #ifdef _WIN32  // code for windows
 #include <windows.h>
@ -290,7 +301,7 @@ std::string path_join(const std::string& p1, const std::string& p2) {
    return p1 + "/" + p2;
 }
-std::vector<std::string> splitString(const std::string& str, char delimiter) {
+std::vector<std::string> split_string(const std::string& str, char delimiter) {
    std::vector<std::string> result;
    size_t start = 0;
    size_t end   = str.find(delimiter);
--- a/util.h
+++ b/util.h
@ -18,6 +18,8 @@ std::string format(const char* fmt, ...);
 void replace_all_chars(std::string& str, char target, char replacement);
 int round_up_to(int value, int base);
 bool file_exists(const std::string& filename);
 bool is_directory(const std::string& path);
 std::string get_full_path(const std::string& dir, const std::string& filename);
@ -48,7 +50,7 @@ sd_image_f32_t resize_sd_image_f32_t(sd_image_f32_t image, int target_width, int
 sd_image_f32_t clip_preprocess(sd_image_f32_t image, int size);
 std::string path_join(const std::string& p1, const std::string& p2);
-std::vector<std::string> splitString(const std::string& str, char delimiter);
+std::vector<std::string> split_string(const std::string& str, char delimiter);
 void pretty_progress(int step, int steps, float time);
 void log_printf(sd_log_level_t level, const char* file, int line, const char* format, ...);
--- a/vae.hpp
+++ b/vae.hpp
@ -520,17 +520,30 @@ public:
    }
 };
-struct AutoEncoderKL : public GGMLRunner {
+struct VAE : public GGMLRunner {
    VAE(ggml_backend_t backend, bool offload_params_to_cpu)
        : GGMLRunner(backend, offload_params_to_cpu) {}
    virtual void compute(const int n_threads,
                         struct ggml_tensor* z,
                         bool decode_graph,
                         struct ggml_tensor** output,
                         struct ggml_context* output_ctx)                                                         = 0;
    virtual void get_param_tensors(std::map<std::string, struct ggml_tensor*>& tensors, const std::string prefix) = 0;
    virtual void enable_conv2d_direct(){};
 };
 struct AutoEncoderKL : public VAE {
    bool decode_only = true;
    AutoencodingEngine ae;
    AutoEncoderKL(ggml_backend_t backend,
                  bool offload_params_to_cpu,
                  const String2GGMLType& tensor_types,
                  const std::string prefix,
                  bool decode_only       = false,
                  bool use_video_decoder = false,
                  SDVersion version      = VERSION_SD1)
-        : decode_only(decode_only), ae(decode_only, use_video_decoder, version), GGMLRunner(backend) {
+        : decode_only(decode_only), ae(decode_only, use_video_decoder, version), VAE(backend, offload_params_to_cpu) {
        ae.init(params_ctx, tensor_types, prefix);
    }
--- a/vocab_umt5.hpp
+++ b/vocab_umt5.hpp
--- a/wan.hpp
+++ b/wan.hpp
		`@ -1 +1 @@`
			`Subproject commit 7dee1d6a1e7611f238d09be96738388da97c88ed`				`Subproject commit 5fdc78fff274094e2a1b155928131983362d8a71`