diff --git a/README.md b/README.md index 6f8e62d5..a411ce85 100644 --- a/README.md +++ b/README.md @@ -15,6 +15,7 @@ API and command-line option may change frequently.*** ## ๐Ÿ”ฅImportant News +* **2026/06/04** ๐Ÿš€ stable-diffusion.cpp now supports **Ideogram4** * **2026/05/31** ๐Ÿš€ stable-diffusion.cpp now supports **PiD** * **2026/05/27** ๐Ÿš€ stable-diffusion.cpp now supports **Lens** * **2026/05/17** ๐Ÿš€ stable-diffusion.cpp now supports **LTX-2.3** @@ -50,6 +51,7 @@ API and command-line option may change frequently.*** - [Anima](./docs/anima.md) - [ERNIE-Image](./docs/ernie_image.md) - [HiDream-O1-Image](./docs/hidream_o1_image.md) + - [Ideogram4](./docs/ideogram4.md) - Image Edit Models - [FLUX.1-Kontext-dev](./docs/kontext.md) - [Qwen Image Edit series](./docs/qwen_image_edit.md) diff --git a/assets/ideogram4/example.png b/assets/ideogram4/example.png new file mode 100644 index 00000000..f140c54b Binary files /dev/null and b/assets/ideogram4/example.png differ diff --git a/docs/ideogram4.md b/docs/ideogram4.md new file mode 100644 index 00000000..04864f27 --- /dev/null +++ b/docs/ideogram4.md @@ -0,0 +1,40 @@ +# How to Use + +## Download weights + +- Download Ideogram4 + - safetensors: https://huggingface.co/ideogram-ai/ideogram-4-fp8/tree/main/transformer +- Download Ideogram4 uncond + - safetensors: https://huggingface.co/ideogram-ai/ideogram-4-fp8/tree/main/unconditional_transformer +- Download vae + - safetensors: https://huggingface.co/black-forest-labs/FLUX.2-dev/tree/main +- Download Qwen3-VL-8B-Instruct + - gguf: https://huggingface.co/unsloth/Qwen3-VL-8B-Instruct-GGUF/tree/main + +## Convert weights + +fp8 scale -> bf16 + +``` +python .\convert_fp8_scale_to_bf16.py --input .\ideogram4_fp8.safetensors --output ideogram4_bf16.safetensors +python .\convert_fp8_scale_to_bf16.py --input .\ideogram4_uncond_fp8.safetensors --output ideogram4_uncond_bf16.safetensors +``` + +bf16 -> q8 + +``` +.\bin\Release\sd-cli.exe -M convert -m ideogram4_bf16.safetensors -o ideogram4-Q8_0.gguf --tensor-type-rules "^layers.*adaln_modulation.*weight=q8_0,layers.*attention.o.*weight=q8_0,layers.*attention.qkv.*weight=q8_0,layers.*feed_forward.*weight=q8_0" -v + +.\bin\Release\sd-cli.exe -M convert -m ideogram4_uncond_bf16.safetensors -o ideogram4_uncond-Q8_0.gguf --tensor-type-rules "^layers.*adaln_modulation.*weight=q8_0,layers.*attention.o.*weight=q8_0,layers.*attention.qkv.*weight=q8_0,layers.*feed_forward.*weight=q8_0" -v +``` + +If you want lower VRAM usage, you can change the quantization from q8_0 to a lower-level quantization, such as q4_0. + + +## Examples + +```sh +.\bin\Release\sd-cli.exe --diffusion-model ideogram4-Q8_0.gguf --uncond-diffusion-model ideogram4_uncond-Q8_0.gguf --llm ..\..\llm\Qwen3VL-8B-Instruct-Q4_K_M.gguf --vae ..\..\ComfyUI\models\vae\flux2_ae.safetensors -p '{"high_level_description":"A square 1024 x 1024 luxury fashion magazine cover featuring exactly one short chubby fluffy cat as the main model. The cat sits on a soft ivory studio floor, facing the viewer with a stylish calm expression, wearing tiny black sunglasses, a red silk scarf, and a small gold collar charm. In front of the cat on the floor is a wide horizontal luxury nameplate that clearly reads ideogram4.cpp. The whole design feels premium, fashionable, clean, and editorial.","style_description":{"aesthetics":"luxury fashion magazine cover, high-end pet couture campaign, minimalist editorial design, elegant studio photography, soft paper texture, refined typography, fashionable and polished","lighting":"Soft diffused studio lighting, gentle spotlight on the cat, subtle floor shadow, warm ivory highlights, clean separation between subject and background","photo":"high-resolution fashion editorial photography look, front-facing cat portrait, crisp fur details, glossy sunglasses, clear readable nameplate text, shallow depth of field","medium":"mixed media fashion photography and premium editorial graphic design","color_palette":["#F4EFE7","#111111","#D8B56D","#B73A3A","#FFFFFF","#8A7A6A"]},"compositional_deconstruction":{"canvas":"Square 1024 x 1024 canvas with a normal upright orientation. Do not rotate the poster or any text. Use a clean fashion magazine cover layout.","background":"Warm ivory studio backdrop with subtle paper grain, a soft spotlight gradient, faint floor shadow, and a few minimal gold editorial lines. The background is spacious, premium, and uncluttered.","layout":"Top center has a small elegant headline. Center area features one cat as the main fashion model. Lower foreground has a wide horizontal luxury nameplate placed on the floor in front of the cat. Bottom center has a small footer. All text is horizontal, upright, and readable left to right.","elements":[{"type":"text","desc":"Top center headline reading LOOK WHAT I FOUND in a refined high-fashion serif font. The headline is horizontal, centered, elegant, and secondary to the nameplate text."},{"type":"obj","desc":"Exactly one short chubby fluffy cat sitting in the center like a luxury fashion model. The cat has a large round head, compact body, short legs, soft detailed fur, expressive eyes, and a calm confident pose. The cat is cute and rounded, not tall, not stretched, not duplicated."},{"type":"obj","desc":"Tiny glossy black sunglasses worn naturally by the cat, slightly oversized but still showing the cat face clearly. The sunglasses add a chic fashion-editorial attitude."},{"type":"obj","desc":"A red silk scarf tied neatly around the cat neck, with soft folds and a couture feeling. The scarf must not cover the cat face or the nameplate."},{"type":"obj","desc":"A small gold collar charm or fashion accessory under the scarf, subtle and premium, adding a luxury campaign detail."},{"type":"obj","desc":"In the lower foreground, place a wide horizontal luxury nameplate on the floor in front of the cat. The nameplate is low, flat, landscape-oriented, much wider than tall, like a fashion show seat card or premium display plaque. It is centered, front-facing, level, and fully visible. It must not become vertical, tall, standing, rotated, or side-facing."},{"type":"text","desc":"Print the exact text ideogram4.cpp only on the wide horizontal nameplate. Use clean bold black lettering, perfectly spelled, lowercase, with the number 4 and .cpp extension. The text must fit completely inside the nameplate, stay horizontal, and be readable from left to right."},{"type":"obj","desc":"Add sparse premium editorial accents around the edges: thin gold lines, small code brackets, tiny cursor marks, subtle dots, and minimal geometric details. No extra cats, no stickers, no animal faces, no busy decorations."},{"type":"text","desc":"Bottom center footer reading tiny paws, big compile energy in a small refined monospace or editorial font. The footer is horizontal, centered, understated, and much smaller than the nameplate text."}]}}' --diffusion-fa -v --offload-to-cpu -H 1024 -W 1024 +``` + +ideogram4 image example diff --git a/examples/cli/README.md b/examples/cli/README.md index 4dcfb89b..1b7c2731 100644 --- a/examples/cli/README.md +++ b/examples/cli/README.md @@ -41,6 +41,8 @@ Context Options: --qwen2vl_vision alias of --llm_vision. Deprecated. --diffusion-model path to the standalone diffusion model --high-noise-diffusion-model path to the standalone high noise diffusion model + --uncond-diffusion-model path to the standalone unconditional diffusion model, currently used by + Ideogram4 CFG --vae path to standalone vae model --taesd path to taesd. Using Tiny AutoEncoder for fast decoding (low quality) --tae alias of --taesd diff --git a/examples/common/common.cpp b/examples/common/common.cpp index 6f397aa3..3ae5faba 100644 --- a/examples/common/common.cpp +++ b/examples/common/common.cpp @@ -359,6 +359,10 @@ ArgOptions SDContextParams::get_options() { "--high-noise-diffusion-model", "path to the standalone high noise diffusion model", &high_noise_diffusion_model_path}, + {"", + "--uncond-diffusion-model", + "path to the standalone unconditional diffusion model, currently used by Ideogram4 CFG", + &uncond_diffusion_model_path}, {"", "--embeddings-connectors", "path to LTXAV embeddings connectors", @@ -709,6 +713,7 @@ std::string SDContextParams::to_string() const { << " llm_vision_path: \"" << llm_vision_path << "\",\n" << " diffusion_model_path: \"" << diffusion_model_path << "\",\n" << " high_noise_diffusion_model_path: \"" << high_noise_diffusion_model_path << "\",\n" + << " uncond_diffusion_model_path: \"" << uncond_diffusion_model_path << "\",\n" << " embeddings_connectors_path: \"" << embeddings_connectors_path << "\",\n" << " vae_path: \"" << vae_path << "\",\n" << " vae_format: \"" << vae_format << "\",\n" @@ -772,6 +777,7 @@ sd_ctx_params_t SDContextParams::to_sd_ctx_params_t(bool vae_decode_only, bool f llm_vision_path.c_str(), diffusion_model_path.c_str(), high_noise_diffusion_model_path.c_str(), + uncond_diffusion_model_path.c_str(), embeddings_connectors_path.c_str(), vae_path.c_str(), audio_vae_path.c_str(), @@ -2522,6 +2528,7 @@ std::string build_sdcpp_image_metadata_json(const SDContextParams& ctx_params, set_json_basename_if_not_empty(models, "llm_vision", ctx_params.llm_vision_path); set_json_basename_if_not_empty(models, "diffusion_model", ctx_params.diffusion_model_path); set_json_basename_if_not_empty(models, "high_noise_diffusion_model", ctx_params.high_noise_diffusion_model_path); + set_json_basename_if_not_empty(models, "uncond_diffusion_model", ctx_params.uncond_diffusion_model_path); set_json_basename_if_not_empty(models, "vae", ctx_params.vae_path); set_json_basename_if_not_empty(models, "taesd", ctx_params.taesd_path); set_json_basename_if_not_empty(models, "control_net", ctx_params.control_net_path); @@ -2689,6 +2696,9 @@ std::string get_image_params(const SDContextParams& ctx_params, if (!ctx_params.diffusion_model_path.empty()) { parameter_string += "Unet: " + sd_basename(ctx_params.diffusion_model_path) + ", "; } + if (!ctx_params.uncond_diffusion_model_path.empty()) { + parameter_string += "Uncond Unet: " + sd_basename(ctx_params.uncond_diffusion_model_path) + ", "; + } if (!ctx_params.vae_path.empty()) { parameter_string += "VAE: " + sd_basename(ctx_params.vae_path) + ", "; } diff --git a/examples/common/common.h b/examples/common/common.h index 3a6653bb..a90a3313 100644 --- a/examples/common/common.h +++ b/examples/common/common.h @@ -123,6 +123,7 @@ struct SDContextParams { std::string llm_vision_path; std::string diffusion_model_path; std::string high_noise_diffusion_model_path; + std::string uncond_diffusion_model_path; std::string embeddings_connectors_path; std::string vae_path; std::string vae_format = "auto"; diff --git a/examples/server/README.md b/examples/server/README.md index d971f5fe..16fb393c 100644 --- a/examples/server/README.md +++ b/examples/server/README.md @@ -143,6 +143,8 @@ Context Options: --qwen2vl_vision alias of --llm_vision. Deprecated. --diffusion-model path to the standalone diffusion model --high-noise-diffusion-model path to the standalone high noise diffusion model + --uncond-diffusion-model path to the standalone unconditional diffusion model, currently used by + Ideogram4 CFG --vae path to standalone vae model --taesd path to taesd. Using Tiny AutoEncoder for fast decoding (low quality) --tae alias of --taesd diff --git a/include/stable-diffusion.h b/include/stable-diffusion.h index 8654c01d..17596f84 100644 --- a/include/stable-diffusion.h +++ b/include/stable-diffusion.h @@ -186,6 +186,7 @@ typedef struct { const char* llm_vision_path; const char* diffusion_model_path; const char* high_noise_diffusion_model_path; + const char* uncond_diffusion_model_path; const char* embeddings_connectors_path; const char* vae_path; const char* audio_vae_path; diff --git a/script/convert_fp8_scale_to_bf16.py b/script/convert_fp8_scale_to_bf16.py new file mode 100644 index 00000000..a3eb2acc --- /dev/null +++ b/script/convert_fp8_scale_to_bf16.py @@ -0,0 +1,283 @@ +#!/usr/bin/env python +import argparse +import json +import math +import os +import struct +from collections import Counter +from pathlib import Path + +import torch +from safetensors import safe_open + + +FLOAT_DTYPES = { + "BF16", + "F16", + "F32", + "F64", + "F8_E4M3", + "F8_E4M3FN", + "F8_E5M2", +} + +FP8_DTYPES = { + "F8_E4M3", + "F8_E4M3FN", + "F8_E5M2", +} + +DTYPE_SIZES = { + "BOOL": 1, + "U8": 1, + "I8": 1, + "F8_E4M3": 1, + "F8_E4M3FN": 1, + "F8_E5M2": 1, + "U16": 2, + "I16": 2, + "F16": 2, + "BF16": 2, + "U32": 4, + "I32": 4, + "F32": 4, + "U64": 8, + "I64": 8, + "F64": 8, +} + + +def read_safetensors_header(path: Path): + with path.open("rb") as f: + header_len = struct.unpack(" 0 and scale.ndim == 1: + if first_dim_end is not None and scale.shape[0] >= first_dim_end: + scale = scale[first_dim_start:first_dim_end] + if scale.shape[0] == chunk.shape[0]: + return scale.reshape((scale.shape[0],) + (1,) * (chunk.ndim - 1)) + + return scale + + +def write_scaled_fp8_weight(out, weight, scale, chunk_rows): + if weight.ndim == 0: + result = weight.to(torch.float32) * scale_view_for_chunk(scale, weight) + write_tensor_bytes(out, result.to(torch.bfloat16)) + return + + rows = weight.shape[0] + for start in range(0, rows, chunk_rows): + end = min(start + chunk_rows, rows) + chunk = weight[start:end].to(torch.float32) + scale_view = scale_view_for_chunk(scale, chunk, start, end) + result = chunk * scale_view + write_tensor_bytes(out, result.to(torch.bfloat16)) + + +def write_float_as_bf16(out, tensor, chunk_rows): + if tensor.dtype == torch.bfloat16: + write_tensor_bytes(out, tensor) + return + + if tensor.ndim == 0: + write_tensor_bytes(out, tensor.to(torch.bfloat16)) + return + + rows = tensor.shape[0] + for start in range(0, rows, chunk_rows): + end = min(start + chunk_rows, rows) + write_tensor_bytes(out, tensor[start:end].to(torch.bfloat16)) + + +def convert(input_path: Path, output_path: Path, chunk_rows: int, dry_run: bool): + header = read_safetensors_header(input_path) + plan, output_header, data_size = build_output_plan(header) + + source_counts = Counter(item["source_dtype"] for item in plan) + output_counts = Counter(item["output_dtype"] for item in plan) + scaled_count = sum(item["mode"] == "fp8_scaled_weight" for item in plan) + dropped_scales = sum(item["mode"] == "fp8_scaled_weight" for item in plan) + header_bytes = json.dumps(output_header, separators=(",", ":")).encode("utf-8") + expected_size = 8 + len(header_bytes) + data_size + + print(f"input: {input_path}") + print(f"output: {output_path}") + print(f"tensors written: {len(plan)}") + print(f"scaled fp8 weights dequantized: {scaled_count}") + print(f"weight_scale tensors dropped: {dropped_scales}") + print(f"source dtypes: {dict(sorted(source_counts.items()))}") + print(f"output dtypes: {dict(sorted(output_counts.items()))}") + print(f"expected output size: {expected_size / (1024 ** 3):.2f} GiB") + + if dry_run: + return + + if output_path.exists(): + raise FileExistsError(f"{output_path} already exists; pass --overwrite to replace it") + + tmp_path = output_path.with_suffix(output_path.suffix + ".tmp") + if tmp_path.exists(): + raise FileExistsError(f"{tmp_path} already exists; remove it or choose another output") + + with safe_open(str(input_path), framework="pt", device="cpu") as sf, tmp_path.open("wb") as out: + out.write(struct.pack(" {item['output_dtype']}") + + tensor = sf.get_tensor(name) + if item["mode"] == "fp8_scaled_weight": + scale = sf.get_tensor(item["scale_key"]) + write_scaled_fp8_weight(out, tensor, scale, chunk_rows) + elif item["mode"] == "float_to_bf16": + write_float_as_bf16(out, tensor, chunk_rows) + else: + write_tensor_bytes(out, tensor) + + actual_size = out.tell() + + if actual_size != expected_size: + tmp_path.unlink(missing_ok=True) + raise RuntimeError(f"wrote {actual_size} bytes, expected {expected_size} bytes") + + tmp_path.replace(output_path) + print("done") + + +def main(): + parser = argparse.ArgumentParser( + description="Convert an fp8 safetensors checkpoint with weight_scale tensors to bf16." + ) + parser.add_argument("--input", default="ideogram4_fp8.safetensors", type=Path) + parser.add_argument("--output", default="ideogram4_bf16.safetensors", type=Path) + parser.add_argument("--chunk-rows", default=1024, type=int) + parser.add_argument("--dry-run", action="store_true") + parser.add_argument("--overwrite", action="store_true") + args = parser.parse_args() + + input_path = args.input.resolve() + output_path = args.output.resolve() + + if args.chunk_rows < 1: + raise ValueError("--chunk-rows must be >= 1") + if not input_path.exists(): + raise FileNotFoundError(input_path) + if args.overwrite and output_path.exists(): + output_path.unlink() + + convert(input_path, output_path, args.chunk_rows, args.dry_run) + + +if __name__ == "__main__": + main() diff --git a/src/conditioner.hpp b/src/conditioner.hpp index 157d3906..f0f8e3e4 100644 --- a/src/conditioner.hpp +++ b/src/conditioner.hpp @@ -1759,6 +1759,8 @@ struct LLMEmbedder : public Conditioner { arch = LLM::LLMArch::GPT_OSS_20B; } else if (sd_version_is_pid(version)) { arch = LLM::LLMArch::GEMMA2_2B; + } else if (sd_version_is_ideogram4(version)) { + arch = LLM::LLMArch::QWEN3_VL; } else if (sd_version_is_z_image(version) || version == VERSION_OVIS_IMAGE || version == VERSION_FLUX2_KLEIN) { arch = LLM::LLMArch::QWEN3; } @@ -2101,6 +2103,14 @@ struct LLMEmbedder : public Conditioner { prompt_attn_range.second = static_cast(prompt.size()); prompt += "[/INST]"; + } else if (sd_version_is_ideogram4(version)) { + prompt_template_encode_start_idx = 0; + out_layers = {1, 4, 7, 10, 13, 16, 19, 22, 25, 28, 31, 34, 36}; + + prompt = "<|im_start|>user\n"; + prompt += conditioner_params.text; + prompt += "<|im_end|>\n<|im_start|>assistant\n"; + prompt_attn_range = {0, 0}; } else if (sd_version_is_ernie_image(version)) { prompt_template_encode_start_idx = 0; out_layers = {25}; // -2 diff --git a/src/ggml_extend.hpp b/src/ggml_extend.hpp index f6ac1d65..1f32c9bc 100644 --- a/src/ggml_extend.hpp +++ b/src/ggml_extend.hpp @@ -3347,11 +3347,14 @@ protected: bool bias; bool force_f32; bool force_prec_f32; + bool allow_weight_scale; + bool has_weight_scale = false; float scale; std::string prefix; void init_params(ggml_context* ctx, const String2TensorStorage& tensor_storage_map = {}, const std::string prefix = "") override { this->prefix = prefix; + has_weight_scale = false; enum ggml_type wtype = get_type(prefix + "weight", tensor_storage_map, GGML_TYPE_F32); if (in_features % ggml_blck_size(wtype) != 0 || force_f32) { wtype = GGML_TYPE_F32; @@ -3361,20 +3364,26 @@ protected: enum ggml_type wtype = GGML_TYPE_F32; params["bias"] = ggml_new_tensor_1d(ctx, wtype, out_features); } + if (allow_weight_scale && tensor_storage_map.find(prefix + "weight_scale") != tensor_storage_map.end()) { + params["weight_scale"] = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, out_features); + has_weight_scale = true; + } } public: Linear(int64_t in_features, int64_t out_features, - bool bias = true, - bool force_f32 = false, - bool force_prec_f32 = false, - float scale = 1.f) + bool bias = true, + bool force_f32 = false, + bool force_prec_f32 = false, + float scale = 1.f, + bool allow_weight_scale = false) : in_features(in_features), out_features(out_features), bias(bias), force_f32(force_f32), force_prec_f32(force_prec_f32), + allow_weight_scale(allow_weight_scale), scale(scale) {} void set_scale(float scale_) { @@ -3391,14 +3400,24 @@ public: if (bias) { b = params["bias"]; } + ggml_tensor* linear_bias = has_weight_scale ? nullptr : b; + ggml_tensor* out = nullptr; if (ctx->weight_adapter) { WeightAdapter::ForwardParams forward_params; forward_params.op_type = WeightAdapter::ForwardParams::op_type_t::OP_LINEAR; forward_params.linear.force_prec_f32 = force_prec_f32; forward_params.linear.scale = scale; - return ctx->weight_adapter->forward_with_lora(ctx->ggml_ctx, ctx->backend, x, w, b, prefix, forward_params); + out = ctx->weight_adapter->forward_with_lora(ctx->ggml_ctx, ctx->backend, x, w, linear_bias, prefix, forward_params); + } else { + out = ggml_ext_linear(ctx->ggml_ctx, x, w, linear_bias, force_prec_f32, scale); } - return ggml_ext_linear(ctx->ggml_ctx, x, w, b, force_prec_f32, scale); + if (has_weight_scale) { + out = ggml_mul(ctx->ggml_ctx, out, params["weight_scale"]); + if (b != nullptr) { + out = ggml_add_inplace(ctx->ggml_ctx, out, b); + } + } + return out; } }; diff --git a/src/ideogram4.hpp b/src/ideogram4.hpp new file mode 100644 index 00000000..58cd7638 --- /dev/null +++ b/src/ideogram4.hpp @@ -0,0 +1,527 @@ +#ifndef __IDEOGRAM4_HPP__ +#define __IDEOGRAM4_HPP__ + +#include +#include +#include +#include +#include +#include + +#include "diffusion_model.hpp" +#include "ggml_extend.hpp" +#include "ggml_graph_cut.h" +#include "rope.hpp" + +namespace Ideogram4 { + constexpr int IDEOGRAM4_GRAPH_SIZE = 65536; + constexpr int OUTPUT_IMAGE_INDICATOR = 2; + constexpr int IMAGE_POSITION_OFFSET = 65536; + constexpr int DEFAULT_MROPE_SECTION_T = 24; + constexpr int DEFAULT_MROPE_SECTION_H = 20; + constexpr int DEFAULT_MROPE_SECTION_W = 20; + constexpr int TIMESTEP_MAX_PERIOD = 10000; + constexpr int LLM_HIDDEN_STATE_LAYERS = 13; + + struct Ideogram4Config { + int64_t emb_dim = 4608; + int64_t num_layers = 34; + int64_t num_heads = 18; + int64_t intermediate_size = 12288; + int64_t adanln_dim = 512; + int64_t in_channels = 128; + int64_t llm_features_dim = 53248; + int64_t rope_theta = 5000000; + float norm_eps = 1e-5f; + int patch_size = 2; + int ae_channels = 32; + std::vector mrope_section = {DEFAULT_MROPE_SECTION_T, + DEFAULT_MROPE_SECTION_H, + DEFAULT_MROPE_SECTION_W}; + }; + + __STATIC_INLINE__ ggml_tensor* timestep_embedding_sin_cos(ggml_context* ctx, + ggml_tensor* timesteps, + int dim) { + GGML_ASSERT(dim % 2 == 0); + auto embedding = ggml_ext_timestep_embedding(ctx, timesteps, dim, TIMESTEP_MAX_PERIOD, 10.f); + auto chunks = ggml_ext_chunk(ctx, embedding, 2, 0); + return ggml_concat(ctx, chunks[1], chunks[0], 0); + } + + __STATIC_INLINE__ ggml_tensor* to_token_modulation(ggml_context* ctx, ggml_tensor* x) { + // [N, C] -> [N, 1, C] in PyTorch layout. + if (ggml_n_dims(x) < 3 || x->ne[1] != 1) { + x = ggml_reshape_3d(ctx, x, x->ne[0], 1, x->ne[1]); + } + return x; + } + + __STATIC_INLINE__ ggml_tensor* interleave_hidden_state_layers(ggml_context* ctx, ggml_tensor* x) { + // Match upstream stack(...).permute(1, 2, 3, 0).reshape(...): + // [layers * hidden, tokens, batch] -> [hidden * layers, tokens, batch]. + GGML_ASSERT(x->ne[0] % LLM_HIDDEN_STATE_LAYERS == 0); + const int64_t hidden_size = x->ne[0] / LLM_HIDDEN_STATE_LAYERS; + const int64_t token_count = x->ne[1]; + const int64_t batch_count = x->ne[2]; + + x = ggml_reshape_4d(ctx, x, hidden_size, LLM_HIDDEN_STATE_LAYERS, token_count, batch_count); + x = ggml_cont(ctx, ggml_permute(ctx, x, 1, 0, 2, 3)); + return ggml_reshape_3d(ctx, x, hidden_size * LLM_HIDDEN_STATE_LAYERS, token_count, batch_count); + } + + __STATIC_INLINE__ ggml_tensor* modulate(ggml_context* ctx, ggml_tensor* x, ggml_tensor* scale) { + scale = to_token_modulation(ctx, scale); + return ggml_add(ctx, x, ggml_mul(ctx, x, scale)); + } + + __STATIC_INLINE__ ggml_tensor* patchify(ggml_context* ctx, ggml_tensor* x, const Ideogram4Config& config) { + // x: [N, 128, H, W] with channel order [ae, ph, pw]. + // return: [N, H*W, 128] with token channel order [ph, pw, ae]. + const int64_t W = x->ne[0]; + const int64_t H = x->ne[1]; + const int64_t C = x->ne[2]; + const int64_t N = x->ne[3]; + + GGML_ASSERT(N == 1); + GGML_ASSERT(C == config.ae_channels * config.patch_size * config.patch_size); + + x = ggml_cont(ctx, x); + x = ggml_reshape_4d(ctx, x, W * H, config.patch_size, config.patch_size, config.ae_channels); + x = ggml_cont(ctx, ggml_permute(ctx, x, 3, 1, 2, 0)); + x = ggml_reshape_3d(ctx, x, C, W * H, N); + return x; + } + + __STATIC_INLINE__ ggml_tensor* unpatchify(ggml_context* ctx, + ggml_tensor* x, + int64_t H, + int64_t W, + const Ideogram4Config& config) { + const int64_t C = x->ne[0]; + const int64_t N = x->ne[2]; + + GGML_ASSERT(N == 1); + GGML_ASSERT(C == config.ae_channels * config.patch_size * config.patch_size); + GGML_ASSERT(x->ne[1] == H * W); + + x = ggml_reshape_4d(ctx, x, config.ae_channels, config.patch_size, config.patch_size, H * W); + x = ggml_cont(ctx, ggml_permute(ctx, x, 3, 1, 2, 0)); + x = ggml_reshape_4d(ctx, x, W, H, C, N); + return x; + } + + __STATIC_INLINE__ std::shared_ptr make_linear(int64_t in_features, + int64_t out_features, + bool bias = true) { + return std::make_shared(in_features, out_features, bias, false, false, 1.f, true); + } + + __STATIC_INLINE__ std::vector gen_ideogram4_pe(int grid_h, + int grid_w, + int bs, + int context_len, + int head_dim, + int rope_theta, + const std::vector& mrope_section) { + GGML_ASSERT(bs == 1); + std::vector> ids(static_cast(bs) * (context_len + grid_h * grid_w), + std::vector(3, 0.f)); + + for (int i = 0; i < context_len; ++i) { + ids[i] = {static_cast(i), static_cast(i), static_cast(i)}; + } + + int cursor = context_len; + for (int y = 0; y < grid_h; ++y) { + for (int x = 0; x < grid_w; ++x) { + ids[cursor++] = {static_cast(IMAGE_POSITION_OFFSET), + static_cast(IMAGE_POSITION_OFFSET + y), + static_cast(IMAGE_POSITION_OFFSET + x)}; + } + } + + return Rope::embed_interleaved_mrope(ids, bs, static_cast(rope_theta), head_dim, mrope_section); + } + + class Ideogram4Attention : public GGMLBlock { + protected: + int64_t hidden_size; + int64_t num_heads; + int64_t head_dim; + + public: + Ideogram4Attention(int64_t hidden_size, int64_t num_heads, float eps) + : hidden_size(hidden_size), num_heads(num_heads), head_dim(hidden_size / num_heads) { + GGML_ASSERT(hidden_size % num_heads == 0); + blocks["qkv"] = make_linear(hidden_size, hidden_size * 3, false); + blocks["norm_q"] = std::make_shared(head_dim, eps); + blocks["norm_k"] = std::make_shared(head_dim, eps); + blocks["o"] = make_linear(hidden_size, hidden_size, false); + } + + ggml_tensor* forward(GGMLRunnerContext* ctx, + ggml_tensor* x, + ggml_tensor* pe, + ggml_tensor* mask = nullptr) { + int64_t n_token = x->ne[1]; + int64_t N = x->ne[2]; + + auto qkv_proj = std::dynamic_pointer_cast(blocks["qkv"]); + auto norm_q = std::dynamic_pointer_cast(blocks["norm_q"]); + auto norm_k = std::dynamic_pointer_cast(blocks["norm_k"]); + auto out_proj = std::dynamic_pointer_cast(blocks["o"]); + + auto qkv = qkv_proj->forward(ctx, x); + auto qkv_vec = split_qkv(ctx->ggml_ctx, qkv); + auto q = ggml_reshape_4d(ctx->ggml_ctx, qkv_vec[0], head_dim, num_heads, n_token, N); + auto k = ggml_reshape_4d(ctx->ggml_ctx, qkv_vec[1], head_dim, num_heads, n_token, N); + auto v = ggml_reshape_4d(ctx->ggml_ctx, qkv_vec[2], head_dim, num_heads, n_token, N); + + q = norm_q->forward(ctx, q); + k = norm_k->forward(ctx, k); + + x = Rope::attention(ctx, q, k, v, pe, mask, 1.f / 128.f, false); + x = out_proj->forward(ctx, x); + return x; + } + }; + + class Ideogram4MLP : public GGMLBlock { + public: + Ideogram4MLP(int64_t dim, int64_t hidden_dim) { + blocks["w1"] = make_linear(dim, hidden_dim, false); + blocks["w2"] = make_linear(hidden_dim, dim, false); + blocks["w3"] = make_linear(dim, hidden_dim, false); + } + + ggml_tensor* forward(GGMLRunnerContext* ctx, ggml_tensor* x) { + auto w1 = std::dynamic_pointer_cast(blocks["w1"]); + auto w2 = std::dynamic_pointer_cast(blocks["w2"]); + auto w3 = std::dynamic_pointer_cast(blocks["w3"]); + + auto x1 = ggml_silu(ctx->ggml_ctx, w1->forward(ctx, x)); + auto x3 = w3->forward(ctx, x); + x = ggml_mul(ctx->ggml_ctx, x1, x3); + x = w2->forward(ctx, x); + return x; + } + }; + + class Ideogram4TransformerBlock : public GGMLBlock { + public: + Ideogram4TransformerBlock(const Ideogram4Config& config) { + blocks["attention"] = std::make_shared(config.emb_dim, config.num_heads, config.norm_eps); + blocks["feed_forward"] = std::make_shared(config.emb_dim, config.intermediate_size); + blocks["attention_norm1"] = std::make_shared(config.emb_dim, config.norm_eps); + blocks["ffn_norm1"] = std::make_shared(config.emb_dim, config.norm_eps); + blocks["attention_norm2"] = std::make_shared(config.emb_dim, config.norm_eps); + blocks["ffn_norm2"] = std::make_shared(config.emb_dim, config.norm_eps); + blocks["adaln_modulation"] = make_linear(config.adanln_dim, 4 * config.emb_dim, true); + } + + ggml_tensor* forward(GGMLRunnerContext* ctx, + ggml_tensor* x, + ggml_tensor* pe, + ggml_tensor* adaln_input, + ggml_tensor* mask = nullptr) { + auto attention = std::dynamic_pointer_cast(blocks["attention"]); + auto feed_forward = std::dynamic_pointer_cast(blocks["feed_forward"]); + auto attention_norm1 = std::dynamic_pointer_cast(blocks["attention_norm1"]); + auto ffn_norm1 = std::dynamic_pointer_cast(blocks["ffn_norm1"]); + auto attention_norm2 = std::dynamic_pointer_cast(blocks["attention_norm2"]); + auto ffn_norm2 = std::dynamic_pointer_cast(blocks["ffn_norm2"]); + auto adaln_modulation = std::dynamic_pointer_cast(blocks["adaln_modulation"]); + + auto mod = adaln_modulation->forward(ctx, adaln_input); + auto mods = ggml_ext_chunk(ctx->ggml_ctx, mod, 4, 0); + auto scale_msa = mods[0]; + auto gate_msa = to_token_modulation(ctx->ggml_ctx, ggml_tanh(ctx->ggml_ctx, mods[1])); + auto scale_mlp = mods[2]; + auto gate_mlp = to_token_modulation(ctx->ggml_ctx, ggml_tanh(ctx->ggml_ctx, mods[3])); + + auto attn_out = attention_norm1->forward(ctx, x); + attn_out = modulate(ctx->ggml_ctx, attn_out, scale_msa); + attn_out = attention->forward(ctx, attn_out, pe, mask); + attn_out = attention_norm2->forward(ctx, attn_out); + x = ggml_add(ctx->ggml_ctx, x, ggml_mul(ctx->ggml_ctx, attn_out, gate_msa)); + + auto ffn_out = ffn_norm1->forward(ctx, x); + ffn_out = modulate(ctx->ggml_ctx, ffn_out, scale_mlp); + ffn_out = feed_forward->forward(ctx, ffn_out); + ffn_out = ffn_norm2->forward(ctx, ffn_out); + x = ggml_add(ctx->ggml_ctx, x, ggml_mul(ctx->ggml_ctx, ffn_out, gate_mlp)); + + return x; + } + }; + + class Ideogram4EmbedScalar : public GGMLBlock { + protected: + int64_t dim; + + public: + Ideogram4EmbedScalar(int64_t dim) + : dim(dim) { + blocks["mlp_in"] = make_linear(dim, dim, true); + blocks["mlp_out"] = make_linear(dim, dim, true); + } + + ggml_tensor* forward(GGMLRunnerContext* ctx, ggml_tensor* x) { + auto mlp_in = std::dynamic_pointer_cast(blocks["mlp_in"]); + auto mlp_out = std::dynamic_pointer_cast(blocks["mlp_out"]); + + x = timestep_embedding_sin_cos(ctx->ggml_ctx, x, static_cast(dim)); + x = ggml_silu(ctx->ggml_ctx, mlp_in->forward(ctx, x)); + x = mlp_out->forward(ctx, x); + return x; + } + }; + + class Ideogram4FinalLayer : public GGMLBlock { + public: + Ideogram4FinalLayer(const Ideogram4Config& config) { + blocks["norm_final"] = std::make_shared(config.emb_dim, 1e-6f, false); + blocks["linear"] = make_linear(config.emb_dim, config.in_channels, true); + blocks["adaln_modulation"] = make_linear(config.adanln_dim, config.emb_dim, true); + } + + ggml_tensor* forward(GGMLRunnerContext* ctx, ggml_tensor* x, ggml_tensor* c) { + auto norm_final = std::dynamic_pointer_cast(blocks["norm_final"]); + auto linear = std::dynamic_pointer_cast(blocks["linear"]); + auto adaln_modulation = std::dynamic_pointer_cast(blocks["adaln_modulation"]); + + auto scale = adaln_modulation->forward(ctx, ggml_silu(ctx->ggml_ctx, c)); + x = norm_final->forward(ctx, x); + x = modulate(ctx->ggml_ctx, x, scale); + x = linear->forward(ctx, x); + return x; + } + }; + + class Ideogram4Transformer : public GGMLBlock { + protected: + Ideogram4Config config; + + public: + Ideogram4Transformer() = default; + explicit Ideogram4Transformer(Ideogram4Config config) + : config(std::move(config)) { + blocks["input_proj"] = make_linear(this->config.in_channels, this->config.emb_dim, true); + blocks["llm_cond_norm"] = std::make_shared(this->config.llm_features_dim, 1e-6f); + blocks["llm_cond_proj"] = make_linear(this->config.llm_features_dim, this->config.emb_dim, true); + blocks["t_embedding"] = std::make_shared(this->config.emb_dim); + blocks["adaln_proj"] = make_linear(this->config.emb_dim, this->config.adanln_dim, true); + blocks["embed_image_indicator"] = std::make_shared(2, this->config.emb_dim); + + for (int i = 0; i < this->config.num_layers; ++i) { + blocks["layers." + std::to_string(i)] = std::make_shared(this->config); + } + blocks["final_layer"] = std::make_shared(this->config); + } + + ggml_tensor* forward(GGMLRunnerContext* ctx, + ggml_tensor* x, + ggml_tensor* timestep, + ggml_tensor* context, + ggml_tensor* pe, + ggml_tensor* image_indicator_ids) { + int64_t W = x->ne[0]; + int64_t H = x->ne[1]; + int64_t N = x->ne[3]; + GGML_ASSERT(N == 1); + + auto input_proj = std::dynamic_pointer_cast(blocks["input_proj"]); + auto llm_cond_norm = std::dynamic_pointer_cast(blocks["llm_cond_norm"]); + auto llm_cond_proj = std::dynamic_pointer_cast(blocks["llm_cond_proj"]); + auto t_embedding = std::dynamic_pointer_cast(blocks["t_embedding"]); + auto adaln_proj = std::dynamic_pointer_cast(blocks["adaln_proj"]); + auto embed_image_indicator = std::dynamic_pointer_cast(blocks["embed_image_indicator"]); + auto final_layer = std::dynamic_pointer_cast(blocks["final_layer"]); + + auto img = patchify(ctx->ggml_ctx, x, config); + img = input_proj->forward(ctx, img); + + ggml_tensor* h = img; + int64_t context_len = 0; + if (context != nullptr) { + if (ggml_n_dims(context) < 3) { + context = ggml_reshape_3d(ctx->ggml_ctx, context, context->ne[0], context->ne[1], 1); + } + context = interleave_hidden_state_layers(ctx->ggml_ctx, context); + context_len = context->ne[1]; + auto txt = llm_cond_norm->forward(ctx, context); + txt = llm_cond_proj->forward(ctx, txt); + h = ggml_concat(ctx->ggml_ctx, txt, img, 1); + } + + auto indicator_embedding = embed_image_indicator->forward(ctx, image_indicator_ids); + h = ggml_add(ctx->ggml_ctx, h, indicator_embedding); + + auto t_cond = t_embedding->forward(ctx, timestep); + auto adaln_input = ggml_silu(ctx->ggml_ctx, adaln_proj->forward(ctx, t_cond)); + + for (int i = 0; i < config.num_layers; ++i) { + auto block = std::dynamic_pointer_cast(blocks["layers." + std::to_string(i)]); + h = block->forward(ctx, h, pe, adaln_input, nullptr); + sd::ggml_graph_cut::mark_graph_cut(h, "ideogram4.layers." + std::to_string(i), "hidden"); + } + + h = final_layer->forward(ctx, h, adaln_input); + if (context_len > 0) { + h = ggml_ext_slice(ctx->ggml_ctx, h, 1, context_len, h->ne[1]); + } + + h = unpatchify(ctx->ggml_ctx, h, H, W, config); + h = ggml_ext_scale(ctx->ggml_ctx, h, -1.f); + return h; + } + }; + + class Ideogram4Runner : public DiffusionModelRunner { + protected: + static int64_t detect_num_layers(const String2TensorStorage& tensor_storage_map, + const std::string& prefix) { + int64_t detected_layers = 0; + std::string layer_prefix = prefix.empty() ? "layers." : prefix + ".layers."; + for (const auto& pair : tensor_storage_map) { + const std::string& name = pair.first; + if (name.find(layer_prefix) != 0) { + continue; + } + std::string tail = name.substr(layer_prefix.size()); + size_t dot = tail.find('.'); + if (dot == std::string::npos) { + continue; + } + int layer_idx = std::atoi(tail.substr(0, dot).c_str()); + detected_layers = std::max(detected_layers, layer_idx + 1); + } + return detected_layers; + } + + bool should_use_uncond_model(const DiffusionParams& diffusion_params) const { + return has_uncond_model && + diffusion_params.context == nullptr && + diffusion_params.y != nullptr && + !diffusion_params.y->empty(); + } + + public: + Ideogram4Config config; + Ideogram4Transformer model; + Ideogram4Transformer uncond_model; + bool has_uncond_model = false; + std::string uncond_prefix; + std::vector pe_vec; + std::vector image_indicator_vec; + + Ideogram4Runner(ggml_backend_t backend, + ggml_backend_t params_backend, + const String2TensorStorage& tensor_storage_map = {}, + const std::string prefix = "") + : DiffusionModelRunner(backend, params_backend, prefix), + uncond_prefix(prefix + ".uncond") { + int64_t detected_layers = detect_num_layers(tensor_storage_map, prefix); + if (detected_layers > 0) { + config.num_layers = detected_layers; + } + + model = Ideogram4Transformer(config); + model.init(params_ctx, tensor_storage_map, prefix); + for (const auto& pair : tensor_storage_map) { + const std::string& name = pair.first; + if (starts_with(name, uncond_prefix)) { + has_uncond_model = true; + break; + } + } + if (has_uncond_model) { + LOG_DEBUG("using uncond model"); + uncond_model = Ideogram4Transformer(config); + uncond_model.init(params_ctx, tensor_storage_map, uncond_prefix); + } + } + + std::string get_desc() override { + return "ideogram4"; + } + + void get_param_tensors(std::map& tensors, const std::string& prefix) override { + model.get_param_tensors(tensors, prefix); + if (has_uncond_model) { + uncond_model.get_param_tensors(tensors, this->uncond_prefix); + } + } + + ggml_cgraph* build_graph(const sd::Tensor& x_tensor, + const sd::Tensor& timesteps_tensor, + const sd::Tensor& context_tensor, + bool use_uncond_model = false) { + ggml_cgraph* gf = new_graph_custom(IDEOGRAM4_GRAPH_SIZE); + ggml_tensor* x = make_input(x_tensor); + ggml_tensor* timesteps = make_input(timesteps_tensor); + GGML_ASSERT(x->ne[3] == 1); + Ideogram4Transformer& active_model = use_uncond_model ? uncond_model : model; + + ggml_tensor* context = nullptr; + int64_t context_len = 0; + if (!context_tensor.empty()) { + context = make_input(context_tensor); + context_len = context->ne[1]; + } + + int64_t grid_w = x->ne[0]; + int64_t grid_h = x->ne[1]; + int64_t pos_len = context_len + grid_h * grid_w; + int64_t head_dim = config.emb_dim / config.num_heads; + + pe_vec = gen_ideogram4_pe(static_cast(grid_h), + static_cast(grid_w), + static_cast(x->ne[3]), + static_cast(context_len), + static_cast(head_dim), + static_cast(config.rope_theta), + config.mrope_section); + auto pe = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, 2, 2, head_dim / 2, pos_len); + set_backend_tensor_data(pe, pe_vec.data()); + + image_indicator_vec.assign(static_cast(pos_len), 1); + for (int64_t i = 0; i < context_len; ++i) { + image_indicator_vec[static_cast(i)] = 0; + } + auto indicator = ggml_new_tensor_2d(compute_ctx, GGML_TYPE_I32, pos_len, x->ne[3]); + set_backend_tensor_data(indicator, image_indicator_vec.data()); + + auto runner_ctx = get_context(); + ggml_tensor* out = active_model.forward(&runner_ctx, x, timesteps, context, pe, indicator); + ggml_build_forward_expand(gf, out); + return gf; + } + + sd::Tensor compute(int n_threads, + const sd::Tensor& x, + const sd::Tensor& timesteps, + const sd::Tensor& context, + bool use_uncond_model = false) { + auto get_graph = [&]() -> ggml_cgraph* { + return build_graph(x, timesteps, context, use_uncond_model); + }; + return restore_trailing_singleton_dims(GGMLRunner::compute(get_graph, n_threads, false), x.dim()); + } + + sd::Tensor compute(int n_threads, + const DiffusionParams& diffusion_params) override { + GGML_ASSERT(diffusion_params.x != nullptr); + GGML_ASSERT(diffusion_params.timesteps != nullptr); + bool use_uncond_model = should_use_uncond_model(diffusion_params); + return compute(n_threads, + *diffusion_params.x, + *diffusion_params.timesteps, + tensor_or_empty(diffusion_params.context), + use_uncond_model); + } + }; +} // namespace Ideogram4 + +#endif // __IDEOGRAM4_HPP__ diff --git a/src/llm.hpp b/src/llm.hpp index 0dbd37d9..e3614b22 100644 --- a/src/llm.hpp +++ b/src/llm.hpp @@ -1460,13 +1460,18 @@ namespace LLM { params.num_kv_heads = 8; params.qkv_bias = false; params.rms_norm_eps = 1e-5f; - } else if (arch == LLMArch::QWEN3) { + } else if (arch == LLMArch::QWEN3 || arch == LLMArch::QWEN3_VL) { params.head_dim = 128; params.num_heads = 32; params.num_kv_heads = 8; params.qkv_bias = false; params.qk_norm = true; params.rms_norm_eps = 1e-6f; + if (arch == LLMArch::QWEN3_VL) { + params.max_position_embeddings = 262144; + params.rope_thetas = {5000000.f}; + params.vision.arch = LLMVisionArch::QWEN3_VL; + } } else if (arch == LLMArch::GEMMA3_12B) { params.head_dim = 256; params.num_heads = 16; diff --git a/src/model.cpp b/src/model.cpp index 404d73a1..e451b126 100644 --- a/src/model.cpp +++ b/src/model.cpp @@ -435,6 +435,9 @@ SDVersion ModelLoader::get_sd_version() { if (tensor_storage.name.find("model.diffusion_model.net.lq_proj.latent_proj.0.weight") != std::string::npos) { return VERSION_PID; } + if (tensor_storage.name.find("embed_image_indicator.weight") != std::string::npos) { + return VERSION_IDEOGRAM4; + } if (tensor_storage.name.find("model.diffusion_model.nerf_final_layer_conv.") != std::string::npos) { return VERSION_CHROMA_RADIANCE; } @@ -1254,6 +1257,8 @@ bool ModelLoader::tensor_should_be_converted(const TensorStorage& tensor_storage // Pass, do not convert } else if (ends_with(name, ".scale")) { // Pass, do not convert + } else if (ends_with(name, ".weight_scale")) { + // Pass, do not convert } else if (contains(name, "img_in.") || contains(name, "txt_in.") || contains(name, "time_in.") || diff --git a/src/model.h b/src/model.h index 9c49d82e..322a8614 100644 --- a/src/model.h +++ b/src/model.h @@ -50,6 +50,7 @@ enum SDVersion { VERSION_LENS, VERSION_LONGCAT, VERSION_PID, + VERSION_IDEOGRAM4, VERSION_COUNT, }; @@ -172,8 +173,15 @@ static inline bool sd_version_is_pid(SDVersion version) { return false; } +static inline bool sd_version_is_ideogram4(SDVersion version) { + if (version == VERSION_IDEOGRAM4) { + return true; + } + return false; +} + static inline bool sd_version_uses_flux2_vae(SDVersion version) { - if (sd_version_is_flux2(version) || sd_version_is_ernie_image(version) || sd_version_is_lens(version)) { + if (sd_version_is_flux2(version) || sd_version_is_ernie_image(version) || sd_version_is_lens(version) || sd_version_is_ideogram4(version)) { return true; } return false; @@ -203,7 +211,8 @@ static inline bool sd_version_is_dit(SDVersion version) { sd_version_is_ernie_image(version) || sd_version_is_lens(version) || sd_version_is_longcat(version) || - sd_version_is_pid(version)) { + sd_version_is_pid(version) || + sd_version_is_ideogram4(version)) { return true; } return false; diff --git a/src/rope.hpp b/src/rope.hpp index 45f2d95a..fb3e4ac5 100644 --- a/src/rope.hpp +++ b/src/rope.hpp @@ -249,6 +249,40 @@ namespace Rope { return embed_nd(ids, bs, axis_thetas, axes_dim, wrap_dims, layout); } + __STATIC_INLINE__ std::vector embed_interleaved_mrope(const std::vector>& ids, + int bs, + float theta, + int head_dim, + const std::vector& mrope_section) { + GGML_ASSERT(bs > 0); + GGML_ASSERT(head_dim % 2 == 0); + GGML_ASSERT(mrope_section.size() >= 3); + + std::vector> trans_ids = transpose(ids); + size_t pos_len = ids.size() / bs; + int half_dim = head_dim / 2; + + std::vector>> axis_embs; + axis_embs.reserve(3); + for (int axis = 0; axis < 3; ++axis) { + axis_embs.push_back(rope(trans_ids[axis], head_dim, theta)); + } + + std::vector> emb = axis_embs[0]; + for (int axis = 1; axis < 3; ++axis) { + int length = std::min(mrope_section[axis] * 3, half_dim); + for (int freq_idx = axis; freq_idx < length; freq_idx += 3) { + for (size_t pos_idx = 0; pos_idx < bs * pos_len; ++pos_idx) { + for (int k = 0; k < 4; ++k) { + emb[pos_idx][4 * freq_idx + k] = axis_embs[axis][pos_idx][4 * freq_idx + k]; + } + } + } + } + + return flatten(emb); + } + __STATIC_INLINE__ std::vector embed_2d_interleaved(int height, int width, int dim, diff --git a/src/stable-diffusion.cpp b/src/stable-diffusion.cpp index 11e36fda..3080ca0f 100644 --- a/src/stable-diffusion.cpp +++ b/src/stable-diffusion.cpp @@ -23,6 +23,7 @@ #include "flux.hpp" #include "guidance.h" #include "hidream_o1.hpp" +#include "ideogram4.hpp" #include "lens.hpp" #include "lora.hpp" #include "ltx_audio_vae.h" @@ -84,6 +85,7 @@ const char* model_version_to_str[] = { "Lens", "Longcat-Image", "PiD", + "Ideogram 4", }; const char* sampling_methods_str[] = { @@ -315,6 +317,13 @@ public: } } + if (strlen(SAFE_STR(sd_ctx_params->uncond_diffusion_model_path)) > 0) { + LOG_INFO("loading unconditional diffusion model from '%s'", sd_ctx_params->uncond_diffusion_model_path); + if (!model_loader.init_from_file(sd_ctx_params->uncond_diffusion_model_path, "model.diffusion_model.uncond.")) { + LOG_WARN("loading unconditional diffusion model from '%s' failed", sd_ctx_params->uncond_diffusion_model_path); + } + } + bool is_unet = sd_version_is_unet(model_loader.get_sd_version()); if (strlen(SAFE_STR(sd_ctx_params->clip_l_path)) > 0) { @@ -547,6 +556,17 @@ public: params_backend_for(SDBackendModule::DIFFUSION), tensor_storage_map, "model.diffusion_model.net"); + } else if (sd_version_is_ideogram4(version)) { + cond_stage_model = std::make_shared(backend_for(SDBackendModule::TE), + params_backend_for(SDBackendModule::TE), + tensor_storage_map, + version, + "", + false); + diffusion_model = std::make_shared(backend_for(SDBackendModule::DIFFUSION), + params_backend_for(SDBackendModule::DIFFUSION), + tensor_storage_map, + "model.diffusion_model"); } else if (sd_version_is_flux(version)) { bool is_chroma = false; for (auto pair : tensor_storage_map) { @@ -1024,6 +1044,12 @@ public: ignore_tensors.insert("text_encoders.llm.model.layers.0.mlp.experts.gate_up_proj.weight_scale_2"); ignore_tensors.insert("text_encoders.llm.model.layers.0.mlp.experts.down_proj.weight_scale_2"); } + if (sd_version_is_ideogram4(version)) { + ignore_tensors.insert("text_encoders.llm.lm_head."); + ignore_tensors.insert("text_encoders.llm.visual."); + ignore_tensors.insert("text_encoders.llm.vision_model."); + ignore_tensors.insert("text_encoders.llm.tokenizer_json"); + } if (version == VERSION_HIDREAM_O1) { ignore_tensors.insert("lm_head."); ignore_tensors.insert("model.visual.deepstack_merger_list."); @@ -1199,7 +1225,8 @@ public: sd_version_is_anima(version) || sd_version_is_ernie_image(version) || sd_version_is_z_image(version) || - sd_version_is_pid(version)) { + sd_version_is_pid(version) || + sd_version_is_ideogram4(version)) { pred_type = FLOW_PRED; if (sd_version_is_wan(version)) { default_flow_shift = 5.f; @@ -1207,6 +1234,8 @@ public: default_flow_shift = 4.f; } else if (sd_version_is_pid(version)) { default_flow_shift = 1.5f; + } else if (sd_version_is_ideogram4(version)) { + default_flow_shift = 1.0f; } else { default_flow_shift = 3.f; } @@ -1869,7 +1898,7 @@ public: if (version == VERSION_HIDREAM_O1) { return std::vector{1.0f - (t / static_cast(TIMESTEPS))}; } - if (sd_version_is_z_image(version)) { + if (sd_version_is_z_image(version) || sd_version_is_ideogram4(version)) { return std::vector{1000.f - t}; } return std::vector{t}; @@ -2771,6 +2800,7 @@ char* sd_ctx_params_to_str(const sd_ctx_params_t* sd_ctx_params) { "llm_vision_path: %s\n" "diffusion_model_path: %s\n" "high_noise_diffusion_model_path: %s\n" + "uncond_diffusion_model_path: %s\n" "embeddings_connectors_path: %s\n" "vae_path: %s\n" "audio_vae_path: %s\n" @@ -2810,6 +2840,7 @@ char* sd_ctx_params_to_str(const sd_ctx_params_t* sd_ctx_params) { SAFE_STR(sd_ctx_params->llm_vision_path), SAFE_STR(sd_ctx_params->diffusion_model_path), SAFE_STR(sd_ctx_params->high_noise_diffusion_model_path), + SAFE_STR(sd_ctx_params->uncond_diffusion_model_path), SAFE_STR(sd_ctx_params->embeddings_connectors_path), SAFE_STR(sd_ctx_params->vae_path), SAFE_STR(sd_ctx_params->audio_vae_path), @@ -4178,16 +4209,20 @@ static std::optional prepare_image_generation_embeds(sd_c SDCondition uncond; if (request->use_uncond || request->use_high_noise_uncond) { - bool zero_out_masked = false; - if (sd_version_is_sdxl(sd_ctx->sd->version) && - request->negative_prompt.empty() && - !sd_ctx->sd->is_using_edm_v_parameterization) { - zero_out_masked = true; + if (sd_version_is_ideogram4(sd_ctx->sd->version)) { + uncond.c_vector = sd::Tensor::from_vector({1.0f}); + } else { + bool zero_out_masked = false; + if (sd_version_is_sdxl(sd_ctx->sd->version) && + request->negative_prompt.empty() && + !sd_ctx->sd->is_using_edm_v_parameterization) { + zero_out_masked = true; + } + condition_params.text = request->negative_prompt; + condition_params.zero_out_masked = zero_out_masked; + uncond = sd_ctx->sd->cond_stage_model->get_learned_condition(sd_ctx->sd->n_threads, + condition_params); } - condition_params.text = request->negative_prompt; - condition_params.zero_out_masked = zero_out_masked; - uncond = sd_ctx->sd->cond_stage_model->get_learned_condition(sd_ctx->sd->n_threads, - condition_params); if (uncond.c_concat.empty()) { uncond.c_concat = latents->concat_latent; // TODO: optimize }