Compare commits

..

24 Commits

Author SHA1 Message Date
DN6 ba2ef8bc40 update 2025-01-21 06:19:32 +05:30
baymax591 75a636da48 bugfix for npu not support float64 (#10123)
* bugfix for npu not support float64

* is_mps is_npu

---------

Co-authored-by: 白超 <baichao19@huawei.com>
Co-authored-by: hlky <hlky@hlky.ac>
2025-01-20 09:35:24 -10:00
sunxunle 4842f5d8de chore: remove redundant words (#10609)
Signed-off-by: sunxunle <sunxunle@ampere.tech>
2025-01-20 08:15:26 -10:00
Sayak Paul 328e0d20a7 [training] set rest of the blocks with requires_grad False. (#10607)
set rest of the blocks with requires_grad False.
2025-01-19 19:34:53 +05:30
Shenghai Yuan 23b467c79c [core] ConsisID (#10140)
* Update __init__.py

* add consisid

* update consisid

* update consisid

* make style

* make_style

* Update src/diffusers/pipelines/consisid/pipeline_consisid.py

Co-authored-by: hlky <hlky@hlky.ac>

* Update src/diffusers/pipelines/consisid/pipeline_consisid.py

Co-authored-by: hlky <hlky@hlky.ac>

* Update src/diffusers/pipelines/consisid/pipeline_consisid.py

Co-authored-by: hlky <hlky@hlky.ac>

* Update src/diffusers/pipelines/consisid/pipeline_consisid.py

Co-authored-by: hlky <hlky@hlky.ac>

* Update src/diffusers/pipelines/consisid/pipeline_consisid.py

Co-authored-by: hlky <hlky@hlky.ac>

* Update src/diffusers/pipelines/consisid/pipeline_consisid.py

Co-authored-by: hlky <hlky@hlky.ac>

* add doc

* make style

* Rename consisid .md to consisid.md

* Update geodiff_molecule_conformation.ipynb

* Update geodiff_molecule_conformation.ipynb

* Update geodiff_molecule_conformation.ipynb

* Update demo.ipynb

* Update pipeline_consisid.py

* make fix-copies

* Update docs/source/en/using-diffusers/consisid.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update src/diffusers/pipelines/consisid/pipeline_consisid.py

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update src/diffusers/pipelines/consisid/pipeline_consisid.py

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/using-diffusers/consisid.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/using-diffusers/consisid.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* update doc & pipeline code

* fix typo

* make style

* update example

* Update docs/source/en/using-diffusers/consisid.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* update example

* update example

* Update src/diffusers/pipelines/consisid/pipeline_consisid.py

Co-authored-by: hlky <hlky@hlky.ac>

* Update src/diffusers/pipelines/consisid/pipeline_consisid.py

Co-authored-by: hlky <hlky@hlky.ac>

* update

* add test and update

* remove some changes from docs

* refactor

* fix

* undo changes to examples

* remove save/load and fuse methods

* update

* link hf-doc-img & make test extremely small

* update

* add lora

* fix test

* update

* update

* change expected_diff_max to 0.4

* fix typo

* fix link

* fix typo

* update docs

* update

* remove consisid lora tests

---------

Co-authored-by: hlky <hlky@hlky.ac>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Aryan <aryan@huggingface.co>
2025-01-19 13:10:08 +05:30
Juan Acevedo aeac0a00f8 implementing flux on TPUs with ptxla (#10515)
* implementing flux on TPUs with ptxla

* add xla flux attention class

* run make style/quality

* Update src/diffusers/models/attention_processor.py

Co-authored-by: YiYi Xu <yixu310@gmail.com>

* Update src/diffusers/models/attention_processor.py

Co-authored-by: YiYi Xu <yixu310@gmail.com>

* run style and quality

---------

Co-authored-by: Juan Acevedo <jfacevedo@google.com>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: YiYi Xu <yixu310@gmail.com>
2025-01-16 08:46:02 -10:00
Leo Jiang cecada5280 NPU adaption for RMSNorm (#10534)
* NPU adaption for RMSNorm

* NPU adaption for RMSNorm

---------

Co-authored-by: J石页 <jiangshuo9@h-partners.com>
2025-01-16 08:45:29 -10:00
C 17d99c4d22 [Docs] Add documentation about using ParaAttention to optimize FLUX and HunyuanVideo (#10544)
* add para_attn_flux.md and para_attn_hunyuan_video.md

* add enable_sequential_cpu_offload in para_attn_hunyuan_video.md

* add comment

* refactor

* fix

* fix

* Update docs/source/en/optimization/para_attn.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/optimization/para_attn.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/optimization/para_attn.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/optimization/para_attn.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/optimization/para_attn.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/optimization/para_attn.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/optimization/para_attn.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/optimization/para_attn.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/optimization/para_attn.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/optimization/para_attn.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* fix

* update links

* Update docs/source/en/optimization/para_attn.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/optimization/para_attn.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/optimization/para_attn.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* fix

* Update docs/source/en/optimization/para_attn.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/optimization/para_attn.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/optimization/para_attn.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/optimization/para_attn.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/optimization/para_attn.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/optimization/para_attn.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/optimization/para_attn.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/optimization/para_attn.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2025-01-16 10:05:13 -08:00
hlky 08e62fe0c2 Scheduling fixes on MPS (#10549)
* use np.int32 in scheduling

* test_add_noise_device

* -np.int32, fixes
2025-01-16 07:45:03 -10:00
Daniel Regado 9e1b8a0017 [Docs] Update SD3 ip_adapter model_id to diffusers checkpoint (#10597)
Update to diffusers ip_adapter ckpt
2025-01-16 07:43:29 -10:00
hlky 0b065c099a Move buffers to device (#10523)
* Move buffers to device

* add test

* named_buffers
2025-01-16 07:42:56 -10:00
Junyu Chen b785ddb654 [DC-AE, SANA] fix SanaMultiscaleLinearAttention apply_quadratic_attention bf16 (#10595)
* autoencoder_dc tiling

* add tiling and slicing support in SANA pipelines

* create variables for padding length because the line becomes too long

* add tiling and slicing support in pag SANA pipelines

* revert changes to tile size

* make style

* add vae tiling test

* fix SanaMultiscaleLinearAttention apply_quadratic_attention bf16

---------

Co-authored-by: Aryan <aryan@huggingface.co>
2025-01-16 16:49:02 +05:30
Daniel Regado e8114bd068 IP-Adapter for StableDiffusion3Img2ImgPipeline (#10589)
Added support for IP-Adapter
2025-01-16 09:46:22 +00:00
Leo Jiang b0c8973834 [Sana 4K] Add vae tiling option to avoid OOM (#10583)
Co-authored-by: J石页 <jiangshuo9@h-partners.com>
2025-01-16 02:06:07 +05:30
Sayak Paul c944f0651f [Chore] fix vae annotation in mochi pipeline (#10585)
fix vae annotation in mochi pipeline
2025-01-15 15:19:51 +05:30
Sayak Paul bba59fb88b [Tests] add: test to check 8bit bnb quantized models work with lora loading. (#10576)
* add: test to check 8bit bnb quantized models work with lora loading.

* Update tests/quantization/bnb/test_mixed_int8.py

Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com>

---------

Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com>
2025-01-15 13:05:26 +05:30
Sayak Paul 2432f80ca3 [LoRA] feat: support loading loras into 4bit quantized Flux models. (#10578)
* feat: support loading loras into 4bit quantized models.

* updates

* update

* remove weight check.
2025-01-15 12:40:40 +05:30
Aryan f9e957f011 Fix offload tests for CogVideoX and CogView3 (#10547)
* update

* update
2025-01-15 12:24:46 +05:30
Daniel Regado 4dec63c18e IP-Adapter for StableDiffusion3InpaintPipeline (#10581)
* Added support for IP-Adapter

* Added joint_attention_kwargs property
2025-01-15 06:52:23 +00:00
Junsong Chen 3d70777379 [Sana-4K] (#10537)
* [Sana 4K]
add 4K support for Sana

* [Sana-4K] fix SanaPAGPipeline

* add VAE automatically tiling function;

* set clean_caption to False;

* add warnings for VAE OOM.

* style

---------

Co-authored-by: yiyixuxu <yixu310@gmail.com>
2025-01-14 11:48:56 -10:00
Teriks 6b727842d7 allow passing hf_token to load_textual_inversion (#10546)
Co-authored-by: Teriks <Teriks@users.noreply.github.com>
2025-01-14 11:48:34 -10:00
Dhruv Nair be62c85cd9 [CI] Update HF Token on Fast GPU Model Tests (#10570)
update
2025-01-14 17:00:32 +05:30
Marc Sun fbff43acc9 [FEAT] DDUF format (#10037)
* load and save dduf archive

* style

* switch to zip uncompressed

* updates

* Update src/diffusers/pipelines/pipeline_utils.py

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* Update src/diffusers/pipelines/pipeline_utils.py

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* first draft

* remove print

* switch to dduf_file for consistency

* switch to huggingface hub api

* fix log

* add a basic test

* Update src/diffusers/configuration_utils.py

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* Update src/diffusers/pipelines/pipeline_utils.py

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* Update src/diffusers/pipelines/pipeline_utils.py

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* fix

* fix variant

* change saving logic

* DDUF - Load transformers components manually (#10171)

* update hfh version

* Load transformers components manually

* load encoder from_pretrained with state_dict

* working version with transformers and tokenizer !

* add generation_config case

* fix tests

* remove saving for now

* typing

* need next version from transformers

* Update src/diffusers/configuration_utils.py

Co-authored-by: Lucain <lucain@huggingface.co>

* check path corectly

* Apply suggestions from code review

Co-authored-by: Lucain <lucain@huggingface.co>

* udapte

* typing

* remove check for subfolder

* quality

* revert setup changes

* oups

* more readable condition

* add loading from the hub test

* add basic docs.

* Apply suggestions from code review

Co-authored-by: Lucain <lucain@huggingface.co>

* add example

* add

* make functions private

* Apply suggestions from code review

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* minor.

* fixes

* fix

* change the precdence of parameterized.

* error out when custom pipeline is passed with dduf_file.

* updates

* fix

* updates

* fixes

* updates

* fix xfail condition.

* fix xfail

* fixes

* sharded checkpoint compat

* add test for sharded checkpoint

* add suggestions

* Update src/diffusers/models/model_loading_utils.py

Co-authored-by: YiYi Xu <yixu310@gmail.com>

* from suggestions

* add class attributes to flag dduf tests

* last one

* fix logic

* remove comment

* revert changes

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: Lucain <lucain@huggingface.co>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: YiYi Xu <yixu310@gmail.com>
2025-01-14 13:21:42 +05:30
Dhruv Nair 3279751bf9 [CI] Update HF Token in Fast GPU Tests (#10568)
update
2025-01-14 13:04:26 +05:30
142 changed files with 5123 additions and 167 deletions
+3 -3
View File
@@ -265,7 +265,7 @@ jobs:
- name: Run PyTorch CUDA tests
env:
HF_TOKEN: ${{ secrets.HF_TOKEN }}
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
# https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms
CUBLAS_WORKSPACE_CONFIG: :16:8
run: |
@@ -505,7 +505,7 @@ jobs:
# shell: arch -arch arm64 bash {0}
# env:
# HF_HOME: /System/Volumes/Data/mnt/cache
# HF_TOKEN: ${{ secrets.HF_TOKEN }}
# HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
# run: |
# ${CONDA_RUN} python -m pytest -n 1 -s -v --make-reports=tests_torch_mps \
# --report-log=tests_torch_mps.log \
@@ -561,7 +561,7 @@ jobs:
# shell: arch -arch arm64 bash {0}
# env:
# HF_HOME: /System/Volumes/Data/mnt/cache
# HF_TOKEN: ${{ secrets.HF_TOKEN }}
# HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
# run: |
# ${CONDA_RUN} python -m pytest -n 1 -s -v --make-reports=tests_torch_mps \
# --report-log=tests_torch_mps.log \
+6 -6
View File
@@ -137,7 +137,7 @@ jobs:
- name: Run PyTorch CUDA tests
env:
HF_TOKEN: ${{ secrets.HF_TOKEN }}
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
# https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms
CUBLAS_WORKSPACE_CONFIG: :16:8
run: |
@@ -187,7 +187,7 @@ jobs:
- name: Run Flax TPU tests
env:
HF_TOKEN: ${{ secrets.HF_TOKEN }}
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
run: |
python -m pytest -n 0 \
-s -v -k "Flax" \
@@ -235,7 +235,7 @@ jobs:
- name: Run ONNXRuntime CUDA tests
env:
HF_TOKEN: ${{ secrets.HF_TOKEN }}
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
run: |
python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile \
-s -v -k "Onnx" \
@@ -283,7 +283,7 @@ jobs:
python utils/print_env.py
- name: Run example tests on GPU
env:
HF_TOKEN: ${{ secrets.HF_TOKEN }}
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
RUN_COMPILE: yes
run: |
python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile -s -v -k "compile" --make-reports=tests_torch_compile_cuda tests/
@@ -326,7 +326,7 @@ jobs:
python utils/print_env.py
- name: Run example tests on GPU
env:
HF_TOKEN: ${{ secrets.HF_TOKEN }}
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
run: |
python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile -s -v -k "xformers" --make-reports=tests_torch_xformers_cuda tests/
- name: Failure short reports
@@ -372,7 +372,7 @@ jobs:
- name: Run example tests on GPU
env:
HF_TOKEN: ${{ secrets.HF_TOKEN }}
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
run: |
python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH"
python -m uv pip install timm
+8 -8
View File
@@ -81,7 +81,7 @@ jobs:
python utils/print_env.py
- name: Slow PyTorch CUDA checkpoint tests on Ubuntu
env:
HF_TOKEN: ${{ secrets.HF_TOKEN }}
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
# https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms
CUBLAS_WORKSPACE_CONFIG: :16:8
run: |
@@ -135,7 +135,7 @@ jobs:
- name: Run PyTorch CUDA tests
env:
HF_TOKEN: ${{ secrets.HF_TOKEN }}
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
# https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms
CUBLAS_WORKSPACE_CONFIG: :16:8
run: |
@@ -186,7 +186,7 @@ jobs:
- name: Run PyTorch CUDA tests
env:
HF_TOKEN: ${{ secrets.HF_TOKEN }}
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
# https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms
CUBLAS_WORKSPACE_CONFIG: :16:8
run: |
@@ -241,7 +241,7 @@ jobs:
- name: Run slow Flax TPU tests
env:
HF_TOKEN: ${{ secrets.HF_TOKEN }}
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
run: |
python -m pytest -n 0 \
-s -v -k "Flax" \
@@ -289,7 +289,7 @@ jobs:
- name: Run slow ONNXRuntime CUDA tests
env:
HF_TOKEN: ${{ secrets.HF_TOKEN }}
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
run: |
python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile \
-s -v -k "Onnx" \
@@ -337,7 +337,7 @@ jobs:
python utils/print_env.py
- name: Run example tests on GPU
env:
HF_TOKEN: ${{ secrets.HF_TOKEN }}
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
RUN_COMPILE: yes
run: |
python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile -s -v -k "compile" --make-reports=tests_torch_compile_cuda tests/
@@ -380,7 +380,7 @@ jobs:
python utils/print_env.py
- name: Run example tests on GPU
env:
HF_TOKEN: ${{ secrets.HF_TOKEN }}
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
run: |
python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile -s -v -k "xformers" --make-reports=tests_torch_xformers_cuda tests/
- name: Failure short reports
@@ -426,7 +426,7 @@ jobs:
- name: Run example tests on GPU
env:
HF_TOKEN: ${{ secrets.HF_TOKEN }}
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
run: |
python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH"
python -m uv pip install timm
+8
View File
@@ -79,6 +79,8 @@
- sections:
- local: using-diffusers/cogvideox
title: CogVideoX
- local: using-diffusers/consisid
title: ConsisID
- local: using-diffusers/sdxl
title: Stable Diffusion XL
- local: using-diffusers/sdxl_turbo
@@ -179,6 +181,8 @@
title: TGATE
- local: optimization/xdit
title: xDiT
- local: optimization/para_attn
title: ParaAttention
- sections:
- local: using-diffusers/stable_diffusion_jax_how_to
title: JAX/Flax
@@ -268,6 +272,8 @@
title: AuraFlowTransformer2DModel
- local: api/models/cogvideox_transformer3d
title: CogVideoXTransformer3DModel
- local: api/models/consisid_transformer3d
title: ConsisIDTransformer3DModel
- local: api/models/cogview3plus_transformer2d
title: CogView3PlusTransformer2DModel
- local: api/models/dit_transformer2d
@@ -370,6 +376,8 @@
title: CogVideoX
- local: api/pipelines/cogview3
title: CogView3
- local: api/pipelines/consisid
title: ConsisID
- local: api/pipelines/consistency_models
title: Consistency Models
- local: api/pipelines/controlnet
@@ -0,0 +1,30 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License. -->
# ConsisIDTransformer3DModel
A Diffusion Transformer model for 3D data from [ConsisID](https://github.com/PKU-YuanGroup/ConsisID) was introduced in [Identity-Preserving Text-to-Video Generation by Frequency Decomposition](https://arxiv.org/pdf/2411.17440) by Peking University & University of Rochester & etc.
The model can be loaded with the following code snippet.
```python
from diffusers import ConsisIDTransformer3DModel
transformer = ConsisIDTransformer3DModel.from_pretrained("BestWishYsh/ConsisID-preview", subfolder="transformer", torch_dtype=torch.bfloat16).to("cuda")
```
## ConsisIDTransformer3DModel
[[autodoc]] ConsisIDTransformer3DModel
## Transformer2DModelOutput
[[autodoc]] models.modeling_outputs.Transformer2DModelOutput
+60
View File
@@ -0,0 +1,60 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
-->
# ConsisID
[Identity-Preserving Text-to-Video Generation by Frequency Decomposition](https://arxiv.org/abs/2411.17440) from Peking University & University of Rochester & etc, by Shenghai Yuan, Jinfa Huang, Xianyi He, Yunyang Ge, Yujun Shi, Liuhan Chen, Jiebo Luo, Li Yuan.
The abstract from the paper is:
*Identity-preserving text-to-video (IPT2V) generation aims to create high-fidelity videos with consistent human identity. It is an important task in video generation but remains an open problem for generative models. This paper pushes the technical frontier of IPT2V in two directions that have not been resolved in the literature: (1) A tuning-free pipeline without tedious case-by-case finetuning, and (2) A frequency-aware heuristic identity-preserving Diffusion Transformer (DiT)-based control scheme. To achieve these goals, we propose **ConsisID**, a tuning-free DiT-based controllable IPT2V model to keep human-**id**entity **consis**tent in the generated video. Inspired by prior findings in frequency analysis of vision/diffusion transformers, it employs identity-control signals in the frequency domain, where facial features can be decomposed into low-frequency global features (e.g., profile, proportions) and high-frequency intrinsic features (e.g., identity markers that remain unaffected by pose changes). First, from a low-frequency perspective, we introduce a global facial extractor, which encodes the reference image and facial key points into a latent space, generating features enriched with low-frequency information. These features are then integrated into the shallow layers of the network to alleviate training challenges associated with DiT. Second, from a high-frequency perspective, we design a local facial extractor to capture high-frequency details and inject them into the transformer blocks, enhancing the model's ability to preserve fine-grained features. To leverage the frequency information for identity preservation, we propose a hierarchical training strategy, transforming a vanilla pre-trained video generation model into an IPT2V model. Extensive experiments demonstrate that our frequency-aware heuristic scheme provides an optimal control solution for DiT-based models. Thanks to this scheme, our **ConsisID** achieves excellent results in generating high-quality, identity-preserving videos, making strides towards more effective IPT2V. The model weight of ConsID is publicly available at https://github.com/PKU-YuanGroup/ConsisID.*
<Tip>
Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.md) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading.md#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.
</Tip>
This pipeline was contributed by [SHYuanBest](https://github.com/SHYuanBest). The original codebase can be found [here](https://github.com/PKU-YuanGroup/ConsisID). The original weights can be found under [hf.co/BestWishYsh](https://huggingface.co/BestWishYsh).
There are two official ConsisID checkpoints for identity-preserving text-to-video.
| checkpoints | recommended inference dtype |
|:---:|:---:|
| [`BestWishYsh/ConsisID-preview`](https://huggingface.co/BestWishYsh/ConsisID-preview) | torch.bfloat16 |
| [`BestWishYsh/ConsisID-1.5`](https://huggingface.co/BestWishYsh/ConsisID-preview) | torch.bfloat16 |
### Memory optimization
ConsisID requires about 44 GB of GPU memory to decode 49 frames (6 seconds of video at 8 FPS) with output resolution 720x480 (W x H), which makes it not possible to run on consumer GPUs or free-tier T4 Colab. The following memory optimizations could be used to reduce the memory footprint. For replication, you can refer to [this](https://gist.github.com/SHYuanBest/bc4207c36f454f9e969adbb50eaf8258) script.
| Feature (overlay the previous) | Max Memory Allocated | Max Memory Reserved |
| :----------------------------- | :------------------- | :------------------ |
| - | 37 GB | 44 GB |
| enable_model_cpu_offload | 22 GB | 25 GB |
| enable_sequential_cpu_offload | 16 GB | 22 GB |
| vae.enable_slicing | 16 GB | 22 GB |
| vae.enable_tiling | 5 GB | 7 GB |
## ConsisIDPipeline
[[autodoc]] ConsisIDPipeline
- all
- __call__
## ConsisIDPipelineOutput
[[autodoc]] pipelines.consisid.pipeline_output.ConsisIDPipelineOutput
+1 -1
View File
@@ -115,7 +115,7 @@ export_to_video(frames, "mochi.mp4", fps=30)
## Reproducing the results from the Genmo Mochi repo
The [Genmo Mochi implementation](https://github.com/genmoai/mochi/tree/main) uses different precision values for each stage in the inference process. The text encoder and VAE use `torch.float32`, while the DiT uses `torch.bfloat16` with the [attention kernel](https://pytorch.org/docs/stable/generated/torch.nn.attention.sdpa_kernel.html#torch.nn.attention.sdpa_kernel) set to `EFFICIENT_ATTENTION`. Diffusers pipelines currently do not support setting different `dtypes` for different stages of the pipeline. In order to run inference in the same way as the the original implementation, please refer to the following example.
The [Genmo Mochi implementation](https://github.com/genmoai/mochi/tree/main) uses different precision values for each stage in the inference process. The text encoder and VAE use `torch.float32`, while the DiT uses `torch.bfloat16` with the [attention kernel](https://pytorch.org/docs/stable/generated/torch.nn.attention.sdpa_kernel.html#torch.nn.attention.sdpa_kernel) set to `EFFICIENT_ATTENTION`. Diffusers pipelines currently do not support setting different `dtypes` for different stages of the pipeline. In order to run inference in the same way as the original implementation, please refer to the following example.
<Tip>
The original Mochi implementation zeros out empty prompts. However, enabling this option and placing the entire pipeline under autocast can lead to numerical overflows with the T5 text encoder.
@@ -77,7 +77,7 @@ from diffusers import StableDiffusion3Pipeline
from transformers import SiglipVisionModel, SiglipImageProcessor
image_encoder_id = "google/siglip-so400m-patch14-384"
ip_adapter_id = "InstantX/SD3.5-Large-IP-Adapter"
ip_adapter_id = "guiyrt/InstantX-SD3.5-Large-IP-Adapter-diffusers"
feature_extractor = SiglipImageProcessor.from_pretrained(
image_encoder_id,
+497
View File
@@ -0,0 +1,497 @@
# ParaAttention
<div class="flex justify-center">
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/para-attn/flux-performance.png">
</div>
<div class="flex justify-center">
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/para-attn/hunyuan-video-performance.png">
</div>
Large image and video generation models, such as [FLUX.1-dev](https://huggingface.co/black-forest-labs/FLUX.1-dev) and [HunyuanVideo](https://huggingface.co/tencent/HunyuanVideo), can be an inference challenge for real-time applications and deployment because of their size.
[ParaAttention](https://github.com/chengzeyi/ParaAttention) is a library that implements **context parallelism** and **first block cache**, and can be combined with other techniques (torch.compile, fp8 dynamic quantization), to accelerate inference.
This guide will show you how to apply ParaAttention to FLUX.1-dev and HunyuanVideo on NVIDIA L20 GPUs.
No optimizations are applied for our baseline benchmark, except for HunyuanVideo to avoid out-of-memory errors.
Our baseline benchmark shows that FLUX.1-dev is able to generate a 1024x1024 resolution image in 28 steps in 26.36 seconds, and HunyuanVideo is able to generate 129 frames at 720p resolution in 30 steps in 3675.71 seconds.
> [!TIP]
> For even faster inference with context parallelism, try using NVIDIA A100 or H100 GPUs (if available) with NVLink support, especially when there is a large number of GPUs.
## First Block Cache
Caching the output of the transformers blocks in the model and reusing them in the next inference steps reduces the computation cost and makes inference faster.
However, it is hard to decide when to reuse the cache to ensure quality generated images or videos. ParaAttention directly uses the **residual difference of the first transformer block output** to approximate the difference among model outputs. When the difference is small enough, the residual difference of previous inference steps is reused. In other words, the denoising step is skipped.
This achieves a 2x speedup on FLUX.1-dev and HunyuanVideo inference with very good quality.
<figure>
<img src="https://huggingface.co/datasets/chengzeyi/documentation-images/resolve/main/diffusers/para-attn/ada-cache.png" alt="Cache in Diffusion Transformer" />
<figcaption>How AdaCache works, First Block Cache is a variant of it</figcaption>
</figure>
<hfoptions id="first-block-cache">
<hfoption id="FLUX-1.dev">
To apply first block cache on FLUX.1-dev, call `apply_cache_on_pipe` as shown below. 0.08 is the default residual difference value for FLUX models.
```python
import time
import torch
from diffusers import FluxPipeline
pipe = FluxPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev",
torch_dtype=torch.bfloat16,
).to("cuda")
from para_attn.first_block_cache.diffusers_adapters import apply_cache_on_pipe
apply_cache_on_pipe(pipe, residual_diff_threshold=0.08)
# Enable memory savings
# pipe.enable_model_cpu_offload()
# pipe.enable_sequential_cpu_offload()
begin = time.time()
image = pipe(
"A cat holding a sign that says hello world",
num_inference_steps=28,
).images[0]
end = time.time()
print(f"Time: {end - begin:.2f}s")
print("Saving image to flux.png")
image.save("flux.png")
```
| Optimizations | Original | FBCache rdt=0.06 | FBCache rdt=0.08 | FBCache rdt=0.10 | FBCache rdt=0.12 |
| - | - | - | - | - | - |
| Preview | ![Original](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/para-attn/flux-original.png) | ![FBCache rdt=0.06](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/para-attn/flux-fbc-0.06.png) | ![FBCache rdt=0.08](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/para-attn/flux-fbc-0.08.png) | ![FBCache rdt=0.10](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/para-attn/flux-fbc-0.10.png) | ![FBCache rdt=0.12](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/para-attn/flux-fbc-0.12.png) |
| Wall Time (s) | 26.36 | 21.83 | 17.01 | 16.00 | 13.78 |
First Block Cache reduced the inference speed to 17.01 seconds compared to the baseline, or 1.55x faster, while maintaining nearly zero quality loss.
</hfoption>
<hfoption id="HunyuanVideo">
To apply First Block Cache on HunyuanVideo, `apply_cache_on_pipe` as shown below. 0.06 is the default residual difference value for HunyuanVideo models.
```python
import time
import torch
from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel
from diffusers.utils import export_to_video
model_id = "tencent/HunyuanVideo"
transformer = HunyuanVideoTransformer3DModel.from_pretrained(
model_id,
subfolder="transformer",
torch_dtype=torch.bfloat16,
revision="refs/pr/18",
)
pipe = HunyuanVideoPipeline.from_pretrained(
model_id,
transformer=transformer,
torch_dtype=torch.float16,
revision="refs/pr/18",
).to("cuda")
from para_attn.first_block_cache.diffusers_adapters import apply_cache_on_pipe
apply_cache_on_pipe(pipe, residual_diff_threshold=0.6)
pipe.vae.enable_tiling()
begin = time.time()
output = pipe(
prompt="A cat walks on the grass, realistic",
height=720,
width=1280,
num_frames=129,
num_inference_steps=30,
).frames[0]
end = time.time()
print(f"Time: {end - begin:.2f}s")
print("Saving video to hunyuan_video.mp4")
export_to_video(output, "hunyuan_video.mp4", fps=15)
```
<video controls>
<source src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/para-attn/hunyuan-video-original.mp4" type="video/mp4">
Your browser does not support the video tag.
</video>
<small> HunyuanVideo without FBCache </small>
<video controls>
<source src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/para-attn/hunyuan-video-fbc.mp4" type="video/mp4">
Your browser does not support the video tag.
</video>
<small> HunyuanVideo with FBCache </small>
First Block Cache reduced the inference speed to 2271.06 seconds compared to the baseline, or 1.62x faster, while maintaining nearly zero quality loss.
</hfoption>
</hfoptions>
## fp8 quantization
fp8 with dynamic quantization further speeds up inference and reduces memory usage. Both the activations and weights must be quantized in order to use the 8-bit [NVIDIA Tensor Cores](https://www.nvidia.com/en-us/data-center/tensor-cores/).
Use `float8_weight_only` and `float8_dynamic_activation_float8_weight` to quantize the text encoder and transformer model.
The default quantization method is per tensor quantization, but if your GPU supports row-wise quantization, you can also try it for better accuracy.
Install [torchao](https://github.com/pytorch/ao/tree/main) with the command below.
```bash
pip3 install -U torch torchao
```
[torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) with `mode="max-autotune-no-cudagraphs"` or `mode="max-autotune"` selects the best kernel for performance. Compilation can take a long time if it's the first time the model is called, but it is worth it once the model has been compiled.
This example only quantizes the transformer model, but you can also quantize the text encoder to reduce memory usage even more.
> [!TIP]
> Dynamic quantization can significantly change the distribution of the model output, so you need to change the `residual_diff_threshold` to a larger value for it to take effect.
<hfoptions id="fp8-quantization">
<hfoption id="FLUX-1.dev">
```python
import time
import torch
from diffusers import FluxPipeline
pipe = FluxPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev",
torch_dtype=torch.bfloat16,
).to("cuda")
from para_attn.first_block_cache.diffusers_adapters import apply_cache_on_pipe
apply_cache_on_pipe(
pipe,
residual_diff_threshold=0.12, # Use a larger value to make the cache take effect
)
from torchao.quantization import quantize_, float8_dynamic_activation_float8_weight, float8_weight_only
quantize_(pipe.text_encoder, float8_weight_only())
quantize_(pipe.transformer, float8_dynamic_activation_float8_weight())
pipe.transformer = torch.compile(
pipe.transformer, mode="max-autotune-no-cudagraphs",
)
# Enable memory savings
# pipe.enable_model_cpu_offload()
# pipe.enable_sequential_cpu_offload()
for i in range(2):
begin = time.time()
image = pipe(
"A cat holding a sign that says hello world",
num_inference_steps=28,
).images[0]
end = time.time()
if i == 0:
print(f"Warm up time: {end - begin:.2f}s")
else:
print(f"Time: {end - begin:.2f}s")
print("Saving image to flux.png")
image.save("flux.png")
```
fp8 dynamic quantization and torch.compile reduced the inference speed to 7.56 seconds compared to the baseline, or 3.48x faster.
</hfoption>
<hfoption id="HunyuanVideo">
```python
import time
import torch
from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel
from diffusers.utils import export_to_video
model_id = "tencent/HunyuanVideo"
transformer = HunyuanVideoTransformer3DModel.from_pretrained(
model_id,
subfolder="transformer",
torch_dtype=torch.bfloat16,
revision="refs/pr/18",
)
pipe = HunyuanVideoPipeline.from_pretrained(
model_id,
transformer=transformer,
torch_dtype=torch.float16,
revision="refs/pr/18",
).to("cuda")
from para_attn.first_block_cache.diffusers_adapters import apply_cache_on_pipe
apply_cache_on_pipe(pipe)
from torchao.quantization import quantize_, float8_dynamic_activation_float8_weight, float8_weight_only
quantize_(pipe.text_encoder, float8_weight_only())
quantize_(pipe.transformer, float8_dynamic_activation_float8_weight())
pipe.transformer = torch.compile(
pipe.transformer, mode="max-autotune-no-cudagraphs",
)
# Enable memory savings
pipe.vae.enable_tiling()
# pipe.enable_model_cpu_offload()
# pipe.enable_sequential_cpu_offload()
for i in range(2):
begin = time.time()
output = pipe(
prompt="A cat walks on the grass, realistic",
height=720,
width=1280,
num_frames=129,
num_inference_steps=1 if i == 0 else 30,
).frames[0]
end = time.time()
if i == 0:
print(f"Warm up time: {end - begin:.2f}s")
else:
print(f"Time: {end - begin:.2f}s")
print("Saving video to hunyuan_video.mp4")
export_to_video(output, "hunyuan_video.mp4", fps=15)
```
A NVIDIA L20 GPU only has 48GB memory and could face out-of-memory (OOM) errors after compilation and if `enable_model_cpu_offload` isn't called because HunyuanVideo has very large activation tensors when running with high resolution and large number of frames. For GPUs with less than 80GB of memory, you can try reducing the resolution and number of frames to avoid OOM errors.
Large video generation models are usually bottlenecked by the attention computations rather than the fully connected layers. These models don't significantly benefit from quantization and torch.compile.
</hfoption>
</hfoptions>
## Context Parallelism
Context Parallelism parallelizes inference and scales with multiple GPUs. The ParaAttention compositional design allows you to combine Context Parallelism with First Block Cache and dynamic quantization.
> [!TIP]
> Refer to the [ParaAttention](https://github.com/chengzeyi/ParaAttention/tree/main) repository for detailed instructions and examples of how to scale inference with multiple GPUs.
If the inference process needs to be persistent and serviceable, it is suggested to use [torch.multiprocessing](https://pytorch.org/docs/stable/multiprocessing.html) to write your own inference processor. This can eliminate the overhead of launching the process and loading and recompiling the model.
<hfoptions id="context-parallelism">
<hfoption id="FLUX-1.dev">
The code sample below combines First Block Cache, fp8 dynamic quantization, torch.compile, and Context Parallelism for the fastest inference speed.
```python
import time
import torch
import torch.distributed as dist
from diffusers import FluxPipeline
dist.init_process_group()
torch.cuda.set_device(dist.get_rank())
pipe = FluxPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev",
torch_dtype=torch.bfloat16,
).to("cuda")
from para_attn.context_parallel import init_context_parallel_mesh
from para_attn.context_parallel.diffusers_adapters import parallelize_pipe
from para_attn.parallel_vae.diffusers_adapters import parallelize_vae
mesh = init_context_parallel_mesh(
pipe.device.type,
max_ring_dim_size=2,
)
parallelize_pipe(
pipe,
mesh=mesh,
)
parallelize_vae(pipe.vae, mesh=mesh._flatten())
from para_attn.first_block_cache.diffusers_adapters import apply_cache_on_pipe
apply_cache_on_pipe(
pipe,
residual_diff_threshold=0.12, # Use a larger value to make the cache take effect
)
from torchao.quantization import quantize_, float8_dynamic_activation_float8_weight, float8_weight_only
quantize_(pipe.text_encoder, float8_weight_only())
quantize_(pipe.transformer, float8_dynamic_activation_float8_weight())
torch._inductor.config.reorder_for_compute_comm_overlap = True
pipe.transformer = torch.compile(
pipe.transformer, mode="max-autotune-no-cudagraphs",
)
# Enable memory savings
# pipe.enable_model_cpu_offload(gpu_id=dist.get_rank())
# pipe.enable_sequential_cpu_offload(gpu_id=dist.get_rank())
for i in range(2):
begin = time.time()
image = pipe(
"A cat holding a sign that says hello world",
num_inference_steps=28,
output_type="pil" if dist.get_rank() == 0 else "pt",
).images[0]
end = time.time()
if dist.get_rank() == 0:
if i == 0:
print(f"Warm up time: {end - begin:.2f}s")
else:
print(f"Time: {end - begin:.2f}s")
if dist.get_rank() == 0:
print("Saving image to flux.png")
image.save("flux.png")
dist.destroy_process_group()
```
Save to `run_flux.py` and launch it with [torchrun](https://pytorch.org/docs/stable/elastic/run.html).
```bash
# Use --nproc_per_node to specify the number of GPUs
torchrun --nproc_per_node=2 run_flux.py
```
Inference speed is reduced to 8.20 seconds compared to the baseline, or 3.21x faster, with 2 NVIDIA L20 GPUs. On 4 L20s, inference speed is 3.90 seconds, or 6.75x faster.
</hfoption>
<hfoption id="HunyuanVideo">
The code sample below combines First Block Cache and Context Parallelism for the fastest inference speed.
```python
import time
import torch
import torch.distributed as dist
from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel
from diffusers.utils import export_to_video
dist.init_process_group()
torch.cuda.set_device(dist.get_rank())
model_id = "tencent/HunyuanVideo"
transformer = HunyuanVideoTransformer3DModel.from_pretrained(
model_id,
subfolder="transformer",
torch_dtype=torch.bfloat16,
revision="refs/pr/18",
)
pipe = HunyuanVideoPipeline.from_pretrained(
model_id,
transformer=transformer,
torch_dtype=torch.float16,
revision="refs/pr/18",
).to("cuda")
from para_attn.context_parallel import init_context_parallel_mesh
from para_attn.context_parallel.diffusers_adapters import parallelize_pipe
from para_attn.parallel_vae.diffusers_adapters import parallelize_vae
mesh = init_context_parallel_mesh(
pipe.device.type,
)
parallelize_pipe(
pipe,
mesh=mesh,
)
parallelize_vae(pipe.vae, mesh=mesh._flatten())
from para_attn.first_block_cache.diffusers_adapters import apply_cache_on_pipe
apply_cache_on_pipe(pipe)
# from torchao.quantization import quantize_, float8_dynamic_activation_float8_weight, float8_weight_only
#
# torch._inductor.config.reorder_for_compute_comm_overlap = True
#
# quantize_(pipe.text_encoder, float8_weight_only())
# quantize_(pipe.transformer, float8_dynamic_activation_float8_weight())
# pipe.transformer = torch.compile(
# pipe.transformer, mode="max-autotune-no-cudagraphs",
# )
# Enable memory savings
pipe.vae.enable_tiling()
# pipe.enable_model_cpu_offload(gpu_id=dist.get_rank())
# pipe.enable_sequential_cpu_offload(gpu_id=dist.get_rank())
for i in range(2):
begin = time.time()
output = pipe(
prompt="A cat walks on the grass, realistic",
height=720,
width=1280,
num_frames=129,
num_inference_steps=1 if i == 0 else 30,
output_type="pil" if dist.get_rank() == 0 else "pt",
).frames[0]
end = time.time()
if dist.get_rank() == 0:
if i == 0:
print(f"Warm up time: {end - begin:.2f}s")
else:
print(f"Time: {end - begin:.2f}s")
if dist.get_rank() == 0:
print("Saving video to hunyuan_video.mp4")
export_to_video(output, "hunyuan_video.mp4", fps=15)
dist.destroy_process_group()
```
Save to `run_hunyuan_video.py` and launch it with [torchrun](https://pytorch.org/docs/stable/elastic/run.html).
```bash
# Use --nproc_per_node to specify the number of GPUs
torchrun --nproc_per_node=8 run_hunyuan_video.py
```
Inference speed is reduced to 649.23 seconds compared to the baseline, or 5.66x faster, with 8 NVIDIA L20 GPUs.
</hfoption>
</hfoptions>
## Benchmarks
<hfoptions id="conclusion">
<hfoption id="FLUX-1.dev">
| GPU Type | Number of GPUs | Optimizations | Wall Time (s) | Speedup |
| - | - | - | - | - |
| NVIDIA L20 | 1 | Baseline | 26.36 | 1.00x |
| NVIDIA L20 | 1 | FBCache (rdt=0.08) | 17.01 | 1.55x |
| NVIDIA L20 | 1 | FP8 DQ | 13.40 | 1.96x |
| NVIDIA L20 | 1 | FBCache (rdt=0.12) + FP8 DQ | 7.56 | 3.48x |
| NVIDIA L20 | 2 | FBCache (rdt=0.12) + FP8 DQ + CP | 4.92 | 5.35x |
| NVIDIA L20 | 4 | FBCache (rdt=0.12) + FP8 DQ + CP | 3.90 | 6.75x |
</hfoption>
<hfoption id="HunyuanVideo">
| GPU Type | Number of GPUs | Optimizations | Wall Time (s) | Speedup |
| - | - | - | - | - |
| NVIDIA L20 | 1 | Baseline | 3675.71 | 1.00x |
| NVIDIA L20 | 1 | FBCache | 2271.06 | 1.62x |
| NVIDIA L20 | 2 | FBCache + CP | 1132.90 | 3.24x |
| NVIDIA L20 | 4 | FBCache + CP | 718.15 | 5.12x |
| NVIDIA L20 | 8 | FBCache + CP | 649.23 | 5.66x |
</hfoption>
</hfoptions>
@@ -0,0 +1,96 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# ConsisID
[ConsisID](https://github.com/PKU-YuanGroup/ConsisID) is an identity-preserving text-to-video generation model that keeps the face consistent in the generated video by frequency decomposition. The main features of ConsisID are:
- Frequency decomposition: The characteristics of the DiT architecture are analyzed from the frequency domain perspective, and based on these characteristics, a reasonable control information injection method is designed.
- Consistency training strategy: A coarse-to-fine training strategy, dynamic masking loss, and dynamic cross-face loss further enhance the model's generalization ability and identity preservation performance.
- Inference without finetuning: Previous methods required case-by-case finetuning of the input ID before inference, leading to significant time and computational costs. In contrast, ConsisID is tuning-free.
This guide will walk you through using ConsisID for use cases.
## Load Model Checkpoints
Model weights may be stored in separate subfolders on the Hub or locally, in which case, you should use the [`~DiffusionPipeline.from_pretrained`] method.
```python
# !pip install consisid_eva_clip insightface facexlib
import torch
from diffusers import ConsisIDPipeline
from diffusers.pipelines.consisid.consisid_utils import prepare_face_models, process_face_embeddings_infer
from huggingface_hub import snapshot_download
# Download ckpts
snapshot_download(repo_id="BestWishYsh/ConsisID-preview", local_dir="BestWishYsh/ConsisID-preview")
# Load face helper model to preprocess input face image
face_helper_1, face_helper_2, face_clip_model, face_main_model, eva_transform_mean, eva_transform_std = prepare_face_models("BestWishYsh/ConsisID-preview", device="cuda", dtype=torch.bfloat16)
# Load consisid base model
pipe = ConsisIDPipeline.from_pretrained("BestWishYsh/ConsisID-preview", torch_dtype=torch.bfloat16)
pipe.to("cuda")
```
## Identity-Preserving Text-to-Video
For identity-preserving text-to-video, pass a text prompt and an image contain clear face (e.g., preferably half-body or full-body). By default, ConsisID generates a 720x480 video for the best results.
```python
from diffusers.utils import export_to_video
prompt = "The video captures a boy walking along a city street, filmed in black and white on a classic 35mm camera. His expression is thoughtful, his brow slightly furrowed as if he's lost in contemplation. The film grain adds a textured, timeless quality to the image, evoking a sense of nostalgia. Around him, the cityscape is filled with vintage buildings, cobblestone sidewalks, and softly blurred figures passing by, their outlines faint and indistinct. Streetlights cast a gentle glow, while shadows play across the boy's path, adding depth to the scene. The lighting highlights the boy's subtle smile, hinting at a fleeting moment of curiosity. The overall cinematic atmosphere, complete with classic film still aesthetics and dramatic contrasts, gives the scene an evocative and introspective feel."
image = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_input.png?download=true"
id_cond, id_vit_hidden, image, face_kps = process_face_embeddings_infer(face_helper_1, face_clip_model, face_helper_2, eva_transform_mean, eva_transform_std, face_main_model, "cuda", torch.bfloat16, image, is_align_face=True)
video = pipe(image=image, prompt=prompt, num_inference_steps=50, guidance_scale=6.0, use_dynamic_cfg=False, id_vit_hidden=id_vit_hidden, id_cond=id_cond, kps_cond=face_kps, generator=torch.Generator("cuda").manual_seed(42))
export_to_video(video.frames[0], "output.mp4", fps=8)
```
<table>
<tr>
<th style="text-align: center;">Face Image</th>
<th style="text-align: center;">Video</th>
<th style="text-align: center;">Description</th
</tr>
<tr>
<td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_image_0.png?download=true" style="height: auto; width: 600px;"></td>
<td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_output_0.gif?download=true" style="height: auto; width: 2000px;"></td>
<td>The video, in a beautifully crafted animated style, features a confident woman riding a horse through a lush forest clearing. Her expression is focused yet serene as she adjusts her wide-brimmed hat with a practiced hand. She wears a flowy bohemian dress, which moves gracefully with the rhythm of the horse, the fabric flowing fluidly in the animated motion. The dappled sunlight filters through the trees, casting soft, painterly patterns on the forest floor. Her posture is poised, showing both control and elegance as she guides the horse with ease. The animation's gentle, fluid style adds a dreamlike quality to the scene, with the womans calm demeanor and the peaceful surroundings evoking a sense of freedom and harmony.</td>
</tr>
<tr>
<td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_image_1.png?download=true" style="height: auto; width: 600px;"></td>
<td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_output_1.gif?download=true" style="height: auto; width: 2000px;"></td>
<td>The video, in a captivating animated style, shows a woman standing in the center of a snowy forest, her eyes narrowed in concentration as she extends her hand forward. She is dressed in a deep blue cloak, her breath visible in the cold air, which is rendered with soft, ethereal strokes. A faint smile plays on her lips as she summons a wisp of ice magic, watching with focus as the surrounding trees and ground begin to shimmer and freeze, covered in delicate ice crystals. The animations fluid motion brings the magic to life, with the frost spreading outward in intricate, sparkling patterns. The environment is painted with soft, watercolor-like hues, enhancing the magical, dreamlike atmosphere. The overall mood is serene yet powerful, with the quiet winter air amplifying the delicate beauty of the frozen scene.</td>
</tr>
<tr>
<td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_image_2.png?download=true" style="height: auto; width: 600px;"></td>
<td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_output_2.gif?download=true" style="height: auto; width: 2000px;"></td>
<td>The animation features a whimsical portrait of a balloon seller standing in a gentle breeze, captured with soft, hazy brushstrokes that evoke the feel of a serene spring day. His face is framed by a gentle smile, his eyes squinting slightly against the sun, while a few wisps of hair flutter in the wind. He is dressed in a light, pastel-colored shirt, and the balloons around him sway with the wind, adding a sense of playfulness to the scene. The background blurs softly, with hints of a vibrant market or park, enhancing the light-hearted, yet tender mood of the moment.</td>
</tr>
<tr>
<td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_image_3.png?download=true" style="height: auto; width: 600px;"></td>
<td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_output_3.gif?download=true" style="height: auto; width: 2000px;"></td>
<td>The video captures a boy walking along a city street, filmed in black and white on a classic 35mm camera. His expression is thoughtful, his brow slightly furrowed as if he's lost in contemplation. The film grain adds a textured, timeless quality to the image, evoking a sense of nostalgia. Around him, the cityscape is filled with vintage buildings, cobblestone sidewalks, and softly blurred figures passing by, their outlines faint and indistinct. Streetlights cast a gentle glow, while shadows play across the boy's path, adding depth to the scene. The lighting highlights the boy's subtle smile, hinting at a fleeting moment of curiosity. The overall cinematic atmosphere, complete with classic film still aesthetics and dramatic contrasts, gives the scene an evocative and introspective feel.</td>
</tr>
<tr>
<td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_image_4.png?download=true" style="height: auto; width: 600px;"></td>
<td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_output_4.gif?download=true" style="height: auto; width: 2000px;"></td>
<td>The video features a baby wearing a bright superhero cape, standing confidently with arms raised in a powerful pose. The baby has a determined look on their face, with eyes wide and lips pursed in concentration, as if ready to take on a challenge. The setting appears playful, with colorful toys scattered around and a soft rug underfoot, while sunlight streams through a nearby window, highlighting the fluttering cape and adding to the impression of heroism. The overall atmosphere is lighthearted and fun, with the baby's expressions capturing a mix of innocence and an adorable attempt at bravery, as if truly ready to save the day.</td>
</tr>
</table>
## Resources
Learn more about ConsisID with the following resources.
- A [video](https://www.youtube.com/watch?v=PhlgC-bI5SQ) demonstrating ConsisID's main features.
- The research paper, [Identity-Preserving Text-to-Video Generation by Frequency Decomposition](https://hf.co/papers/2411.17440) for more details.
@@ -240,6 +240,46 @@ Benefits of using a single-file layout include:
1. Easy compatibility with diffusion interfaces such as [ComfyUI](https://github.com/comfyanonymous/ComfyUI) or [Automatic1111](https://github.com/AUTOMATIC1111/stable-diffusion-webui) which commonly use a single-file layout.
2. Easier to manage (download and share) a single file.
### DDUF
> [!WARNING]
> DDUF is an experimental file format and APIs related to it can change in the future.
DDUF (**D**DUF **D**iffusion **U**nified **F**ormat) is a file format designed to make storing, distributing, and using diffusion models much easier. Built on the ZIP file format, DDUF offers a standardized, efficient, and flexible way to package all parts of a diffusion model into a single, easy-to-manage file. It provides a balance between Diffusers multi-folder format and the widely popular single-file format.
Learn more details about DDUF on the Hugging Face Hub [documentation](https://huggingface.co/docs/hub/dduf).
Pass a checkpoint to the `dduf_file` parameter to load it in [`DiffusionPipeline`].
```py
from diffusers import DiffusionPipeline
import torch
pipe = DiffusionPipeline.from_pretrained(
"DDUF/FLUX.1-dev-DDUF", dduf_file="FLUX.1-dev.dduf", torch_dtype=torch.bfloat16
).to("cuda")
image = pipe(
"photo a cat holding a sign that says Diffusers", num_inference_steps=50, guidance_scale=3.5
).images[0]
image.save("cat.png")
```
To save a pipeline as a `.dduf` checkpoint, use the [`~huggingface_hub.export_folder_as_dduf`] utility, which takes care of all the necessary file-level validations.
```py
from huggingface_hub import export_folder_as_dduf
from diffusers import DiffusionPipeline
import torch
pipe = DiffusionPipeline.from_pretrained("black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16)
save_folder = "flux-dev"
pipe.save_pretrained("flux-dev")
export_folder_as_dduf("flux-dev.dduf", folder_path=save_folder)
> [!TIP]
> Packaging and loading quantized checkpoints in the DDUF format is supported as long as they respect the multi-folder structure.
## Convert layout and files
Diffusers provides many scripts and methods to convert storage layouts and file formats to enable broader support across the diffusion ecosystem.
+2
View File
@@ -5,6 +5,8 @@
title: 快速入门
- local: stable_diffusion
title: 有效和高效的扩散
- local: consisid
title: 身份保持的文本到视频生成
- local: installation
title: 安装
title: 开始
+100
View File
@@ -0,0 +1,100 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# ConsisID
[ConsisID](https://github.com/PKU-YuanGroup/ConsisID)是一种身份保持的文本到视频生成模型,其通过频率分解在生成的视频中保持面部一致性。它具有以下特点:
- 基于频率分解:将人物ID特征解耦为高频和低频部分,从频域的角度分析DIT架构的特性,并且基于此特性设计合理的控制信息注入方式。
- 一致性训练策略:我们提出粗到细训练策略、动态掩码损失、动态跨脸损失,进一步提高了模型的泛化能力和身份保持效果。
- 推理无需微调:之前的方法在推理前,需要对输入id进行case-by-case微调,时间和算力开销较大,而我们的方法是tuning-free的。
本指南将指导您使用 ConsisID 生成身份保持的视频。
## Load Model Checkpoints
模型权重可以存储在Hub上或本地的单独子文件夹中,在这种情况下,您应该使用 [`~DiffusionPipeline.from_pretrained`] 方法。
```python
# !pip install consisid_eva_clip insightface facexlib
import torch
from diffusers import ConsisIDPipeline
from diffusers.pipelines.consisid.consisid_utils import prepare_face_models, process_face_embeddings_infer
from huggingface_hub import snapshot_download
# Download ckpts
snapshot_download(repo_id="BestWishYsh/ConsisID-preview", local_dir="BestWishYsh/ConsisID-preview")
# Load face helper model to preprocess input face image
face_helper_1, face_helper_2, face_clip_model, face_main_model, eva_transform_mean, eva_transform_std = prepare_face_models("BestWishYsh/ConsisID-preview", device="cuda", dtype=torch.bfloat16)
# Load consisid base model
pipe = ConsisIDPipeline.from_pretrained("BestWishYsh/ConsisID-preview", torch_dtype=torch.bfloat16)
pipe.to("cuda")
```
## Identity-Preserving Text-to-Video
对于身份保持的文本到视频生成,需要输入文本提示和包含清晰面部(例如,最好是半身或全身)的图像。默认情况下,ConsisID 会生成 720x480 的视频以获得最佳效果。
```python
from diffusers.utils import export_to_video
prompt = "The video captures a boy walking along a city street, filmed in black and white on a classic 35mm camera. His expression is thoughtful, his brow slightly furrowed as if he's lost in contemplation. The film grain adds a textured, timeless quality to the image, evoking a sense of nostalgia. Around him, the cityscape is filled with vintage buildings, cobblestone sidewalks, and softly blurred figures passing by, their outlines faint and indistinct. Streetlights cast a gentle glow, while shadows play across the boy's path, adding depth to the scene. The lighting highlights the boy's subtle smile, hinting at a fleeting moment of curiosity. The overall cinematic atmosphere, complete with classic film still aesthetics and dramatic contrasts, gives the scene an evocative and introspective feel."
image = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_input.png?download=true"
id_cond, id_vit_hidden, image, face_kps = process_face_embeddings_infer(face_helper_1, face_clip_model, face_helper_2, eva_transform_mean, eva_transform_std, face_main_model, "cuda", torch.bfloat16, image, is_align_face=True)
video = pipe(image=image, prompt=prompt, num_inference_steps=50, guidance_scale=6.0, use_dynamic_cfg=False, id_vit_hidden=id_vit_hidden, id_cond=id_cond, kps_cond=face_kps, generator=torch.Generator("cuda").manual_seed(42))
export_to_video(video.frames[0], "output.mp4", fps=8)
```
<table>
<tr>
<th style="text-align: center;">Face Image</th>
<th style="text-align: center;">Video</th>
<th style="text-align: center;">Description</th
</tr>
<tr>
<td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_image_0.png?download=true" style="height: auto; width: 600px;"></td>
<td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_output_0.gif?download=true" style="height: auto; width: 2000px;"></td>
<td>The video, in a beautifully crafted animated style, features a confident woman riding a horse through a lush forest clearing. Her expression is focused yet serene as she adjusts her wide-brimmed hat with a practiced hand. She wears a flowy bohemian dress, which moves gracefully with the rhythm of the horse, the fabric flowing fluidly in the animated motion. The dappled sunlight filters through the trees, casting soft, painterly patterns on the forest floor. Her posture is poised, showing both control and elegance as she guides the horse with ease. The animation's gentle, fluid style adds a dreamlike quality to the scene, with the womans calm demeanor and the peaceful surroundings evoking a sense of freedom and harmony.</td>
</tr>
<tr>
<td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_image_1.png?download=true" style="height: auto; width: 600px;"></td>
<td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_output_1.gif?download=true" style="height: auto; width: 2000px;"></td>
<td>The video, in a captivating animated style, shows a woman standing in the center of a snowy forest, her eyes narrowed in concentration as she extends her hand forward. She is dressed in a deep blue cloak, her breath visible in the cold air, which is rendered with soft, ethereal strokes. A faint smile plays on her lips as she summons a wisp of ice magic, watching with focus as the surrounding trees and ground begin to shimmer and freeze, covered in delicate ice crystals. The animations fluid motion brings the magic to life, with the frost spreading outward in intricate, sparkling patterns. The environment is painted with soft, watercolor-like hues, enhancing the magical, dreamlike atmosphere. The overall mood is serene yet powerful, with the quiet winter air amplifying the delicate beauty of the frozen scene.</td>
</tr>
<tr>
<td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_image_2.png?download=true" style="height: auto; width: 600px;"></td>
<td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_output_2.gif?download=true" style="height: auto; width: 2000px;"></td>
<td>The animation features a whimsical portrait of a balloon seller standing in a gentle breeze, captured with soft, hazy brushstrokes that evoke the feel of a serene spring day. His face is framed by a gentle smile, his eyes squinting slightly against the sun, while a few wisps of hair flutter in the wind. He is dressed in a light, pastel-colored shirt, and the balloons around him sway with the wind, adding a sense of playfulness to the scene. The background blurs softly, with hints of a vibrant market or park, enhancing the light-hearted, yet tender mood of the moment.</td>
</tr>
<tr>
<td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_image_3.png?download=true" style="height: auto; width: 600px;"></td>
<td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_output_3.gif?download=true" style="height: auto; width: 2000px;"></td>
<td>The video captures a boy walking along a city street, filmed in black and white on a classic 35mm camera. His expression is thoughtful, his brow slightly furrowed as if he's lost in contemplation. The film grain adds a textured, timeless quality to the image, evoking a sense of nostalgia. Around him, the cityscape is filled with vintage buildings, cobblestone sidewalks, and softly blurred figures passing by, their outlines faint and indistinct. Streetlights cast a gentle glow, while shadows play across the boy's path, adding depth to the scene. The lighting highlights the boy's subtle smile, hinting at a fleeting moment of curiosity. The overall cinematic atmosphere, complete with classic film still aesthetics and dramatic contrasts, gives the scene an evocative and introspective feel.</td>
</tr>
<tr>
<td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_image_4.png?download=true" style="height: auto; width: 600px;"></td>
<td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_output_4.gif?download=true" style="height: auto; width: 2000px;"></td>
<td>The video features a baby wearing a bright superhero cape, standing confidently with arms raised in a powerful pose. The baby has a determined look on their face, with eyes wide and lips pursed in concentration, as if ready to take on a challenge. The setting appears playful, with colorful toys scattered around and a soft rug underfoot, while sunlight streams through a nearby window, highlighting the fluttering cape and adding to the impression of heroism. The overall atmosphere is lighthearted and fun, with the baby's expressions capturing a mix of innocence and an adorable attempt at bravery, as if truly ready to save the day.</td>
</tr>
</table>
## Resources
通过以下资源了解有关 ConsisID 的更多信息:
- 一段 [视频](https://www.youtube.com/watch?v=PhlgC-bI5SQ) 演示了 ConsisID 的主要功能;
- 有关更多详细信息,请参阅研究论文 [Identity-Preserving Text-to-Video Generation by Frequency Decomposition](https://hf.co/papers/2411.17440)。
+3 -2
View File
@@ -404,10 +404,11 @@ def my_forward(
# TODO: this requires sync between CPU and GPU. So try to pass timesteps as tensors if you can
# This would be a good case for the `match` statement (Python 3.10+)
is_mps = sample.device.type == "mps"
is_npu = sample.device.type == "npu"
if isinstance(timestep, float):
dtype = torch.float32 if is_mps else torch.float64
dtype = torch.float32 if (is_mps or is_npu) else torch.float64
else:
dtype = torch.int32 if is_mps else torch.int64
dtype = torch.int32 if (is_mps or is_npu) else torch.int64
timesteps = torch.tensor([timesteps], dtype=dtype, device=sample.device)
elif len(timesteps.shape) == 0:
timesteps = timesteps[None].to(sample.device)
+3 -2
View File
@@ -2806,10 +2806,11 @@ class MatryoshkaUNet2DConditionModel(
# TODO: this requires sync between CPU and GPU. So try to pass timesteps as tensors if you can
# This would be a good case for the `match` statement (Python 3.10+)
is_mps = sample.device.type == "mps"
is_npu = sample.device.type == "npu"
if isinstance(timestep, float):
dtype = torch.float32 if is_mps else torch.float64
dtype = torch.float32 if (is_mps or is_npu) else torch.float64
else:
dtype = torch.int32 if is_mps else torch.int64
dtype = torch.int32 if (is_mps or is_npu) else torch.int64
timesteps = torch.tensor([timesteps], dtype=dtype, device=sample.device)
elif len(timesteps.shape) == 0:
timesteps = timesteps[None].to(sample.device)
@@ -158,6 +158,9 @@ def log_validation(
f"Running validation... \n Generating {args.num_validation_images} images with prompt:"
f" {args.validation_prompt}."
)
if args.enable_vae_tiling:
pipeline.vae.enable_tiling(tile_sample_min_height=1024, tile_sample_stride_width=1024)
pipeline.text_encoder = pipeline.text_encoder.to(torch.bfloat16)
pipeline = pipeline.to(accelerator.device)
pipeline.set_progress_bar_config(disable=True)
@@ -597,6 +600,7 @@ def parse_args(input_args=None):
help="Whether to offload the VAE and the text encoder to CPU when they are not used.",
)
parser.add_argument("--local_rank", type=int, default=-1, help="For distributed training: local_rank")
parser.add_argument("--enable_vae_tiling", action="store_true", help="Enabla vae tiling in log validation")
if input_args is not None:
args = parser.parse_args(input_args)
@@ -812,6 +812,8 @@ def main(args):
for name, module in flux_transformer.named_modules():
if "transformer_blocks" in name:
module.requires_grad_(True)
else:
module.requirs_grad_(False)
def unwrap_model(model):
model = accelerator.unwrap_model(model)
@@ -1031,10 +1031,11 @@ class PixArtAlphaControlnetPipeline(DiffusionPipeline):
# TODO: this requires sync between CPU and GPU. So try to pass timesteps as tensors if you can
# This would be a good case for the `match` statement (Python 3.10+)
is_mps = latent_model_input.device.type == "mps"
is_npu = latent_model_input.device.type == "npu"
if isinstance(current_timestep, float):
dtype = torch.float32 if is_mps else torch.float64
dtype = torch.float32 if (is_mps or is_npu) else torch.float64
else:
dtype = torch.int32 if is_mps else torch.int64
dtype = torch.int32 if (is_mps or is_npu) else torch.int64
current_timestep = torch.tensor([current_timestep], dtype=dtype, device=latent_model_input.device)
elif len(current_timestep.shape) == 0:
current_timestep = current_timestep[None].to(latent_model_input.device)
@@ -258,10 +258,11 @@ class PromptDiffusionControlNetModel(ControlNetModel):
# TODO: this requires sync between CPU and GPU. So try to pass timesteps as tensors if you can
# This would be a good case for the `match` statement (Python 3.10+)
is_mps = sample.device.type == "mps"
is_npu = sample.device.type == "npu"
if isinstance(timestep, float):
dtype = torch.float32 if is_mps else torch.float64
dtype = torch.float32 if (is_mps or is_npu) else torch.float64
else:
dtype = torch.int32 if is_mps else torch.int64
dtype = torch.int32 if (is_mps or is_npu) else torch.int64
timesteps = torch.tensor([timesteps], dtype=dtype, device=sample.device)
elif len(timesteps.shape) == 0:
timesteps = timesteps[None].to(sample.device)
@@ -0,0 +1,100 @@
# Generating images using Flux and PyTorch/XLA
The `flux_inference` script shows how to do image generation using Flux on TPU devices using PyTorch/XLA. It uses the pallas kernel for flash attention for faster generation.
It has been tested on [Trillium](https://cloud.google.com/blog/products/compute/introducing-trillium-6th-gen-tpus) TPU versions. No other TPU types have been tested.
## Create TPU
To create a TPU on Google Cloud, follow [this guide](https://cloud.google.com/tpu/docs/v6e)
## Setup TPU environment
SSH into the VM and install Pytorch, Pytorch/XLA
```bash
pip install torch~=2.5.0 torch_xla[tpu]~=2.5.0 -f https://storage.googleapis.com/libtpu-releases/index.html -f https://storage.googleapis.com/libtpu-wheels/index.html
pip install torch_xla[pallas] -f https://storage.googleapis.com/jax-releases/jax_nightly_releases.html -f https://storage.googleapis.com/jax-releases/jaxlib_nightly_releases.html
```
Verify that PyTorch and PyTorch/XLA were installed correctly:
```bash
python3 -c "import torch; import torch_xla;"
```
Install dependencies
```bash
pip install transformers accelerate sentencepiece structlog
pushd ../../..
pip install .
popd
```
## Run the inference job
### Authenticate
Run the following command to authenticate your token in order to download Flux weights.
```bash
huggingface-cli login
```
Then run:
```bash
python flux_inference.py
```
The script loads the text encoders onto the CPU and the Flux transformer and VAE models onto the TPU. The first time the script runs, the compilation time is longer, while the cache stores the compiled programs. On subsequent runs, compilation is much faster and the subsequent passes being the fastest.
On a Trillium v6e-4, you should expect ~9 sec / 4 images or 2.25 sec / image (as devices run generation in parallel):
```bash
WARNING:root:libtpu.so and TPU device found. Setting PJRT_DEVICE=TPU.
Loading checkpoint shards: 100%|███████████████████████████████| 2/2 [00:00<00:00, 7.01it/s]
Loading pipeline components...: 40%|██████████▍ | 2/5 [00:00<00:00, 3.78it/s]You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Loading pipeline components...: 100%|██████████████████████████| 5/5 [00:00<00:00, 6.72it/s]
2025-01-10 00:51:25 [info ] loading flux from black-forest-labs/FLUX.1-dev
2025-01-10 00:51:25 [info ] loading flux from black-forest-labs/FLUX.1-dev
2025-01-10 00:51:26 [info ] loading flux from black-forest-labs/FLUX.1-dev
2025-01-10 00:51:26 [info ] loading flux from black-forest-labs/FLUX.1-dev
Loading pipeline components...: 100%|██████████████████████████| 3/3 [00:00<00:00, 4.29it/s]
Loading pipeline components...: 100%|██████████████████████████| 3/3 [00:00<00:00, 3.26it/s]
Loading pipeline components...: 100%|██████████████████████████| 3/3 [00:00<00:00, 3.27it/s]
Loading pipeline components...: 100%|██████████████████████████| 3/3 [00:00<00:00, 3.25it/s]
2025-01-10 00:51:34 [info ] starting compilation run...
2025-01-10 00:51:35 [info ] starting compilation run...
2025-01-10 00:51:37 [info ] starting compilation run...
2025-01-10 00:51:37 [info ] starting compilation run...
2025-01-10 00:52:52 [info ] compilation took 78.5155531649998 sec.
2025-01-10 00:52:53 [info ] starting inference run...
2025-01-10 00:52:57 [info ] compilation took 79.52986721400157 sec.
2025-01-10 00:52:57 [info ] compilation took 81.91776501700042 sec.
2025-01-10 00:52:57 [info ] compilation took 80.24951512600092 sec.
2025-01-10 00:52:57 [info ] starting inference run...
2025-01-10 00:52:57 [info ] starting inference run...
2025-01-10 00:52:58 [info ] starting inference run...
2025-01-10 00:53:22 [info ] inference time: 25.112665320000815
2025-01-10 00:53:30 [info ] inference time: 7.7019307739992655
2025-01-10 00:53:38 [info ] inference time: 7.693858365000779
2025-01-10 00:53:46 [info ] inference time: 7.690621814001133
2025-01-10 00:53:53 [info ] inference time: 7.679490454000188
2025-01-10 00:54:01 [info ] inference time: 7.68949568500102
2025-01-10 00:54:09 [info ] inference time: 7.686633744000574
2025-01-10 00:54:16 [info ] inference time: 7.696786873999372
2025-01-10 00:54:24 [info ] inference time: 7.691988694999964
2025-01-10 00:54:32 [info ] inference time: 7.700649563999832
2025-01-10 00:54:39 [info ] inference time: 7.684993574001055
2025-01-10 00:54:47 [info ] inference time: 7.68343457499941
2025-01-10 00:54:55 [info ] inference time: 7.667921153999487
2025-01-10 00:55:02 [info ] inference time: 7.683585194001353
2025-01-10 00:55:06 [info ] avg. inference over 15 iterations took 8.61202360273334 sec.
2025-01-10 00:55:07 [info ] avg. inference over 15 iterations took 8.952725123600006 sec.
2025-01-10 00:55:10 [info ] inference time: 7.673799695001435
2025-01-10 00:55:10 [info ] avg. inference over 15 iterations took 8.849190365400379 sec.
2025-01-10 00:55:10 [info ] saved metric information as /tmp/metrics_report.txt
2025-01-10 00:55:12 [info ] avg. inference over 15 iterations took 8.940161458400205 sec.
```
@@ -0,0 +1,120 @@
from argparse import ArgumentParser
from pathlib import Path
from time import perf_counter
import structlog
import torch
import torch_xla.core.xla_model as xm
import torch_xla.debug.metrics as met
import torch_xla.debug.profiler as xp
import torch_xla.distributed.xla_multiprocessing as xmp
import torch_xla.runtime as xr
from diffusers import FluxPipeline
logger = structlog.get_logger()
metrics_filepath = "/tmp/metrics_report.txt"
def _main(index, args, text_pipe, ckpt_id):
cache_path = Path("/tmp/data/compiler_cache_tRiLlium_eXp")
cache_path.mkdir(parents=True, exist_ok=True)
xr.initialize_cache(str(cache_path), readonly=False)
profile_path = Path("/tmp/data/profiler_out_tRiLlium_eXp")
profile_path.mkdir(parents=True, exist_ok=True)
profiler_port = 9012
profile_duration = args.profile_duration
if args.profile:
logger.info(f"starting profiler on port {profiler_port}")
_ = xp.start_server(profiler_port)
device0 = xm.xla_device()
logger.info(f"loading flux from {ckpt_id}")
flux_pipe = FluxPipeline.from_pretrained(
ckpt_id, text_encoder=None, tokenizer=None, text_encoder_2=None, tokenizer_2=None, torch_dtype=torch.bfloat16
).to(device0)
flux_pipe.transformer.enable_xla_flash_attention(partition_spec=("data", None, None, None), is_flux=True)
prompt = "photograph of an electronics chip in the shape of a race car with trillium written on its side"
width = args.width
height = args.height
guidance = args.guidance
n_steps = 4 if args.schnell else 28
logger.info("starting compilation run...")
ts = perf_counter()
with torch.no_grad():
prompt_embeds, pooled_prompt_embeds, text_ids = text_pipe.encode_prompt(
prompt=prompt, prompt_2=None, max_sequence_length=512
)
prompt_embeds = prompt_embeds.to(device0)
pooled_prompt_embeds = pooled_prompt_embeds.to(device0)
image = flux_pipe(
prompt_embeds=prompt_embeds,
pooled_prompt_embeds=pooled_prompt_embeds,
num_inference_steps=28,
guidance_scale=guidance,
height=height,
width=width,
).images[0]
logger.info(f"compilation took {perf_counter() - ts} sec.")
image.save("/tmp/compile_out.png")
base_seed = 4096 if args.seed is None else args.seed
seed_range = 1000
unique_seed = base_seed + index * seed_range
xm.set_rng_state(seed=unique_seed, device=device0)
times = []
logger.info("starting inference run...")
for _ in range(args.itters):
ts = perf_counter()
with torch.no_grad():
prompt_embeds, pooled_prompt_embeds, text_ids = text_pipe.encode_prompt(
prompt=prompt, prompt_2=None, max_sequence_length=512
)
prompt_embeds = prompt_embeds.to(device0)
pooled_prompt_embeds = pooled_prompt_embeds.to(device0)
if args.profile:
xp.trace_detached(f"localhost:{profiler_port}", str(profile_path), duration_ms=profile_duration)
image = flux_pipe(
prompt_embeds=prompt_embeds,
pooled_prompt_embeds=pooled_prompt_embeds,
num_inference_steps=n_steps,
guidance_scale=guidance,
height=height,
width=width,
).images[0]
inference_time = perf_counter() - ts
if index == 0:
logger.info(f"inference time: {inference_time}")
times.append(inference_time)
logger.info(f"avg. inference over {args.itters} iterations took {sum(times)/len(times)} sec.")
image.save(f"/tmp/inference_out-{index}.png")
if index == 0:
metrics_report = met.metrics_report()
with open(metrics_filepath, "w+") as fout:
fout.write(metrics_report)
logger.info(f"saved metric information as {metrics_filepath}")
if __name__ == "__main__":
parser = ArgumentParser()
parser.add_argument("--schnell", action="store_true", help="run flux schnell instead of dev")
parser.add_argument("--width", type=int, default=1024, help="width of the image to generate")
parser.add_argument("--height", type=int, default=1024, help="height of the image to generate")
parser.add_argument("--guidance", type=float, default=3.5, help="gauidance strentgh for dev")
parser.add_argument("--seed", type=int, default=None, help="seed for inference")
parser.add_argument("--profile", action="store_true", help="enable profiling")
parser.add_argument("--profile-duration", type=int, default=10000, help="duration for profiling in msec.")
parser.add_argument("--itters", type=int, default=15, help="tiems to run inference and get avg time in sec.")
args = parser.parse_args()
if args.schnell:
ckpt_id = "black-forest-labs/FLUX.1-schnell"
else:
ckpt_id = "black-forest-labs/FLUX.1-dev"
text_pipe = FluxPipeline.from_pretrained(ckpt_id, transformer=None, vae=None, torch_dtype=torch.bfloat16).to("cpu")
xmp.spawn(_main, args=(args, text_pipe, ckpt_id))
+1 -1
View File
@@ -73,7 +73,7 @@ def _download(url: str, root: str):
loop.update(len(buffer))
if insecure_hashlib.sha256(open(download_target, "rb").read()).hexdigest() != expected_sha256:
raise RuntimeError("Model has been downloaded but the SHA256 checksum does not not match")
raise RuntimeError("Model has been downloaded but the SHA256 checksum does not match")
return download_target
+1 -1
View File
@@ -101,7 +101,7 @@ _deps = [
"filelock",
"flax>=0.4.1",
"hf-doc-builder>=0.3.0",
"huggingface-hub>=0.23.2",
"huggingface-hub>=0.27.0",
"requests-mock==1.10.0",
"importlib_metadata",
"invisible-watermark>=0.2.0",
+4
View File
@@ -92,6 +92,7 @@ else:
"AutoencoderTiny",
"CogVideoXTransformer3DModel",
"CogView3PlusTransformer2DModel",
"ConsisIDTransformer3DModel",
"ConsistencyDecoderVAE",
"ControlNetModel",
"ControlNetUnionModel",
@@ -275,6 +276,7 @@ else:
"CogVideoXPipeline",
"CogVideoXVideoToVideoPipeline",
"CogView3PlusPipeline",
"ConsisIDPipeline",
"CycleDiffusionPipeline",
"FluxControlImg2ImgPipeline",
"FluxControlInpaintPipeline",
@@ -602,6 +604,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
AutoencoderTiny,
CogVideoXTransformer3DModel,
CogView3PlusTransformer2DModel,
ConsisIDTransformer3DModel,
ConsistencyDecoderVAE,
ControlNetModel,
ControlNetUnionModel,
@@ -764,6 +767,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
CogVideoXPipeline,
CogVideoXVideoToVideoPipeline,
CogView3PlusPipeline,
ConsisIDPipeline,
CycleDiffusionPipeline,
FluxControlImg2ImgPipeline,
FluxControlInpaintPipeline,
+35 -10
View File
@@ -24,10 +24,10 @@ import os
import re
from collections import OrderedDict
from pathlib import Path
from typing import Any, Dict, Tuple, Union
from typing import Any, Dict, Optional, Tuple, Union
import numpy as np
from huggingface_hub import create_repo, hf_hub_download
from huggingface_hub import DDUFEntry, create_repo, hf_hub_download
from huggingface_hub.utils import (
EntryNotFoundError,
RepositoryNotFoundError,
@@ -347,6 +347,7 @@ class ConfigMixin:
_ = kwargs.pop("mirror", None)
subfolder = kwargs.pop("subfolder", None)
user_agent = kwargs.pop("user_agent", {})
dduf_entries: Optional[Dict[str, DDUFEntry]] = kwargs.pop("dduf_entries", None)
user_agent = {**user_agent, "file_type": "config"}
user_agent = http_user_agent(user_agent)
@@ -358,8 +359,15 @@ class ConfigMixin:
"`self.config_name` is not defined. Note that one should not load a config from "
"`ConfigMixin`. Please make sure to define `config_name` in a class inheriting from `ConfigMixin`"
)
if os.path.isfile(pretrained_model_name_or_path):
# Custom path for now
if dduf_entries:
if subfolder is not None:
raise ValueError(
"DDUF file only allow for 1 level of directory (e.g transformer/model1/model.safetentors is not allowed). "
"Please check the DDUF structure"
)
config_file = cls._get_config_file_from_dduf(pretrained_model_name_or_path, dduf_entries)
elif os.path.isfile(pretrained_model_name_or_path):
config_file = pretrained_model_name_or_path
elif os.path.isdir(pretrained_model_name_or_path):
if subfolder is not None and os.path.isfile(
@@ -426,10 +434,8 @@ class ConfigMixin:
f"Otherwise, make sure '{pretrained_model_name_or_path}' is the correct path to a directory "
f"containing a {cls.config_name} file"
)
try:
# Load config dict
config_dict = cls._dict_from_json_file(config_file)
config_dict = cls._dict_from_json_file(config_file, dduf_entries=dduf_entries)
commit_hash = extract_commit_hash(config_file)
except (json.JSONDecodeError, UnicodeDecodeError):
@@ -552,9 +558,14 @@ class ConfigMixin:
return init_dict, unused_kwargs, hidden_config_dict
@classmethod
def _dict_from_json_file(cls, json_file: Union[str, os.PathLike]):
with open(json_file, "r", encoding="utf-8") as reader:
text = reader.read()
def _dict_from_json_file(
cls, json_file: Union[str, os.PathLike], dduf_entries: Optional[Dict[str, DDUFEntry]] = None
):
if dduf_entries:
text = dduf_entries[json_file].read_text()
else:
with open(json_file, "r", encoding="utf-8") as reader:
text = reader.read()
return json.loads(text)
def __repr__(self):
@@ -616,6 +627,20 @@ class ConfigMixin:
with open(json_file_path, "w", encoding="utf-8") as writer:
writer.write(self.to_json_string())
@classmethod
def _get_config_file_from_dduf(cls, pretrained_model_name_or_path: str, dduf_entries: Dict[str, DDUFEntry]):
# paths inside a DDUF file must always be "/"
config_file = (
cls.config_name
if pretrained_model_name_or_path == ""
else "/".join([pretrained_model_name_or_path, cls.config_name])
)
if config_file not in dduf_entries:
raise ValueError(
f"We did not manage to find the file {config_file} in the dduf file. We only have the following files {dduf_entries.keys()}"
)
return config_file
def register_to_config(init):
r"""
+1 -1
View File
@@ -9,7 +9,7 @@ deps = {
"filelock": "filelock",
"flax": "flax>=0.4.1",
"hf-doc-builder": "hf-doc-builder>=0.3.0",
"huggingface-hub": "huggingface-hub>=0.23.2",
"huggingface-hub": "huggingface-hub>=0.27.0",
"requests-mock": "requests-mock==1.10.0",
"importlib_metadata": "importlib_metadata",
"invisible-watermark": "invisible-watermark>=0.2.0",
+36 -3
View File
@@ -21,6 +21,7 @@ from huggingface_hub.utils import validate_hf_hub_args
from ..utils import (
USE_PEFT_BACKEND,
deprecate,
get_submodule_by_name,
is_peft_available,
is_peft_version,
is_torch_version,
@@ -1981,10 +1982,17 @@ class FluxLoraLoaderMixin(LoraBaseMixin):
in_features = state_dict[lora_A_weight_name].shape[1]
out_features = state_dict[lora_B_weight_name].shape[0]
# Model maybe loaded with different quantization schemes which may flatten the params.
# `bitsandbytes`, for example, flatten the weights when using 4bit. 8bit bnb models
# preserve weight shape.
module_weight_shape = cls._calculate_module_shape(model=transformer, base_module=module)
# This means there's no need for an expansion in the params, so we simply skip.
if tuple(module_weight.shape) == (out_features, in_features):
if tuple(module_weight_shape) == (out_features, in_features):
continue
# TODO (sayakpaul): We still need to consider if the module we're expanding is
# quantized and handle it accordingly if that is the case.
module_out_features, module_in_features = module_weight.shape
debug_message = ""
if in_features > module_in_features:
@@ -2080,13 +2088,16 @@ class FluxLoraLoaderMixin(LoraBaseMixin):
base_weight_param = transformer_state_dict[base_param_name]
lora_A_param = lora_state_dict[f"{prefix}{k}.lora_A.weight"]
if base_weight_param.shape[1] > lora_A_param.shape[1]:
# TODO (sayakpaul): Handle the cases when we actually need to expand when using quantization.
base_module_shape = cls._calculate_module_shape(model=transformer, base_weight_param_name=base_param_name)
if base_module_shape[1] > lora_A_param.shape[1]:
shape = (lora_A_param.shape[0], base_weight_param.shape[1])
expanded_state_dict_weight = torch.zeros(shape, device=base_weight_param.device)
expanded_state_dict_weight[:, : lora_A_param.shape[1]].copy_(lora_A_param)
lora_state_dict[f"{prefix}{k}.lora_A.weight"] = expanded_state_dict_weight
expanded_module_names.add(k)
elif base_weight_param.shape[1] < lora_A_param.shape[1]:
elif base_module_shape[1] < lora_A_param.shape[1]:
raise NotImplementedError(
f"This LoRA param ({k}.lora_A.weight) has an incompatible shape {lora_A_param.shape}. Please open an issue to file for a feature request - https://github.com/huggingface/diffusers/issues/new."
)
@@ -2098,6 +2109,28 @@ class FluxLoraLoaderMixin(LoraBaseMixin):
return lora_state_dict
@staticmethod
def _calculate_module_shape(
model: "torch.nn.Module",
base_module: "torch.nn.Linear" = None,
base_weight_param_name: str = None,
) -> "torch.Size":
def _get_weight_shape(weight: torch.Tensor):
return weight.quant_state.shape if weight.__class__.__name__ == "Params4bit" else weight.shape
if base_module is not None:
return _get_weight_shape(base_module.weight)
elif base_weight_param_name is not None:
if not base_weight_param_name.endswith(".weight"):
raise ValueError(
f"Invalid `base_weight_param_name` passed as it does not end with '.weight' {base_weight_param_name=}."
)
module_path = base_weight_param_name.rsplit(".weight", 1)[0]
submodule = get_submodule_by_name(model, module_path)
return _get_weight_shape(submodule.weight)
raise ValueError("Either `base_module` or `base_weight_param_name` must be provided.")
# The reason why we subclass from `StableDiffusionLoraLoaderMixin` here is because Amused initially
# relied on `StableDiffusionLoraLoaderMixin` for its LoRA support.
+1
View File
@@ -47,6 +47,7 @@ _SET_ADAPTER_SCALE_FN_MAPPING = {
"SD3Transformer2DModel": lambda model_cls, weights: weights,
"FluxTransformer2DModel": lambda model_cls, weights: weights,
"CogVideoXTransformer3DModel": lambda model_cls, weights: weights,
"ConsisIDTransformer3DModel": lambda model_cls, weights: weights,
"MochiTransformer3DModel": lambda model_cls, weights: weights,
"HunyuanVideoTransformer3DModel": lambda model_cls, weights: weights,
"LTXVideoTransformer3DModel": lambda model_cls, weights: weights,
@@ -362,6 +362,7 @@ class FromOriginalModelMixin:
if is_accelerate_available():
param_device = torch.device(device) if device else torch.device("cpu")
named_buffers = model.named_buffers()
unexpected_keys = load_model_dict_into_meta(
model,
diffusers_format_checkpoint,
@@ -369,6 +370,7 @@ class FromOriginalModelMixin:
device=param_device,
hf_quantizer=hf_quantizer,
keep_in_fp32_modules=keep_in_fp32_modules,
named_buffers=named_buffers,
)
else:
+4 -4
View File
@@ -40,7 +40,7 @@ def load_textual_inversion_state_dicts(pretrained_model_name_or_paths, **kwargs)
force_download = kwargs.pop("force_download", False)
proxies = kwargs.pop("proxies", None)
local_files_only = kwargs.pop("local_files_only", None)
token = kwargs.pop("token", None)
hf_token = kwargs.pop("hf_token", None)
revision = kwargs.pop("revision", None)
subfolder = kwargs.pop("subfolder", None)
weight_name = kwargs.pop("weight_name", None)
@@ -73,7 +73,7 @@ def load_textual_inversion_state_dicts(pretrained_model_name_or_paths, **kwargs)
force_download=force_download,
proxies=proxies,
local_files_only=local_files_only,
token=token,
token=hf_token,
revision=revision,
subfolder=subfolder,
user_agent=user_agent,
@@ -93,7 +93,7 @@ def load_textual_inversion_state_dicts(pretrained_model_name_or_paths, **kwargs)
force_download=force_download,
proxies=proxies,
local_files_only=local_files_only,
token=token,
token=hf_token,
revision=revision,
subfolder=subfolder,
user_agent=user_agent,
@@ -312,7 +312,7 @@ class TextualInversionLoaderMixin:
local_files_only (`bool`, *optional*, defaults to `False`):
Whether to only load local model weights and configuration files or not. If set to `True`, the model
won't be downloaded from the Hub.
token (`str` or *bool*, *optional*):
hf_token (`str` or *bool*, *optional*):
The token to use as HTTP bearer authorization for remote files. If `True`, the token generated from
`diffusers-cli login` (stored in `~/.huggingface`) is used.
revision (`str`, *optional*, defaults to `"main"`):
+2
View File
@@ -54,6 +54,7 @@ if is_torch_available():
_import_structure["modeling_utils"] = ["ModelMixin"]
_import_structure["transformers.auraflow_transformer_2d"] = ["AuraFlowTransformer2DModel"]
_import_structure["transformers.cogvideox_transformer_3d"] = ["CogVideoXTransformer3DModel"]
_import_structure["transformers.consisid_transformer_3d"] = ["ConsisIDTransformer3DModel"]
_import_structure["transformers.dit_transformer_2d"] = ["DiTTransformer2DModel"]
_import_structure["transformers.dual_transformer_2d"] = ["DualTransformer2DModel"]
_import_structure["transformers.hunyuan_transformer_2d"] = ["HunyuanDiT2DModel"]
@@ -129,6 +130,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
AuraFlowTransformer2DModel,
CogVideoXTransformer3DModel,
CogView3PlusTransformer2DModel,
ConsisIDTransformer3DModel,
DiTTransformer2DModel,
DualTransformer2DModel,
FluxTransformer2DModel,
+112 -6
View File
@@ -297,7 +297,10 @@ class Attention(nn.Module):
self.set_processor(processor)
def set_use_xla_flash_attention(
self, use_xla_flash_attention: bool, partition_spec: Optional[Tuple[Optional[str], ...]] = None
self,
use_xla_flash_attention: bool,
partition_spec: Optional[Tuple[Optional[str], ...]] = None,
is_flux=False,
) -> None:
r"""
Set whether to use xla flash attention from `torch_xla` or not.
@@ -316,7 +319,10 @@ class Attention(nn.Module):
elif is_spmd() and is_torch_xla_version("<", "2.4"):
raise "flash attention pallas kernel using SPMD is supported from torch_xla version 2.4"
else:
processor = XLAFlashAttnProcessor2_0(partition_spec)
if is_flux:
processor = XLAFluxFlashAttnProcessor2_0(partition_spec)
else:
processor = XLAFlashAttnProcessor2_0(partition_spec)
else:
processor = (
AttnProcessor2_0() if hasattr(F, "scaled_dot_product_attention") and self.scale_qk else AttnProcessor()
@@ -899,7 +905,7 @@ class SanaMultiscaleLinearAttention(nn.Module):
scores = torch.matmul(key.transpose(-1, -2), query)
scores = scores.to(dtype=torch.float32)
scores = scores / (torch.sum(scores, dim=2, keepdim=True) + self.eps)
hidden_states = torch.matmul(value, scores)
hidden_states = torch.matmul(value, scores.to(value.dtype))
return hidden_states
def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
@@ -2318,9 +2324,8 @@ class FluxAttnProcessor2_0:
query = apply_rotary_emb(query, image_rotary_emb)
key = apply_rotary_emb(key, image_rotary_emb)
hidden_states = F.scaled_dot_product_attention(
query, key, value, attn_mask=attention_mask, dropout_p=0.0, is_causal=False
)
hidden_states = F.scaled_dot_product_attention(query, key, value, dropout_p=0.0, is_causal=False)
hidden_states = hidden_states.transpose(1, 2).reshape(batch_size, -1, attn.heads * head_dim)
hidden_states = hidden_states.to(query.dtype)
@@ -2522,6 +2527,7 @@ class FusedFluxAttnProcessor2_0:
key = apply_rotary_emb(key, image_rotary_emb)
hidden_states = F.scaled_dot_product_attention(query, key, value, dropout_p=0.0, is_causal=False)
hidden_states = hidden_states.transpose(1, 2).reshape(batch_size, -1, attn.heads * head_dim)
hidden_states = hidden_states.to(query.dtype)
@@ -3422,6 +3428,106 @@ class XLAFlashAttnProcessor2_0:
return hidden_states
class XLAFluxFlashAttnProcessor2_0:
r"""
Processor for implementing scaled dot-product attention with pallas flash attention kernel if using `torch_xla`.
"""
def __init__(self, partition_spec: Optional[Tuple[Optional[str], ...]] = None):
if not hasattr(F, "scaled_dot_product_attention"):
raise ImportError(
"XLAFlashAttnProcessor2_0 requires PyTorch 2.0, to use it, please upgrade PyTorch to 2.0."
)
if is_torch_xla_version("<", "2.3"):
raise ImportError("XLA flash attention requires torch_xla version >= 2.3.")
if is_spmd() and is_torch_xla_version("<", "2.4"):
raise ImportError("SPMD support for XLA flash attention needs torch_xla version >= 2.4.")
self.partition_spec = partition_spec
def __call__(
self,
attn: Attention,
hidden_states: torch.FloatTensor,
encoder_hidden_states: torch.FloatTensor = None,
attention_mask: Optional[torch.FloatTensor] = None,
image_rotary_emb: Optional[torch.Tensor] = None,
) -> torch.FloatTensor:
batch_size, _, _ = hidden_states.shape if encoder_hidden_states is None else encoder_hidden_states.shape
# `sample` projections.
query = attn.to_q(hidden_states)
key = attn.to_k(hidden_states)
value = attn.to_v(hidden_states)
inner_dim = key.shape[-1]
head_dim = inner_dim // attn.heads
query = query.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
key = key.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
value = value.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
if attn.norm_q is not None:
query = attn.norm_q(query)
if attn.norm_k is not None:
key = attn.norm_k(key)
# the attention in FluxSingleTransformerBlock does not use `encoder_hidden_states`
if encoder_hidden_states is not None:
# `context` projections.
encoder_hidden_states_query_proj = attn.add_q_proj(encoder_hidden_states)
encoder_hidden_states_key_proj = attn.add_k_proj(encoder_hidden_states)
encoder_hidden_states_value_proj = attn.add_v_proj(encoder_hidden_states)
encoder_hidden_states_query_proj = encoder_hidden_states_query_proj.view(
batch_size, -1, attn.heads, head_dim
).transpose(1, 2)
encoder_hidden_states_key_proj = encoder_hidden_states_key_proj.view(
batch_size, -1, attn.heads, head_dim
).transpose(1, 2)
encoder_hidden_states_value_proj = encoder_hidden_states_value_proj.view(
batch_size, -1, attn.heads, head_dim
).transpose(1, 2)
if attn.norm_added_q is not None:
encoder_hidden_states_query_proj = attn.norm_added_q(encoder_hidden_states_query_proj)
if attn.norm_added_k is not None:
encoder_hidden_states_key_proj = attn.norm_added_k(encoder_hidden_states_key_proj)
# attention
query = torch.cat([encoder_hidden_states_query_proj, query], dim=2)
key = torch.cat([encoder_hidden_states_key_proj, key], dim=2)
value = torch.cat([encoder_hidden_states_value_proj, value], dim=2)
if image_rotary_emb is not None:
from .embeddings import apply_rotary_emb
query = apply_rotary_emb(query, image_rotary_emb)
key = apply_rotary_emb(key, image_rotary_emb)
query /= math.sqrt(head_dim)
hidden_states = flash_attention(query, key, value, causal=False)
hidden_states = hidden_states.transpose(1, 2).reshape(batch_size, -1, attn.heads * head_dim)
hidden_states = hidden_states.to(query.dtype)
if encoder_hidden_states is not None:
encoder_hidden_states, hidden_states = (
hidden_states[:, : encoder_hidden_states.shape[1]],
hidden_states[:, encoder_hidden_states.shape[1] :],
)
# linear proj
hidden_states = attn.to_out[0](hidden_states)
# dropout
hidden_states = attn.to_out[1](hidden_states)
encoder_hidden_states = attn.to_add_out(encoder_hidden_states)
return hidden_states, encoder_hidden_states
else:
return hidden_states
class MochiVaeAttnProcessor2_0:
r"""
Attention processor used in Mochi VAE.
@@ -740,10 +740,11 @@ class ControlNetModel(ModelMixin, ConfigMixin, FromOriginalModelMixin):
# TODO: this requires sync between CPU and GPU. So try to pass timesteps as tensors if you can
# This would be a good case for the `match` statement (Python 3.10+)
is_mps = sample.device.type == "mps"
is_npu = sample.device.type == "npu"
if isinstance(timestep, float):
dtype = torch.float32 if is_mps else torch.float64
dtype = torch.float32 if (is_mps or is_npu) else torch.float64
else:
dtype = torch.int32 if is_mps else torch.int64
dtype = torch.int32 if (is_mps or is_npu) else torch.int64
timesteps = torch.tensor([timesteps], dtype=dtype, device=sample.device)
elif len(timesteps.shape) == 0:
timesteps = timesteps[None].to(sample.device)
@@ -671,10 +671,11 @@ class SparseControlNetModel(ModelMixin, ConfigMixin, FromOriginalModelMixin):
# TODO: this requires sync between CPU and GPU. So try to pass timesteps as tensors if you can
# This would be a good case for the `match` statement (Python 3.10+)
is_mps = sample.device.type == "mps"
is_npu = sample.device.type == "npu"
if isinstance(timestep, float):
dtype = torch.float32 if is_mps else torch.float64
dtype = torch.float32 if (is_mps or is_npu) else torch.float64
else:
dtype = torch.int32 if is_mps else torch.int64
dtype = torch.int32 if (is_mps or is_npu) else torch.int64
timesteps = torch.tensor([timesteps], dtype=dtype, device=sample.device)
elif len(timesteps.shape) == 0:
timesteps = timesteps[None].to(sample.device)
@@ -681,10 +681,11 @@ class ControlNetUnionModel(ModelMixin, ConfigMixin, FromOriginalModelMixin):
# TODO: this requires sync between CPU and GPU. So try to pass timesteps as tensors if you can
# This would be a good case for the `match` statement (Python 3.10+)
is_mps = sample.device.type == "mps"
is_npu = sample.device.type == "npu"
if isinstance(timestep, float):
dtype = torch.float32 if is_mps else torch.float64
dtype = torch.float32 if (is_mps or is_npu) else torch.float64
else:
dtype = torch.int32 if is_mps else torch.int64
dtype = torch.int32 if (is_mps or is_npu) else torch.int64
timesteps = torch.tensor([timesteps], dtype=dtype, device=sample.device)
elif len(timesteps.shape) == 0:
timesteps = timesteps[None].to(sample.device)
@@ -1088,10 +1088,11 @@ class UNetControlNetXSModel(ModelMixin, ConfigMixin):
# TODO: this requires sync between CPU and GPU. So try to pass timesteps as tensors if you can
# This would be a good case for the `match` statement (Python 3.10+)
is_mps = sample.device.type == "mps"
is_npu = sample.device.type == "npu"
if isinstance(timestep, float):
dtype = torch.float32 if is_mps else torch.float64
dtype = torch.float32 if (is_mps or is_npu) else torch.float64
else:
dtype = torch.int32 if is_mps else torch.int64
dtype = torch.int32 if (is_mps or is_npu) else torch.int64
timesteps = torch.tensor([timesteps], dtype=dtype, device=sample.device)
elif len(timesteps.shape) == 0:
timesteps = timesteps[None].to(sample.device)
+49 -10
View File
@@ -20,10 +20,11 @@ import os
from array import array
from collections import OrderedDict
from pathlib import Path
from typing import List, Optional, Union
from typing import Dict, Iterator, List, Optional, Tuple, Union
import safetensors
import torch
from huggingface_hub import DDUFEntry
from huggingface_hub.utils import EntryNotFoundError
from ..utils import (
@@ -132,7 +133,10 @@ def _fetch_remapped_cls_from_config(config, old_class):
def load_state_dict(
checkpoint_file: Union[str, os.PathLike], variant: Optional[str] = None, disable_mmap: bool = False
checkpoint_file: Union[str, os.PathLike],
variant: Optional[str] = None,
dduf_entries: Optional[Dict[str, DDUFEntry]] = None,
disable_mmap: bool = False,
):
"""
Reads a checkpoint file, returning properly formatted errors if they arise.
@@ -144,6 +148,10 @@ def load_state_dict(
try:
file_extension = os.path.basename(checkpoint_file).split(".")[-1]
if file_extension == SAFETENSORS_FILE_EXTENSION:
if dduf_entries:
# tensors are loaded on cpu
with dduf_entries[checkpoint_file].as_mmap() as mm:
return safetensors.torch.load(mm)
if disable_mmap:
return safetensors.torch.load(open(checkpoint_file, "rb").read())
else:
@@ -185,6 +193,7 @@ def load_model_dict_into_meta(
model_name_or_path: Optional[str] = None,
hf_quantizer=None,
keep_in_fp32_modules=None,
named_buffers: Optional[Iterator[Tuple[str, torch.Tensor]]] = None,
) -> List[str]:
if device is not None and not isinstance(device, (str, torch.device)):
raise ValueError(f"Expected device to have type `str` or `torch.device`, but got {type(device)=}.")
@@ -246,6 +255,20 @@ def load_model_dict_into_meta(
else:
set_module_tensor_to_device(model, param_name, device, value=param)
if named_buffers is None:
return unexpected_keys
for param_name, param in named_buffers:
if is_quantized and (
hf_quantizer.check_if_quantized_param(model, param, param_name, state_dict, param_device=device)
):
hf_quantizer.create_quantized_param(model, param, param_name, device, state_dict, unexpected_keys)
else:
if accepts_dtype:
set_module_tensor_to_device(model, param_name, device, value=param, **set_module_kwargs)
else:
set_module_tensor_to_device(model, param_name, device, value=param)
return unexpected_keys
@@ -284,6 +307,7 @@ def _fetch_index_file(
revision,
user_agent,
commit_hash,
dduf_entries: Optional[Dict[str, DDUFEntry]] = None,
):
if is_local:
index_file = Path(
@@ -309,8 +333,10 @@ def _fetch_index_file(
subfolder=None,
user_agent=user_agent,
commit_hash=commit_hash,
dduf_entries=dduf_entries,
)
index_file = Path(index_file)
if not dduf_entries:
index_file = Path(index_file)
except (EntryNotFoundError, EnvironmentError):
index_file = None
@@ -319,7 +345,9 @@ def _fetch_index_file(
# Adapted from
# https://github.com/bghira/SimpleTuner/blob/cea2457ab063f6dedb9e697830ae68a96be90641/helpers/training/save_hooks.py#L64
def _merge_sharded_checkpoints(sharded_ckpt_cached_folder, sharded_metadata):
def _merge_sharded_checkpoints(
sharded_ckpt_cached_folder, sharded_metadata, dduf_entries: Optional[Dict[str, DDUFEntry]] = None
):
weight_map = sharded_metadata.get("weight_map", None)
if weight_map is None:
raise KeyError("'weight_map' key not found in the shard index file.")
@@ -332,14 +360,23 @@ def _merge_sharded_checkpoints(sharded_ckpt_cached_folder, sharded_metadata):
# Load tensors from each unique file
for file_name in files_to_load:
part_file_path = os.path.join(sharded_ckpt_cached_folder, file_name)
if not os.path.exists(part_file_path):
raise FileNotFoundError(f"Part file {file_name} not found.")
if dduf_entries:
if part_file_path not in dduf_entries:
raise FileNotFoundError(f"Part file {file_name} not found.")
else:
if not os.path.exists(part_file_path):
raise FileNotFoundError(f"Part file {file_name} not found.")
if is_safetensors:
with safetensors.safe_open(part_file_path, framework="pt", device="cpu") as f:
for tensor_key in f.keys():
if tensor_key in weight_map:
merged_state_dict[tensor_key] = f.get_tensor(tensor_key)
if dduf_entries:
with dduf_entries[part_file_path].as_mmap() as mm:
tensors = safetensors.torch.load(mm)
merged_state_dict.update(tensors)
else:
with safetensors.safe_open(part_file_path, framework="pt", device="cpu") as f:
for tensor_key in f.keys():
if tensor_key in weight_map:
merged_state_dict[tensor_key] = f.get_tensor(tensor_key)
else:
merged_state_dict.update(torch.load(part_file_path, weights_only=True, map_location="cpu"))
@@ -360,6 +397,7 @@ def _fetch_index_file_legacy(
revision,
user_agent,
commit_hash,
dduf_entries: Optional[Dict[str, DDUFEntry]] = None,
):
if is_local:
index_file = Path(
@@ -400,6 +438,7 @@ def _fetch_index_file_legacy(
subfolder=None,
user_agent=user_agent,
commit_hash=commit_hash,
dduf_entries=dduf_entries,
)
index_file = Path(index_file)
deprecation_message = f"This serialization format is now deprecated to standardize the serialization format between `transformers` and `diffusers`. We recommend you to remove the existing files associated with the current variant ({variant}) and re-obtain them by running a `save_pretrained()`."
+27 -11
View File
@@ -23,11 +23,11 @@ import re
from collections import OrderedDict
from functools import partial, wraps
from pathlib import Path
from typing import Any, Callable, List, Optional, Tuple, Union
from typing import Any, Callable, Dict, List, Optional, Tuple, Union
import safetensors
import torch
from huggingface_hub import create_repo, split_torch_state_dict_into_shards
from huggingface_hub import DDUFEntry, create_repo, split_torch_state_dict_into_shards
from huggingface_hub.utils import validate_hf_hub_args
from torch import Tensor, nn
@@ -227,14 +227,14 @@ class ModelMixin(torch.nn.Module, PushToHubMixin):
self.set_use_npu_flash_attention(False)
def set_use_xla_flash_attention(
self, use_xla_flash_attention: bool, partition_spec: Optional[Callable] = None
self, use_xla_flash_attention: bool, partition_spec: Optional[Callable] = None, **kwargs
) -> None:
# Recursively walk through all the children.
# Any children which exposes the set_use_xla_flash_attention method
# gets the message
def fn_recursive_set_flash_attention(module: torch.nn.Module):
if hasattr(module, "set_use_xla_flash_attention"):
module.set_use_xla_flash_attention(use_xla_flash_attention, partition_spec)
module.set_use_xla_flash_attention(use_xla_flash_attention, partition_spec, **kwargs)
for child in module.children():
fn_recursive_set_flash_attention(child)
@@ -243,11 +243,11 @@ class ModelMixin(torch.nn.Module, PushToHubMixin):
if isinstance(module, torch.nn.Module):
fn_recursive_set_flash_attention(module)
def enable_xla_flash_attention(self, partition_spec: Optional[Callable] = None):
def enable_xla_flash_attention(self, partition_spec: Optional[Callable] = None, **kwargs):
r"""
Enable the flash attention pallals kernel for torch_xla.
"""
self.set_use_xla_flash_attention(True, partition_spec)
self.set_use_xla_flash_attention(True, partition_spec, **kwargs)
def disable_xla_flash_attention(self):
r"""
@@ -607,6 +607,7 @@ class ModelMixin(torch.nn.Module, PushToHubMixin):
variant = kwargs.pop("variant", None)
use_safetensors = kwargs.pop("use_safetensors", None)
quantization_config = kwargs.pop("quantization_config", None)
dduf_entries: Optional[Dict[str, DDUFEntry]] = kwargs.pop("dduf_entries", None)
disable_mmap = kwargs.pop("disable_mmap", False)
allow_pickle = False
@@ -700,6 +701,7 @@ class ModelMixin(torch.nn.Module, PushToHubMixin):
revision=revision,
subfolder=subfolder,
user_agent=user_agent,
dduf_entries=dduf_entries,
**kwargs,
)
# no in-place modification of the original config.
@@ -776,13 +778,14 @@ class ModelMixin(torch.nn.Module, PushToHubMixin):
"revision": revision,
"user_agent": user_agent,
"commit_hash": commit_hash,
"dduf_entries": dduf_entries,
}
index_file = _fetch_index_file(**index_file_kwargs)
# In case the index file was not found we still have to consider the legacy format.
# this becomes applicable when the variant is not None.
if variant is not None and (index_file is None or not os.path.exists(index_file)):
index_file = _fetch_index_file_legacy(**index_file_kwargs)
if index_file is not None and index_file.is_file():
if index_file is not None and (dduf_entries or index_file.is_file()):
is_sharded = True
if is_sharded and from_flax:
@@ -811,6 +814,7 @@ class ModelMixin(torch.nn.Module, PushToHubMixin):
model = load_flax_checkpoint_in_pytorch_model(model, model_file)
else:
# in the case it is sharded, we have already the index
if is_sharded:
sharded_ckpt_cached_folder, sharded_metadata = _get_checkpoint_shard_files(
pretrained_model_name_or_path,
@@ -822,10 +826,13 @@ class ModelMixin(torch.nn.Module, PushToHubMixin):
user_agent=user_agent,
revision=revision,
subfolder=subfolder or "",
dduf_entries=dduf_entries,
)
# TODO: https://github.com/huggingface/diffusers/issues/10013
if hf_quantizer is not None:
model_file = _merge_sharded_checkpoints(sharded_ckpt_cached_folder, sharded_metadata)
if hf_quantizer is not None or dduf_entries:
model_file = _merge_sharded_checkpoints(
sharded_ckpt_cached_folder, sharded_metadata, dduf_entries=dduf_entries
)
logger.info("Merged sharded checkpoints as `hf_quantizer` is not None.")
is_sharded = False
@@ -843,6 +850,7 @@ class ModelMixin(torch.nn.Module, PushToHubMixin):
subfolder=subfolder,
user_agent=user_agent,
commit_hash=commit_hash,
dduf_entries=dduf_entries,
)
except IOError as e:
@@ -866,6 +874,7 @@ class ModelMixin(torch.nn.Module, PushToHubMixin):
subfolder=subfolder,
user_agent=user_agent,
commit_hash=commit_hash,
dduf_entries=dduf_entries,
)
if low_cpu_mem_usage:
@@ -887,7 +896,9 @@ class ModelMixin(torch.nn.Module, PushToHubMixin):
# TODO (sayakpaul, SunMarc): remove this after model loading refactor
else:
param_device = torch.device(torch.cuda.current_device())
state_dict = load_state_dict(model_file, variant=variant, disable_mmap=disable_mmap)
state_dict = load_state_dict(
model_file, variant=variant, dduf_entries=dduf_entries, disable_mmap=disable_mmap
)
model._convert_deprecated_attention_blocks(state_dict)
# move the params from meta device to cpu
@@ -902,6 +913,8 @@ class ModelMixin(torch.nn.Module, PushToHubMixin):
" those weights or else make sure your checkpoint file is correct."
)
named_buffers = model.named_buffers()
unexpected_keys = load_model_dict_into_meta(
model,
state_dict,
@@ -910,6 +923,7 @@ class ModelMixin(torch.nn.Module, PushToHubMixin):
model_name_or_path=pretrained_model_name_or_path,
hf_quantizer=hf_quantizer,
keep_in_fp32_modules=keep_in_fp32_modules,
named_buffers=named_buffers,
)
if cls._keys_to_ignore_on_load_unexpected is not None:
@@ -983,7 +997,9 @@ class ModelMixin(torch.nn.Module, PushToHubMixin):
else:
model = cls.from_config(config, **unused_kwargs)
state_dict = load_state_dict(model_file, variant=variant, disable_mmap=disable_mmap)
state_dict = load_state_dict(
model_file, variant=variant, dduf_entries=dduf_entries, disable_mmap=disable_mmap
)
model._convert_deprecated_attention_blocks(state_dict)
model, missing_keys, unexpected_keys, mismatched_keys, error_msgs = cls._load_pretrained_model(
+21 -10
View File
@@ -20,7 +20,7 @@ import torch
import torch.nn as nn
import torch.nn.functional as F
from ..utils import is_torch_version
from ..utils import is_torch_npu_available, is_torch_version
from .activations import get_activation
from .embeddings import CombinedTimestepLabelEmbeddings, PixArtAlphaCombinedTimestepSizeEmbeddings
@@ -505,19 +505,30 @@ class RMSNorm(nn.Module):
self.bias = nn.Parameter(torch.zeros(dim))
def forward(self, hidden_states):
input_dtype = hidden_states.dtype
variance = hidden_states.to(torch.float32).pow(2).mean(-1, keepdim=True)
hidden_states = hidden_states * torch.rsqrt(variance + self.eps)
if is_torch_npu_available():
import torch_npu
if self.weight is not None:
# convert into half-precision if necessary
if self.weight.dtype in [torch.float16, torch.bfloat16]:
hidden_states = hidden_states.to(self.weight.dtype)
hidden_states = hidden_states * self.weight
if self.weight is not None:
# convert into half-precision if necessary
if self.weight.dtype in [torch.float16, torch.bfloat16]:
hidden_states = hidden_states.to(self.weight.dtype)
hidden_states = torch_npu.npu_rms_norm(hidden_states, self.weight, epsilon=self.eps)[0]
if self.bias is not None:
hidden_states = hidden_states + self.bias
else:
hidden_states = hidden_states.to(input_dtype)
input_dtype = hidden_states.dtype
variance = hidden_states.to(torch.float32).pow(2).mean(-1, keepdim=True)
hidden_states = hidden_states * torch.rsqrt(variance + self.eps)
if self.weight is not None:
# convert into half-precision if necessary
if self.weight.dtype in [torch.float16, torch.bfloat16]:
hidden_states = hidden_states.to(self.weight.dtype)
hidden_states = hidden_states * self.weight
if self.bias is not None:
hidden_states = hidden_states + self.bias
else:
hidden_states = hidden_states.to(input_dtype)
return hidden_states
@@ -4,6 +4,7 @@ from ...utils import is_torch_available
if is_torch_available():
from .auraflow_transformer_2d import AuraFlowTransformer2DModel
from .cogvideox_transformer_3d import CogVideoXTransformer3DModel
from .consisid_transformer_3d import ConsisIDTransformer3DModel
from .dit_transformer_2d import DiTTransformer2DModel
from .dual_transformer_2d import DualTransformer2DModel
from .hunyuan_transformer_2d import HunyuanDiT2DModel
@@ -0,0 +1,801 @@
# Copyright 2024 ConsisID Authors and The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import math
from typing import Any, Dict, List, Optional, Tuple, Union
import torch
from torch import nn
from ...configuration_utils import ConfigMixin, register_to_config
from ...loaders import PeftAdapterMixin
from ...utils import USE_PEFT_BACKEND, is_torch_version, logging, scale_lora_layers, unscale_lora_layers
from ...utils.torch_utils import maybe_allow_in_graph
from ..attention import Attention, FeedForward
from ..attention_processor import AttentionProcessor, CogVideoXAttnProcessor2_0
from ..embeddings import CogVideoXPatchEmbed, TimestepEmbedding, Timesteps
from ..modeling_outputs import Transformer2DModelOutput
from ..modeling_utils import ModelMixin
from ..normalization import AdaLayerNorm, CogVideoXLayerNormZero
logger = logging.get_logger(__name__) # pylint: disable=invalid-name
class PerceiverAttention(nn.Module):
def __init__(self, dim: int, dim_head: int = 64, heads: int = 8, kv_dim: Optional[int] = None):
super().__init__()
self.scale = dim_head**-0.5
self.dim_head = dim_head
self.heads = heads
inner_dim = dim_head * heads
self.norm1 = nn.LayerNorm(dim if kv_dim is None else kv_dim)
self.norm2 = nn.LayerNorm(dim)
self.to_q = nn.Linear(dim, inner_dim, bias=False)
self.to_kv = nn.Linear(dim if kv_dim is None else kv_dim, inner_dim * 2, bias=False)
self.to_out = nn.Linear(inner_dim, dim, bias=False)
def forward(self, image_embeds: torch.Tensor, latents: torch.Tensor) -> torch.Tensor:
# Apply normalization
image_embeds = self.norm1(image_embeds)
latents = self.norm2(latents)
batch_size, seq_len, _ = latents.shape # Get batch size and sequence length
# Compute query, key, and value matrices
query = self.to_q(latents)
kv_input = torch.cat((image_embeds, latents), dim=-2)
key, value = self.to_kv(kv_input).chunk(2, dim=-1)
# Reshape the tensors for multi-head attention
query = query.reshape(query.size(0), -1, self.heads, self.dim_head).transpose(1, 2)
key = key.reshape(key.size(0), -1, self.heads, self.dim_head).transpose(1, 2)
value = value.reshape(value.size(0), -1, self.heads, self.dim_head).transpose(1, 2)
# attention
scale = 1 / math.sqrt(math.sqrt(self.dim_head))
weight = (query * scale) @ (key * scale).transpose(-2, -1) # More stable with f16 than dividing afterwards
weight = torch.softmax(weight.float(), dim=-1).type(weight.dtype)
output = weight @ value
# Reshape and return the final output
output = output.permute(0, 2, 1, 3).reshape(batch_size, seq_len, -1)
return self.to_out(output)
class LocalFacialExtractor(nn.Module):
def __init__(
self,
id_dim: int = 1280,
vit_dim: int = 1024,
depth: int = 10,
dim_head: int = 64,
heads: int = 16,
num_id_token: int = 5,
num_queries: int = 32,
output_dim: int = 2048,
ff_mult: int = 4,
num_scale: int = 5,
):
super().__init__()
# Storing identity token and query information
self.num_id_token = num_id_token
self.vit_dim = vit_dim
self.num_queries = num_queries
assert depth % num_scale == 0
self.depth = depth // num_scale
self.num_scale = num_scale
scale = vit_dim**-0.5
# Learnable latent query embeddings
self.latents = nn.Parameter(torch.randn(1, num_queries, vit_dim) * scale)
# Projection layer to map the latent output to the desired dimension
self.proj_out = nn.Parameter(scale * torch.randn(vit_dim, output_dim))
# Attention and ConsisIDFeedForward layer stack
self.layers = nn.ModuleList([])
for _ in range(depth):
self.layers.append(
nn.ModuleList(
[
PerceiverAttention(dim=vit_dim, dim_head=dim_head, heads=heads), # Perceiver Attention layer
nn.Sequential(
nn.LayerNorm(vit_dim),
nn.Linear(vit_dim, vit_dim * ff_mult, bias=False),
nn.GELU(),
nn.Linear(vit_dim * ff_mult, vit_dim, bias=False),
), # ConsisIDFeedForward layer
]
)
)
# Mappings for each of the 5 different ViT features
for i in range(num_scale):
setattr(
self,
f"mapping_{i}",
nn.Sequential(
nn.Linear(vit_dim, vit_dim),
nn.LayerNorm(vit_dim),
nn.LeakyReLU(),
nn.Linear(vit_dim, vit_dim),
nn.LayerNorm(vit_dim),
nn.LeakyReLU(),
nn.Linear(vit_dim, vit_dim),
),
)
# Mapping for identity embedding vectors
self.id_embedding_mapping = nn.Sequential(
nn.Linear(id_dim, vit_dim),
nn.LayerNorm(vit_dim),
nn.LeakyReLU(),
nn.Linear(vit_dim, vit_dim),
nn.LayerNorm(vit_dim),
nn.LeakyReLU(),
nn.Linear(vit_dim, vit_dim * num_id_token),
)
def forward(self, id_embeds: torch.Tensor, vit_hidden_states: List[torch.Tensor]) -> torch.Tensor:
# Repeat latent queries for the batch size
latents = self.latents.repeat(id_embeds.size(0), 1, 1)
# Map the identity embedding to tokens
id_embeds = self.id_embedding_mapping(id_embeds)
id_embeds = id_embeds.reshape(-1, self.num_id_token, self.vit_dim)
# Concatenate identity tokens with the latent queries
latents = torch.cat((latents, id_embeds), dim=1)
# Process each of the num_scale visual feature inputs
for i in range(self.num_scale):
vit_feature = getattr(self, f"mapping_{i}")(vit_hidden_states[i])
ctx_feature = torch.cat((id_embeds, vit_feature), dim=1)
# Pass through the PerceiverAttention and ConsisIDFeedForward layers
for attn, ff in self.layers[i * self.depth : (i + 1) * self.depth]:
latents = attn(ctx_feature, latents) + latents
latents = ff(latents) + latents
# Retain only the query latents
latents = latents[:, : self.num_queries]
# Project the latents to the output dimension
latents = latents @ self.proj_out
return latents
class PerceiverCrossAttention(nn.Module):
def __init__(self, dim: int = 3072, dim_head: int = 128, heads: int = 16, kv_dim: int = 2048):
super().__init__()
self.scale = dim_head**-0.5
self.dim_head = dim_head
self.heads = heads
inner_dim = dim_head * heads
# Layer normalization to stabilize training
self.norm1 = nn.LayerNorm(dim if kv_dim is None else kv_dim)
self.norm2 = nn.LayerNorm(dim)
# Linear transformations to produce queries, keys, and values
self.to_q = nn.Linear(dim, inner_dim, bias=False)
self.to_kv = nn.Linear(dim if kv_dim is None else kv_dim, inner_dim * 2, bias=False)
self.to_out = nn.Linear(inner_dim, dim, bias=False)
def forward(self, image_embeds: torch.Tensor, hidden_states: torch.Tensor) -> torch.Tensor:
# Apply layer normalization to the input image and latent features
image_embeds = self.norm1(image_embeds)
hidden_states = self.norm2(hidden_states)
batch_size, seq_len, _ = hidden_states.shape
# Compute queries, keys, and values
query = self.to_q(hidden_states)
key, value = self.to_kv(image_embeds).chunk(2, dim=-1)
# Reshape tensors to split into attention heads
query = query.reshape(query.size(0), -1, self.heads, self.dim_head).transpose(1, 2)
key = key.reshape(key.size(0), -1, self.heads, self.dim_head).transpose(1, 2)
value = value.reshape(value.size(0), -1, self.heads, self.dim_head).transpose(1, 2)
# Compute attention weights
scale = 1 / math.sqrt(math.sqrt(self.dim_head))
weight = (query * scale) @ (key * scale).transpose(-2, -1) # More stable scaling than post-division
weight = torch.softmax(weight.float(), dim=-1).type(weight.dtype)
# Compute the output via weighted combination of values
out = weight @ value
# Reshape and permute to prepare for final linear transformation
out = out.permute(0, 2, 1, 3).reshape(batch_size, seq_len, -1)
return self.to_out(out)
@maybe_allow_in_graph
class ConsisIDBlock(nn.Module):
r"""
Transformer block used in [ConsisID](https://github.com/PKU-YuanGroup/ConsisID) model.
Parameters:
dim (`int`):
The number of channels in the input and output.
num_attention_heads (`int`):
The number of heads to use for multi-head attention.
attention_head_dim (`int`):
The number of channels in each head.
time_embed_dim (`int`):
The number of channels in timestep embedding.
dropout (`float`, defaults to `0.0`):
The dropout probability to use.
activation_fn (`str`, defaults to `"gelu-approximate"`):
Activation function to be used in feed-forward.
attention_bias (`bool`, defaults to `False`):
Whether or not to use bias in attention projection layers.
qk_norm (`bool`, defaults to `True`):
Whether or not to use normalization after query and key projections in Attention.
norm_elementwise_affine (`bool`, defaults to `True`):
Whether to use learnable elementwise affine parameters for normalization.
norm_eps (`float`, defaults to `1e-5`):
Epsilon value for normalization layers.
final_dropout (`bool` defaults to `False`):
Whether to apply a final dropout after the last feed-forward layer.
ff_inner_dim (`int`, *optional*, defaults to `None`):
Custom hidden dimension of Feed-forward layer. If not provided, `4 * dim` is used.
ff_bias (`bool`, defaults to `True`):
Whether or not to use bias in Feed-forward layer.
attention_out_bias (`bool`, defaults to `True`):
Whether or not to use bias in Attention output projection layer.
"""
def __init__(
self,
dim: int,
num_attention_heads: int,
attention_head_dim: int,
time_embed_dim: int,
dropout: float = 0.0,
activation_fn: str = "gelu-approximate",
attention_bias: bool = False,
qk_norm: bool = True,
norm_elementwise_affine: bool = True,
norm_eps: float = 1e-5,
final_dropout: bool = True,
ff_inner_dim: Optional[int] = None,
ff_bias: bool = True,
attention_out_bias: bool = True,
):
super().__init__()
# 1. Self Attention
self.norm1 = CogVideoXLayerNormZero(time_embed_dim, dim, norm_elementwise_affine, norm_eps, bias=True)
self.attn1 = Attention(
query_dim=dim,
dim_head=attention_head_dim,
heads=num_attention_heads,
qk_norm="layer_norm" if qk_norm else None,
eps=1e-6,
bias=attention_bias,
out_bias=attention_out_bias,
processor=CogVideoXAttnProcessor2_0(),
)
# 2. Feed Forward
self.norm2 = CogVideoXLayerNormZero(time_embed_dim, dim, norm_elementwise_affine, norm_eps, bias=True)
self.ff = FeedForward(
dim,
dropout=dropout,
activation_fn=activation_fn,
final_dropout=final_dropout,
inner_dim=ff_inner_dim,
bias=ff_bias,
)
def forward(
self,
hidden_states: torch.Tensor,
encoder_hidden_states: torch.Tensor,
temb: torch.Tensor,
image_rotary_emb: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
) -> torch.Tensor:
text_seq_length = encoder_hidden_states.size(1)
# norm & modulate
norm_hidden_states, norm_encoder_hidden_states, gate_msa, enc_gate_msa = self.norm1(
hidden_states, encoder_hidden_states, temb
)
# attention
attn_hidden_states, attn_encoder_hidden_states = self.attn1(
hidden_states=norm_hidden_states,
encoder_hidden_states=norm_encoder_hidden_states,
image_rotary_emb=image_rotary_emb,
)
hidden_states = hidden_states + gate_msa * attn_hidden_states
encoder_hidden_states = encoder_hidden_states + enc_gate_msa * attn_encoder_hidden_states
# norm & modulate
norm_hidden_states, norm_encoder_hidden_states, gate_ff, enc_gate_ff = self.norm2(
hidden_states, encoder_hidden_states, temb
)
# feed-forward
norm_hidden_states = torch.cat([norm_encoder_hidden_states, norm_hidden_states], dim=1)
ff_output = self.ff(norm_hidden_states)
hidden_states = hidden_states + gate_ff * ff_output[:, text_seq_length:]
encoder_hidden_states = encoder_hidden_states + enc_gate_ff * ff_output[:, :text_seq_length]
return hidden_states, encoder_hidden_states
class ConsisIDTransformer3DModel(ModelMixin, ConfigMixin, PeftAdapterMixin):
"""
A Transformer model for video-like data in [ConsisID](https://github.com/PKU-YuanGroup/ConsisID).
Parameters:
num_attention_heads (`int`, defaults to `30`):
The number of heads to use for multi-head attention.
attention_head_dim (`int`, defaults to `64`):
The number of channels in each head.
in_channels (`int`, defaults to `16`):
The number of channels in the input.
out_channels (`int`, *optional*, defaults to `16`):
The number of channels in the output.
flip_sin_to_cos (`bool`, defaults to `True`):
Whether to flip the sin to cos in the time embedding.
time_embed_dim (`int`, defaults to `512`):
Output dimension of timestep embeddings.
text_embed_dim (`int`, defaults to `4096`):
Input dimension of text embeddings from the text encoder.
num_layers (`int`, defaults to `30`):
The number of layers of Transformer blocks to use.
dropout (`float`, defaults to `0.0`):
The dropout probability to use.
attention_bias (`bool`, defaults to `True`):
Whether to use bias in the attention projection layers.
sample_width (`int`, defaults to `90`):
The width of the input latents.
sample_height (`int`, defaults to `60`):
The height of the input latents.
sample_frames (`int`, defaults to `49`):
The number of frames in the input latents. Note that this parameter was incorrectly initialized to 49
instead of 13 because ConsisID processed 13 latent frames at once in its default and recommended settings,
but cannot be changed to the correct value to ensure backwards compatibility. To create a transformer with
K latent frames, the correct value to pass here would be: ((K - 1) * temporal_compression_ratio + 1).
patch_size (`int`, defaults to `2`):
The size of the patches to use in the patch embedding layer.
temporal_compression_ratio (`int`, defaults to `4`):
The compression ratio across the temporal dimension. See documentation for `sample_frames`.
max_text_seq_length (`int`, defaults to `226`):
The maximum sequence length of the input text embeddings.
activation_fn (`str`, defaults to `"gelu-approximate"`):
Activation function to use in feed-forward.
timestep_activation_fn (`str`, defaults to `"silu"`):
Activation function to use when generating the timestep embeddings.
norm_elementwise_affine (`bool`, defaults to `True`):
Whether to use elementwise affine in normalization layers.
norm_eps (`float`, defaults to `1e-5`):
The epsilon value to use in normalization layers.
spatial_interpolation_scale (`float`, defaults to `1.875`):
Scaling factor to apply in 3D positional embeddings across spatial dimensions.
temporal_interpolation_scale (`float`, defaults to `1.0`):
Scaling factor to apply in 3D positional embeddings across temporal dimensions.
is_train_face (`bool`, defaults to `False`):
Whether to use enable the identity-preserving module during the training process. When set to `True`, the
model will focus on identity-preserving tasks.
is_kps (`bool`, defaults to `False`):
Whether to enable keypoint for global facial extractor. If `True`, keypoints will be in the model.
cross_attn_interval (`int`, defaults to `2`):
The interval between cross-attention layers in the Transformer architecture. A larger value may reduce the
frequency of cross-attention computations, which can help reduce computational overhead.
cross_attn_dim_head (`int`, optional, defaults to `128`):
The dimensionality of each attention head in the cross-attention layers of the Transformer architecture. A
larger value increases the capacity to attend to more complex patterns, but also increases memory and
computation costs.
cross_attn_num_heads (`int`, optional, defaults to `16`):
The number of attention heads in the cross-attention layers. More heads allow for more parallel attention
mechanisms, capturing diverse relationships between different components of the input, but can also
increase computational requirements.
LFE_id_dim (`int`, optional, defaults to `1280`):
The dimensionality of the identity vector used in the Local Facial Extractor (LFE). This vector represents
the identity features of a face, which are important for tasks like face recognition and identity
preservation across different frames.
LFE_vit_dim (`int`, optional, defaults to `1024`):
The dimension of the vision transformer (ViT) output used in the Local Facial Extractor (LFE). This value
dictates the size of the transformer-generated feature vectors that will be processed for facial feature
extraction.
LFE_depth (`int`, optional, defaults to `10`):
The number of layers in the Local Facial Extractor (LFE). Increasing the depth allows the model to capture
more complex representations of facial features, but also increases the computational load.
LFE_dim_head (`int`, optional, defaults to `64`):
The dimensionality of each attention head in the Local Facial Extractor (LFE). This parameter affects how
finely the model can process and focus on different parts of the facial features during the extraction
process.
LFE_num_heads (`int`, optional, defaults to `16`):
The number of attention heads in the Local Facial Extractor (LFE). More heads can improve the model's
ability to capture diverse facial features, but at the cost of increased computational complexity.
LFE_num_id_token (`int`, optional, defaults to `5`):
The number of identity tokens used in the Local Facial Extractor (LFE). This defines how many
identity-related tokens the model will process to ensure face identity preservation during feature
extraction.
LFE_num_querie (`int`, optional, defaults to `32`):
The number of query tokens used in the Local Facial Extractor (LFE). These tokens are used to capture
high-frequency face-related information that aids in accurate facial feature extraction.
LFE_output_dim (`int`, optional, defaults to `2048`):
The output dimension of the Local Facial Extractor (LFE). This dimension determines the size of the feature
vectors produced by the LFE module, which will be used for subsequent tasks such as face recognition or
tracking.
LFE_ff_mult (`int`, optional, defaults to `4`):
The multiplication factor applied to the feed-forward network's hidden layer size in the Local Facial
Extractor (LFE). A higher value increases the model's capacity to learn more complex facial feature
transformations, but also increases the computation and memory requirements.
LFE_num_scale (`int`, optional, defaults to `5`):
The number of different scales visual feature. A higher value increases the model's capacity to learn more
complex facial feature transformations, but also increases the computation and memory requirements.
local_face_scale (`float`, defaults to `1.0`):
A scaling factor used to adjust the importance of local facial features in the model. This can influence
how strongly the model focuses on high frequency face-related content.
"""
_supports_gradient_checkpointing = True
@register_to_config
def __init__(
self,
num_attention_heads: int = 30,
attention_head_dim: int = 64,
in_channels: int = 16,
out_channels: Optional[int] = 16,
flip_sin_to_cos: bool = True,
freq_shift: int = 0,
time_embed_dim: int = 512,
text_embed_dim: int = 4096,
num_layers: int = 30,
dropout: float = 0.0,
attention_bias: bool = True,
sample_width: int = 90,
sample_height: int = 60,
sample_frames: int = 49,
patch_size: int = 2,
temporal_compression_ratio: int = 4,
max_text_seq_length: int = 226,
activation_fn: str = "gelu-approximate",
timestep_activation_fn: str = "silu",
norm_elementwise_affine: bool = True,
norm_eps: float = 1e-5,
spatial_interpolation_scale: float = 1.875,
temporal_interpolation_scale: float = 1.0,
use_rotary_positional_embeddings: bool = False,
use_learned_positional_embeddings: bool = False,
is_train_face: bool = False,
is_kps: bool = False,
cross_attn_interval: int = 2,
cross_attn_dim_head: int = 128,
cross_attn_num_heads: int = 16,
LFE_id_dim: int = 1280,
LFE_vit_dim: int = 1024,
LFE_depth: int = 10,
LFE_dim_head: int = 64,
LFE_num_heads: int = 16,
LFE_num_id_token: int = 5,
LFE_num_querie: int = 32,
LFE_output_dim: int = 2048,
LFE_ff_mult: int = 4,
LFE_num_scale: int = 5,
local_face_scale: float = 1.0,
):
super().__init__()
inner_dim = num_attention_heads * attention_head_dim
if not use_rotary_positional_embeddings and use_learned_positional_embeddings:
raise ValueError(
"There are no ConsisID checkpoints available with disable rotary embeddings and learned positional "
"embeddings. If you're using a custom model and/or believe this should be supported, please open an "
"issue at https://github.com/huggingface/diffusers/issues."
)
# 1. Patch embedding
self.patch_embed = CogVideoXPatchEmbed(
patch_size=patch_size,
in_channels=in_channels,
embed_dim=inner_dim,
text_embed_dim=text_embed_dim,
bias=True,
sample_width=sample_width,
sample_height=sample_height,
sample_frames=sample_frames,
temporal_compression_ratio=temporal_compression_ratio,
max_text_seq_length=max_text_seq_length,
spatial_interpolation_scale=spatial_interpolation_scale,
temporal_interpolation_scale=temporal_interpolation_scale,
use_positional_embeddings=not use_rotary_positional_embeddings,
use_learned_positional_embeddings=use_learned_positional_embeddings,
)
self.embedding_dropout = nn.Dropout(dropout)
# 2. Time embeddings
self.time_proj = Timesteps(inner_dim, flip_sin_to_cos, freq_shift)
self.time_embedding = TimestepEmbedding(inner_dim, time_embed_dim, timestep_activation_fn)
# 3. Define spatio-temporal transformers blocks
self.transformer_blocks = nn.ModuleList(
[
ConsisIDBlock(
dim=inner_dim,
num_attention_heads=num_attention_heads,
attention_head_dim=attention_head_dim,
time_embed_dim=time_embed_dim,
dropout=dropout,
activation_fn=activation_fn,
attention_bias=attention_bias,
norm_elementwise_affine=norm_elementwise_affine,
norm_eps=norm_eps,
)
for _ in range(num_layers)
]
)
self.norm_final = nn.LayerNorm(inner_dim, norm_eps, norm_elementwise_affine)
# 4. Output blocks
self.norm_out = AdaLayerNorm(
embedding_dim=time_embed_dim,
output_dim=2 * inner_dim,
norm_elementwise_affine=norm_elementwise_affine,
norm_eps=norm_eps,
chunk_dim=1,
)
self.proj_out = nn.Linear(inner_dim, patch_size * patch_size * out_channels)
self.is_train_face = is_train_face
self.is_kps = is_kps
# 5. Define identity-preserving config
if is_train_face:
# LFE configs
self.LFE_id_dim = LFE_id_dim
self.LFE_vit_dim = LFE_vit_dim
self.LFE_depth = LFE_depth
self.LFE_dim_head = LFE_dim_head
self.LFE_num_heads = LFE_num_heads
self.LFE_num_id_token = LFE_num_id_token
self.LFE_num_querie = LFE_num_querie
self.LFE_output_dim = LFE_output_dim
self.LFE_ff_mult = LFE_ff_mult
self.LFE_num_scale = LFE_num_scale
# cross configs
self.inner_dim = inner_dim
self.cross_attn_interval = cross_attn_interval
self.num_cross_attn = num_layers // cross_attn_interval
self.cross_attn_dim_head = cross_attn_dim_head
self.cross_attn_num_heads = cross_attn_num_heads
self.cross_attn_kv_dim = int(self.inner_dim / 3 * 2)
self.local_face_scale = local_face_scale
# face modules
self._init_face_inputs()
self.gradient_checkpointing = False
def _set_gradient_checkpointing(self, module, value=False):
self.gradient_checkpointing = value
def _init_face_inputs(self):
self.local_facial_extractor = LocalFacialExtractor(
id_dim=self.LFE_id_dim,
vit_dim=self.LFE_vit_dim,
depth=self.LFE_depth,
dim_head=self.LFE_dim_head,
heads=self.LFE_num_heads,
num_id_token=self.LFE_num_id_token,
num_queries=self.LFE_num_querie,
output_dim=self.LFE_output_dim,
ff_mult=self.LFE_ff_mult,
num_scale=self.LFE_num_scale,
)
self.perceiver_cross_attention = nn.ModuleList(
[
PerceiverCrossAttention(
dim=self.inner_dim,
dim_head=self.cross_attn_dim_head,
heads=self.cross_attn_num_heads,
kv_dim=self.cross_attn_kv_dim,
)
for _ in range(self.num_cross_attn)
]
)
@property
# Copied from diffusers.models.unets.unet_2d_condition.UNet2DConditionModel.attn_processors
def attn_processors(self) -> Dict[str, AttentionProcessor]:
r"""
Returns:
`dict` of attention processors: A dictionary containing all attention processors used in the model with
indexed by its weight name.
"""
# set recursively
processors = {}
def fn_recursive_add_processors(name: str, module: torch.nn.Module, processors: Dict[str, AttentionProcessor]):
if hasattr(module, "get_processor"):
processors[f"{name}.processor"] = module.get_processor()
for sub_name, child in module.named_children():
fn_recursive_add_processors(f"{name}.{sub_name}", child, processors)
return processors
for name, module in self.named_children():
fn_recursive_add_processors(name, module, processors)
return processors
# Copied from diffusers.models.unets.unet_2d_condition.UNet2DConditionModel.set_attn_processor
def set_attn_processor(self, processor: Union[AttentionProcessor, Dict[str, AttentionProcessor]]):
r"""
Sets the attention processor to use to compute attention.
Parameters:
processor (`dict` of `AttentionProcessor` or only `AttentionProcessor`):
The instantiated processor class or a dictionary of processor classes that will be set as the processor
for **all** `Attention` layers.
If `processor` is a dict, the key needs to define the path to the corresponding cross attention
processor. This is strongly recommended when setting trainable attention processors.
"""
count = len(self.attn_processors.keys())
if isinstance(processor, dict) and len(processor) != count:
raise ValueError(
f"A dict of processors was passed, but the number of processors {len(processor)} does not match the"
f" number of attention layers: {count}. Please make sure to pass {count} processor classes."
)
def fn_recursive_attn_processor(name: str, module: torch.nn.Module, processor):
if hasattr(module, "set_processor"):
if not isinstance(processor, dict):
module.set_processor(processor)
else:
module.set_processor(processor.pop(f"{name}.processor"))
for sub_name, child in module.named_children():
fn_recursive_attn_processor(f"{name}.{sub_name}", child, processor)
for name, module in self.named_children():
fn_recursive_attn_processor(name, module, processor)
def forward(
self,
hidden_states: torch.Tensor,
encoder_hidden_states: torch.Tensor,
timestep: Union[int, float, torch.LongTensor],
timestep_cond: Optional[torch.Tensor] = None,
image_rotary_emb: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
attention_kwargs: Optional[Dict[str, Any]] = None,
id_cond: Optional[torch.Tensor] = None,
id_vit_hidden: Optional[torch.Tensor] = None,
return_dict: bool = True,
):
if attention_kwargs is not None:
attention_kwargs = attention_kwargs.copy()
lora_scale = attention_kwargs.pop("scale", 1.0)
else:
lora_scale = 1.0
if USE_PEFT_BACKEND:
# weight the lora layers by setting `lora_scale` for each PEFT layer
scale_lora_layers(self, lora_scale)
else:
if attention_kwargs is not None and attention_kwargs.get("scale", None) is not None:
logger.warning(
"Passing `scale` via `attention_kwargs` when not using the PEFT backend is ineffective."
)
# fuse clip and insightface
valid_face_emb = None
if self.is_train_face:
id_cond = id_cond.to(device=hidden_states.device, dtype=hidden_states.dtype)
id_vit_hidden = [
tensor.to(device=hidden_states.device, dtype=hidden_states.dtype) for tensor in id_vit_hidden
]
valid_face_emb = self.local_facial_extractor(
id_cond, id_vit_hidden
) # torch.Size([1, 1280]), list[5](torch.Size([1, 577, 1024])) -> torch.Size([1, 32, 2048])
batch_size, num_frames, channels, height, width = hidden_states.shape
# 1. Time embedding
timesteps = timestep
t_emb = self.time_proj(timesteps)
# timesteps does not contain any weights and will always return f32 tensors
# but time_embedding might actually be running in fp16. so we need to cast here.
# there might be better ways to encapsulate this.
t_emb = t_emb.to(dtype=hidden_states.dtype)
emb = self.time_embedding(t_emb, timestep_cond)
# 2. Patch embedding
# torch.Size([1, 226, 4096]) torch.Size([1, 13, 32, 60, 90])
hidden_states = self.patch_embed(encoder_hidden_states, hidden_states) # torch.Size([1, 17776, 3072])
hidden_states = self.embedding_dropout(hidden_states) # torch.Size([1, 17776, 3072])
text_seq_length = encoder_hidden_states.shape[1]
encoder_hidden_states = hidden_states[:, :text_seq_length] # torch.Size([1, 226, 3072])
hidden_states = hidden_states[:, text_seq_length:] # torch.Size([1, 17550, 3072])
# 3. Transformer blocks
ca_idx = 0
for i, block in enumerate(self.transformer_blocks):
if self.training and self.gradient_checkpointing:
def create_custom_forward(module):
def custom_forward(*inputs):
return module(*inputs)
return custom_forward
ckpt_kwargs: Dict[str, Any] = {"use_reentrant": False} if is_torch_version(">=", "1.11.0") else {}
hidden_states, encoder_hidden_states = torch.utils.checkpoint.checkpoint(
create_custom_forward(block),
hidden_states,
encoder_hidden_states,
emb,
image_rotary_emb,
**ckpt_kwargs,
)
else:
hidden_states, encoder_hidden_states = block(
hidden_states=hidden_states,
encoder_hidden_states=encoder_hidden_states,
temb=emb,
image_rotary_emb=image_rotary_emb,
)
if self.is_train_face:
if i % self.cross_attn_interval == 0 and valid_face_emb is not None:
hidden_states = hidden_states + self.local_face_scale * self.perceiver_cross_attention[ca_idx](
valid_face_emb, hidden_states
) # torch.Size([2, 32, 2048]) torch.Size([2, 17550, 3072])
ca_idx += 1
hidden_states = torch.cat([encoder_hidden_states, hidden_states], dim=1)
hidden_states = self.norm_final(hidden_states)
hidden_states = hidden_states[:, text_seq_length:]
# 4. Final block
hidden_states = self.norm_out(hidden_states, temb=emb)
hidden_states = self.proj_out(hidden_states)
# 5. Unpatchify
# Note: we use `-1` instead of `channels`:
# - It is okay to `channels` use for ConsisID (number of input channels is equal to output channels)
p = self.config.patch_size
output = hidden_states.reshape(batch_size, num_frames, height // p, width // p, -1, p, p)
output = output.permute(0, 1, 4, 2, 5, 3, 6).flatten(5, 6).flatten(3, 4)
if USE_PEFT_BACKEND:
# remove `lora_scale` from each PEFT layer
unscale_lora_layers(self, lora_scale)
if not return_dict:
return (output,)
return Transformer2DModelOutput(sample=output)
@@ -915,10 +915,11 @@ class UNet2DConditionModel(
# TODO: this requires sync between CPU and GPU. So try to pass timesteps as tensors if you can
# This would be a good case for the `match` statement (Python 3.10+)
is_mps = sample.device.type == "mps"
is_npu = sample.device.type == "npu"
if isinstance(timestep, float):
dtype = torch.float32 if is_mps else torch.float64
dtype = torch.float32 if (is_mps or is_npu) else torch.float64
else:
dtype = torch.int32 if is_mps else torch.int64
dtype = torch.int32 if (is_mps or is_npu) else torch.int64
timesteps = torch.tensor([timesteps], dtype=dtype, device=sample.device)
elif len(timesteps.shape) == 0:
timesteps = timesteps[None].to(sample.device)
@@ -624,10 +624,11 @@ class UNet3DConditionModel(ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin)
# TODO: this requires sync between CPU and GPU. So try to pass timesteps as tensors if you can
# This would be a good case for the `match` statement (Python 3.10+)
is_mps = sample.device.type == "mps"
is_npu = sample.device.type == "npu"
if isinstance(timestep, float):
dtype = torch.float32 if is_mps else torch.float64
dtype = torch.float32 if (is_mps or is_npu) else torch.float64
else:
dtype = torch.int32 if is_mps else torch.int64
dtype = torch.int32 if (is_mps or is_npu) else torch.int64
timesteps = torch.tensor([timesteps], dtype=dtype, device=sample.device)
elif len(timesteps.shape) == 0:
timesteps = timesteps[None].to(sample.device)
+3 -2
View File
@@ -575,10 +575,11 @@ class I2VGenXLUNet(ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin):
# TODO: this requires sync between CPU and GPU. So try to pass `timesteps` as tensors if you can
# This would be a good case for the `match` statement (Python 3.10+)
is_mps = sample.device.type == "mps"
is_npu = sample.device.type == "npu"
if isinstance(timesteps, float):
dtype = torch.float32 if is_mps else torch.float64
dtype = torch.float32 if (is_mps or is_npu) else torch.float64
else:
dtype = torch.int32 if is_mps else torch.int64
dtype = torch.int32 if (is_mps or is_npu) else torch.int64
timesteps = torch.tensor([timesteps], dtype=dtype, device=sample.device)
elif len(timesteps.shape) == 0:
timesteps = timesteps[None].to(sample.device)
@@ -2114,10 +2114,11 @@ class UNetMotionModel(ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin, Peft
# TODO: this requires sync between CPU and GPU. So try to pass timesteps as tensors if you can
# This would be a good case for the `match` statement (Python 3.10+)
is_mps = sample.device.type == "mps"
is_npu = sample.device.type == "npu"
if isinstance(timestep, float):
dtype = torch.float32 if is_mps else torch.float64
dtype = torch.float32 if (is_mps or is_npu) else torch.float64
else:
dtype = torch.int32 if is_mps else torch.int64
dtype = torch.int32 if (is_mps or is_npu) else torch.int64
timesteps = torch.tensor([timesteps], dtype=dtype, device=sample.device)
elif len(timesteps.shape) == 0:
timesteps = timesteps[None].to(sample.device)
@@ -402,10 +402,11 @@ class UNetSpatioTemporalConditionModel(ModelMixin, ConfigMixin, UNet2DConditionL
# TODO: this requires sync between CPU and GPU. So try to pass timesteps as tensors if you can
# This would be a good case for the `match` statement (Python 3.10+)
is_mps = sample.device.type == "mps"
is_npu = sample.device.type == "npu"
if isinstance(timestep, float):
dtype = torch.float32 if is_mps else torch.float64
dtype = torch.float32 if (is_mps or is_npu) else torch.float64
else:
dtype = torch.int32 if is_mps else torch.int64
dtype = torch.int32 if (is_mps or is_npu) else torch.int64
timesteps = torch.tensor([timesteps], dtype=dtype, device=sample.device)
elif len(timesteps.shape) == 0:
timesteps = timesteps[None].to(sample.device)
+1 -1
View File
@@ -258,7 +258,7 @@ def get_polynomial_decay_schedule_with_warmup(
lr_init = optimizer.defaults["lr"]
if not (lr_init > lr_end):
raise ValueError(f"lr_end ({lr_end}) must be be smaller than initial lr ({lr_init})")
raise ValueError(f"lr_end ({lr_end}) must be smaller than initial lr ({lr_init})")
def lr_lambda(current_step: int):
if current_step < num_warmup_steps:
+2
View File
@@ -154,6 +154,7 @@ else:
"CogVideoXFunControlPipeline",
]
_import_structure["cogview3"] = ["CogView3PlusPipeline"]
_import_structure["consisid"] = ["ConsisIDPipeline"]
_import_structure["controlnet"].extend(
[
"BlipDiffusionControlNetPipeline",
@@ -496,6 +497,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
CogVideoXVideoToVideoPipeline,
)
from .cogview3 import CogView3PlusPipeline
from .consisid import ConsisIDPipeline
from .controlnet import (
BlipDiffusionControlNetPipeline,
StableDiffusionControlNetImg2ImgPipeline,
@@ -768,10 +768,11 @@ class AudioLDM2UNet2DConditionModel(ModelMixin, ConfigMixin, UNet2DConditionLoad
# TODO: this requires sync between CPU and GPU. So try to pass timesteps as tensors if you can
# This would be a good case for the `match` statement (Python 3.10+)
is_mps = sample.device.type == "mps"
is_npu = sample.device.type == "npu"
if isinstance(timestep, float):
dtype = torch.float32 if is_mps else torch.float64
dtype = torch.float32 if (is_mps or is_npu) else torch.float64
else:
dtype = torch.int32 if is_mps else torch.int64
dtype = torch.int32 if (is_mps or is_npu) else torch.int64
timesteps = torch.tensor([timesteps], dtype=dtype, device=sample.device)
elif len(timesteps.shape) == 0:
timesteps = timesteps[None].to(sample.device)
@@ -0,0 +1,48 @@
from typing import TYPE_CHECKING
from ...utils import (
DIFFUSERS_SLOW_IMPORT,
OptionalDependencyNotAvailable,
_LazyModule,
get_objects_from_module,
is_torch_available,
is_transformers_available,
)
_dummy_objects = {}
_import_structure = {}
try:
if not (is_transformers_available() and is_torch_available()):
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
from ...utils import dummy_torch_and_transformers_objects # noqa F403
_dummy_objects.update(get_objects_from_module(dummy_torch_and_transformers_objects))
else:
_import_structure["pipeline_consisid"] = ["ConsisIDPipeline"]
if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
try:
if not (is_transformers_available() and is_torch_available()):
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
from ...utils.dummy_torch_and_transformers_objects import *
else:
from .pipeline_consisid import ConsisIDPipeline
else:
import sys
sys.modules[__name__] = _LazyModule(
__name__,
globals()["__file__"],
_import_structure,
module_spec=__spec__,
)
for name, value in _dummy_objects.items():
setattr(sys.modules[__name__], name, value)
@@ -0,0 +1,355 @@
import importlib.util
import os
import cv2
import numpy as np
import torch
from PIL import Image, ImageOps
from torchvision.transforms import InterpolationMode
from torchvision.transforms.functional import normalize, resize
from ...utils import load_image
_insightface_available = importlib.util.find_spec("insightface") is not None
_consisid_eva_clip_available = importlib.util.find_spec("consisid_eva_clip") is not None
_facexlib_available = importlib.util.find_spec("facexlib") is not None
if _insightface_available:
import insightface
from insightface.app import FaceAnalysis
else:
raise ImportError("insightface is not available. Please install it using 'pip install insightface'.")
if _consisid_eva_clip_available:
from consisid_eva_clip import create_model_and_transforms
from consisid_eva_clip.constants import OPENAI_DATASET_MEAN, OPENAI_DATASET_STD
else:
raise ImportError("consisid_eva_clip is not available. Please install it using 'pip install consisid_eva_clip'.")
if _facexlib_available:
from facexlib.parsing import init_parsing_model
from facexlib.utils.face_restoration_helper import FaceRestoreHelper
else:
raise ImportError("facexlib is not available. Please install it using 'pip install facexlib'.")
def resize_numpy_image_long(image, resize_long_edge=768):
"""
Resize the input image to a specified long edge while maintaining aspect ratio.
Args:
image (numpy.ndarray): Input image (H x W x C or H x W).
resize_long_edge (int): The target size for the long edge of the image. Default is 768.
Returns:
numpy.ndarray: Resized image with the long edge matching `resize_long_edge`, while maintaining the aspect
ratio.
"""
h, w = image.shape[:2]
if max(h, w) <= resize_long_edge:
return image
k = resize_long_edge / max(h, w)
h = int(h * k)
w = int(w * k)
image = cv2.resize(image, (w, h), interpolation=cv2.INTER_LANCZOS4)
return image
def img2tensor(imgs, bgr2rgb=True, float32=True):
"""Numpy array to tensor.
Args:
imgs (list[ndarray] | ndarray): Input images.
bgr2rgb (bool): Whether to change bgr to rgb.
float32 (bool): Whether to change to float32.
Returns:
list[tensor] | tensor: Tensor images. If returned results only have
one element, just return tensor.
"""
def _totensor(img, bgr2rgb, float32):
if img.shape[2] == 3 and bgr2rgb:
if img.dtype == "float64":
img = img.astype("float32")
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
img = torch.from_numpy(img.transpose(2, 0, 1))
if float32:
img = img.float()
return img
if isinstance(imgs, list):
return [_totensor(img, bgr2rgb, float32) for img in imgs]
return _totensor(imgs, bgr2rgb, float32)
def to_gray(img):
"""
Converts an RGB image to grayscale by applying the standard luminosity formula.
Args:
img (torch.Tensor): The input image tensor with shape (batch_size, channels, height, width).
The image is expected to be in RGB format (3 channels).
Returns:
torch.Tensor: The grayscale image tensor with shape (batch_size, 3, height, width).
The grayscale values are replicated across all three channels.
"""
x = 0.299 * img[:, 0:1] + 0.587 * img[:, 1:2] + 0.114 * img[:, 2:3]
x = x.repeat(1, 3, 1, 1)
return x
def process_face_embeddings(
face_helper_1,
clip_vision_model,
face_helper_2,
eva_transform_mean,
eva_transform_std,
app,
device,
weight_dtype,
image,
original_id_image=None,
is_align_face=True,
):
"""
Process face embeddings from an image, extracting relevant features such as face embeddings, landmarks, and parsed
face features using a series of face detection and alignment tools.
Args:
face_helper_1: Face helper object (first helper) for alignment and landmark detection.
clip_vision_model: Pre-trained CLIP vision model used for feature extraction.
face_helper_2: Face helper object (second helper) for embedding extraction.
eva_transform_mean: Mean values for image normalization before passing to EVA model.
eva_transform_std: Standard deviation values for image normalization before passing to EVA model.
app: Application instance used for face detection.
device: Device (CPU or GPU) where the computations will be performed.
weight_dtype: Data type of the weights for precision (e.g., `torch.float32`).
image: Input image in RGB format with pixel values in the range [0, 255].
original_id_image: (Optional) Original image for feature extraction if `is_align_face` is False.
is_align_face: Boolean flag indicating whether face alignment should be performed.
Returns:
Tuple:
- id_cond: Concatenated tensor of Ante face embedding and CLIP vision embedding
- id_vit_hidden: Hidden state of the CLIP vision model, a list of tensors.
- return_face_features_image_2: Processed face features image after normalization and parsing.
- face_kps: Keypoints of the face detected in the image.
"""
face_helper_1.clean_all()
image_bgr = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)
# get antelopev2 embedding
face_info = app.get(image_bgr)
if len(face_info) > 0:
face_info = sorted(face_info, key=lambda x: (x["bbox"][2] - x["bbox"][0]) * (x["bbox"][3] - x["bbox"][1]))[
-1
] # only use the maximum face
id_ante_embedding = face_info["embedding"] # (512,)
face_kps = face_info["kps"]
else:
id_ante_embedding = None
face_kps = None
# using facexlib to detect and align face
face_helper_1.read_image(image_bgr)
face_helper_1.get_face_landmarks_5(only_center_face=True)
if face_kps is None:
face_kps = face_helper_1.all_landmarks_5[0]
face_helper_1.align_warp_face()
if len(face_helper_1.cropped_faces) == 0:
raise RuntimeError("facexlib align face fail")
align_face = face_helper_1.cropped_faces[0] # (512, 512, 3) # RGB
# incase insightface didn't detect face
if id_ante_embedding is None:
print("fail to detect face using insightface, extract embedding on align face")
id_ante_embedding = face_helper_2.get_feat(align_face)
id_ante_embedding = torch.from_numpy(id_ante_embedding).to(device, weight_dtype) # torch.Size([512])
if id_ante_embedding.ndim == 1:
id_ante_embedding = id_ante_embedding.unsqueeze(0) # torch.Size([1, 512])
# parsing
if is_align_face:
input = img2tensor(align_face, bgr2rgb=True).unsqueeze(0) / 255.0 # torch.Size([1, 3, 512, 512])
input = input.to(device)
parsing_out = face_helper_1.face_parse(normalize(input, [0.485, 0.456, 0.406], [0.229, 0.224, 0.225]))[0]
parsing_out = parsing_out.argmax(dim=1, keepdim=True) # torch.Size([1, 1, 512, 512])
bg_label = [0, 16, 18, 7, 8, 9, 14, 15]
bg = sum(parsing_out == i for i in bg_label).bool()
white_image = torch.ones_like(input) # torch.Size([1, 3, 512, 512])
# only keep the face features
return_face_features_image = torch.where(bg, white_image, to_gray(input)) # torch.Size([1, 3, 512, 512])
return_face_features_image_2 = torch.where(bg, white_image, input) # torch.Size([1, 3, 512, 512])
else:
original_image_bgr = cv2.cvtColor(original_id_image, cv2.COLOR_RGB2BGR)
input = img2tensor(original_image_bgr, bgr2rgb=True).unsqueeze(0) / 255.0 # torch.Size([1, 3, 512, 512])
input = input.to(device)
return_face_features_image = return_face_features_image_2 = input
# transform img before sending to eva-clip-vit
face_features_image = resize(
return_face_features_image, clip_vision_model.image_size, InterpolationMode.BICUBIC
) # torch.Size([1, 3, 336, 336])
face_features_image = normalize(face_features_image, eva_transform_mean, eva_transform_std)
id_cond_vit, id_vit_hidden = clip_vision_model(
face_features_image.to(weight_dtype), return_all_features=False, return_hidden=True, shuffle=False
) # torch.Size([1, 768]), list(torch.Size([1, 577, 1024]))
id_cond_vit_norm = torch.norm(id_cond_vit, 2, 1, True)
id_cond_vit = torch.div(id_cond_vit, id_cond_vit_norm)
id_cond = torch.cat(
[id_ante_embedding, id_cond_vit], dim=-1
) # torch.Size([1, 512]), torch.Size([1, 768]) -> torch.Size([1, 1280])
return (
id_cond,
id_vit_hidden,
return_face_features_image_2,
face_kps,
) # torch.Size([1, 1280]), list(torch.Size([1, 577, 1024]))
def process_face_embeddings_infer(
face_helper_1,
clip_vision_model,
face_helper_2,
eva_transform_mean,
eva_transform_std,
app,
device,
weight_dtype,
img_file_path,
is_align_face=True,
):
"""
Process face embeddings from an input image for inference, including alignment, feature extraction, and embedding
concatenation.
Args:
face_helper_1: Face helper object (first helper) for alignment and landmark detection.
clip_vision_model: Pre-trained CLIP vision model used for feature extraction.
face_helper_2: Face helper object (second helper) for embedding extraction.
eva_transform_mean: Mean values for image normalization before passing to EVA model.
eva_transform_std: Standard deviation values for image normalization before passing to EVA model.
app: Application instance used for face detection.
device: Device (CPU or GPU) where the computations will be performed.
weight_dtype: Data type of the weights for precision (e.g., `torch.float32`).
img_file_path: Path to the input image file (string) or a numpy array representing an image.
is_align_face: Boolean flag indicating whether face alignment should be performed (default: True).
Returns:
Tuple:
- id_cond: Concatenated tensor of Ante face embedding and CLIP vision embedding.
- id_vit_hidden: Hidden state of the CLIP vision model, a list of tensors.
- image: Processed face image after feature extraction and alignment.
- face_kps: Keypoints of the face detected in the image.
"""
# Load and preprocess the input image
if isinstance(img_file_path, str):
image = np.array(load_image(image=img_file_path).convert("RGB"))
else:
image = np.array(ImageOps.exif_transpose(Image.fromarray(img_file_path)).convert("RGB"))
# Resize image to ensure the longer side is 1024 pixels
image = resize_numpy_image_long(image, 1024)
original_id_image = image
# Process the image to extract face embeddings and related features
id_cond, id_vit_hidden, align_crop_face_image, face_kps = process_face_embeddings(
face_helper_1,
clip_vision_model,
face_helper_2,
eva_transform_mean,
eva_transform_std,
app,
device,
weight_dtype,
image,
original_id_image,
is_align_face,
)
# Convert the aligned cropped face image (torch tensor) to a numpy array
tensor = align_crop_face_image.cpu().detach()
tensor = tensor.squeeze()
tensor = tensor.permute(1, 2, 0)
tensor = tensor.numpy() * 255
tensor = tensor.astype(np.uint8)
image = ImageOps.exif_transpose(Image.fromarray(tensor))
return id_cond, id_vit_hidden, image, face_kps
def prepare_face_models(model_path, device, dtype):
"""
Prepare all face models for the facial recognition task.
Parameters:
- model_path: Path to the directory containing model files.
- device: The device (e.g., 'cuda', 'cpu') where models will be loaded.
- dtype: Data type (e.g., torch.float32) for model inference.
Returns:
- face_helper_1: First face restoration helper.
- face_helper_2: Second face restoration helper.
- face_clip_model: CLIP model for face extraction.
- eva_transform_mean: Mean value for image normalization.
- eva_transform_std: Standard deviation value for image normalization.
- face_main_model: Main face analysis model.
"""
# get helper model
face_helper_1 = FaceRestoreHelper(
upscale_factor=1,
face_size=512,
crop_ratio=(1, 1),
det_model="retinaface_resnet50",
save_ext="png",
device=device,
model_rootpath=os.path.join(model_path, "face_encoder"),
)
face_helper_1.face_parse = None
face_helper_1.face_parse = init_parsing_model(
model_name="bisenet", device=device, model_rootpath=os.path.join(model_path, "face_encoder")
)
face_helper_2 = insightface.model_zoo.get_model(
f"{model_path}/face_encoder/models/antelopev2/glintr100.onnx", providers=["CUDAExecutionProvider"]
)
face_helper_2.prepare(ctx_id=0)
# get local facial extractor part 1
model, _, _ = create_model_and_transforms(
"EVA02-CLIP-L-14-336",
os.path.join(model_path, "face_encoder", "EVA02_CLIP_L_336_psz14_s6B.pt"),
force_custom_clip=True,
)
face_clip_model = model.visual
eva_transform_mean = getattr(face_clip_model, "image_mean", OPENAI_DATASET_MEAN)
eva_transform_std = getattr(face_clip_model, "image_std", OPENAI_DATASET_STD)
if not isinstance(eva_transform_mean, (list, tuple)):
eva_transform_mean = (eva_transform_mean,) * 3
if not isinstance(eva_transform_std, (list, tuple)):
eva_transform_std = (eva_transform_std,) * 3
eva_transform_mean = eva_transform_mean
eva_transform_std = eva_transform_std
# get local facial extractor part 2
face_main_model = FaceAnalysis(
name="antelopev2", root=os.path.join(model_path, "face_encoder"), providers=["CUDAExecutionProvider"]
)
face_main_model.prepare(ctx_id=0, det_size=(640, 640))
# move face models to device
face_helper_1.face_det.eval()
face_helper_1.face_parse.eval()
face_clip_model.eval()
face_helper_1.face_det.to(device)
face_helper_1.face_parse.to(device)
face_clip_model.to(device, dtype=dtype)
return face_helper_1, face_helper_2, face_clip_model, face_main_model, eva_transform_mean, eva_transform_std
@@ -0,0 +1,966 @@
# Copyright 2024 ConsisID Authors and The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import inspect
import math
from typing import Any, Callable, Dict, List, Optional, Tuple, Union
import cv2
import numpy as np
import PIL
import torch
from transformers import T5EncoderModel, T5Tokenizer
from ...callbacks import MultiPipelineCallbacks, PipelineCallback
from ...image_processor import PipelineImageInput
from ...loaders import CogVideoXLoraLoaderMixin
from ...models import AutoencoderKLCogVideoX, ConsisIDTransformer3DModel
from ...models.embeddings import get_3d_rotary_pos_embed
from ...pipelines.pipeline_utils import DiffusionPipeline
from ...schedulers import CogVideoXDPMScheduler
from ...utils import logging, replace_example_docstring
from ...utils.torch_utils import randn_tensor
from ...video_processor import VideoProcessor
from .pipeline_output import ConsisIDPipelineOutput
logger = logging.get_logger(__name__) # pylint: disable=invalid-name
EXAMPLE_DOC_STRING = """
Examples:
```python
>>> import torch
>>> from diffusers import ConsisIDPipeline
>>> from diffusers.pipelines.consisid.consisid_utils import prepare_face_models, process_face_embeddings_infer
>>> from diffusers.utils import export_to_video
>>> from huggingface_hub import snapshot_download
>>> snapshot_download(repo_id="BestWishYsh/ConsisID-preview", local_dir="BestWishYsh/ConsisID-preview")
>>> face_helper_1, face_helper_2, face_clip_model, face_main_model, eva_transform_mean, eva_transform_std = (
... prepare_face_models("BestWishYsh/ConsisID-preview", device="cuda", dtype=torch.bfloat16)
... )
>>> pipe = ConsisIDPipeline.from_pretrained("BestWishYsh/ConsisID-preview", torch_dtype=torch.bfloat16)
>>> pipe.to("cuda")
>>> # ConsisID works well with long and well-described prompts. Make sure the face in the image is clearly visible (e.g., preferably half-body or full-body).
>>> prompt = "The video captures a boy walking along a city street, filmed in black and white on a classic 35mm camera. His expression is thoughtful, his brow slightly furrowed as if he's lost in contemplation. The film grain adds a textured, timeless quality to the image, evoking a sense of nostalgia. Around him, the cityscape is filled with vintage buildings, cobblestone sidewalks, and softly blurred figures passing by, their outlines faint and indistinct. Streetlights cast a gentle glow, while shadows play across the boy's path, adding depth to the scene. The lighting highlights the boy's subtle smile, hinting at a fleeting moment of curiosity. The overall cinematic atmosphere, complete with classic film still aesthetics and dramatic contrasts, gives the scene an evocative and introspective feel."
>>> image = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_input.png?download=true"
>>> id_cond, id_vit_hidden, image, face_kps = process_face_embeddings_infer(
... face_helper_1,
... face_clip_model,
... face_helper_2,
... eva_transform_mean,
... eva_transform_std,
... face_main_model,
... "cuda",
... torch.bfloat16,
... image,
... is_align_face=True,
... )
>>> video = pipe(
... image=image,
... prompt=prompt,
... num_inference_steps=50,
... guidance_scale=6.0,
... use_dynamic_cfg=False,
... id_vit_hidden=id_vit_hidden,
... id_cond=id_cond,
... kps_cond=face_kps,
... generator=torch.Generator("cuda").manual_seed(42),
... )
>>> export_to_video(video.frames[0], "output.mp4", fps=8)
```
"""
def draw_kps(image_pil, kps, color_list=[(255, 0, 0), (0, 255, 0), (0, 0, 255), (255, 255, 0), (255, 0, 255)]):
"""
This function draws keypoints and the limbs connecting them on an image.
Parameters:
- image_pil (PIL.Image): Input image as a PIL object.
- kps (list of tuples): A list of keypoints where each keypoint is a tuple of (x, y) coordinates.
- color_list (list of tuples, optional): List of colors (in RGB format) for each keypoint. Default is a set of five
colors.
Returns:
- PIL.Image: Image with the keypoints and limbs drawn.
"""
stickwidth = 4
limbSeq = np.array([[0, 2], [1, 2], [3, 2], [4, 2]])
kps = np.array(kps)
w, h = image_pil.size
out_img = np.zeros([h, w, 3])
for i in range(len(limbSeq)):
index = limbSeq[i]
color = color_list[index[0]]
x = kps[index][:, 0]
y = kps[index][:, 1]
length = ((x[0] - x[1]) ** 2 + (y[0] - y[1]) ** 2) ** 0.5
angle = math.degrees(math.atan2(y[0] - y[1], x[0] - x[1]))
polygon = cv2.ellipse2Poly(
(int(np.mean(x)), int(np.mean(y))), (int(length / 2), stickwidth), int(angle), 0, 360, 1
)
out_img = cv2.fillConvexPoly(out_img.copy(), polygon, color)
out_img = (out_img * 0.6).astype(np.uint8)
for idx_kp, kp in enumerate(kps):
color = color_list[idx_kp]
x, y = kp
out_img = cv2.circle(out_img.copy(), (int(x), int(y)), 10, color, -1)
out_img_pil = PIL.Image.fromarray(out_img.astype(np.uint8))
return out_img_pil
# Similar to diffusers.pipelines.hunyuandit.pipeline_hunyuandit.get_resize_crop_region_for_grid
def get_resize_crop_region_for_grid(src, tgt_width, tgt_height):
"""
This function calculates the resize and crop region for an image to fit a target width and height while preserving
the aspect ratio.
Parameters:
- src (tuple): A tuple containing the source image's height (h) and width (w).
- tgt_width (int): The target width to resize the image.
- tgt_height (int): The target height to resize the image.
Returns:
- tuple: Two tuples representing the crop region:
1. The top-left coordinates of the crop region.
2. The bottom-right coordinates of the crop region.
"""
tw = tgt_width
th = tgt_height
h, w = src
r = h / w
if r > (th / tw):
resize_height = th
resize_width = int(round(th / h * w))
else:
resize_width = tw
resize_height = int(round(tw / w * h))
crop_top = int(round((th - resize_height) / 2.0))
crop_left = int(round((tw - resize_width) / 2.0))
return (crop_top, crop_left), (crop_top + resize_height, crop_left + resize_width)
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.retrieve_timesteps
def retrieve_timesteps(
scheduler,
num_inference_steps: Optional[int] = None,
device: Optional[Union[str, torch.device]] = None,
timesteps: Optional[List[int]] = None,
sigmas: Optional[List[float]] = None,
**kwargs,
):
r"""
Calls the scheduler's `set_timesteps` method and retrieves timesteps from the scheduler after the call. Handles
custom timesteps. Any kwargs will be supplied to `scheduler.set_timesteps`.
Args:
scheduler (`SchedulerMixin`):
The scheduler to get timesteps from.
num_inference_steps (`int`):
The number of diffusion steps used when generating samples with a pre-trained model. If used, `timesteps`
must be `None`.
device (`str` or `torch.device`, *optional*):
The device to which the timesteps should be moved to. If `None`, the timesteps are not moved.
timesteps (`List[int]`, *optional*):
Custom timesteps used to override the timestep spacing strategy of the scheduler. If `timesteps` is passed,
`num_inference_steps` and `sigmas` must be `None`.
sigmas (`List[float]`, *optional*):
Custom sigmas used to override the timestep spacing strategy of the scheduler. If `sigmas` is passed,
`num_inference_steps` and `timesteps` must be `None`.
Returns:
`Tuple[torch.Tensor, int]`: A tuple where the first element is the timestep schedule from the scheduler and the
second element is the number of inference steps.
"""
if timesteps is not None and sigmas is not None:
raise ValueError("Only one of `timesteps` or `sigmas` can be passed. Please choose one to set custom values")
if timesteps is not None:
accepts_timesteps = "timesteps" in set(inspect.signature(scheduler.set_timesteps).parameters.keys())
if not accepts_timesteps:
raise ValueError(
f"The current scheduler class {scheduler.__class__}'s `set_timesteps` does not support custom"
f" timestep schedules. Please check whether you are using the correct scheduler."
)
scheduler.set_timesteps(timesteps=timesteps, device=device, **kwargs)
timesteps = scheduler.timesteps
num_inference_steps = len(timesteps)
elif sigmas is not None:
accept_sigmas = "sigmas" in set(inspect.signature(scheduler.set_timesteps).parameters.keys())
if not accept_sigmas:
raise ValueError(
f"The current scheduler class {scheduler.__class__}'s `set_timesteps` does not support custom"
f" sigmas schedules. Please check whether you are using the correct scheduler."
)
scheduler.set_timesteps(sigmas=sigmas, device=device, **kwargs)
timesteps = scheduler.timesteps
num_inference_steps = len(timesteps)
else:
scheduler.set_timesteps(num_inference_steps, device=device, **kwargs)
timesteps = scheduler.timesteps
return timesteps, num_inference_steps
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion_img2img.retrieve_latents
def retrieve_latents(
encoder_output: torch.Tensor, generator: Optional[torch.Generator] = None, sample_mode: str = "sample"
):
if hasattr(encoder_output, "latent_dist") and sample_mode == "sample":
return encoder_output.latent_dist.sample(generator)
elif hasattr(encoder_output, "latent_dist") and sample_mode == "argmax":
return encoder_output.latent_dist.mode()
elif hasattr(encoder_output, "latents"):
return encoder_output.latents
else:
raise AttributeError("Could not access latents of provided encoder_output")
class ConsisIDPipeline(DiffusionPipeline, CogVideoXLoraLoaderMixin):
r"""
Pipeline for image-to-video generation using ConsisID.
This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the
library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)
Args:
vae ([`AutoencoderKL`]):
Variational Auto-Encoder (VAE) Model to encode and decode videos to and from latent representations.
text_encoder ([`T5EncoderModel`]):
Frozen text-encoder. ConsisID uses
[T5](https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5EncoderModel); specifically the
[t5-v1_1-xxl](https://huggingface.co/PixArt-alpha/PixArt-alpha/tree/main/t5-v1_1-xxl) variant.
tokenizer (`T5Tokenizer`):
Tokenizer of class
[T5Tokenizer](https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5Tokenizer).
transformer ([`ConsisIDTransformer3DModel`]):
A text conditioned `ConsisIDTransformer3DModel` to denoise the encoded video latents.
scheduler ([`SchedulerMixin`]):
A scheduler to be used in combination with `transformer` to denoise the encoded video latents.
"""
_optional_components = []
model_cpu_offload_seq = "text_encoder->transformer->vae"
_callback_tensor_inputs = [
"latents",
"prompt_embeds",
"negative_prompt_embeds",
]
def __init__(
self,
tokenizer: T5Tokenizer,
text_encoder: T5EncoderModel,
vae: AutoencoderKLCogVideoX,
transformer: ConsisIDTransformer3DModel,
scheduler: CogVideoXDPMScheduler,
):
super().__init__()
self.register_modules(
tokenizer=tokenizer,
text_encoder=text_encoder,
vae=vae,
transformer=transformer,
scheduler=scheduler,
)
self.vae_scale_factor_spatial = (
2 ** (len(self.vae.config.block_out_channels) - 1) if hasattr(self, "vae") and self.vae is not None else 8
)
self.vae_scale_factor_temporal = (
self.vae.config.temporal_compression_ratio if hasattr(self, "vae") and self.vae is not None else 4
)
self.vae_scaling_factor_image = (
self.vae.config.scaling_factor if hasattr(self, "vae") and self.vae is not None else 0.7
)
self.video_processor = VideoProcessor(vae_scale_factor=self.vae_scale_factor_spatial)
# Copied from diffusers.pipelines.cogvideo.pipeline_cogvideox.CogVideoXPipeline._get_t5_prompt_embeds
def _get_t5_prompt_embeds(
self,
prompt: Union[str, List[str]] = None,
num_videos_per_prompt: int = 1,
max_sequence_length: int = 226,
device: Optional[torch.device] = None,
dtype: Optional[torch.dtype] = None,
):
device = device or self._execution_device
dtype = dtype or self.text_encoder.dtype
prompt = [prompt] if isinstance(prompt, str) else prompt
batch_size = len(prompt)
text_inputs = self.tokenizer(
prompt,
padding="max_length",
max_length=max_sequence_length,
truncation=True,
add_special_tokens=True,
return_tensors="pt",
)
text_input_ids = text_inputs.input_ids
untruncated_ids = self.tokenizer(prompt, padding="longest", return_tensors="pt").input_ids
if untruncated_ids.shape[-1] >= text_input_ids.shape[-1] and not torch.equal(text_input_ids, untruncated_ids):
removed_text = self.tokenizer.batch_decode(untruncated_ids[:, max_sequence_length - 1 : -1])
logger.warning(
"The following part of your input was truncated because `max_sequence_length` is set to "
f" {max_sequence_length} tokens: {removed_text}"
)
prompt_embeds = self.text_encoder(text_input_ids.to(device))[0]
prompt_embeds = prompt_embeds.to(dtype=dtype, device=device)
# duplicate text embeddings for each generation per prompt, using mps friendly method
_, seq_len, _ = prompt_embeds.shape
prompt_embeds = prompt_embeds.repeat(1, num_videos_per_prompt, 1)
prompt_embeds = prompt_embeds.view(batch_size * num_videos_per_prompt, seq_len, -1)
return prompt_embeds
# Copied from diffusers.pipelines.cogvideo.pipeline_cogvideox.CogVideoXPipeline.encode_prompt
def encode_prompt(
self,
prompt: Union[str, List[str]],
negative_prompt: Optional[Union[str, List[str]]] = None,
do_classifier_free_guidance: bool = True,
num_videos_per_prompt: int = 1,
prompt_embeds: Optional[torch.Tensor] = None,
negative_prompt_embeds: Optional[torch.Tensor] = None,
max_sequence_length: int = 226,
device: Optional[torch.device] = None,
dtype: Optional[torch.dtype] = None,
):
r"""
Encodes the prompt into text encoder hidden states.
Args:
prompt (`str` or `List[str]`, *optional*):
prompt to be encoded
negative_prompt (`str` or `List[str]`, *optional*):
The prompt or prompts not to guide the image generation. If not defined, one has to pass
`negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is
less than `1`).
do_classifier_free_guidance (`bool`, *optional*, defaults to `True`):
Whether to use classifier free guidance or not.
num_videos_per_prompt (`int`, *optional*, defaults to 1):
Number of videos that should be generated per prompt. torch device to place the resulting embeddings on
prompt_embeds (`torch.Tensor`, *optional*):
Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
provided, text embeddings will be generated from `prompt` input argument.
negative_prompt_embeds (`torch.Tensor`, *optional*):
Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
argument.
device: (`torch.device`, *optional*):
torch device
dtype: (`torch.dtype`, *optional*):
torch dtype
"""
device = device or self._execution_device
prompt = [prompt] if isinstance(prompt, str) else prompt
if prompt is not None:
batch_size = len(prompt)
else:
batch_size = prompt_embeds.shape[0]
if prompt_embeds is None:
prompt_embeds = self._get_t5_prompt_embeds(
prompt=prompt,
num_videos_per_prompt=num_videos_per_prompt,
max_sequence_length=max_sequence_length,
device=device,
dtype=dtype,
)
if do_classifier_free_guidance and negative_prompt_embeds is None:
negative_prompt = negative_prompt or ""
negative_prompt = batch_size * [negative_prompt] if isinstance(negative_prompt, str) else negative_prompt
if prompt is not None and type(prompt) is not type(negative_prompt):
raise TypeError(
f"`negative_prompt` should be the same type to `prompt`, but got {type(negative_prompt)} !="
f" {type(prompt)}."
)
elif batch_size != len(negative_prompt):
raise ValueError(
f"`negative_prompt`: {negative_prompt} has batch size {len(negative_prompt)}, but `prompt`:"
f" {prompt} has batch size {batch_size}. Please make sure that passed `negative_prompt` matches"
" the batch size of `prompt`."
)
negative_prompt_embeds = self._get_t5_prompt_embeds(
prompt=negative_prompt,
num_videos_per_prompt=num_videos_per_prompt,
max_sequence_length=max_sequence_length,
device=device,
dtype=dtype,
)
return prompt_embeds, negative_prompt_embeds
def prepare_latents(
self,
image: torch.Tensor,
batch_size: int = 1,
num_channels_latents: int = 16,
num_frames: int = 13,
height: int = 60,
width: int = 90,
dtype: Optional[torch.dtype] = None,
device: Optional[torch.device] = None,
generator: Optional[torch.Generator] = None,
latents: Optional[torch.Tensor] = None,
kps_cond: Optional[torch.Tensor] = None,
):
if isinstance(generator, list) and len(generator) != batch_size:
raise ValueError(
f"You have passed a list of generators of length {len(generator)}, but requested an effective batch"
f" size of {batch_size}. Make sure the batch size matches the length of the generators."
)
num_frames = (num_frames - 1) // self.vae_scale_factor_temporal + 1
shape = (
batch_size,
num_frames,
num_channels_latents,
height // self.vae_scale_factor_spatial,
width // self.vae_scale_factor_spatial,
)
image = image.unsqueeze(2) # [B, C, F, H, W]
if isinstance(generator, list):
image_latents = [
retrieve_latents(self.vae.encode(image[i].unsqueeze(0)), generator[i]) for i in range(batch_size)
]
if kps_cond is not None:
kps_cond = kps_cond.unsqueeze(2)
kps_cond_latents = [
retrieve_latents(self.vae.encode(kps_cond[i].unsqueeze(0)), generator[i])
for i in range(batch_size)
]
else:
image_latents = [retrieve_latents(self.vae.encode(img.unsqueeze(0)), generator) for img in image]
if kps_cond is not None:
kps_cond = kps_cond.unsqueeze(2)
kps_cond_latents = [retrieve_latents(self.vae.encode(img.unsqueeze(0)), generator) for img in kps_cond]
image_latents = torch.cat(image_latents, dim=0).to(dtype).permute(0, 2, 1, 3, 4) # [B, F, C, H, W]
image_latents = self.vae_scaling_factor_image * image_latents
if kps_cond is not None:
kps_cond_latents = torch.cat(kps_cond_latents, dim=0).to(dtype).permute(0, 2, 1, 3, 4) # [B, F, C, H, W]
kps_cond_latents = self.vae_scaling_factor_image * kps_cond_latents
padding_shape = (
batch_size,
num_frames - 2,
num_channels_latents,
height // self.vae_scale_factor_spatial,
width // self.vae_scale_factor_spatial,
)
else:
padding_shape = (
batch_size,
num_frames - 1,
num_channels_latents,
height // self.vae_scale_factor_spatial,
width // self.vae_scale_factor_spatial,
)
latent_padding = torch.zeros(padding_shape, device=device, dtype=dtype)
if kps_cond is not None:
image_latents = torch.cat([image_latents, kps_cond_latents, latent_padding], dim=1)
else:
image_latents = torch.cat([image_latents, latent_padding], dim=1)
if latents is None:
latents = randn_tensor(shape, generator=generator, device=device, dtype=dtype)
else:
latents = latents.to(device)
# scale the initial noise by the standard deviation required by the scheduler
latents = latents * self.scheduler.init_noise_sigma
return latents, image_latents
# Copied from diffusers.pipelines.cogvideo.pipeline_cogvideox.CogVideoXPipeline.decode_latents
def decode_latents(self, latents: torch.Tensor) -> torch.Tensor:
latents = latents.permute(0, 2, 1, 3, 4) # [batch_size, num_channels, num_frames, height, width]
latents = 1 / self.vae_scaling_factor_image * latents
frames = self.vae.decode(latents).sample
return frames
# Copied from diffusers.pipelines.animatediff.pipeline_animatediff_video2video.AnimateDiffVideoToVideoPipeline.get_timesteps
def get_timesteps(self, num_inference_steps, timesteps, strength, device):
# get the original timestep using init_timestep
init_timestep = min(int(num_inference_steps * strength), num_inference_steps)
t_start = max(num_inference_steps - init_timestep, 0)
timesteps = timesteps[t_start * self.scheduler.order :]
return timesteps, num_inference_steps - t_start
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.prepare_extra_step_kwargs
def prepare_extra_step_kwargs(self, generator, eta):
# prepare extra kwargs for the scheduler step, since not all schedulers have the same signature
# eta (η) is only used with the DDIMScheduler, it will be ignored for other schedulers.
# eta corresponds to η in DDIM paper: https://arxiv.org/abs/2010.02502
# and should be between [0, 1]
accepts_eta = "eta" in set(inspect.signature(self.scheduler.step).parameters.keys())
extra_step_kwargs = {}
if accepts_eta:
extra_step_kwargs["eta"] = eta
# check if the scheduler accepts generator
accepts_generator = "generator" in set(inspect.signature(self.scheduler.step).parameters.keys())
if accepts_generator:
extra_step_kwargs["generator"] = generator
return extra_step_kwargs
def check_inputs(
self,
image,
prompt,
height,
width,
negative_prompt,
callback_on_step_end_tensor_inputs,
latents=None,
prompt_embeds=None,
negative_prompt_embeds=None,
):
if (
not isinstance(image, torch.Tensor)
and not isinstance(image, PIL.Image.Image)
and not isinstance(image, list)
):
raise ValueError(
"`image` has to be of type `torch.Tensor` or `PIL.Image.Image` or `List[PIL.Image.Image]` but is"
f" {type(image)}"
)
if height % 8 != 0 or width % 8 != 0:
raise ValueError(f"`height` and `width` have to be divisible by 8 but are {height} and {width}.")
if callback_on_step_end_tensor_inputs is not None and not all(
k in self._callback_tensor_inputs for k in callback_on_step_end_tensor_inputs
):
raise ValueError(
f"`callback_on_step_end_tensor_inputs` has to be in {self._callback_tensor_inputs}, but found {[k for k in callback_on_step_end_tensor_inputs if k not in self._callback_tensor_inputs]}"
)
if prompt is not None and prompt_embeds is not None:
raise ValueError(
f"Cannot forward both `prompt`: {prompt} and `prompt_embeds`: {prompt_embeds}. Please make sure to"
" only forward one of the two."
)
elif prompt is None and prompt_embeds is None:
raise ValueError(
"Provide either `prompt` or `prompt_embeds`. Cannot leave both `prompt` and `prompt_embeds` undefined."
)
elif prompt is not None and (not isinstance(prompt, str) and not isinstance(prompt, list)):
raise ValueError(f"`prompt` has to be of type `str` or `list` but is {type(prompt)}")
if prompt is not None and negative_prompt_embeds is not None:
raise ValueError(
f"Cannot forward both `prompt`: {prompt} and `negative_prompt_embeds`:"
f" {negative_prompt_embeds}. Please make sure to only forward one of the two."
)
if negative_prompt is not None and negative_prompt_embeds is not None:
raise ValueError(
f"Cannot forward both `negative_prompt`: {negative_prompt} and `negative_prompt_embeds`:"
f" {negative_prompt_embeds}. Please make sure to only forward one of the two."
)
if prompt_embeds is not None and negative_prompt_embeds is not None:
if prompt_embeds.shape != negative_prompt_embeds.shape:
raise ValueError(
"`prompt_embeds` and `negative_prompt_embeds` must have the same shape when passed directly, but"
f" got: `prompt_embeds` {prompt_embeds.shape} != `negative_prompt_embeds`"
f" {negative_prompt_embeds.shape}."
)
def _prepare_rotary_positional_embeddings(
self,
height: int,
width: int,
num_frames: int,
device: torch.device,
) -> Tuple[torch.Tensor, torch.Tensor]:
grid_height = height // (self.vae_scale_factor_spatial * self.transformer.config.patch_size)
grid_width = width // (self.vae_scale_factor_spatial * self.transformer.config.patch_size)
base_size_width = self.transformer.config.sample_width // self.transformer.config.patch_size
base_size_height = self.transformer.config.sample_height // self.transformer.config.patch_size
grid_crops_coords = get_resize_crop_region_for_grid(
(grid_height, grid_width), base_size_width, base_size_height
)
freqs_cos, freqs_sin = get_3d_rotary_pos_embed(
embed_dim=self.transformer.config.attention_head_dim,
crops_coords=grid_crops_coords,
grid_size=(grid_height, grid_width),
temporal_size=num_frames,
device=device,
)
return freqs_cos, freqs_sin
@property
def guidance_scale(self):
return self._guidance_scale
@property
def num_timesteps(self):
return self._num_timesteps
@property
def attention_kwargs(self):
return self._attention_kwargs
@property
def interrupt(self):
return self._interrupt
@torch.no_grad()
@replace_example_docstring(EXAMPLE_DOC_STRING)
def __call__(
self,
image: PipelineImageInput,
prompt: Optional[Union[str, List[str]]] = None,
negative_prompt: Optional[Union[str, List[str]]] = None,
height: int = 480,
width: int = 720,
num_frames: int = 49,
num_inference_steps: int = 50,
guidance_scale: float = 6.0,
use_dynamic_cfg: bool = False,
num_videos_per_prompt: int = 1,
eta: float = 0.0,
generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,
latents: Optional[torch.FloatTensor] = None,
prompt_embeds: Optional[torch.FloatTensor] = None,
negative_prompt_embeds: Optional[torch.FloatTensor] = None,
output_type: str = "pil",
return_dict: bool = True,
attention_kwargs: Optional[Dict[str, Any]] = None,
callback_on_step_end: Optional[
Union[Callable[[int, int, Dict], None], PipelineCallback, MultiPipelineCallbacks]
] = None,
callback_on_step_end_tensor_inputs: List[str] = ["latents"],
max_sequence_length: int = 226,
id_vit_hidden: Optional[torch.Tensor] = None,
id_cond: Optional[torch.Tensor] = None,
kps_cond: Optional[torch.Tensor] = None,
) -> Union[ConsisIDPipelineOutput, Tuple]:
"""
Function invoked when calling the pipeline for generation.
Args:
image (`PipelineImageInput`):
The input image to condition the generation on. Must be an image, a list of images or a `torch.Tensor`.
prompt (`str` or `List[str]`, *optional*):
The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`.
instead.
negative_prompt (`str` or `List[str]`, *optional*):
The prompt or prompts not to guide the image generation. If not defined, one has to pass
`negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is
less than `1`).
height (`int`, *optional*, defaults to self.transformer.config.sample_height * self.vae_scale_factor_spatial):
The height in pixels of the generated image. This is set to 480 by default for the best results.
width (`int`, *optional*, defaults to self.transformer.config.sample_height * self.vae_scale_factor_spatial):
The width in pixels of the generated image. This is set to 720 by default for the best results.
num_frames (`int`, defaults to `49`):
Number of frames to generate. Must be divisible by self.vae_scale_factor_temporal. Generated video will
contain 1 extra frame because ConsisID is conditioned with (num_seconds * fps + 1) frames where
num_seconds is 6 and fps is 4. However, since videos can be saved at any fps, the only condition that
needs to be satisfied is that of divisibility mentioned above.
num_inference_steps (`int`, *optional*, defaults to 50):
The number of denoising steps. More denoising steps usually lead to a higher quality image at the
expense of slower inference.
guidance_scale (`float`, *optional*, defaults to 6):
Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
`guidance_scale` is defined as `w` of equation 2. of [Imagen
Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
usually at the expense of lower image quality.
use_dynamic_cfg (`bool`, *optional*, defaults to `False`):
If True, dynamically adjusts the guidance scale during inference. This allows the model to use a
progressive guidance scale, improving the balance between text-guided generation and image quality over
the course of the inference steps. Typically, early inference steps use a higher guidance scale for
more faithful image generation, while later steps reduce it for more diverse and natural results.
num_videos_per_prompt (`int`, *optional*, defaults to 1):
The number of videos to generate per prompt.
generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html)
to make generation deterministic.
latents (`torch.FloatTensor`, *optional*):
Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
tensor will ge generated by sampling using the supplied random `generator`.
prompt_embeds (`torch.FloatTensor`, *optional*):
Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
provided, text embeddings will be generated from `prompt` input argument.
negative_prompt_embeds (`torch.FloatTensor`, *optional*):
Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
argument.
output_type (`str`, *optional*, defaults to `"pil"`):
The output format of the generate image. Choose between
[PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput`] instead
of a plain tuple.
attention_kwargs (`dict`, *optional*):
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
`self.processor` in
[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
callback_on_step_end (`Callable`, *optional*):
A function that calls at the end of each denoising steps during the inference. The function is called
with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int,
callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by
`callback_on_step_end_tensor_inputs`.
callback_on_step_end_tensor_inputs (`List`, *optional*):
The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list
will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the
`._callback_tensor_inputs` attribute of your pipeline class.
max_sequence_length (`int`, defaults to `226`):
Maximum sequence length in encoded prompt. Must be consistent with
`self.transformer.config.max_text_seq_length` otherwise may lead to poor results.
id_vit_hidden (`Optional[torch.Tensor]`, *optional*):
The tensor representing the hidden features extracted from the face model, which are used to condition
the local facial extractor. This is crucial for the model to obtain high-frequency information of the
face. If not provided, the local facial extractor will not run normally.
id_cond (`Optional[torch.Tensor]`, *optional*):
The tensor representing the hidden features extracted from the clip model, which are used to condition
the local facial extractor. This is crucial for the model to edit facial features If not provided, the
local facial extractor will not run normally.
kps_cond (`Optional[torch.Tensor]`, *optional*):
A tensor that determines whether the global facial extractor use keypoint information for conditioning.
If provided, this tensor controls whether facial keypoints such as eyes, nose, and mouth landmarks are
used during the generation process. This helps ensure the model retains more facial low-frequency
information.
Examples:
Returns:
[`~pipelines.consisid.pipeline_output.ConsisIDPipelineOutput`] or `tuple`:
[`~pipelines.consisid.pipeline_output.ConsisIDPipelineOutput`] if `return_dict` is True, otherwise a
`tuple`. When returning a tuple, the first element is a list with the generated images.
"""
if isinstance(callback_on_step_end, (PipelineCallback, MultiPipelineCallbacks)):
callback_on_step_end_tensor_inputs = callback_on_step_end.tensor_inputs
height = height or self.transformer.config.sample_height * self.vae_scale_factor_spatial
width = width or self.transformer.config.sample_width * self.vae_scale_factor_spatial
num_frames = num_frames or self.transformer.config.sample_frames
num_videos_per_prompt = 1
# 1. Check inputs. Raise error if not correct
self.check_inputs(
image=image,
prompt=prompt,
height=height,
width=width,
negative_prompt=negative_prompt,
callback_on_step_end_tensor_inputs=callback_on_step_end_tensor_inputs,
latents=latents,
prompt_embeds=prompt_embeds,
negative_prompt_embeds=negative_prompt_embeds,
)
self._guidance_scale = guidance_scale
self._attention_kwargs = attention_kwargs
self._interrupt = False
# 2. Default call parameters
if prompt is not None and isinstance(prompt, str):
batch_size = 1
elif prompt is not None and isinstance(prompt, list):
batch_size = len(prompt)
else:
batch_size = prompt_embeds.shape[0]
device = self._execution_device
# here `guidance_scale` is defined analog to the guidance weight `w` of equation (2)
# of the Imagen paper: https://arxiv.org/pdf/2205.11487.pdf . `guidance_scale = 1`
# corresponds to doing no classifier free guidance.
do_classifier_free_guidance = guidance_scale > 1.0
# 3. Encode input prompt
prompt_embeds, negative_prompt_embeds = self.encode_prompt(
prompt=prompt,
negative_prompt=negative_prompt,
do_classifier_free_guidance=do_classifier_free_guidance,
num_videos_per_prompt=num_videos_per_prompt,
prompt_embeds=prompt_embeds,
negative_prompt_embeds=negative_prompt_embeds,
max_sequence_length=max_sequence_length,
device=device,
)
if do_classifier_free_guidance:
prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds], dim=0)
# 4. Prepare timesteps
timesteps, num_inference_steps = retrieve_timesteps(self.scheduler, num_inference_steps, device)
self._num_timesteps = len(timesteps)
# 5. Prepare latents
is_kps = getattr(self.transformer.config, "is_kps", False)
kps_cond = kps_cond if is_kps else None
if kps_cond is not None:
kps_cond = draw_kps(image, kps_cond)
kps_cond = self.video_processor.preprocess(kps_cond, height=height, width=width).to(
device, dtype=prompt_embeds.dtype
)
image = self.video_processor.preprocess(image, height=height, width=width).to(
device, dtype=prompt_embeds.dtype
)
latent_channels = self.transformer.config.in_channels // 2
latents, image_latents = self.prepare_latents(
image,
batch_size * num_videos_per_prompt,
latent_channels,
num_frames,
height,
width,
prompt_embeds.dtype,
device,
generator,
latents,
kps_cond,
)
# 6. Prepare extra step kwargs. TODO: Logic should ideally just be moved out of the pipeline
extra_step_kwargs = self.prepare_extra_step_kwargs(generator, eta)
# 7. Create rotary embeds if required
image_rotary_emb = (
self._prepare_rotary_positional_embeddings(height, width, latents.size(1), device)
if self.transformer.config.use_rotary_positional_embeddings
else None
)
# 8. Denoising loop
num_warmup_steps = max(len(timesteps) - num_inference_steps * self.scheduler.order, 0)
with self.progress_bar(total=num_inference_steps) as progress_bar:
# for DPM-solver++
old_pred_original_sample = None
timesteps_cpu = timesteps.cpu()
for i, t in enumerate(timesteps):
if self.interrupt:
continue
latent_model_input = torch.cat([latents] * 2) if do_classifier_free_guidance else latents
latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)
latent_image_input = torch.cat([image_latents] * 2) if do_classifier_free_guidance else image_latents
latent_model_input = torch.cat([latent_model_input, latent_image_input], dim=2)
# broadcast to batch dimension in a way that's compatible with ONNX/Core ML
timestep = t.expand(latent_model_input.shape[0])
# predict noise model_output
noise_pred = self.transformer(
hidden_states=latent_model_input,
encoder_hidden_states=prompt_embeds,
timestep=timestep,
image_rotary_emb=image_rotary_emb,
attention_kwargs=attention_kwargs,
return_dict=False,
id_vit_hidden=id_vit_hidden,
id_cond=id_cond,
)[0]
noise_pred = noise_pred.float()
# perform guidance
if use_dynamic_cfg:
self._guidance_scale = 1 + guidance_scale * (
(
1
- math.cos(
math.pi
* ((num_inference_steps - timesteps_cpu[i].item()) / num_inference_steps) ** 5.0
)
)
/ 2
)
if do_classifier_free_guidance:
noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
noise_pred = noise_pred_uncond + self.guidance_scale * (noise_pred_text - noise_pred_uncond)
# compute the previous noisy sample x_t -> x_t-1
if not isinstance(self.scheduler, CogVideoXDPMScheduler):
latents = self.scheduler.step(noise_pred, t, latents, **extra_step_kwargs, return_dict=False)[0]
else:
latents, old_pred_original_sample = self.scheduler.step(
noise_pred,
old_pred_original_sample,
t,
timesteps[i - 1] if i > 0 else None,
latents,
**extra_step_kwargs,
return_dict=False,
)
latents = latents.to(prompt_embeds.dtype)
# call the callback, if provided
if callback_on_step_end is not None:
callback_kwargs = {}
for k in callback_on_step_end_tensor_inputs:
callback_kwargs[k] = locals()[k]
callback_outputs = callback_on_step_end(self, i, t, callback_kwargs)
latents = callback_outputs.pop("latents", latents)
prompt_embeds = callback_outputs.pop("prompt_embeds", prompt_embeds)
negative_prompt_embeds = callback_outputs.pop("negative_prompt_embeds", negative_prompt_embeds)
if i == len(timesteps) - 1 or ((i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0):
progress_bar.update()
if not output_type == "latent":
video = self.decode_latents(latents)
video = self.video_processor.postprocess_video(video=video, output_type=output_type)
else:
video = latents
# Offload all models
self.maybe_free_model_hooks()
if not return_dict:
return (video,)
return ConsisIDPipelineOutput(frames=video)
@@ -0,0 +1,20 @@
from dataclasses import dataclass
import torch
from diffusers.utils import BaseOutput
@dataclass
class ConsisIDPipelineOutput(BaseOutput):
r"""
Output class for ConsisID pipelines.
Args:
frames (`torch.Tensor`, `np.ndarray`, or List[List[PIL.Image.Image]]):
List of video outputs - It can be a nested list of length `batch_size,` with each sub-list containing
denoised PIL image sequences of length `num_frames.` It can also be a NumPy array or Torch tensor of shape
`(batch_size, num_frames, channels, height, width)`.
"""
frames: torch.Tensor
@@ -1163,10 +1163,11 @@ class UNetFlatConditionModel(ModelMixin, ConfigMixin):
# TODO: this requires sync between CPU and GPU. So try to pass timesteps as tensors if you can
# This would be a good case for the `match` statement (Python 3.10+)
is_mps = sample.device.type == "mps"
is_npu = sample.device.type == "npu"
if isinstance(timestep, float):
dtype = torch.float32 if is_mps else torch.float64
dtype = torch.float32 if (is_mps or is_npu) else torch.float64
else:
dtype = torch.int32 if is_mps else torch.int64
dtype = torch.int32 if (is_mps or is_npu) else torch.int64
timesteps = torch.tensor([timesteps], dtype=dtype, device=sample.device)
elif len(timesteps.shape) == 0:
timesteps = timesteps[None].to(sample.device)
+3 -2
View File
@@ -187,10 +187,11 @@ class DiTPipeline(DiffusionPipeline):
# TODO: this requires sync between CPU and GPU. So try to pass timesteps as tensors if you can
# This would be a good case for the `match` statement (Python 3.10+)
is_mps = latent_model_input.device.type == "mps"
is_npu = latent_model_input.device.type == "npu"
if isinstance(timesteps, float):
dtype = torch.float32 if is_mps else torch.float64
dtype = torch.float32 if (is_mps or is_npu) else torch.float64
else:
dtype = torch.int32 if is_mps else torch.int64
dtype = torch.int32 if (is_mps or is_npu) else torch.int64
timesteps = torch.tensor([timesteps], dtype=dtype, device=latent_model_input.device)
elif len(timesteps.shape) == 0:
timesteps = timesteps[None].to(latent_model_input.device)
@@ -798,10 +798,11 @@ class LattePipeline(DiffusionPipeline):
# TODO: this requires sync between CPU and GPU. So try to pass timesteps as tensors if you can
# This would be a good case for the `match` statement (Python 3.10+)
is_mps = latent_model_input.device.type == "mps"
is_npu = latent_model_input.device.type == "npu"
if isinstance(current_timestep, float):
dtype = torch.float32 if is_mps else torch.float64
dtype = torch.float32 if (is_mps or is_npu) else torch.float64
else:
dtype = torch.int32 if is_mps else torch.int64
dtype = torch.int32 if (is_mps or is_npu) else torch.int64
current_timestep = torch.tensor([current_timestep], dtype=dtype, device=latent_model_input.device)
elif len(current_timestep.shape) == 0:
current_timestep = current_timestep[None].to(latent_model_input.device)
@@ -806,10 +806,11 @@ class LuminaText2ImgPipeline(DiffusionPipeline):
# TODO: this requires sync between CPU and GPU. So try to pass timesteps as tensors if you can
# This would be a good case for the `match` statement (Python 3.10+)
is_mps = latent_model_input.device.type == "mps"
is_npu = latent_model_input.device.type == "npu"
if isinstance(current_timestep, float):
dtype = torch.float32 if is_mps else torch.float64
dtype = torch.float32 if (is_mps or is_npu) else torch.float64
else:
dtype = torch.int32 if is_mps else torch.int64
dtype = torch.int32 if (is_mps or is_npu) else torch.int64
current_timestep = torch.tensor(
[current_timestep],
dtype=dtype,
@@ -21,7 +21,7 @@ from transformers import T5EncoderModel, T5TokenizerFast
from ...callbacks import MultiPipelineCallbacks, PipelineCallback
from ...loaders import Mochi1LoraLoaderMixin
from ...models.autoencoders import AutoencoderKL
from ...models.autoencoders import AutoencoderKLMochi
from ...models.transformers import MochiTransformer3DModel
from ...schedulers import FlowMatchEulerDiscreteScheduler
from ...utils import (
@@ -151,8 +151,8 @@ class MochiPipeline(DiffusionPipeline, Mochi1LoraLoaderMixin):
Conditional Transformer architecture to denoise the encoded video latents.
scheduler ([`FlowMatchEulerDiscreteScheduler`]):
A scheduler to be used in combination with `transformer` to denoise the encoded image latents.
vae ([`AutoencoderKL`]):
Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.
vae ([`AutoencoderKLMochi`]):
Variational Auto-Encoder (VAE) Model to encode and decode videos to and from latent representations.
text_encoder ([`T5EncoderModel`]):
[T5](https://huggingface.co/docs/transformers/en/model_doc/t5#transformers.T5EncoderModel), specifically
the [google/t5-v1_1-xxl](https://huggingface.co/google/t5-v1_1-xxl) variant.
@@ -171,7 +171,7 @@ class MochiPipeline(DiffusionPipeline, Mochi1LoraLoaderMixin):
def __init__(
self,
scheduler: FlowMatchEulerDiscreteScheduler,
vae: AutoencoderKL,
vae: AutoencoderKLMochi,
text_encoder: T5EncoderModel,
tokenizer: T5TokenizerFast,
transformer: MochiTransformer3DModel,
+1 -1
View File
@@ -158,7 +158,7 @@ class PAGMixin:
),
):
r"""
Set the the self-attention layers to apply PAG. Raise ValueError if the input is invalid.
Set the self-attention layers to apply PAG. Raise ValueError if the input is invalid.
Args:
pag_applied_layers (`str` or `List[str]`):
@@ -807,10 +807,11 @@ class PixArtSigmaPAGPipeline(DiffusionPipeline, PAGMixin):
# TODO: this requires sync between CPU and GPU. So try to pass timesteps as tensors if you can
# This would be a good case for the `match` statement (Python 3.10+)
is_mps = latent_model_input.device.type == "mps"
is_npu = latent_model_input.device.type == "npu"
if isinstance(current_timestep, float):
dtype = torch.float32 if is_mps else torch.float64
dtype = torch.float32 if (is_mps or is_npu) else torch.float64
else:
dtype = torch.int32 if is_mps else torch.int64
dtype = torch.int32 if (is_mps or is_npu) else torch.int64
current_timestep = torch.tensor([current_timestep], dtype=dtype, device=latent_model_input.device)
elif len(current_timestep.shape) == 0:
current_timestep = current_timestep[None].to(latent_model_input.device)
@@ -16,6 +16,7 @@ import html
import inspect
import re
import urllib.parse as ul
import warnings
from typing import Callable, Dict, List, Optional, Tuple, Union
import torch
@@ -41,6 +42,7 @@ from ..pixart_alpha.pipeline_pixart_alpha import (
ASPECT_RATIO_1024_BIN,
)
from ..pixart_alpha.pipeline_pixart_sigma import ASPECT_RATIO_2048_BIN
from ..sana.pipeline_sana import ASPECT_RATIO_4096_BIN
from .pag_utils import PAGMixin
@@ -639,7 +641,7 @@ class SanaPAGPipeline(DiffusionPipeline, PAGMixin):
negative_prompt_attention_mask: Optional[torch.Tensor] = None,
output_type: Optional[str] = "pil",
return_dict: bool = True,
clean_caption: bool = True,
clean_caption: bool = False,
use_resolution_binning: bool = True,
callback_on_step_end: Optional[Callable[[int, int, Dict], None]] = None,
callback_on_step_end_tensor_inputs: List[str] = ["latents"],
@@ -755,7 +757,9 @@ class SanaPAGPipeline(DiffusionPipeline, PAGMixin):
callback_on_step_end_tensor_inputs = callback_on_step_end.tensor_inputs
if use_resolution_binning:
if self.transformer.config.sample_size == 64:
if self.transformer.config.sample_size == 128:
aspect_ratio_bin = ASPECT_RATIO_4096_BIN
elif self.transformer.config.sample_size == 64:
aspect_ratio_bin = ASPECT_RATIO_2048_BIN
elif self.transformer.config.sample_size == 32:
aspect_ratio_bin = ASPECT_RATIO_1024_BIN
@@ -912,7 +916,14 @@ class SanaPAGPipeline(DiffusionPipeline, PAGMixin):
image = latents
else:
latents = latents.to(self.vae.dtype)
image = self.vae.decode(latents / self.vae.config.scaling_factor, return_dict=False)[0]
try:
image = self.vae.decode(latents / self.vae.config.scaling_factor, return_dict=False)[0]
except torch.cuda.OutOfMemoryError as e:
warnings.warn(
f"{e}. \n"
f"Try to use VAE tiling for large images. For example: \n"
f"pipe.vae.enable_tiling(tile_sample_min_width=512, tile_sample_min_height=512)"
)
if use_resolution_binning:
image = self.image_processor.resize_and_crop_tensor(image, orig_width, orig_height)
@@ -12,19 +12,19 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import importlib
import os
import re
import warnings
from pathlib import Path
from typing import Any, Dict, List, Optional, Union
from typing import Any, Callable, Dict, List, Optional, Union
import requests
import torch
from huggingface_hub import ModelCard, model_info
from huggingface_hub.utils import validate_hf_hub_args
from huggingface_hub import DDUFEntry, ModelCard, model_info, snapshot_download
from huggingface_hub.utils import OfflineModeIsEnabled, validate_hf_hub_args
from packaging import version
from requests.exceptions import HTTPError
from .. import __version__
from ..utils import (
@@ -38,14 +38,16 @@ from ..utils import (
is_accelerate_available,
is_peft_available,
is_transformers_available,
is_transformers_version,
logging,
)
from ..utils.torch_utils import is_compiled_module
from .transformers_loading_utils import _load_tokenizer_from_dduf, _load_transformers_model_from_dduf
if is_transformers_available():
import transformers
from transformers import PreTrainedModel
from transformers import PreTrainedModel, PreTrainedTokenizerBase
from transformers.utils import FLAX_WEIGHTS_NAME as TRANSFORMERS_FLAX_WEIGHTS_NAME
from transformers.utils import SAFE_WEIGHTS_NAME as TRANSFORMERS_SAFE_WEIGHTS_NAME
from transformers.utils import WEIGHTS_NAME as TRANSFORMERS_WEIGHTS_NAME
@@ -627,6 +629,7 @@ def load_sub_model(
low_cpu_mem_usage: bool,
cached_folder: Union[str, os.PathLike],
use_safetensors: bool,
dduf_entries: Optional[Dict[str, DDUFEntry]],
):
"""Helper method to load the module `name` from `library_name` and `class_name`"""
@@ -663,7 +666,7 @@ def load_sub_model(
f" any of the loading methods defined in {ALL_IMPORTABLE_CLASSES}."
)
load_method = getattr(class_obj, load_method_name)
load_method = _get_load_method(class_obj, load_method_name, is_dduf=dduf_entries is not None)
# add kwargs to loading method
diffusers_module = importlib.import_module(__name__.split(".")[0])
@@ -721,7 +724,10 @@ def load_sub_model(
loading_kwargs["low_cpu_mem_usage"] = False
# check if the module is in a subdirectory
if os.path.isdir(os.path.join(cached_folder, name)):
if dduf_entries:
loading_kwargs["dduf_entries"] = dduf_entries
loaded_sub_model = load_method(name, **loading_kwargs)
elif os.path.isdir(os.path.join(cached_folder, name)):
loaded_sub_model = load_method(os.path.join(cached_folder, name), **loading_kwargs)
else:
# else load from the root directory
@@ -746,6 +752,22 @@ def load_sub_model(
return loaded_sub_model
def _get_load_method(class_obj: object, load_method_name: str, is_dduf: bool) -> Callable:
"""
Return the method to load the sub model.
In practice, this method will return the `"from_pretrained"` (or `load_method_name`) method of the class object
except if loading from a DDUF checkpoint. In that case, transformers models and tokenizers have a specific loading
method that we need to use.
"""
if is_dduf:
if issubclass(class_obj, PreTrainedTokenizerBase):
return lambda *args, **kwargs: _load_tokenizer_from_dduf(class_obj, *args, **kwargs)
if issubclass(class_obj, PreTrainedModel):
return lambda *args, **kwargs: _load_transformers_model_from_dduf(class_obj, *args, **kwargs)
return getattr(class_obj, load_method_name)
def _fetch_class_library_tuple(module):
# import it here to avoid circular import
diffusers_module = importlib.import_module(__name__.split(".")[0])
@@ -968,3 +990,70 @@ def _get_ignore_patterns(
)
return ignore_patterns
def _download_dduf_file(
pretrained_model_name: str,
dduf_file: str,
pipeline_class_name: str,
cache_dir: str,
proxies: str,
local_files_only: bool,
token: str,
revision: str,
):
model_info_call_error = None
if not local_files_only:
try:
info = model_info(pretrained_model_name, token=token, revision=revision)
except (HTTPError, OfflineModeIsEnabled, requests.ConnectionError) as e:
logger.warning(f"Couldn't connect to the Hub: {e}.\nWill try to load from local cache.")
local_files_only = True
model_info_call_error = e # save error to reraise it if model is not cached locally
if (
not local_files_only
and dduf_file is not None
and dduf_file not in (sibling.rfilename for sibling in info.siblings)
):
raise ValueError(f"Requested {dduf_file} file is not available in {pretrained_model_name}.")
try:
user_agent = {"pipeline_class": pipeline_class_name, "dduf": True}
cached_folder = snapshot_download(
pretrained_model_name,
cache_dir=cache_dir,
proxies=proxies,
local_files_only=local_files_only,
token=token,
revision=revision,
allow_patterns=[dduf_file],
user_agent=user_agent,
)
return cached_folder
except FileNotFoundError:
# Means we tried to load pipeline with `local_files_only=True` but the files have not been found in local cache.
# This can happen in two cases:
# 1. If the user passed `local_files_only=True` => we raise the error directly
# 2. If we forced `local_files_only=True` when `model_info` failed => we raise the initial error
if model_info_call_error is None:
# 1. user passed `local_files_only=True`
raise
else:
# 2. we forced `local_files_only=True` when `model_info` failed
raise EnvironmentError(
f"Cannot load model {pretrained_model_name}: model is not cached locally and an error occurred"
" while trying to fetch metadata from the Hub. Please check out the root cause in the stacktrace"
" above."
) from model_info_call_error
def _maybe_raise_error_for_incorrect_transformers(config_dict):
has_transformers_component = False
for k in config_dict:
if isinstance(config_dict[k], list):
has_transformers_component = config_dict[k][0] == "transformers"
if has_transformers_component:
break
if has_transformers_component and not is_transformers_version(">", "4.47.1"):
raise ValueError("Please upgrade your `transformers` installation to the latest version to use DDUF.")
+47 -3
View File
@@ -29,10 +29,12 @@ import PIL.Image
import requests
import torch
from huggingface_hub import (
DDUFEntry,
ModelCard,
create_repo,
hf_hub_download,
model_info,
read_dduf_file,
snapshot_download,
)
from huggingface_hub.utils import OfflineModeIsEnabled, validate_hf_hub_args
@@ -72,6 +74,7 @@ from .pipeline_loading_utils import (
CONNECTED_PIPES_KEYS,
CUSTOM_PIPELINE_FILE_NAME,
LOADABLE_CLASSES,
_download_dduf_file,
_fetch_class_library_tuple,
_get_custom_components_and_folders,
_get_custom_pipeline_class,
@@ -79,6 +82,7 @@ from .pipeline_loading_utils import (
_get_ignore_patterns,
_get_pipeline_class,
_identify_model_variants,
_maybe_raise_error_for_incorrect_transformers,
_maybe_raise_warning_for_inpainting,
_resolve_custom_pipeline_and_cls,
_unwrap_model,
@@ -218,6 +222,7 @@ class DiffusionPipeline(ConfigMixin, PushToHubMixin):
Whether or not to push your model to the Hugging Face model hub after saving it. You can specify the
repository you want to push to with `repo_id` (will default to the name of `save_directory` in your
namespace).
kwargs (`Dict[str, Any]`, *optional*):
Additional keyword arguments passed along to the [`~utils.PushToHubMixin.push_to_hub`] method.
"""
@@ -531,6 +536,7 @@ class DiffusionPipeline(ConfigMixin, PushToHubMixin):
- A path to a *directory* (for example `./my_pipeline_directory/`) containing pipeline weights
saved using
[`~DiffusionPipeline.save_pretrained`].
- A path to a *directory* (for example `./my_pipeline_directory/`) containing a dduf file
torch_dtype (`str` or `torch.dtype`, *optional*):
Override the default `torch.dtype` and load the model with another dtype. If "auto" is passed, the
dtype is automatically derived from the model's weights.
@@ -625,6 +631,8 @@ class DiffusionPipeline(ConfigMixin, PushToHubMixin):
variant (`str`, *optional*):
Load weights from a specified variant filename such as `"fp16"` or `"ema"`. This is ignored when
loading `from_flax`.
dduf_file(`str`, *optional*):
Load weights from the specified dduf file.
<Tip>
@@ -674,6 +682,7 @@ class DiffusionPipeline(ConfigMixin, PushToHubMixin):
offload_state_dict = kwargs.pop("offload_state_dict", False)
low_cpu_mem_usage = kwargs.pop("low_cpu_mem_usage", _LOW_CPU_MEM_USAGE_DEFAULT)
variant = kwargs.pop("variant", None)
dduf_file = kwargs.pop("dduf_file", None)
use_safetensors = kwargs.pop("use_safetensors", None)
use_onnx = kwargs.pop("use_onnx", None)
load_connected_pipeline = kwargs.pop("load_connected_pipeline", False)
@@ -722,6 +731,12 @@ class DiffusionPipeline(ConfigMixin, PushToHubMixin):
" dispatching. Please make sure to set `low_cpu_mem_usage=True`."
)
if dduf_file:
if custom_pipeline:
raise NotImplementedError("Custom pipelines are not supported with DDUF at the moment.")
if load_connected_pipeline:
raise NotImplementedError("Connected pipelines are not supported with DDUF at the moment.")
# 1. Download the checkpoints and configs
# use snapshot download here to get it working from from_pretrained
if not os.path.isdir(pretrained_model_name_or_path):
@@ -744,6 +759,7 @@ class DiffusionPipeline(ConfigMixin, PushToHubMixin):
custom_pipeline=custom_pipeline,
custom_revision=custom_revision,
variant=variant,
dduf_file=dduf_file,
load_connected_pipeline=load_connected_pipeline,
**kwargs,
)
@@ -765,7 +781,17 @@ class DiffusionPipeline(ConfigMixin, PushToHubMixin):
)
logger.warning(warn_msg)
config_dict = cls.load_config(cached_folder)
dduf_entries = None
if dduf_file:
dduf_file_path = os.path.join(cached_folder, dduf_file)
dduf_entries = read_dduf_file(dduf_file_path)
# The reader contains already all the files needed, no need to check it again
cached_folder = ""
config_dict = cls.load_config(cached_folder, dduf_entries=dduf_entries)
if dduf_file:
_maybe_raise_error_for_incorrect_transformers(config_dict)
# pop out "_ignore_files" as it is only needed for download
config_dict.pop("_ignore_files", None)
@@ -943,6 +969,7 @@ class DiffusionPipeline(ConfigMixin, PushToHubMixin):
low_cpu_mem_usage=low_cpu_mem_usage,
cached_folder=cached_folder,
use_safetensors=use_safetensors,
dduf_entries=dduf_entries,
)
logger.info(
f"Loaded {name} as {class_name} from `{name}` subfolder of {pretrained_model_name_or_path}."
@@ -1256,6 +1283,8 @@ class DiffusionPipeline(ConfigMixin, PushToHubMixin):
variant (`str`, *optional*):
Load weights from a specified variant filename such as `"fp16"` or `"ema"`. This is ignored when
loading `from_flax`.
dduf_file(`str`, *optional*):
Load weights from the specified DDUF file.
use_safetensors (`bool`, *optional*, defaults to `None`):
If set to `None`, the safetensors weights are downloaded if they're available **and** if the
safetensors library is installed. If set to `True`, the model is forcibly loaded from safetensors
@@ -1296,6 +1325,23 @@ class DiffusionPipeline(ConfigMixin, PushToHubMixin):
use_onnx = kwargs.pop("use_onnx", None)
load_connected_pipeline = kwargs.pop("load_connected_pipeline", False)
trust_remote_code = kwargs.pop("trust_remote_code", False)
dduf_file: Optional[Dict[str, DDUFEntry]] = kwargs.pop("dduf_file", None)
if dduf_file:
if custom_pipeline:
raise NotImplementedError("Custom pipelines are not supported with DDUF at the moment.")
if load_connected_pipeline:
raise NotImplementedError("Connected pipelines are not supported with DDUF at the moment.")
return _download_dduf_file(
pretrained_model_name=pretrained_model_name,
dduf_file=dduf_file,
pipeline_class_name=cls.__name__,
cache_dir=cache_dir,
proxies=proxies,
local_files_only=local_files_only,
token=token,
revision=revision,
)
allow_pickle = False
if use_safetensors is None:
@@ -1375,7 +1421,6 @@ class DiffusionPipeline(ConfigMixin, PushToHubMixin):
allow_patterns += [f"{custom_pipeline}.py"] if f"{custom_pipeline}.py" in filenames else []
# also allow downloading config.json files with the model
allow_patterns += [os.path.join(k, "config.json") for k in model_folder_names]
allow_patterns += [
SCHEDULER_CONFIG_NAME,
CONFIG_NAME,
@@ -1471,7 +1516,6 @@ class DiffusionPipeline(ConfigMixin, PushToHubMixin):
user_agent=user_agent,
)
# retrieve pipeline class from local file
cls_name = cls.load_config(os.path.join(cached_folder, "model_index.json")).get("_class_name", None)
cls_name = cls_name[4:] if isinstance(cls_name, str) and cls_name.startswith("Flax") else cls_name
@@ -907,10 +907,11 @@ class PixArtAlphaPipeline(DiffusionPipeline):
# TODO: this requires sync between CPU and GPU. So try to pass timesteps as tensors if you can
# This would be a good case for the `match` statement (Python 3.10+)
is_mps = latent_model_input.device.type == "mps"
is_npu = latent_model_input.device.type == "npu"
if isinstance(current_timestep, float):
dtype = torch.float32 if is_mps else torch.float64
dtype = torch.float32 if (is_mps or is_npu) else torch.float64
else:
dtype = torch.int32 if is_mps else torch.int64
dtype = torch.int32 if (is_mps or is_npu) else torch.int64
current_timestep = torch.tensor([current_timestep], dtype=dtype, device=latent_model_input.device)
elif len(current_timestep.shape) == 0:
current_timestep = current_timestep[None].to(latent_model_input.device)
@@ -822,10 +822,11 @@ class PixArtSigmaPipeline(DiffusionPipeline):
# TODO: this requires sync between CPU and GPU. So try to pass timesteps as tensors if you can
# This would be a good case for the `match` statement (Python 3.10+)
is_mps = latent_model_input.device.type == "mps"
is_npu = latent_model_input.device.type == "npu"
if isinstance(current_timestep, float):
dtype = torch.float32 if is_mps else torch.float64
dtype = torch.float32 if (is_mps or is_npu) else torch.float64
else:
dtype = torch.int32 if is_mps else torch.int64
dtype = torch.int32 if (is_mps or is_npu) else torch.int64
current_timestep = torch.tensor([current_timestep], dtype=dtype, device=latent_model_input.device)
elif len(current_timestep.shape) == 0:
current_timestep = current_timestep[None].to(latent_model_input.device)
@@ -16,6 +16,7 @@ import html
import inspect
import re
import urllib.parse as ul
import warnings
from typing import Any, Callable, Dict, List, Optional, Tuple, Union
import torch
@@ -953,7 +954,14 @@ class SanaPipeline(DiffusionPipeline, SanaLoraLoaderMixin):
image = latents
else:
latents = latents.to(self.vae.dtype)
image = self.vae.decode(latents / self.vae.config.scaling_factor, return_dict=False)[0]
try:
image = self.vae.decode(latents / self.vae.config.scaling_factor, return_dict=False)[0]
except torch.cuda.OutOfMemoryError as e:
warnings.warn(
f"{e}. \n"
f"Try to use VAE tiling for large images. For example: \n"
f"pipe.vae.enable_tiling(tile_sample_min_width=512, tile_sample_min_height=512)"
)
if use_resolution_binning:
image = self.image_processor.resize_and_crop_tensor(image, orig_width, orig_height)
@@ -18,14 +18,16 @@ from typing import Any, Callable, Dict, List, Optional, Union
import PIL.Image
import torch
from transformers import (
BaseImageProcessor,
CLIPTextModelWithProjection,
CLIPTokenizer,
PreTrainedModel,
T5EncoderModel,
T5TokenizerFast,
)
from ...image_processor import PipelineImageInput, VaeImageProcessor
from ...loaders import FromSingleFileMixin, SD3LoraLoaderMixin
from ...loaders import FromSingleFileMixin, SD3IPAdapterMixin, SD3LoraLoaderMixin
from ...models.autoencoders import AutoencoderKL
from ...models.transformers import SD3Transformer2DModel
from ...schedulers import FlowMatchEulerDiscreteScheduler
@@ -163,7 +165,7 @@ def retrieve_timesteps(
return timesteps, num_inference_steps
class StableDiffusion3Img2ImgPipeline(DiffusionPipeline, SD3LoraLoaderMixin, FromSingleFileMixin):
class StableDiffusion3Img2ImgPipeline(DiffusionPipeline, SD3LoraLoaderMixin, FromSingleFileMixin, SD3IPAdapterMixin):
r"""
Args:
transformer ([`SD3Transformer2DModel`]):
@@ -197,8 +199,8 @@ class StableDiffusion3Img2ImgPipeline(DiffusionPipeline, SD3LoraLoaderMixin, Fro
[T5Tokenizer](https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5Tokenizer).
"""
model_cpu_offload_seq = "text_encoder->text_encoder_2->text_encoder_3->transformer->vae"
_optional_components = []
model_cpu_offload_seq = "text_encoder->text_encoder_2->text_encoder_3->image_encoder->transformer->vae"
_optional_components = ["image_encoder", "feature_extractor"]
_callback_tensor_inputs = ["latents", "prompt_embeds", "negative_prompt_embeds", "negative_pooled_prompt_embeds"]
def __init__(
@@ -212,6 +214,8 @@ class StableDiffusion3Img2ImgPipeline(DiffusionPipeline, SD3LoraLoaderMixin, Fro
tokenizer_2: CLIPTokenizer,
text_encoder_3: T5EncoderModel,
tokenizer_3: T5TokenizerFast,
image_encoder: PreTrainedModel = None,
feature_extractor: BaseImageProcessor = None,
):
super().__init__()
@@ -225,6 +229,8 @@ class StableDiffusion3Img2ImgPipeline(DiffusionPipeline, SD3LoraLoaderMixin, Fro
tokenizer_3=tokenizer_3,
transformer=transformer,
scheduler=scheduler,
image_encoder=image_encoder,
feature_extractor=feature_extractor,
)
self.vae_scale_factor = 2 ** (len(self.vae.config.block_out_channels) - 1) if getattr(self, "vae", None) else 8
latent_channels = self.vae.config.latent_channels if getattr(self, "vae", None) else 16
@@ -738,6 +744,84 @@ class StableDiffusion3Img2ImgPipeline(DiffusionPipeline, SD3LoraLoaderMixin, Fro
def interrupt(self):
return self._interrupt
# Copied from diffusers.pipelines.stable_diffusion_3.pipeline_stable_diffusion_3.StableDiffusion3Pipeline.encode_image
def encode_image(self, image: PipelineImageInput, device: torch.device) -> torch.Tensor:
"""Encodes the given image into a feature representation using a pre-trained image encoder.
Args:
image (`PipelineImageInput`):
Input image to be encoded.
device: (`torch.device`):
Torch device.
Returns:
`torch.Tensor`: The encoded image feature representation.
"""
if not isinstance(image, torch.Tensor):
image = self.feature_extractor(image, return_tensors="pt").pixel_values
image = image.to(device=device, dtype=self.dtype)
return self.image_encoder(image, output_hidden_states=True).hidden_states[-2]
# Copied from diffusers.pipelines.stable_diffusion_3.pipeline_stable_diffusion_3.StableDiffusion3Pipeline.prepare_ip_adapter_image_embeds
def prepare_ip_adapter_image_embeds(
self,
ip_adapter_image: Optional[PipelineImageInput] = None,
ip_adapter_image_embeds: Optional[torch.Tensor] = None,
device: Optional[torch.device] = None,
num_images_per_prompt: int = 1,
do_classifier_free_guidance: bool = True,
) -> torch.Tensor:
"""Prepares image embeddings for use in the IP-Adapter.
Either `ip_adapter_image` or `ip_adapter_image_embeds` must be passed.
Args:
ip_adapter_image (`PipelineImageInput`, *optional*):
The input image to extract features from for IP-Adapter.
ip_adapter_image_embeds (`torch.Tensor`, *optional*):
Precomputed image embeddings.
device: (`torch.device`, *optional*):
Torch device.
num_images_per_prompt (`int`, defaults to 1):
Number of images that should be generated per prompt.
do_classifier_free_guidance (`bool`, defaults to True):
Whether to use classifier free guidance or not.
"""
device = device or self._execution_device
if ip_adapter_image_embeds is not None:
if do_classifier_free_guidance:
single_negative_image_embeds, single_image_embeds = ip_adapter_image_embeds.chunk(2)
else:
single_image_embeds = ip_adapter_image_embeds
elif ip_adapter_image is not None:
single_image_embeds = self.encode_image(ip_adapter_image, device)
if do_classifier_free_guidance:
single_negative_image_embeds = torch.zeros_like(single_image_embeds)
else:
raise ValueError("Neither `ip_adapter_image_embeds` or `ip_adapter_image_embeds` were provided.")
image_embeds = torch.cat([single_image_embeds] * num_images_per_prompt, dim=0)
if do_classifier_free_guidance:
negative_image_embeds = torch.cat([single_negative_image_embeds] * num_images_per_prompt, dim=0)
image_embeds = torch.cat([negative_image_embeds, image_embeds], dim=0)
return image_embeds.to(device=device)
# Copied from diffusers.pipelines.stable_diffusion_3.pipeline_stable_diffusion_3.StableDiffusion3Pipeline.enable_sequential_cpu_offload
def enable_sequential_cpu_offload(self, *args, **kwargs):
if self.image_encoder is not None and "image_encoder" not in self._exclude_from_cpu_offload:
logger.warning(
"`pipe.enable_sequential_cpu_offload()` might fail for `image_encoder` if it uses "
"`torch.nn.MultiheadAttention`. You can exclude `image_encoder` from CPU offloading by calling "
"`pipe._exclude_from_cpu_offload.append('image_encoder')` before `pipe.enable_sequential_cpu_offload()`."
)
super().enable_sequential_cpu_offload(*args, **kwargs)
@torch.no_grad()
@replace_example_docstring(EXAMPLE_DOC_STRING)
def __call__(
@@ -763,6 +847,8 @@ class StableDiffusion3Img2ImgPipeline(DiffusionPipeline, SD3LoraLoaderMixin, Fro
pooled_prompt_embeds: Optional[torch.FloatTensor] = None,
negative_pooled_prompt_embeds: Optional[torch.FloatTensor] = None,
output_type: Optional[str] = "pil",
ip_adapter_image: Optional[PipelineImageInput] = None,
ip_adapter_image_embeds: Optional[torch.Tensor] = None,
return_dict: bool = True,
joint_attention_kwargs: Optional[Dict[str, Any]] = None,
clip_skip: Optional[int] = None,
@@ -784,9 +870,9 @@ class StableDiffusion3Img2ImgPipeline(DiffusionPipeline, SD3LoraLoaderMixin, Fro
prompt_3 (`str` or `List[str]`, *optional*):
The prompt or prompts to be sent to `tokenizer_3` and `text_encoder_3`. If not defined, `prompt` is
will be used instead
height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor):
height (`int`, *optional*, defaults to self.transformer.config.sample_size * self.vae_scale_factor):
The height in pixels of the generated image. This is set to 1024 by default for the best results.
width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor):
width (`int`, *optional*, defaults to self.transformer.config.sample_size * self.vae_scale_factor):
The width in pixels of the generated image. This is set to 1024 by default for the best results.
num_inference_steps (`int`, *optional*, defaults to 50):
The number of denoising steps. More denoising steps usually lead to a higher quality image at the
@@ -834,6 +920,12 @@ class StableDiffusion3Img2ImgPipeline(DiffusionPipeline, SD3LoraLoaderMixin, Fro
Pre-generated negative pooled text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
weighting. If not provided, pooled negative_prompt_embeds will be generated from `negative_prompt`
input argument.
ip_adapter_image (`PipelineImageInput`, *optional*):
Optional image input to work with IP Adapters.
ip_adapter_image_embeds (`torch.Tensor`, *optional*):
Pre-generated image embeddings for IP-Adapter. Should be a tensor of shape `(batch_size, num_images,
emb_dim)`. It should contain the negative image embedding if `do_classifier_free_guidance` is set to
`True`. If not provided, embeddings are computed from the `ip_adapter_image` input argument.
output_type (`str`, *optional*, defaults to `"pil"`):
The output format of the generate image. Choose between
[PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
@@ -969,7 +1061,22 @@ class StableDiffusion3Img2ImgPipeline(DiffusionPipeline, SD3LoraLoaderMixin, Fro
generator,
)
# 6. Denoising loop
# 6. Prepare image embeddings
if (ip_adapter_image is not None and self.is_ip_adapter_active) or ip_adapter_image_embeds is not None:
ip_adapter_image_embeds = self.prepare_ip_adapter_image_embeds(
ip_adapter_image,
ip_adapter_image_embeds,
device,
batch_size * num_images_per_prompt,
self.do_classifier_free_guidance,
)
if self.joint_attention_kwargs is None:
self._joint_attention_kwargs = {"ip_adapter_image_embeds": ip_adapter_image_embeds}
else:
self._joint_attention_kwargs.update(ip_adapter_image_embeds=ip_adapter_image_embeds)
# 7. Denoising loop
num_warmup_steps = max(len(timesteps) - num_inference_steps * self.scheduler.order, 0)
self._num_timesteps = len(timesteps)
with self.progress_bar(total=num_inference_steps) as progress_bar:
@@ -13,19 +13,21 @@
# limitations under the License.
import inspect
from typing import Callable, Dict, List, Optional, Union
from typing import Any, Callable, Dict, List, Optional, Union
import torch
from transformers import (
BaseImageProcessor,
CLIPTextModelWithProjection,
CLIPTokenizer,
PreTrainedModel,
T5EncoderModel,
T5TokenizerFast,
)
from ...callbacks import MultiPipelineCallbacks, PipelineCallback
from ...image_processor import PipelineImageInput, VaeImageProcessor
from ...loaders import FromSingleFileMixin, SD3LoraLoaderMixin
from ...loaders import FromSingleFileMixin, SD3IPAdapterMixin, SD3LoraLoaderMixin
from ...models.autoencoders import AutoencoderKL
from ...models.transformers import SD3Transformer2DModel
from ...schedulers import FlowMatchEulerDiscreteScheduler
@@ -162,7 +164,7 @@ def retrieve_timesteps(
return timesteps, num_inference_steps
class StableDiffusion3InpaintPipeline(DiffusionPipeline, SD3LoraLoaderMixin, FromSingleFileMixin):
class StableDiffusion3InpaintPipeline(DiffusionPipeline, SD3LoraLoaderMixin, FromSingleFileMixin, SD3IPAdapterMixin):
r"""
Args:
transformer ([`SD3Transformer2DModel`]):
@@ -194,10 +196,14 @@ class StableDiffusion3InpaintPipeline(DiffusionPipeline, SD3LoraLoaderMixin, Fro
tokenizer_3 (`T5TokenizerFast`):
Tokenizer of class
[T5Tokenizer](https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5Tokenizer).
image_encoder (`PreTrainedModel`, *optional*):
Pre-trained Vision Model for IP Adapter.
feature_extractor (`BaseImageProcessor`, *optional*):
Image processor for IP Adapter.
"""
model_cpu_offload_seq = "text_encoder->text_encoder_2->text_encoder_3->transformer->vae"
_optional_components = []
model_cpu_offload_seq = "text_encoder->text_encoder_2->text_encoder_3->image_encoder->transformer->vae"
_optional_components = ["image_encoder", "feature_extractor"]
_callback_tensor_inputs = ["latents", "prompt_embeds", "negative_prompt_embeds", "negative_pooled_prompt_embeds"]
def __init__(
@@ -211,6 +217,8 @@ class StableDiffusion3InpaintPipeline(DiffusionPipeline, SD3LoraLoaderMixin, Fro
tokenizer_2: CLIPTokenizer,
text_encoder_3: T5EncoderModel,
tokenizer_3: T5TokenizerFast,
image_encoder: PreTrainedModel = None,
feature_extractor: BaseImageProcessor = None,
):
super().__init__()
@@ -224,6 +232,8 @@ class StableDiffusion3InpaintPipeline(DiffusionPipeline, SD3LoraLoaderMixin, Fro
tokenizer_3=tokenizer_3,
transformer=transformer,
scheduler=scheduler,
image_encoder=image_encoder,
feature_extractor=feature_extractor,
)
self.vae_scale_factor = 2 ** (len(self.vae.config.block_out_channels) - 1) if getattr(self, "vae", None) else 8
latent_channels = self.vae.config.latent_channels if getattr(self, "vae", None) else 16
@@ -818,6 +828,10 @@ class StableDiffusion3InpaintPipeline(DiffusionPipeline, SD3LoraLoaderMixin, Fro
def do_classifier_free_guidance(self):
return self._guidance_scale > 1
@property
def joint_attention_kwargs(self):
return self._joint_attention_kwargs
@property
def num_timesteps(self):
return self._num_timesteps
@@ -826,6 +840,84 @@ class StableDiffusion3InpaintPipeline(DiffusionPipeline, SD3LoraLoaderMixin, Fro
def interrupt(self):
return self._interrupt
# Copied from diffusers.pipelines.stable_diffusion_3.pipeline_stable_diffusion_3.StableDiffusion3Pipeline.encode_image
def encode_image(self, image: PipelineImageInput, device: torch.device) -> torch.Tensor:
"""Encodes the given image into a feature representation using a pre-trained image encoder.
Args:
image (`PipelineImageInput`):
Input image to be encoded.
device: (`torch.device`):
Torch device.
Returns:
`torch.Tensor`: The encoded image feature representation.
"""
if not isinstance(image, torch.Tensor):
image = self.feature_extractor(image, return_tensors="pt").pixel_values
image = image.to(device=device, dtype=self.dtype)
return self.image_encoder(image, output_hidden_states=True).hidden_states[-2]
# Copied from diffusers.pipelines.stable_diffusion_3.pipeline_stable_diffusion_3.StableDiffusion3Pipeline.prepare_ip_adapter_image_embeds
def prepare_ip_adapter_image_embeds(
self,
ip_adapter_image: Optional[PipelineImageInput] = None,
ip_adapter_image_embeds: Optional[torch.Tensor] = None,
device: Optional[torch.device] = None,
num_images_per_prompt: int = 1,
do_classifier_free_guidance: bool = True,
) -> torch.Tensor:
"""Prepares image embeddings for use in the IP-Adapter.
Either `ip_adapter_image` or `ip_adapter_image_embeds` must be passed.
Args:
ip_adapter_image (`PipelineImageInput`, *optional*):
The input image to extract features from for IP-Adapter.
ip_adapter_image_embeds (`torch.Tensor`, *optional*):
Precomputed image embeddings.
device: (`torch.device`, *optional*):
Torch device.
num_images_per_prompt (`int`, defaults to 1):
Number of images that should be generated per prompt.
do_classifier_free_guidance (`bool`, defaults to True):
Whether to use classifier free guidance or not.
"""
device = device or self._execution_device
if ip_adapter_image_embeds is not None:
if do_classifier_free_guidance:
single_negative_image_embeds, single_image_embeds = ip_adapter_image_embeds.chunk(2)
else:
single_image_embeds = ip_adapter_image_embeds
elif ip_adapter_image is not None:
single_image_embeds = self.encode_image(ip_adapter_image, device)
if do_classifier_free_guidance:
single_negative_image_embeds = torch.zeros_like(single_image_embeds)
else:
raise ValueError("Neither `ip_adapter_image_embeds` or `ip_adapter_image_embeds` were provided.")
image_embeds = torch.cat([single_image_embeds] * num_images_per_prompt, dim=0)
if do_classifier_free_guidance:
negative_image_embeds = torch.cat([single_negative_image_embeds] * num_images_per_prompt, dim=0)
image_embeds = torch.cat([negative_image_embeds, image_embeds], dim=0)
return image_embeds.to(device=device)
# Copied from diffusers.pipelines.stable_diffusion_3.pipeline_stable_diffusion_3.StableDiffusion3Pipeline.enable_sequential_cpu_offload
def enable_sequential_cpu_offload(self, *args, **kwargs):
if self.image_encoder is not None and "image_encoder" not in self._exclude_from_cpu_offload:
logger.warning(
"`pipe.enable_sequential_cpu_offload()` might fail for `image_encoder` if it uses "
"`torch.nn.MultiheadAttention`. You can exclude `image_encoder` from CPU offloading by calling "
"`pipe._exclude_from_cpu_offload.append('image_encoder')` before `pipe.enable_sequential_cpu_offload()`."
)
super().enable_sequential_cpu_offload(*args, **kwargs)
@torch.no_grad()
@replace_example_docstring(EXAMPLE_DOC_STRING)
def __call__(
@@ -853,8 +945,11 @@ class StableDiffusion3InpaintPipeline(DiffusionPipeline, SD3LoraLoaderMixin, Fro
negative_prompt_embeds: Optional[torch.Tensor] = None,
pooled_prompt_embeds: Optional[torch.Tensor] = None,
negative_pooled_prompt_embeds: Optional[torch.Tensor] = None,
ip_adapter_image: Optional[PipelineImageInput] = None,
ip_adapter_image_embeds: Optional[torch.Tensor] = None,
output_type: Optional[str] = "pil",
return_dict: bool = True,
joint_attention_kwargs: Optional[Dict[str, Any]] = None,
clip_skip: Optional[int] = None,
callback_on_step_end: Optional[Callable[[int, int, Dict], None]] = None,
callback_on_step_end_tensor_inputs: List[str] = ["latents"],
@@ -890,9 +985,9 @@ class StableDiffusion3InpaintPipeline(DiffusionPipeline, SD3LoraLoaderMixin, Fro
mask_image_latent (`torch.Tensor`, `List[torch.Tensor]`):
`Tensor` representing an image batch to mask `image` generated by VAE. If not provided, the mask
latents tensor will ge generated by `mask_image`.
height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor):
height (`int`, *optional*, defaults to self.transformer.config.sample_size * self.vae_scale_factor):
The height in pixels of the generated image. This is set to 1024 by default for the best results.
width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor):
width (`int`, *optional*, defaults to self.transformer.config.sample_size * self.vae_scale_factor):
The width in pixels of the generated image. This is set to 1024 by default for the best results.
padding_mask_crop (`int`, *optional*, defaults to `None`):
The size of margin in the crop to be applied to the image and masking. If `None`, no crop is applied to
@@ -953,12 +1048,22 @@ class StableDiffusion3InpaintPipeline(DiffusionPipeline, SD3LoraLoaderMixin, Fro
Pre-generated negative pooled text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
weighting. If not provided, pooled negative_prompt_embeds will be generated from `negative_prompt`
input argument.
ip_adapter_image (`PipelineImageInput`, *optional*):
Optional image input to work with IP Adapters.
ip_adapter_image_embeds (`torch.Tensor`, *optional*):
Pre-generated image embeddings for IP-Adapter. Should be a tensor of shape `(batch_size, num_images,
emb_dim)`. It should contain the negative image embedding if `do_classifier_free_guidance` is set to
`True`. If not provided, embeddings are computed from the `ip_adapter_image` input argument.
output_type (`str`, *optional*, defaults to `"pil"`):
The output format of the generate image. Choose between
[PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~pipelines.stable_diffusion_3.StableDiffusion3PipelineOutput`] instead of
a plain tuple.
joint_attention_kwargs (`dict`, *optional*):
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
`self.processor` in
[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
callback_on_step_end (`Callable`, *optional*):
A function that calls at the end of each denoising steps during the inference. The function is called
with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int,
@@ -1006,6 +1111,7 @@ class StableDiffusion3InpaintPipeline(DiffusionPipeline, SD3LoraLoaderMixin, Fro
self._guidance_scale = guidance_scale
self._clip_skip = clip_skip
self._joint_attention_kwargs = joint_attention_kwargs
self._interrupt = False
# 2. Define call parameters
@@ -1160,7 +1266,22 @@ class StableDiffusion3InpaintPipeline(DiffusionPipeline, SD3LoraLoaderMixin, Fro
f"The transformer {self.transformer.__class__} should have 16 input channels or 33 input channels, not {self.transformer.config.in_channels}."
)
# 7. Denoising loop
# 7. Prepare image embeddings
if (ip_adapter_image is not None and self.is_ip_adapter_active) or ip_adapter_image_embeds is not None:
ip_adapter_image_embeds = self.prepare_ip_adapter_image_embeds(
ip_adapter_image,
ip_adapter_image_embeds,
device,
batch_size * num_images_per_prompt,
self.do_classifier_free_guidance,
)
if self.joint_attention_kwargs is None:
self._joint_attention_kwargs = {"ip_adapter_image_embeds": ip_adapter_image_embeds}
else:
self._joint_attention_kwargs.update(ip_adapter_image_embeds=ip_adapter_image_embeds)
# 8. Denoising loop
num_warmup_steps = max(len(timesteps) - num_inference_steps * self.scheduler.order, 0)
self._num_timesteps = len(timesteps)
with self.progress_bar(total=num_inference_steps) as progress_bar:
@@ -1181,6 +1302,7 @@ class StableDiffusion3InpaintPipeline(DiffusionPipeline, SD3LoraLoaderMixin, Fro
timestep=timestep,
encoder_hidden_states=prompt_embeds,
pooled_projections=pooled_prompt_embeds,
joint_attention_kwargs=self.joint_attention_kwargs,
return_dict=False,
)[0]
@@ -0,0 +1,121 @@
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import contextlib
import os
import tempfile
from typing import TYPE_CHECKING, Dict
from huggingface_hub import DDUFEntry
from tqdm import tqdm
from ..utils import is_safetensors_available, is_transformers_available, is_transformers_version
if TYPE_CHECKING:
from transformers import PreTrainedModel, PreTrainedTokenizer
if is_transformers_available():
from transformers import PreTrainedModel, PreTrainedTokenizer
if is_safetensors_available():
import safetensors.torch
def _load_tokenizer_from_dduf(
cls: "PreTrainedTokenizer", name: str, dduf_entries: Dict[str, DDUFEntry], **kwargs
) -> "PreTrainedTokenizer":
"""
Load a tokenizer from a DDUF archive.
In practice, `transformers` do not provide a way to load a tokenizer from a DDUF archive. This function is a
workaround by extracting the tokenizer files from the DDUF archive and loading the tokenizer from the extracted
files. There is an extra cost of extracting the files, but of limited impact as the tokenizer files are usually
small-ish.
"""
with tempfile.TemporaryDirectory() as tmp_dir:
for entry_name, entry in dduf_entries.items():
if entry_name.startswith(name + "/"):
tmp_entry_path = os.path.join(tmp_dir, *entry_name.split("/"))
# need to create intermediary directory if they don't exist
os.makedirs(os.path.dirname(tmp_entry_path), exist_ok=True)
with open(tmp_entry_path, "wb") as f:
with entry.as_mmap() as mm:
f.write(mm)
return cls.from_pretrained(os.path.dirname(tmp_entry_path), **kwargs)
def _load_transformers_model_from_dduf(
cls: "PreTrainedModel", name: str, dduf_entries: Dict[str, DDUFEntry], **kwargs
) -> "PreTrainedModel":
"""
Load a transformers model from a DDUF archive.
In practice, `transformers` do not provide a way to load a model from a DDUF archive. This function is a workaround
by instantiating a model from the config file and loading the weights from the DDUF archive directly.
"""
config_file = dduf_entries.get(f"{name}/config.json")
if config_file is None:
raise EnvironmentError(
f"Could not find a config.json file for component {name} in DDUF file (contains {dduf_entries.keys()})."
)
generation_config = dduf_entries.get(f"{name}/generation_config.json", None)
weight_files = [
entry
for entry_name, entry in dduf_entries.items()
if entry_name.startswith(f"{name}/") and entry_name.endswith(".safetensors")
]
if not weight_files:
raise EnvironmentError(
f"Could not find any weight file for component {name} in DDUF file (contains {dduf_entries.keys()})."
)
if not is_safetensors_available():
raise EnvironmentError(
"Safetensors is not available, cannot load model from DDUF. Please `pip install safetensors`."
)
if is_transformers_version("<", "4.47.0"):
raise ImportError(
"You need to install `transformers>4.47.0` in order to load a transformers model from a DDUF file. "
"You can install it with: `pip install --upgrade transformers`"
)
with tempfile.TemporaryDirectory() as tmp_dir:
from transformers import AutoConfig, GenerationConfig
tmp_config_file = os.path.join(tmp_dir, "config.json")
with open(tmp_config_file, "w") as f:
f.write(config_file.read_text())
config = AutoConfig.from_pretrained(tmp_config_file)
if generation_config is not None:
tmp_generation_config_file = os.path.join(tmp_dir, "generation_config.json")
with open(tmp_generation_config_file, "w") as f:
f.write(generation_config.read_text())
generation_config = GenerationConfig.from_pretrained(tmp_generation_config_file)
state_dict = {}
with contextlib.ExitStack() as stack:
for entry in tqdm(weight_files, desc="Loading state_dict"): # Loop over safetensors files
# Memory-map the safetensors file
mmap = stack.enter_context(entry.as_mmap())
# Load tensors from the memory-mapped file
tensors = safetensors.torch.load(mmap)
# Update the state dictionary with tensors
state_dict.update(tensors)
return cls.from_pretrained(
pretrained_model_name_or_path=None,
config=config,
generation_config=generation_config,
state_dict=state_dict,
**kwargs,
)
@@ -342,7 +342,7 @@ class HeunDiscreteScheduler(SchedulerMixin, ConfigMixin):
timesteps = torch.from_numpy(timesteps)
timesteps = torch.cat([timesteps[:1], timesteps[1:].repeat_interleave(2)])
self.timesteps = timesteps.to(device=device)
self.timesteps = timesteps.to(device=device, dtype=torch.float32)
# empty dt and derivative
self.prev_derivative = None
@@ -311,7 +311,7 @@ class LMSDiscreteScheduler(SchedulerMixin, ConfigMixin):
sigmas = np.concatenate([sigmas, [0.0]]).astype(np.float32)
self.sigmas = torch.from_numpy(sigmas).to(device=device)
self.timesteps = torch.from_numpy(timesteps).to(device=device)
self.timesteps = torch.from_numpy(timesteps).to(device=device, dtype=torch.float32)
self._step_index = None
self._begin_index = None
self.sigmas = self.sigmas.to("cpu") # to avoid too much CPU/GPU communication
+2 -1
View File
@@ -70,6 +70,7 @@ from .import_utils import (
is_gguf_available,
is_gguf_version,
is_google_colab,
is_hf_hub_version,
is_inflect_available,
is_invisible_watermark_available,
is_k_diffusion_available,
@@ -100,7 +101,7 @@ from .import_utils import (
is_xformers_available,
requires_backends,
)
from .loading_utils import get_module_from_name, load_image, load_video
from .loading_utils import get_module_from_name, get_submodule_by_name, load_image, load_video
from .logging import get_logger
from .outputs import BaseOutput
from .peft_utils import (
+15
View File
@@ -227,6 +227,21 @@ class CogView3PlusTransformer2DModel(metaclass=DummyObject):
requires_backends(cls, ["torch"])
class ConsisIDTransformer3DModel(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
@classmethod
def from_config(cls, *args, **kwargs):
requires_backends(cls, ["torch"])
@classmethod
def from_pretrained(cls, *args, **kwargs):
requires_backends(cls, ["torch"])
class ConsistencyDecoderVAE(metaclass=DummyObject):
_backends = ["torch"]
@@ -362,6 +362,21 @@ class CogView3PlusPipeline(metaclass=DummyObject):
requires_backends(cls, ["torch", "transformers"])
class ConsisIDPipeline(metaclass=DummyObject):
_backends = ["torch", "transformers"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch", "transformers"])
@classmethod
def from_config(cls, *args, **kwargs):
requires_backends(cls, ["torch", "transformers"])
@classmethod
def from_pretrained(cls, *args, **kwargs):
requires_backends(cls, ["torch", "transformers"])
class CycleDiffusionPipeline(metaclass=DummyObject):
_backends = ["torch", "transformers"]
+33 -5
View File
@@ -26,6 +26,7 @@ from typing import Dict, List, Optional, Union
from uuid import uuid4
from huggingface_hub import (
DDUFEntry,
ModelCard,
ModelCardData,
create_repo,
@@ -291,9 +292,26 @@ def _get_model_file(
user_agent: Optional[Union[Dict, str]] = None,
revision: Optional[str] = None,
commit_hash: Optional[str] = None,
dduf_entries: Optional[Dict[str, DDUFEntry]] = None,
):
pretrained_model_name_or_path = str(pretrained_model_name_or_path)
if os.path.isfile(pretrained_model_name_or_path):
if dduf_entries:
if subfolder is not None:
raise ValueError(
"DDUF file only allow for 1 level of directory (e.g transformer/model1/model.safetentors is not allowed). "
"Please check the DDUF structure"
)
model_file = (
weights_name
if pretrained_model_name_or_path == ""
else "/".join([pretrained_model_name_or_path, weights_name])
)
if model_file in dduf_entries:
return model_file
else:
raise EnvironmentError(f"Error no file named {weights_name} found in archive {dduf_entries.keys()}.")
elif os.path.isfile(pretrained_model_name_or_path):
return pretrained_model_name_or_path
elif os.path.isdir(pretrained_model_name_or_path):
if os.path.isfile(os.path.join(pretrained_model_name_or_path, weights_name)):
@@ -419,6 +437,7 @@ def _get_checkpoint_shard_files(
user_agent=None,
revision=None,
subfolder="",
dduf_entries: Optional[Dict[str, DDUFEntry]] = None,
):
"""
For a given model:
@@ -430,11 +449,18 @@ def _get_checkpoint_shard_files(
For the description of each arg, see [`PreTrainedModel.from_pretrained`]. `index_filename` is the full path to the
index (downloaded and cached if `pretrained_model_name_or_path` is a model ID on the Hub).
"""
if not os.path.isfile(index_filename):
raise ValueError(f"Can't find a checkpoint index ({index_filename}) in {pretrained_model_name_or_path}.")
if dduf_entries:
if index_filename not in dduf_entries:
raise ValueError(f"Can't find a checkpoint index ({index_filename}) in {pretrained_model_name_or_path}.")
else:
if not os.path.isfile(index_filename):
raise ValueError(f"Can't find a checkpoint index ({index_filename}) in {pretrained_model_name_or_path}.")
with open(index_filename, "r") as f:
index = json.loads(f.read())
if dduf_entries:
index = json.loads(dduf_entries[index_filename].read_text())
else:
with open(index_filename, "r") as f:
index = json.loads(f.read())
original_shard_filenames = sorted(set(index["weight_map"].values()))
sharded_metadata = index["metadata"]
@@ -448,6 +474,8 @@ def _get_checkpoint_shard_files(
pretrained_model_name_or_path, subfolder=subfolder, original_shard_filenames=original_shard_filenames
)
return shards_path, sharded_metadata
elif dduf_entries:
return shards_path, sharded_metadata
# At this stage pretrained_model_name_or_path is a model identifier on the Hub
allow_patterns = original_shard_filenames
+22
View File
@@ -115,6 +115,13 @@ try:
except importlib_metadata.PackageNotFoundError:
_transformers_available = False
_hf_hub_available = importlib.util.find_spec("huggingface_hub") is not None
try:
_hf_hub_version = importlib_metadata.version("huggingface_hub")
logger.debug(f"Successfully imported huggingface_hub version {_hf_hub_version}")
except importlib_metadata.PackageNotFoundError:
_hf_hub_available = False
_inflect_available = importlib.util.find_spec("inflect") is not None
try:
@@ -767,6 +774,21 @@ def is_transformers_version(operation: str, version: str):
return compare_versions(parse(_transformers_version), operation, version)
def is_hf_hub_version(operation: str, version: str):
"""
Compares the current Hugging Face Hub version to a given reference with an operation.
Args:
operation (`str`):
A string representation of an operator, such as `">"` or `"<="`
version (`str`):
A version string
"""
if not _hf_hub_available:
return False
return compare_versions(parse(_hf_hub_version), operation, version)
def is_accelerate_version(operation: str, version: str):
"""
Compares the current Accelerate version to a given reference with an operation.
+12
View File
@@ -148,3 +148,15 @@ def get_module_from_name(module, tensor_name: str) -> Tuple[Any, str]:
module = new_module
tensor_name = splits[-1]
return module, tensor_name
def get_submodule_by_name(root_module, module_path: str):
current = root_module
parts = module_path.split(".")
for part in parts:
if part.isdigit():
idx = int(part)
current = current[idx] # e.g., for nn.ModuleList or nn.Sequential
else:
current = getattr(current, part)
return current
+12
View File
@@ -478,6 +478,18 @@ def require_bitsandbytes_version_greater(bnb_version):
return decorator
def require_hf_hub_version_greater(hf_hub_version):
def decorator(test_case):
correct_hf_hub_version = version.parse(
version.parse(importlib.metadata.version("huggingface_hub")).base_version
) > version.parse(hf_hub_version)
return unittest.skipUnless(
correct_hf_hub_version, f"Test requires huggingface_hub with the version greater than {hf_hub_version}."
)(test_case)
return decorator
def require_gguf_version_greater_or_equal(gguf_version):
def decorator(test_case):
correct_gguf_version = is_gguf_available() and version.parse(
+1 -1
View File
@@ -67,7 +67,7 @@ class VideoProcessor(VaeImageProcessor):
# ensure the input is a list of videos:
# - if it is a batch of videos (5d torch.Tensor or np.ndarray), it is converted to a list of videos (a list of 4d torch.Tensor or np.ndarray)
# - if it is is a single video, it is convereted to a list of one video.
# - if it is a single video, it is convereted to a list of one video.
if isinstance(video, (np.ndarray, torch.Tensor)) and video.ndim == 5:
video = list(video)
elif isinstance(video, list) and is_valid_image(video[0]) or is_valid_image_imagelist(video):
@@ -33,6 +33,7 @@ class CogVideoXTransformerTests(ModelTesterMixin, unittest.TestCase):
model_class = CogVideoXTransformer3DModel
main_input_name = "hidden_states"
uses_custom_attn_processor = True
model_split_percents = [0.7, 0.7, 0.8]
@property
def dummy_input(self):
@@ -33,6 +33,7 @@ class CogView3PlusTransformerTests(ModelTesterMixin, unittest.TestCase):
model_class = CogView3PlusTransformer2DModel
main_input_name = "hidden_states"
uses_custom_attn_processor = True
model_split_percents = [0.7, 0.6, 0.6]
@property
def dummy_input(self):
@@ -0,0 +1,105 @@
# coding=utf-8
# Copyright 2024 HuggingFace Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import unittest
import torch
from diffusers import ConsisIDTransformer3DModel
from diffusers.utils.testing_utils import (
enable_full_determinism,
torch_device,
)
from ..test_modeling_common import ModelTesterMixin
enable_full_determinism()
class ConsisIDTransformerTests(ModelTesterMixin, unittest.TestCase):
model_class = ConsisIDTransformer3DModel
main_input_name = "hidden_states"
uses_custom_attn_processor = True
@property
def dummy_input(self):
batch_size = 2
num_channels = 4
num_frames = 1
height = 8
width = 8
embedding_dim = 8
sequence_length = 8
hidden_states = torch.randn((batch_size, num_frames, num_channels, height, width)).to(torch_device)
encoder_hidden_states = torch.randn((batch_size, sequence_length, embedding_dim)).to(torch_device)
timestep = torch.randint(0, 1000, size=(batch_size,)).to(torch_device)
id_vit_hidden = [torch.ones([batch_size, 2, 2]).to(torch_device)] * 1
id_cond = torch.ones(batch_size, 2).to(torch_device)
return {
"hidden_states": hidden_states,
"encoder_hidden_states": encoder_hidden_states,
"timestep": timestep,
"id_vit_hidden": id_vit_hidden,
"id_cond": id_cond,
}
@property
def input_shape(self):
return (1, 4, 8, 8)
@property
def output_shape(self):
return (1, 4, 8, 8)
def prepare_init_args_and_inputs_for_common(self):
init_dict = {
"num_attention_heads": 2,
"attention_head_dim": 8,
"in_channels": 4,
"out_channels": 4,
"time_embed_dim": 2,
"text_embed_dim": 8,
"num_layers": 1,
"sample_width": 8,
"sample_height": 8,
"sample_frames": 8,
"patch_size": 2,
"temporal_compression_ratio": 4,
"max_text_seq_length": 8,
"cross_attn_interval": 1,
"is_kps": False,
"is_train_face": True,
"cross_attn_dim_head": 1,
"cross_attn_num_heads": 1,
"LFE_id_dim": 2,
"LFE_vit_dim": 2,
"LFE_depth": 5,
"LFE_dim_head": 8,
"LFE_num_heads": 2,
"LFE_num_id_token": 1,
"LFE_num_querie": 1,
"LFE_output_dim": 10,
"LFE_ff_mult": 1,
"LFE_num_scale": 1,
}
inputs_dict = self.dummy_input
return init_dict, inputs_dict
def test_gradient_checkpointing_is_applied(self):
expected_set = {"ConsisIDTransformer3DModel"}
super().test_gradient_checkpointing_is_applied(expected_set=expected_set)
+33
View File
@@ -14,6 +14,8 @@
import gc
import inspect
import os
import tempfile
import unittest
import numpy as np
@@ -24,7 +26,9 @@ from diffusers import AllegroPipeline, AllegroTransformer3DModel, AutoencoderKLA
from diffusers.utils.testing_utils import (
enable_full_determinism,
numpy_cosine_similarity_distance,
require_hf_hub_version_greater,
require_torch_gpu,
require_transformers_version_greater,
slow,
torch_device,
)
@@ -297,6 +301,35 @@ class AllegroPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
"VAE tiling should not affect the inference results",
)
@require_hf_hub_version_greater("0.26.5")
@require_transformers_version_greater("4.47.1")
def test_save_load_dduf(self):
# reimplement because it needs `enable_tiling()` on the loaded pipe.
from huggingface_hub import export_folder_as_dduf
components = self.get_dummy_components()
pipe = self.pipeline_class(**components)
pipe = pipe.to(torch_device)
pipe.set_progress_bar_config(disable=None)
inputs = self.get_dummy_inputs(device="cpu")
inputs.pop("generator")
inputs["generator"] = torch.manual_seed(0)
pipeline_out = pipe(**inputs)[0].cpu()
with tempfile.TemporaryDirectory() as tmpdir:
dduf_filename = os.path.join(tmpdir, f"{pipe.__class__.__name__.lower()}.dduf")
pipe.save_pretrained(tmpdir, safe_serialization=True)
export_folder_as_dduf(dduf_filename, folder_path=tmpdir)
loaded_pipe = self.pipeline_class.from_pretrained(tmpdir, dduf_file=dduf_filename).to(torch_device)
loaded_pipe.vae.enable_tiling()
inputs["generator"] = torch.manual_seed(0)
loaded_pipeline_out = loaded_pipe(**inputs)[0].cpu()
assert np.allclose(pipeline_out, loaded_pipeline_out)
@slow
@require_torch_gpu
@@ -63,6 +63,8 @@ class AudioLDMPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
]
)
supports_dduf = False
def get_dummy_components(self):
torch.manual_seed(0)
unet = UNet2DConditionModel(
@@ -70,6 +70,8 @@ class AudioLDM2PipelineFastTests(PipelineTesterMixin, unittest.TestCase):
]
)
supports_dduf = False
def get_dummy_components(self):
torch.manual_seed(0)
unet = AudioLDM2UNet2DConditionModel(
@@ -60,6 +60,8 @@ class BlipDiffusionPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
"prompt_reps",
]
supports_dduf = False
def get_dummy_components(self):
torch.manual_seed(0)
text_encoder_config = CLIPTextConfig(
+359
View File
@@ -0,0 +1,359 @@
# Copyright 2024 The HuggingFace Team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import gc
import inspect
import unittest
import numpy as np
import torch
from PIL import Image
from transformers import AutoTokenizer, T5EncoderModel
from diffusers import AutoencoderKLCogVideoX, ConsisIDPipeline, ConsisIDTransformer3DModel, DDIMScheduler
from diffusers.utils import load_image
from diffusers.utils.testing_utils import (
enable_full_determinism,
numpy_cosine_similarity_distance,
require_torch_gpu,
slow,
torch_device,
)
from ..pipeline_params import TEXT_TO_IMAGE_BATCH_PARAMS, TEXT_TO_IMAGE_IMAGE_PARAMS, TEXT_TO_IMAGE_PARAMS
from ..test_pipelines_common import (
PipelineTesterMixin,
to_np,
)
enable_full_determinism()
class ConsisIDPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
pipeline_class = ConsisIDPipeline
params = TEXT_TO_IMAGE_PARAMS - {"cross_attention_kwargs"}
batch_params = TEXT_TO_IMAGE_BATCH_PARAMS.union({"image"})
image_params = TEXT_TO_IMAGE_IMAGE_PARAMS
image_latents_params = TEXT_TO_IMAGE_IMAGE_PARAMS
required_optional_params = frozenset(
[
"num_inference_steps",
"generator",
"latents",
"return_dict",
"callback_on_step_end",
"callback_on_step_end_tensor_inputs",
]
)
test_xformers_attention = False
def get_dummy_components(self):
torch.manual_seed(0)
transformer = ConsisIDTransformer3DModel(
num_attention_heads=2,
attention_head_dim=16,
in_channels=8,
out_channels=4,
time_embed_dim=2,
text_embed_dim=32,
num_layers=1,
sample_width=2,
sample_height=2,
sample_frames=9,
patch_size=2,
temporal_compression_ratio=4,
max_text_seq_length=16,
use_rotary_positional_embeddings=True,
use_learned_positional_embeddings=True,
cross_attn_interval=1,
is_kps=False,
is_train_face=True,
cross_attn_dim_head=1,
cross_attn_num_heads=1,
LFE_id_dim=2,
LFE_vit_dim=2,
LFE_depth=5,
LFE_dim_head=8,
LFE_num_heads=2,
LFE_num_id_token=1,
LFE_num_querie=1,
LFE_output_dim=21,
LFE_ff_mult=1,
LFE_num_scale=1,
)
torch.manual_seed(0)
vae = AutoencoderKLCogVideoX(
in_channels=3,
out_channels=3,
down_block_types=(
"CogVideoXDownBlock3D",
"CogVideoXDownBlock3D",
"CogVideoXDownBlock3D",
"CogVideoXDownBlock3D",
),
up_block_types=(
"CogVideoXUpBlock3D",
"CogVideoXUpBlock3D",
"CogVideoXUpBlock3D",
"CogVideoXUpBlock3D",
),
block_out_channels=(8, 8, 8, 8),
latent_channels=4,
layers_per_block=1,
norm_num_groups=2,
temporal_compression_ratio=4,
)
torch.manual_seed(0)
scheduler = DDIMScheduler()
text_encoder = T5EncoderModel.from_pretrained("hf-internal-testing/tiny-random-t5")
tokenizer = AutoTokenizer.from_pretrained("hf-internal-testing/tiny-random-t5")
components = {
"transformer": transformer,
"vae": vae,
"scheduler": scheduler,
"text_encoder": text_encoder,
"tokenizer": tokenizer,
}
return components
def get_dummy_inputs(self, device, seed=0):
if str(device).startswith("mps"):
generator = torch.manual_seed(seed)
else:
generator = torch.Generator(device=device).manual_seed(seed)
image_height = 16
image_width = 16
image = Image.new("RGB", (image_width, image_height))
id_vit_hidden = [torch.ones([1, 2, 2])] * 1
id_cond = torch.ones(1, 2)
inputs = {
"image": image,
"prompt": "dance monkey",
"negative_prompt": "",
"generator": generator,
"num_inference_steps": 2,
"guidance_scale": 6.0,
"height": image_height,
"width": image_width,
"num_frames": 8,
"max_sequence_length": 16,
"id_vit_hidden": id_vit_hidden,
"id_cond": id_cond,
"output_type": "pt",
}
return inputs
def test_inference(self):
device = "cpu"
components = self.get_dummy_components()
pipe = self.pipeline_class(**components)
pipe.to(device)
pipe.set_progress_bar_config(disable=None)
inputs = self.get_dummy_inputs(device)
video = pipe(**inputs).frames
generated_video = video[0]
self.assertEqual(generated_video.shape, (8, 3, 16, 16))
expected_video = torch.randn(8, 3, 16, 16)
max_diff = np.abs(generated_video - expected_video).max()
self.assertLessEqual(max_diff, 1e10)
def test_callback_inputs(self):
sig = inspect.signature(self.pipeline_class.__call__)
has_callback_tensor_inputs = "callback_on_step_end_tensor_inputs" in sig.parameters
has_callback_step_end = "callback_on_step_end" in sig.parameters
if not (has_callback_tensor_inputs and has_callback_step_end):
return
components = self.get_dummy_components()
pipe = self.pipeline_class(**components)
pipe = pipe.to(torch_device)
pipe.set_progress_bar_config(disable=None)
self.assertTrue(
hasattr(pipe, "_callback_tensor_inputs"),
f" {self.pipeline_class} should have `_callback_tensor_inputs` that defines a list of tensor variables its callback function can use as inputs",
)
def callback_inputs_subset(pipe, i, t, callback_kwargs):
# iterate over callback args
for tensor_name, tensor_value in callback_kwargs.items():
# check that we're only passing in allowed tensor inputs
assert tensor_name in pipe._callback_tensor_inputs
return callback_kwargs
def callback_inputs_all(pipe, i, t, callback_kwargs):
for tensor_name in pipe._callback_tensor_inputs:
assert tensor_name in callback_kwargs
# iterate over callback args
for tensor_name, tensor_value in callback_kwargs.items():
# check that we're only passing in allowed tensor inputs
assert tensor_name in pipe._callback_tensor_inputs
return callback_kwargs
inputs = self.get_dummy_inputs(torch_device)
# Test passing in a subset
inputs["callback_on_step_end"] = callback_inputs_subset
inputs["callback_on_step_end_tensor_inputs"] = ["latents"]
output = pipe(**inputs)[0]
# Test passing in a everything
inputs["callback_on_step_end"] = callback_inputs_all
inputs["callback_on_step_end_tensor_inputs"] = pipe._callback_tensor_inputs
output = pipe(**inputs)[0]
def callback_inputs_change_tensor(pipe, i, t, callback_kwargs):
is_last = i == (pipe.num_timesteps - 1)
if is_last:
callback_kwargs["latents"] = torch.zeros_like(callback_kwargs["latents"])
return callback_kwargs
inputs["callback_on_step_end"] = callback_inputs_change_tensor
inputs["callback_on_step_end_tensor_inputs"] = pipe._callback_tensor_inputs
output = pipe(**inputs)[0]
assert output.abs().sum() < 1e10
def test_inference_batch_single_identical(self):
self._test_inference_batch_single_identical(batch_size=3, expected_max_diff=1e-3)
def test_attention_slicing_forward_pass(
self, test_max_difference=True, test_mean_pixel_difference=True, expected_max_diff=1e-3
):
if not self.test_attention_slicing:
return
components = self.get_dummy_components()
pipe = self.pipeline_class(**components)
for component in pipe.components.values():
if hasattr(component, "set_default_attn_processor"):
component.set_default_attn_processor()
pipe.to(torch_device)
pipe.set_progress_bar_config(disable=None)
generator_device = "cpu"
inputs = self.get_dummy_inputs(generator_device)
output_without_slicing = pipe(**inputs)[0]
pipe.enable_attention_slicing(slice_size=1)
inputs = self.get_dummy_inputs(generator_device)
output_with_slicing1 = pipe(**inputs)[0]
pipe.enable_attention_slicing(slice_size=2)
inputs = self.get_dummy_inputs(generator_device)
output_with_slicing2 = pipe(**inputs)[0]
if test_max_difference:
max_diff1 = np.abs(to_np(output_with_slicing1) - to_np(output_without_slicing)).max()
max_diff2 = np.abs(to_np(output_with_slicing2) - to_np(output_without_slicing)).max()
self.assertLess(
max(max_diff1, max_diff2),
expected_max_diff,
"Attention slicing should not affect the inference results",
)
def test_vae_tiling(self, expected_diff_max: float = 0.4):
generator_device = "cpu"
components = self.get_dummy_components()
# The reason to modify it this way is because ConsisID Transformer limits the generation to resolutions used during initalization.
# This limitation comes from using learned positional embeddings which cannot be generated on-the-fly like sincos or RoPE embeddings.
# See the if-statement on "self.use_learned_positional_embeddings" in diffusers/models/embeddings.py
components["transformer"] = ConsisIDTransformer3DModel.from_config(
components["transformer"].config,
sample_height=16,
sample_width=16,
)
pipe = self.pipeline_class(**components)
pipe.to("cpu")
pipe.set_progress_bar_config(disable=None)
# Without tiling
inputs = self.get_dummy_inputs(generator_device)
inputs["height"] = inputs["width"] = 128
output_without_tiling = pipe(**inputs)[0]
# With tiling
pipe.vae.enable_tiling(
tile_sample_min_height=96,
tile_sample_min_width=96,
tile_overlap_factor_height=1 / 12,
tile_overlap_factor_width=1 / 12,
)
inputs = self.get_dummy_inputs(generator_device)
inputs["height"] = inputs["width"] = 128
output_with_tiling = pipe(**inputs)[0]
self.assertLess(
(to_np(output_without_tiling) - to_np(output_with_tiling)).max(),
expected_diff_max,
"VAE tiling should not affect the inference results",
)
@slow
@require_torch_gpu
class ConsisIDPipelineIntegrationTests(unittest.TestCase):
prompt = "A painting of a squirrel eating a burger."
def setUp(self):
super().setUp()
gc.collect()
torch.cuda.empty_cache()
def tearDown(self):
super().tearDown()
gc.collect()
torch.cuda.empty_cache()
def test_consisid(self):
generator = torch.Generator("cpu").manual_seed(0)
pipe = ConsisIDPipeline.from_pretrained("BestWishYsh/ConsisID-preview", torch_dtype=torch.bfloat16)
pipe.enable_model_cpu_offload()
prompt = self.prompt
image = load_image("https://github.com/PKU-YuanGroup/ConsisID/blob/main/asserts/example_images/2.png?raw=true")
id_vit_hidden = [torch.ones([1, 2, 2])] * 1
id_cond = torch.ones(1, 2)
videos = pipe(
image=image,
prompt=prompt,
height=480,
width=720,
num_frames=16,
id_vit_hidden=id_vit_hidden,
id_cond=id_cond,
generator=generator,
num_inference_steps=1,
output_type="pt",
).frames
video = videos[0]
expected_video = torch.randn(1, 16, 480, 720, 3).numpy()
max_diff = numpy_cosine_similarity_distance(video, expected_video)
assert max_diff < 1e-3, f"Max diff is too high. got {video}"
@@ -291,6 +291,8 @@ class StableDiffusionMultiControlNetPipelineFastTests(
batch_params = TEXT_TO_IMAGE_BATCH_PARAMS
image_params = frozenset([]) # TO_DO: add image_params once refactored VaeImageProcessor.preprocess
supports_dduf = False
def get_dummy_components(self):
torch.manual_seed(0)
unet = UNet2DConditionModel(
@@ -523,6 +525,8 @@ class StableDiffusionMultiControlNetOneModelPipelineFastTests(
batch_params = TEXT_TO_IMAGE_BATCH_PARAMS
image_params = frozenset([]) # TO_DO: add image_params once refactored VaeImageProcessor.preprocess
supports_dduf = False
def get_dummy_components(self):
torch.manual_seed(0)
unet = UNet2DConditionModel(
@@ -68,6 +68,8 @@ class BlipDiffusionControlNetPipelineFastTests(PipelineTesterMixin, unittest.Tes
"prompt_reps",
]
supports_dduf = False
def get_dummy_components(self):
torch.manual_seed(0)
text_encoder_config = CLIPTextConfig(
@@ -198,6 +198,8 @@ class StableDiffusionMultiControlNetPipelineFastTests(
batch_params = TEXT_GUIDED_IMAGE_VARIATION_BATCH_PARAMS
image_params = frozenset([]) # TO_DO: add image_params once refactored VaeImageProcessor.preprocess
supports_dduf = False
def get_dummy_components(self):
torch.manual_seed(0)
unet = UNet2DConditionModel(
@@ -257,6 +257,8 @@ class MultiControlNetInpaintPipelineFastTests(
params = TEXT_GUIDED_IMAGE_INPAINTING_PARAMS
batch_params = TEXT_GUIDED_IMAGE_INPAINTING_BATCH_PARAMS
supports_dduf = False
def get_dummy_components(self):
torch.manual_seed(0)
unet = UNet2DConditionModel(
@@ -78,6 +78,8 @@ class ControlNetPipelineSDXLFastTests(
}
)
supports_dduf = False
def get_dummy_components(self):
torch.manual_seed(0)
unet = UNet2DConditionModel(
@@ -487,6 +487,8 @@ class StableDiffusionXLMultiControlNetPipelineFastTests(
batch_params = TEXT_TO_IMAGE_BATCH_PARAMS
image_params = frozenset([]) # TO_DO: add image_params once refactored VaeImageProcessor.preprocess
supports_dduf = False
def get_dummy_components(self):
torch.manual_seed(0)
unet = UNet2DConditionModel(
@@ -692,6 +694,8 @@ class StableDiffusionXLMultiControlNetOneModelPipelineFastTests(
batch_params = TEXT_TO_IMAGE_BATCH_PARAMS
image_params = frozenset([]) # TO_DO: add image_params once refactored VaeImageProcessor.preprocess
supports_dduf = False
def get_dummy_components(self):
torch.manual_seed(0)
unet = UNet2DConditionModel(
+7
View File
@@ -26,7 +26,9 @@ from diffusers.utils.import_utils import is_xformers_available
from diffusers.utils.testing_utils import (
load_numpy,
require_accelerator,
require_hf_hub_version_greater,
require_torch_gpu,
require_transformers_version_greater,
skip_mps,
slow,
torch_device,
@@ -89,6 +91,11 @@ class IFPipelineFastTests(PipelineTesterMixin, IFPipelineTesterMixin, unittest.T
def test_xformers_attention_forwardGenerator_pass(self):
self._test_xformers_attention_forwardGenerator_pass(expected_max_diff=1e-3)
@require_hf_hub_version_greater("0.26.5")
@require_transformers_version_greater("4.47.1")
def test_save_load_dduf(self):
super().test_save_load_dduf(atol=1e-2, rtol=1e-2)
@slow
@require_torch_gpu
@@ -26,7 +26,9 @@ from diffusers.utils.testing_utils import (
floats_tensor,
load_numpy,
require_accelerator,
require_hf_hub_version_greater,
require_torch_gpu,
require_transformers_version_greater,
skip_mps,
slow,
torch_device,
@@ -100,6 +102,11 @@ class IFImg2ImgPipelineFastTests(PipelineTesterMixin, IFPipelineTesterMixin, uni
expected_max_diff=1e-2,
)
@require_hf_hub_version_greater("0.26.5")
@require_transformers_version_greater("4.47.1")
def test_save_load_dduf(self):
super().test_save_load_dduf(atol=1e-2, rtol=1e-2)
@slow
@require_torch_gpu
@@ -26,7 +26,9 @@ from diffusers.utils.testing_utils import (
floats_tensor,
load_numpy,
require_accelerator,
require_hf_hub_version_greater,
require_torch_gpu,
require_transformers_version_greater,
skip_mps,
slow,
torch_device,
@@ -97,6 +99,11 @@ class IFImg2ImgSuperResolutionPipelineFastTests(PipelineTesterMixin, IFPipelineT
expected_max_diff=1e-2,
)
@require_hf_hub_version_greater("0.26.5")
@require_transformers_version_greater("4.47.1")
def test_save_load_dduf(self):
super().test_save_load_dduf(atol=1e-2, rtol=1e-2)
@slow
@require_torch_gpu

Some files were not shown because too many files have changed in this diff Show More