Compare commits

...

75 Commits

Author SHA1 Message Date
Kashif Rasul ccfaf0b75f initial script copied from the dpo trainer 2025-02-11 16:19:26 +01:00
hlky 7fb481f840 Add Self type hint to ModelMixin's from_pretrained (#10742) 2025-02-10 09:17:57 -10:00
Sayak Paul 9f5ad1db41 [LoRA] fix peft state dict parsing (#10532)
* fix peft state dict parsing

* updates
2025-02-10 18:47:20 +05:30
hlky 464374fb87 EDMEulerScheduler accept sigmas, add final_sigmas_type (#10734) 2025-02-07 06:53:52 +00:00
hlky d43ce14e2d Quantized Flux with IP-Adapter (#10728) 2025-02-06 07:02:36 -10:00
Leo Jiang cd0a4a82cf [bugfix] NPU Adaption for Sana (#10724)
* NPU Adaption for Sanna

* NPU Adaption for Sanna

* NPU Adaption for Sanna

* NPU Adaption for Sanna

* NPU Adaption for Sanna

* NPU Adaption for Sanna

* NPU Adaption for Sanna

* NPU Adaption for Sanna

* NPU Adaption for Sanna

* NPU Adaption for Sanna

* NPU Adaption for Sanna

* NPU Adaption for Sanna

* NPU Adaption for Sanna

* NPU Adaption for Sanna

* NPU Adaption for Sanna

* NPU Adaption for Sanna

* NPU Adaption for Sanna

* NPU Adaption for Sanna

* [bugfix]NPU Adaption for Sanna

---------

Co-authored-by: J石页 <jiangshuo9@h-partners.com>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2025-02-06 19:29:58 +05:30
suzukimain 145522cbb7 [Community] Enhanced Model Search (#10417)
* Added `auto_load_textual_inversion` and `auto_load_lora_weights`

* update README.md

* fix

* make quality

* Fix and `make style`
2025-02-05 14:43:53 -10:00
xieofxie 23bc56a02d add provider_options in from_pretrained (#10719)
Co-authored-by: hualxie <hualxie@microsoft.com>
2025-02-05 09:41:41 -10:00
SahilCarterr 5b1dcd1584 [Fix] Type Hint in from_pretrained() to Ensure Correct Type Inference (#10714)
* Update pipeline_utils.py

Added Self in from_pretrained method so  inference will correctly recognize pipeline

* Use typing_extensions

---------

Co-authored-by: hlky <hlky@hlky.ac>
2025-02-04 08:59:31 -10:00
Parag Ekbote dbe0094e86 Notebooks for Community Scripts-6 (#10713)
* Fix Doc Tutorial.

* Add 4 Notebooks and improve their example
scripts.
2025-02-04 10:12:17 -08:00
Nicolas f63d32233f Fix train_text_to_image.py --help (#10711) 2025-02-04 11:26:23 +05:30
Sayak Paul 5e8e6cb44f [bitsandbytes] Simplify bnb int8 dequant (#10401)
* fix dequantization for latest bnb.

* smol fixes.

* fix type annotation

* update peft link

* updates
2025-02-04 11:17:14 +05:30
Parag Ekbote 3e35f56b00 Fix Documentation about Image-to-Image Pipeline (#10704)
Fix Doc Tutorial.
2025-02-03 09:54:00 -08:00
Ikpreet S Babra 537891e693 Fixed grammar in "write_own_pipeline" readme (#10706) 2025-02-03 09:53:30 -08:00
Vedat Baday 9f28f1abba feat(training-utils): support device and dtype params in compute_density_for_timestep_sampling (#10699)
* feat(training-utils): support device and dtype params in compute_density_for_timestep_sampling

* chore: update type hint

* refactor: use union for type hint

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2025-02-01 23:04:05 +05:30
Thanh Le 5d2d23986e Fix inconsistent random transform in instruct pix2pix (#10698)
* Update train_instruct_pix2pix.py

Fix inconsistent random transform in instruct_pix2pix

* Update train_instruct_pix2pix_sdxl.py
2025-01-31 08:29:29 -10:00
Max Podkorytov 1ae9b0595f Fix enable memory efficient attention on ROCm (#10564)
* fix enable memory efficient attention on ROCm

while calling CK implementation

* Update attention_processor.py

refactor of picking a set element
2025-01-31 17:15:49 +05:30
SahilCarterr aad69ac2f3 [FIX] check_inputs function in Auraflow Pipeline (#10678)
fix_shape_error
2025-01-29 13:11:54 -10:00
Vedat Baday ea76880bd7 fix(hunyuan-video): typo in height and width input check (#10684) 2025-01-30 04:16:05 +05:30
Teriks 33f936154d support StableDiffusionAdapterPipeline.from_single_file (#10552)
* support StableDiffusionAdapterPipeline.from_single_file

* make style

---------

Co-authored-by: Teriks <Teriks@users.noreply.github.com>
Co-authored-by: hlky <hlky@hlky.ac>
2025-01-29 07:18:47 -10:00
Sayak Paul e6037e8275 [tests] update llamatokenizer in hunyuanvideo tests (#10681)
update llamatokenizer in hunyuanvideo tests
2025-01-29 21:12:57 +05:30
Dimitri Barbot 196aef5a6f Fix pipeline dtype unexpected change when using SDXL reference community pipelines in float16 mode (#10670)
Fix pipeline dtype unexpected change when using SDXL reference community pipelines
2025-01-28 10:46:41 -03:00
Sayak Paul 7b100ce589 [Tests] conditionally check fp8_e4m3_bf16_max_memory < fp8_e4m3_fp32_max_memory (#10669)
* conditionally check if compute capability is met.

* log info.

* fix condition.

* updates

* updates

* updates

* updates
2025-01-28 12:00:14 +05:30
Aryan c4d4ac21e7 Refactor gradient checkpointing (#10611)
* update

* remove unused fn

* apply suggestions based on review

* update + cleanup 🧹

* more cleanup 🧹

* make fix-copies

* update test
2025-01-28 06:51:46 +05:30
Hanch Han f295e2eefc [fix] refer use_framewise_encoding on AutoencoderKLHunyuanVideo._encode (#10600)
* fix: refer to use_framewise_encoding on AutoencoderKLHunyuanVideo._encode

* fix: comment about tile_sample_min_num_frames

---------

Co-authored-by: Aryan <aryan@huggingface.co>
2025-01-28 06:51:27 +05:30
Aryan 658e24e86c [core] Pyramid Attention Broadcast (#9562)
* start pyramid attention broadcast

* add coauthor

Co-Authored-By: Xuanlei Zhao <43881818+oahzxl@users.noreply.github.com>

* update

* make style

* update

* make style

* add docs

* add tests

* update

* Update docs/source/en/api/pipelines/cogvideox.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/api/pipelines/cogvideox.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Pyramid Attention Broadcast rewrite + introduce hooks (#9826)

* rewrite implementation with hooks

* make style

* update

* merge pyramid-attention-rewrite-2

* make style

* remove changes from latte transformer

* revert docs changes

* better debug message

* add todos for future

* update tests

* make style

* cleanup

* fix

* improve log message; fix latte test

* refactor

* update

* update

* update

* revert changes to tests

* update docs

* update tests

* Apply suggestions from code review

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* update

* fix flux test

* reorder

* refactor

* make fix-copies

* update docs

* fixes

* more fixes

* make style

* update tests

* update code example

* make fix-copies

* refactor based on reviews

* use maybe_free_model_hooks

* CacheMixin

* make style

* update

* add current_timestep property; update docs

* make fix-copies

* update

* improve tests

* try circular import fix

* apply suggestions from review

* address review comments

* Apply suggestions from code review

* refactor hook implementation

* add test suite for hooks

* PAB Refactor (#10667)

* update

* update

* update

---------

Co-authored-by: DN6 <dhruv.nair@gmail.com>

* update

* fix remove hook behaviour

---------

Co-authored-by: Xuanlei Zhao <43881818+oahzxl@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: DN6 <dhruv.nair@gmail.com>
2025-01-28 05:09:04 +05:30
Giuseppe Catalano fb42066489 Revert RePaint scheduler 'fix' (#10644)
Co-authored-by: Giuseppe Catalano <giuseppelorenzo.catalano@unito.it>
2025-01-27 11:16:45 -10:00
Teriks e89ab5bc26 SDXL ControlNet Union pipelines, make control_image argument immutible (#10663)
controlnet union XL, make control_image immutible

when this argument is passed a list, __call__
modifies its content, since it is pass by reference
the list passed by the caller gets its content
modified unexpectedly

make a copy at method intro so this does not happen

Co-authored-by: Teriks <Teriks@users.noreply.github.com>
2025-01-27 10:53:30 -10:00
victolee0 8ceec90d76 fix check_inputs func in LuminaText2ImgPipeline (#10651) 2025-01-27 09:47:01 -10:00
hlky 158c5c4d08 Add provider_options to OnnxRuntimeModel (#10661) 2025-01-27 09:46:17 -10:00
hlky 41571773d9 [training] Convert to ImageFolder script (#10664)
* [training] Convert to ImageFolder script

* make
2025-01-27 09:43:51 -10:00
hlky 18f7d1d937 ControlNet Union controlnet_conditioning_scale for multiple control inputs (#10666) 2025-01-27 08:15:25 -10:00
Marlon May f7f36c7d3d Add community pipeline for semantic guidance for FLUX (#10610)
* add community pipeline for semantic guidance for flux

* fix imports in community pipeline for semantic guidance for flux

* Update examples/community/pipeline_flux_semantic_guidance.py

Co-authored-by: hlky <hlky@hlky.ac>

* fix community pipeline for semantic guidance for flux

---------

Co-authored-by: Linoy Tsaban <57615435+linoytsaban@users.noreply.github.com>
Co-authored-by: hlky <hlky@hlky.ac>
2025-01-27 16:19:46 +02:00
Yuqian Hong 4fa24591a3 create a script to train autoencoderkl (#10605)
* create a script to train vae

* update main.py

* update train_autoencoderkl.py

* update train_autoencoderkl.py

* add a check of --pretrained_model_name_or_path and --model_config_name_or_path

* remove the comment, remove diffusers in requiremnets.txt, add validation_image ote

* update autoencoderkl.py

* quality

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2025-01-27 16:41:34 +05:30
Jacob Helwig 4f3ec5364e Add sigmoid scheduler in scheduling_ddpm.py docs (#10648)
Sigmoid scheduler in scheduling_ddpm.py docs
2025-01-26 15:37:20 -08:00
Leo Jiang 07860f9916 NPU Adaption for Sanna (#10409)
* NPU Adaption for Sanna


---------

Co-authored-by: J石页 <jiangshuo9@h-partners.com>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2025-01-24 09:08:52 -10:00
Wenhao Sun 87252d80c3 Add pipeline_stable_diffusion_xl_attentive_eraser (#10579)
* add pipeline_stable_diffusion_xl_attentive_eraser

* add pipeline_stable_diffusion_xl_attentive_eraser_make_style

* make style and add example output

* update Docs

Co-authored-by: Other Contributor <a457435687@126.com>

* add Oral

Co-authored-by: Other Contributor <a457435687@126.com>

* update_review

Co-authored-by: Other Contributor <a457435687@126.com>

* update_review_ms

Co-authored-by: Other Contributor <a457435687@126.com>

---------

Co-authored-by: Other Contributor <a457435687@126.com>
2025-01-24 13:52:45 +00:00
Sayak Paul 5897137397 [chore] add a script to extract loras from full fine-tuned models (#10631)
* feat: add a lora extraction script.

* updates
2025-01-24 11:50:36 +05:30
Yaniv Galron a451c0ed14 removing redundant requires_grad = False (#10628)
We already set the unet to requires grad false at line 506

Co-authored-by: Aryan <aryan@huggingface.co>
2025-01-24 03:25:33 +05:30
hlky 37c9697f5b Add IP-Adapter example to Flux docs (#10633)
* Add IP-Adapter example to Flux docs

* Apply suggestions from code review

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2025-01-23 22:15:33 +05:30
Raul Ciotescu 9684c52adf width and height are mixed-up (#10629)
vars mixed-up
2025-01-23 06:40:22 -10:00
Steven Liu 5483162d12 [docs] uv installation (#10622)
* uv

* feedback
2025-01-23 08:34:51 -08:00
Sayak Paul d77c53b6d2 [docs] fix image path in para attention docs (#10632)
fix image path in para attention docs
2025-01-23 08:22:42 -08:00
Sayak Paul 78bc824729 [Tests] modify the test slices for the failing flax test (#10630)
* fixes

* fixes

* fixes

* updates
2025-01-23 12:10:24 +05:30
kahmed10 04d40920a7 add onnxruntime-migraphx as part of check for onnxruntime in import_utils.py (#10624)
add onnxruntime-migraphx to import_utils.py

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2025-01-23 07:49:51 +05:30
Dhruv Nair 8d6f6d6b66 [CI] Update HF_TOKEN in all workflows (#10613)
update
2025-01-22 20:03:41 +05:30
Aryan ca60ad8e55 Improve TorchAO error message (#10627)
improve error message
2025-01-22 19:50:02 +05:30
Aryan beacaa5528 [core] Layerwise Upcasting (#10347)
* update

* update

* make style

* remove dynamo disable

* add coauthor

Co-Authored-By: Dhruv Nair <dhruv.nair@gmail.com>

* update

* update

* update

* update mixin

* add some basic tests

* update

* update

* non_blocking

* improvements

* update

* norm.* -> norm

* apply suggestions from review

* add example

* update hook implementation to the latest changes from pyramid attention broadcast

* deinitialize should raise an error

* update doc page

* Apply suggestions from code review

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* update docs

* update

* refactor

* fix _always_upcast_modules for asym ae and vq_model

* fix lumina embedding forward to not depend on weight dtype

* refactor tests

* add simple lora inference tests

* _always_upcast_modules -> _precision_sensitive_module_patterns

* remove todo comments about review; revert changes to self.dtype in unets because .dtype on ModelMixin should be able to handle fp8 weight case

* check layer dtypes in lora test

* fix UNet1DModelTests::test_layerwise_upcasting_inference

* _precision_sensitive_module_patterns -> _skip_layerwise_casting_patterns based on feedback

* skip test in NCSNppModelTests

* skip tests for AutoencoderTinyTests

* skip tests for AutoencoderOobleckTests

* skip tests for UNet1DModelTests - unsupported pytorch operations

* layerwise_upcasting -> layerwise_casting

* skip tests for UNetRLModelTests; needs next pytorch release for currently unimplemented operation support

* add layerwise fp8 pipeline test

* use xfail

* Apply suggestions from code review

Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com>

* add assertion with fp32 comparison; add tolerance to fp8-fp32 vs fp32-fp32 comparison (required for a few models' test to pass)

* add note about memory consumption on tesla CI runner for failing test

---------

Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2025-01-22 19:49:37 +05:30
Lucain a647682224 Remove cache migration script (#10619) 2025-01-21 07:22:59 -10:00
YiYi Xu a1f9a71238 fix offload gpu tests etc (#10366)
* add

* style
2025-01-21 18:52:36 +05:30
Fanli Lin ec37e20972 [tests] make tests device-agnostic (part 3) (#10437)
* initial comit

* fix empty cache

* fix one more

* fix style

* update device functions

* update

* update

* Update src/diffusers/utils/testing_utils.py

Co-authored-by: hlky <hlky@hlky.ac>

* Update src/diffusers/utils/testing_utils.py

Co-authored-by: hlky <hlky@hlky.ac>

* Update src/diffusers/utils/testing_utils.py

Co-authored-by: hlky <hlky@hlky.ac>

* Update tests/pipelines/controlnet/test_controlnet.py

Co-authored-by: hlky <hlky@hlky.ac>

* Update src/diffusers/utils/testing_utils.py

Co-authored-by: hlky <hlky@hlky.ac>

* Update src/diffusers/utils/testing_utils.py

Co-authored-by: hlky <hlky@hlky.ac>

* Update tests/pipelines/controlnet/test_controlnet.py

Co-authored-by: hlky <hlky@hlky.ac>

* with gc.collect

* update

* make style

* check_torch_dependencies

* add mps empty cache

* bug fix

* Apply suggestions from code review

---------

Co-authored-by: hlky <hlky@hlky.ac>
2025-01-21 12:15:45 +00:00
Muyang Li 158a5a87fb Remove the FP32 Wrapper when evaluating (#10617)
Remove the FP32 Wrapper

Co-authored-by: Linoy Tsaban <57615435+linoytsaban@users.noreply.github.com>
2025-01-21 16:16:54 +05:30
jiqing-feng 012d08b1bc Enable dreambooth lora finetune example on other devices (#10602)
* enable dreambooth_lora on other devices

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* enable xpu

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* check cuda device before empty cache

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* fix comment

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* import free_memory

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

---------

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
2025-01-21 14:09:45 +05:30
Sayak Paul 4ace7d0483 [chore] change licensing to 2025 from 2024. (#10615)
change licensing to 2025 from 2024.
2025-01-20 16:57:27 -10:00
baymax591 75a636da48 bugfix for npu not support float64 (#10123)
* bugfix for npu not support float64

* is_mps is_npu

---------

Co-authored-by: 白超 <baichao19@huawei.com>
Co-authored-by: hlky <hlky@hlky.ac>
2025-01-20 09:35:24 -10:00
sunxunle 4842f5d8de chore: remove redundant words (#10609)
Signed-off-by: sunxunle <sunxunle@ampere.tech>
2025-01-20 08:15:26 -10:00
Sayak Paul 328e0d20a7 [training] set rest of the blocks with requires_grad False. (#10607)
set rest of the blocks with requires_grad False.
2025-01-19 19:34:53 +05:30
Shenghai Yuan 23b467c79c [core] ConsisID (#10140)
* Update __init__.py

* add consisid

* update consisid

* update consisid

* make style

* make_style

* Update src/diffusers/pipelines/consisid/pipeline_consisid.py

Co-authored-by: hlky <hlky@hlky.ac>

* Update src/diffusers/pipelines/consisid/pipeline_consisid.py

Co-authored-by: hlky <hlky@hlky.ac>

* Update src/diffusers/pipelines/consisid/pipeline_consisid.py

Co-authored-by: hlky <hlky@hlky.ac>

* Update src/diffusers/pipelines/consisid/pipeline_consisid.py

Co-authored-by: hlky <hlky@hlky.ac>

* Update src/diffusers/pipelines/consisid/pipeline_consisid.py

Co-authored-by: hlky <hlky@hlky.ac>

* Update src/diffusers/pipelines/consisid/pipeline_consisid.py

Co-authored-by: hlky <hlky@hlky.ac>

* add doc

* make style

* Rename consisid .md to consisid.md

* Update geodiff_molecule_conformation.ipynb

* Update geodiff_molecule_conformation.ipynb

* Update geodiff_molecule_conformation.ipynb

* Update demo.ipynb

* Update pipeline_consisid.py

* make fix-copies

* Update docs/source/en/using-diffusers/consisid.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update src/diffusers/pipelines/consisid/pipeline_consisid.py

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update src/diffusers/pipelines/consisid/pipeline_consisid.py

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/using-diffusers/consisid.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/using-diffusers/consisid.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* update doc & pipeline code

* fix typo

* make style

* update example

* Update docs/source/en/using-diffusers/consisid.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* update example

* update example

* Update src/diffusers/pipelines/consisid/pipeline_consisid.py

Co-authored-by: hlky <hlky@hlky.ac>

* Update src/diffusers/pipelines/consisid/pipeline_consisid.py

Co-authored-by: hlky <hlky@hlky.ac>

* update

* add test and update

* remove some changes from docs

* refactor

* fix

* undo changes to examples

* remove save/load and fuse methods

* update

* link hf-doc-img & make test extremely small

* update

* add lora

* fix test

* update

* update

* change expected_diff_max to 0.4

* fix typo

* fix link

* fix typo

* update docs

* update

* remove consisid lora tests

---------

Co-authored-by: hlky <hlky@hlky.ac>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Aryan <aryan@huggingface.co>
2025-01-19 13:10:08 +05:30
Juan Acevedo aeac0a00f8 implementing flux on TPUs with ptxla (#10515)
* implementing flux on TPUs with ptxla

* add xla flux attention class

* run make style/quality

* Update src/diffusers/models/attention_processor.py

Co-authored-by: YiYi Xu <yixu310@gmail.com>

* Update src/diffusers/models/attention_processor.py

Co-authored-by: YiYi Xu <yixu310@gmail.com>

* run style and quality

---------

Co-authored-by: Juan Acevedo <jfacevedo@google.com>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: YiYi Xu <yixu310@gmail.com>
2025-01-16 08:46:02 -10:00
Leo Jiang cecada5280 NPU adaption for RMSNorm (#10534)
* NPU adaption for RMSNorm

* NPU adaption for RMSNorm

---------

Co-authored-by: J石页 <jiangshuo9@h-partners.com>
2025-01-16 08:45:29 -10:00
C 17d99c4d22 [Docs] Add documentation about using ParaAttention to optimize FLUX and HunyuanVideo (#10544)
* add para_attn_flux.md and para_attn_hunyuan_video.md

* add enable_sequential_cpu_offload in para_attn_hunyuan_video.md

* add comment

* refactor

* fix

* fix

* Update docs/source/en/optimization/para_attn.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/optimization/para_attn.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/optimization/para_attn.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/optimization/para_attn.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/optimization/para_attn.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/optimization/para_attn.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/optimization/para_attn.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/optimization/para_attn.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/optimization/para_attn.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/optimization/para_attn.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* fix

* update links

* Update docs/source/en/optimization/para_attn.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/optimization/para_attn.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/optimization/para_attn.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* fix

* Update docs/source/en/optimization/para_attn.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/optimization/para_attn.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/optimization/para_attn.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/optimization/para_attn.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/optimization/para_attn.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/optimization/para_attn.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/optimization/para_attn.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/optimization/para_attn.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2025-01-16 10:05:13 -08:00
hlky 08e62fe0c2 Scheduling fixes on MPS (#10549)
* use np.int32 in scheduling

* test_add_noise_device

* -np.int32, fixes
2025-01-16 07:45:03 -10:00
Daniel Regado 9e1b8a0017 [Docs] Update SD3 ip_adapter model_id to diffusers checkpoint (#10597)
Update to diffusers ip_adapter ckpt
2025-01-16 07:43:29 -10:00
hlky 0b065c099a Move buffers to device (#10523)
* Move buffers to device

* add test

* named_buffers
2025-01-16 07:42:56 -10:00
Junyu Chen b785ddb654 [DC-AE, SANA] fix SanaMultiscaleLinearAttention apply_quadratic_attention bf16 (#10595)
* autoencoder_dc tiling

* add tiling and slicing support in SANA pipelines

* create variables for padding length because the line becomes too long

* add tiling and slicing support in pag SANA pipelines

* revert changes to tile size

* make style

* add vae tiling test

* fix SanaMultiscaleLinearAttention apply_quadratic_attention bf16

---------

Co-authored-by: Aryan <aryan@huggingface.co>
2025-01-16 16:49:02 +05:30
Daniel Regado e8114bd068 IP-Adapter for StableDiffusion3Img2ImgPipeline (#10589)
Added support for IP-Adapter
2025-01-16 09:46:22 +00:00
Leo Jiang b0c8973834 [Sana 4K] Add vae tiling option to avoid OOM (#10583)
Co-authored-by: J石页 <jiangshuo9@h-partners.com>
2025-01-16 02:06:07 +05:30
Sayak Paul c944f0651f [Chore] fix vae annotation in mochi pipeline (#10585)
fix vae annotation in mochi pipeline
2025-01-15 15:19:51 +05:30
Sayak Paul bba59fb88b [Tests] add: test to check 8bit bnb quantized models work with lora loading. (#10576)
* add: test to check 8bit bnb quantized models work with lora loading.

* Update tests/quantization/bnb/test_mixed_int8.py

Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com>

---------

Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com>
2025-01-15 13:05:26 +05:30
Sayak Paul 2432f80ca3 [LoRA] feat: support loading loras into 4bit quantized Flux models. (#10578)
* feat: support loading loras into 4bit quantized models.

* updates

* update

* remove weight check.
2025-01-15 12:40:40 +05:30
Aryan f9e957f011 Fix offload tests for CogVideoX and CogView3 (#10547)
* update

* update
2025-01-15 12:24:46 +05:30
Daniel Regado 4dec63c18e IP-Adapter for StableDiffusion3InpaintPipeline (#10581)
* Added support for IP-Adapter

* Added joint_attention_kwargs property
2025-01-15 06:52:23 +00:00
Junsong Chen 3d70777379 [Sana-4K] (#10537)
* [Sana 4K]
add 4K support for Sana

* [Sana-4K] fix SanaPAGPipeline

* add VAE automatically tiling function;

* set clean_caption to False;

* add warnings for VAE OOM.

* style

---------

Co-authored-by: yiyixuxu <yixu310@gmail.com>
2025-01-14 11:48:56 -10:00
Teriks 6b727842d7 allow passing hf_token to load_textual_inversion (#10546)
Co-authored-by: Teriks <Teriks@users.noreply.github.com>
2025-01-14 11:48:34 -10:00
Dhruv Nair be62c85cd9 [CI] Update HF Token on Fast GPU Model Tests (#10570)
update
2025-01-14 17:00:32 +05:30
344 changed files with 14418 additions and 2654 deletions
+3 -3
View File
@@ -265,7 +265,7 @@ jobs:
- name: Run PyTorch CUDA tests
env:
HF_TOKEN: ${{ secrets.HF_TOKEN }}
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
# https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms
CUBLAS_WORKSPACE_CONFIG: :16:8
run: |
@@ -505,7 +505,7 @@ jobs:
# shell: arch -arch arm64 bash {0}
# env:
# HF_HOME: /System/Volumes/Data/mnt/cache
# HF_TOKEN: ${{ secrets.HF_TOKEN }}
# HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
# run: |
# ${CONDA_RUN} python -m pytest -n 1 -s -v --make-reports=tests_torch_mps \
# --report-log=tests_torch_mps.log \
@@ -561,7 +561,7 @@ jobs:
# shell: arch -arch arm64 bash {0}
# env:
# HF_HOME: /System/Volumes/Data/mnt/cache
# HF_TOKEN: ${{ secrets.HF_TOKEN }}
# HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
# run: |
# ${CONDA_RUN} python -m pytest -n 1 -s -v --make-reports=tests_torch_mps \
# --report-log=tests_torch_mps.log \
+6 -6
View File
@@ -137,7 +137,7 @@ jobs:
- name: Run PyTorch CUDA tests
env:
HF_TOKEN: ${{ secrets.HF_TOKEN }}
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
# https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms
CUBLAS_WORKSPACE_CONFIG: :16:8
run: |
@@ -187,7 +187,7 @@ jobs:
- name: Run Flax TPU tests
env:
HF_TOKEN: ${{ secrets.HF_TOKEN }}
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
run: |
python -m pytest -n 0 \
-s -v -k "Flax" \
@@ -235,7 +235,7 @@ jobs:
- name: Run ONNXRuntime CUDA tests
env:
HF_TOKEN: ${{ secrets.HF_TOKEN }}
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
run: |
python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile \
-s -v -k "Onnx" \
@@ -283,7 +283,7 @@ jobs:
python utils/print_env.py
- name: Run example tests on GPU
env:
HF_TOKEN: ${{ secrets.HF_TOKEN }}
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
RUN_COMPILE: yes
run: |
python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile -s -v -k "compile" --make-reports=tests_torch_compile_cuda tests/
@@ -326,7 +326,7 @@ jobs:
python utils/print_env.py
- name: Run example tests on GPU
env:
HF_TOKEN: ${{ secrets.HF_TOKEN }}
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
run: |
python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile -s -v -k "xformers" --make-reports=tests_torch_xformers_cuda tests/
- name: Failure short reports
@@ -372,7 +372,7 @@ jobs:
- name: Run example tests on GPU
env:
HF_TOKEN: ${{ secrets.HF_TOKEN }}
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
run: |
python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH"
python -m uv pip install timm
+8 -8
View File
@@ -81,7 +81,7 @@ jobs:
python utils/print_env.py
- name: Slow PyTorch CUDA checkpoint tests on Ubuntu
env:
HF_TOKEN: ${{ secrets.HF_TOKEN }}
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
# https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms
CUBLAS_WORKSPACE_CONFIG: :16:8
run: |
@@ -135,7 +135,7 @@ jobs:
- name: Run PyTorch CUDA tests
env:
HF_TOKEN: ${{ secrets.HF_TOKEN }}
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
# https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms
CUBLAS_WORKSPACE_CONFIG: :16:8
run: |
@@ -186,7 +186,7 @@ jobs:
- name: Run PyTorch CUDA tests
env:
HF_TOKEN: ${{ secrets.HF_TOKEN }}
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
# https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms
CUBLAS_WORKSPACE_CONFIG: :16:8
run: |
@@ -241,7 +241,7 @@ jobs:
- name: Run slow Flax TPU tests
env:
HF_TOKEN: ${{ secrets.HF_TOKEN }}
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
run: |
python -m pytest -n 0 \
-s -v -k "Flax" \
@@ -289,7 +289,7 @@ jobs:
- name: Run slow ONNXRuntime CUDA tests
env:
HF_TOKEN: ${{ secrets.HF_TOKEN }}
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
run: |
python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile \
-s -v -k "Onnx" \
@@ -337,7 +337,7 @@ jobs:
python utils/print_env.py
- name: Run example tests on GPU
env:
HF_TOKEN: ${{ secrets.HF_TOKEN }}
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
RUN_COMPILE: yes
run: |
python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile -s -v -k "compile" --make-reports=tests_torch_compile_cuda tests/
@@ -380,7 +380,7 @@ jobs:
python utils/print_env.py
- name: Run example tests on GPU
env:
HF_TOKEN: ${{ secrets.HF_TOKEN }}
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
run: |
python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile -s -v -k "xformers" --make-reports=tests_torch_xformers_cuda tests/
- name: Failure short reports
@@ -426,7 +426,7 @@ jobs:
- name: Run example tests on GPU
env:
HF_TOKEN: ${{ secrets.HF_TOKEN }}
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
run: |
python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH"
python -m uv pip install timm
+10
View File
@@ -79,6 +79,8 @@
- sections:
- local: using-diffusers/cogvideox
title: CogVideoX
- local: using-diffusers/consisid
title: ConsisID
- local: using-diffusers/sdxl
title: Stable Diffusion XL
- local: using-diffusers/sdxl_turbo
@@ -179,6 +181,8 @@
title: TGATE
- local: optimization/xdit
title: xDiT
- local: optimization/para_attn
title: ParaAttention
- sections:
- local: using-diffusers/stable_diffusion_jax_how_to
title: JAX/Flax
@@ -268,6 +272,8 @@
title: AuraFlowTransformer2DModel
- local: api/models/cogvideox_transformer3d
title: CogVideoXTransformer3DModel
- local: api/models/consisid_transformer3d
title: ConsisIDTransformer3DModel
- local: api/models/cogview3plus_transformer2d
title: CogView3PlusTransformer2DModel
- local: api/models/dit_transformer2d
@@ -370,6 +376,8 @@
title: CogVideoX
- local: api/pipelines/cogview3
title: CogView3
- local: api/pipelines/consisid
title: ConsisID
- local: api/pipelines/consistency_models
title: Consistency Models
- local: api/pipelines/controlnet
@@ -590,6 +598,8 @@
title: Attention Processor
- local: api/activations
title: Custom activation functions
- local: api/cache
title: Caching methods
- local: api/normalization
title: Custom normalization layers
- local: api/utilities
+49
View File
@@ -0,0 +1,49 @@
<!-- Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License. -->
# Caching methods
## Pyramid Attention Broadcast
[Pyramid Attention Broadcast](https://huggingface.co/papers/2408.12588) from Xuanlei Zhao, Xiaolong Jin, Kai Wang, Yang You.
Pyramid Attention Broadcast (PAB) is a method that speeds up inference in diffusion models by systematically skipping attention computations between successive inference steps and reusing cached attention states. The attention states are not very different between successive inference steps. The most prominent difference is in the spatial attention blocks, not as much in the temporal attention blocks, and finally the least in the cross attention blocks. Therefore, many cross attention computation blocks can be skipped, followed by the temporal and spatial attention blocks. By combining other techniques like sequence parallelism and classifier-free guidance parallelism, PAB achieves near real-time video generation.
Enable PAB with [`~PyramidAttentionBroadcastConfig`] on any pipeline. For some benchmarks, refer to [this](https://github.com/huggingface/diffusers/pull/9562) pull request.
```python
import torch
from diffusers import CogVideoXPipeline, PyramidAttentionBroadcastConfig
pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16)
pipe.to("cuda")
# Increasing the value of `spatial_attention_timestep_skip_range[0]` or decreasing the value of
# `spatial_attention_timestep_skip_range[1]` will decrease the interval in which pyramid attention
# broadcast is active, leader to slower inference speeds. However, large intervals can lead to
# poorer quality of generated videos.
config = PyramidAttentionBroadcastConfig(
spatial_attention_block_skip_range=2,
spatial_attention_timestep_skip_range=(100, 800),
current_timestep_callback=lambda: pipe.current_timestep,
)
pipe.transformer.enable_cache(config)
```
### CacheMixin
[[autodoc]] CacheMixin
### PyramidAttentionBroadcastConfig
[[autodoc]] PyramidAttentionBroadcastConfig
[[autodoc]] apply_pyramid_attention_broadcast
@@ -0,0 +1,30 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License. -->
# ConsisIDTransformer3DModel
A Diffusion Transformer model for 3D data from [ConsisID](https://github.com/PKU-YuanGroup/ConsisID) was introduced in [Identity-Preserving Text-to-Video Generation by Frequency Decomposition](https://arxiv.org/pdf/2411.17440) by Peking University & University of Rochester & etc.
The model can be loaded with the following code snippet.
```python
from diffusers import ConsisIDTransformer3DModel
transformer = ConsisIDTransformer3DModel.from_pretrained("BestWishYsh/ConsisID-preview", subfolder="transformer", torch_dtype=torch.bfloat16).to("cuda")
```
## ConsisIDTransformer3DModel
[[autodoc]] ConsisIDTransformer3DModel
## Transformer2DModelOutput
[[autodoc]] models.modeling_outputs.Transformer2DModelOutput
+60
View File
@@ -0,0 +1,60 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
-->
# ConsisID
[Identity-Preserving Text-to-Video Generation by Frequency Decomposition](https://arxiv.org/abs/2411.17440) from Peking University & University of Rochester & etc, by Shenghai Yuan, Jinfa Huang, Xianyi He, Yunyang Ge, Yujun Shi, Liuhan Chen, Jiebo Luo, Li Yuan.
The abstract from the paper is:
*Identity-preserving text-to-video (IPT2V) generation aims to create high-fidelity videos with consistent human identity. It is an important task in video generation but remains an open problem for generative models. This paper pushes the technical frontier of IPT2V in two directions that have not been resolved in the literature: (1) A tuning-free pipeline without tedious case-by-case finetuning, and (2) A frequency-aware heuristic identity-preserving Diffusion Transformer (DiT)-based control scheme. To achieve these goals, we propose **ConsisID**, a tuning-free DiT-based controllable IPT2V model to keep human-**id**entity **consis**tent in the generated video. Inspired by prior findings in frequency analysis of vision/diffusion transformers, it employs identity-control signals in the frequency domain, where facial features can be decomposed into low-frequency global features (e.g., profile, proportions) and high-frequency intrinsic features (e.g., identity markers that remain unaffected by pose changes). First, from a low-frequency perspective, we introduce a global facial extractor, which encodes the reference image and facial key points into a latent space, generating features enriched with low-frequency information. These features are then integrated into the shallow layers of the network to alleviate training challenges associated with DiT. Second, from a high-frequency perspective, we design a local facial extractor to capture high-frequency details and inject them into the transformer blocks, enhancing the model's ability to preserve fine-grained features. To leverage the frequency information for identity preservation, we propose a hierarchical training strategy, transforming a vanilla pre-trained video generation model into an IPT2V model. Extensive experiments demonstrate that our frequency-aware heuristic scheme provides an optimal control solution for DiT-based models. Thanks to this scheme, our **ConsisID** achieves excellent results in generating high-quality, identity-preserving videos, making strides towards more effective IPT2V. The model weight of ConsID is publicly available at https://github.com/PKU-YuanGroup/ConsisID.*
<Tip>
Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.md) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading.md#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.
</Tip>
This pipeline was contributed by [SHYuanBest](https://github.com/SHYuanBest). The original codebase can be found [here](https://github.com/PKU-YuanGroup/ConsisID). The original weights can be found under [hf.co/BestWishYsh](https://huggingface.co/BestWishYsh).
There are two official ConsisID checkpoints for identity-preserving text-to-video.
| checkpoints | recommended inference dtype |
|:---:|:---:|
| [`BestWishYsh/ConsisID-preview`](https://huggingface.co/BestWishYsh/ConsisID-preview) | torch.bfloat16 |
| [`BestWishYsh/ConsisID-1.5`](https://huggingface.co/BestWishYsh/ConsisID-preview) | torch.bfloat16 |
### Memory optimization
ConsisID requires about 44 GB of GPU memory to decode 49 frames (6 seconds of video at 8 FPS) with output resolution 720x480 (W x H), which makes it not possible to run on consumer GPUs or free-tier T4 Colab. The following memory optimizations could be used to reduce the memory footprint. For replication, you can refer to [this](https://gist.github.com/SHYuanBest/bc4207c36f454f9e969adbb50eaf8258) script.
| Feature (overlay the previous) | Max Memory Allocated | Max Memory Reserved |
| :----------------------------- | :------------------- | :------------------ |
| - | 37 GB | 44 GB |
| enable_model_cpu_offload | 22 GB | 25 GB |
| enable_sequential_cpu_offload | 16 GB | 22 GB |
| vae.enable_slicing | 16 GB | 22 GB |
| vae.enable_tiling | 5 GB | 7 GB |
## ConsisIDPipeline
[[autodoc]] ConsisIDPipeline
- all
- __call__
## ConsisIDPipelineOutput
[[autodoc]] pipelines.consisid.pipeline_output.ConsisIDPipelineOutput
+47
View File
@@ -309,6 +309,53 @@ image.save("output.png")
When unloading the Control LoRA weights, call `pipe.unload_lora_weights(reset_to_overwritten_params=True)` to reset the `pipe.transformer` completely back to its original form. The resultant pipeline can then be used with methods like [`DiffusionPipeline.from_pipe`]. More details about this argument are available in [this PR](https://github.com/huggingface/diffusers/pull/10397).
## IP-Adapter
<Tip>
Check out [IP-Adapter](../../../using-diffusers/ip_adapter) to learn more about how IP-Adapters work.
</Tip>
An IP-Adapter lets you prompt Flux with images, in addition to the text prompt. This is especially useful when describing complex concepts that are difficult to articulate through text alone and you have reference images.
```python
import torch
from diffusers import FluxPipeline
from diffusers.utils import load_image
pipe = FluxPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16
).to("cuda")
image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flux_ip_adapter_input.jpg").resize((1024, 1024))
pipe.load_ip_adapter(
"XLabs-AI/flux-ip-adapter",
weight_name="ip_adapter.safetensors",
image_encoder_pretrained_model_name_or_path="openai/clip-vit-large-patch14"
)
pipe.set_ip_adapter_scale(1.0)
image = pipe(
width=1024,
height=1024,
prompt="wearing sunglasses",
negative_prompt="",
true_cfg=4.0,
generator=torch.Generator().manual_seed(4444),
ip_adapter_image=image,
).images[0]
image.save('flux_ip_adapter_output.jpg')
```
<div class="justify-center">
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flux_ip_adapter_output.jpg"/>
<figcaption class="mt-2 text-sm text-center text-gray-500">IP-Adapter examples with prompt "wearing sunglasses"</figcaption>
</div>
## Running FP16 inference
Flux can generate high-quality images with FP16 (i.e. to accelerate inference on Turing/Volta GPUs) but produces different outputs compared to FP32/BF16. The issue is that some activations in the text encoders have to be clipped when running in FP16, which affects the overall image. Forcing text encoders to run with FP32 inference thus removes this output difference. See [here](https://github.com/huggingface/diffusers/pull/9097#issuecomment-2272292516) for details.
+1 -1
View File
@@ -115,7 +115,7 @@ export_to_video(frames, "mochi.mp4", fps=30)
## Reproducing the results from the Genmo Mochi repo
The [Genmo Mochi implementation](https://github.com/genmoai/mochi/tree/main) uses different precision values for each stage in the inference process. The text encoder and VAE use `torch.float32`, while the DiT uses `torch.bfloat16` with the [attention kernel](https://pytorch.org/docs/stable/generated/torch.nn.attention.sdpa_kernel.html#torch.nn.attention.sdpa_kernel) set to `EFFICIENT_ATTENTION`. Diffusers pipelines currently do not support setting different `dtypes` for different stages of the pipeline. In order to run inference in the same way as the the original implementation, please refer to the following example.
The [Genmo Mochi implementation](https://github.com/genmoai/mochi/tree/main) uses different precision values for each stage in the inference process. The text encoder and VAE use `torch.float32`, while the DiT uses `torch.bfloat16` with the [attention kernel](https://pytorch.org/docs/stable/generated/torch.nn.attention.sdpa_kernel.html#torch.nn.attention.sdpa_kernel) set to `EFFICIENT_ATTENTION`. Diffusers pipelines currently do not support setting different `dtypes` for different stages of the pipeline. In order to run inference in the same way as the original implementation, please refer to the following example.
<Tip>
The original Mochi implementation zeros out empty prompts. However, enabling this option and placing the entire pipeline under autocast can lead to numerical overflows with the T5 text encoder.
@@ -77,7 +77,7 @@ from diffusers import StableDiffusion3Pipeline
from transformers import SiglipVisionModel, SiglipImageProcessor
image_encoder_id = "google/siglip-so400m-patch14-384"
ip_adapter_id = "InstantX/SD3.5-Large-IP-Adapter"
ip_adapter_id = "guiyrt/InstantX-SD3.5-Large-IP-Adapter-diffusers"
feature_extractor = SiglipImageProcessor.from_pretrained(
image_encoder_id,
+4
View File
@@ -41,3 +41,7 @@ Utility and helper functions for working with 🤗 Diffusers.
## randn_tensor
[[autodoc]] utils.torch_utils.randn_tensor
## apply_layerwise_casting
[[autodoc]] hooks.layerwise_casting.apply_layerwise_casting
+34 -6
View File
@@ -23,32 +23,60 @@ You should install 🤗 Diffusers in a [virtual environment](https://docs.python
If you're unfamiliar with Python virtual environments, take a look at this [guide](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/).
A virtual environment makes it easier to manage different projects and avoid compatibility issues between dependencies.
Start by creating a virtual environment in your project directory:
Create a virtual environment with Python or [uv](https://docs.astral.sh/uv/) (refer to [Installation](https://docs.astral.sh/uv/getting-started/installation/) for installation instructions), a fast Rust-based Python package and project manager.
<hfoptions id="install">
<hfoption id="uv">
```bash
python -m venv .env
uv venv my-env
source my-env/bin/activate
```
Activate the virtual environment:
</hfoption>
<hfoption id="Python">
```bash
source .env/bin/activate
python -m venv my-env
source my-env/bin/activate
```
You should also install 🤗 Transformers because 🤗 Diffusers relies on its models:
</hfoption>
</hfoptions>
You should also install 🤗 Transformers because 🤗 Diffusers relies on its models.
<frameworkcontent>
<pt>
Note - PyTorch only supports Python 3.8 - 3.11 on Windows.
PyTorch only supports Python 3.8 - 3.11 on Windows. Install Diffusers with uv.
```bash
uv install diffusers["torch"] transformers
```
You can also install Diffusers with pip.
```bash
pip install diffusers["torch"] transformers
```
</pt>
<jax>
Install Diffusers with uv.
```bash
uv pip install diffusers["flax"] transformers
```
You can also install Diffusers with pip.
```bash
pip install diffusers["flax"] transformers
```
</jax>
</frameworkcontent>
+37
View File
@@ -158,6 +158,43 @@ In order to properly offload models after they're called, it is required to run
</Tip>
## FP8 layerwise weight-casting
PyTorch supports `torch.float8_e4m3fn` and `torch.float8_e5m2` as weight storage dtypes, but they can't be used for computation in many different tensor operations due to unimplemented kernel support. However, you can use these dtypes to store model weights in fp8 precision and upcast them on-the-fly when the layers are used in the forward pass. This is known as layerwise weight-casting.
Typically, inference on most models is done with `torch.float16` or `torch.bfloat16` weight/computation precision. Layerwise weight-casting cuts down the memory footprint of the model weights by approximately half.
```python
import torch
from diffusers import CogVideoXPipeline, CogVideoXTransformer3DModel
from diffusers.utils import export_to_video
model_id = "THUDM/CogVideoX-5b"
# Load the model in bfloat16 and enable layerwise casting
transformer = CogVideoXTransformer3DModel.from_pretrained(model_id, subfolder="transformer", torch_dtype=torch.bfloat16)
transformer.enable_layerwise_casting(storage_dtype=torch.float8_e4m3fn, compute_dtype=torch.bfloat16)
# Load the pipeline
pipe = CogVideoXPipeline.from_pretrained(model_id, transformer=transformer, torch_dtype=torch.bfloat16)
pipe.to("cuda")
prompt = (
"A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. "
"The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other "
"pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, "
"casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. "
"The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical "
"atmosphere of this unique musical performance."
)
video = pipe(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0]
export_to_video(video, "output.mp4", fps=8)
```
In the above example, layerwise casting is enabled on the transformer component of the pipeline. By default, certain layers are skipped from the FP8 weight casting because it can lead to significant degradation of generation quality. The normalization and modulation related weight parameters are also skipped by default.
However, you gain more control and flexibility by directly utilizing the [`~hooks.layerwise_casting.apply_layerwise_casting`] function instead of [`~ModelMixin.enable_layerwise_casting`].
## Channels-last memory format
The channels-last memory format is an alternative way of ordering NCHW tensors in memory to preserve dimension ordering. Channels-last tensors are ordered in such a way that the channels become the densest dimension (storing images pixel-per-pixel). Since not all operators currently support the channels-last format, it may result in worst performance but you should still try and see if it works for your model.
+497
View File
@@ -0,0 +1,497 @@
# ParaAttention
<div class="flex justify-center">
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/para-attn/flux-performance.png">
</div>
<div class="flex justify-center">
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/para-attn/hunyuan-video-performance.png">
</div>
Large image and video generation models, such as [FLUX.1-dev](https://huggingface.co/black-forest-labs/FLUX.1-dev) and [HunyuanVideo](https://huggingface.co/tencent/HunyuanVideo), can be an inference challenge for real-time applications and deployment because of their size.
[ParaAttention](https://github.com/chengzeyi/ParaAttention) is a library that implements **context parallelism** and **first block cache**, and can be combined with other techniques (torch.compile, fp8 dynamic quantization), to accelerate inference.
This guide will show you how to apply ParaAttention to FLUX.1-dev and HunyuanVideo on NVIDIA L20 GPUs.
No optimizations are applied for our baseline benchmark, except for HunyuanVideo to avoid out-of-memory errors.
Our baseline benchmark shows that FLUX.1-dev is able to generate a 1024x1024 resolution image in 28 steps in 26.36 seconds, and HunyuanVideo is able to generate 129 frames at 720p resolution in 30 steps in 3675.71 seconds.
> [!TIP]
> For even faster inference with context parallelism, try using NVIDIA A100 or H100 GPUs (if available) with NVLink support, especially when there is a large number of GPUs.
## First Block Cache
Caching the output of the transformers blocks in the model and reusing them in the next inference steps reduces the computation cost and makes inference faster.
However, it is hard to decide when to reuse the cache to ensure quality generated images or videos. ParaAttention directly uses the **residual difference of the first transformer block output** to approximate the difference among model outputs. When the difference is small enough, the residual difference of previous inference steps is reused. In other words, the denoising step is skipped.
This achieves a 2x speedup on FLUX.1-dev and HunyuanVideo inference with very good quality.
<figure>
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/para-attn/ada-cache.png" alt="Cache in Diffusion Transformer" />
<figcaption>How AdaCache works, First Block Cache is a variant of it</figcaption>
</figure>
<hfoptions id="first-block-cache">
<hfoption id="FLUX-1.dev">
To apply first block cache on FLUX.1-dev, call `apply_cache_on_pipe` as shown below. 0.08 is the default residual difference value for FLUX models.
```python
import time
import torch
from diffusers import FluxPipeline
pipe = FluxPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev",
torch_dtype=torch.bfloat16,
).to("cuda")
from para_attn.first_block_cache.diffusers_adapters import apply_cache_on_pipe
apply_cache_on_pipe(pipe, residual_diff_threshold=0.08)
# Enable memory savings
# pipe.enable_model_cpu_offload()
# pipe.enable_sequential_cpu_offload()
begin = time.time()
image = pipe(
"A cat holding a sign that says hello world",
num_inference_steps=28,
).images[0]
end = time.time()
print(f"Time: {end - begin:.2f}s")
print("Saving image to flux.png")
image.save("flux.png")
```
| Optimizations | Original | FBCache rdt=0.06 | FBCache rdt=0.08 | FBCache rdt=0.10 | FBCache rdt=0.12 |
| - | - | - | - | - | - |
| Preview | ![Original](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/para-attn/flux-original.png) | ![FBCache rdt=0.06](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/para-attn/flux-fbc-0.06.png) | ![FBCache rdt=0.08](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/para-attn/flux-fbc-0.08.png) | ![FBCache rdt=0.10](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/para-attn/flux-fbc-0.10.png) | ![FBCache rdt=0.12](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/para-attn/flux-fbc-0.12.png) |
| Wall Time (s) | 26.36 | 21.83 | 17.01 | 16.00 | 13.78 |
First Block Cache reduced the inference speed to 17.01 seconds compared to the baseline, or 1.55x faster, while maintaining nearly zero quality loss.
</hfoption>
<hfoption id="HunyuanVideo">
To apply First Block Cache on HunyuanVideo, `apply_cache_on_pipe` as shown below. 0.06 is the default residual difference value for HunyuanVideo models.
```python
import time
import torch
from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel
from diffusers.utils import export_to_video
model_id = "tencent/HunyuanVideo"
transformer = HunyuanVideoTransformer3DModel.from_pretrained(
model_id,
subfolder="transformer",
torch_dtype=torch.bfloat16,
revision="refs/pr/18",
)
pipe = HunyuanVideoPipeline.from_pretrained(
model_id,
transformer=transformer,
torch_dtype=torch.float16,
revision="refs/pr/18",
).to("cuda")
from para_attn.first_block_cache.diffusers_adapters import apply_cache_on_pipe
apply_cache_on_pipe(pipe, residual_diff_threshold=0.6)
pipe.vae.enable_tiling()
begin = time.time()
output = pipe(
prompt="A cat walks on the grass, realistic",
height=720,
width=1280,
num_frames=129,
num_inference_steps=30,
).frames[0]
end = time.time()
print(f"Time: {end - begin:.2f}s")
print("Saving video to hunyuan_video.mp4")
export_to_video(output, "hunyuan_video.mp4", fps=15)
```
<video controls>
<source src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/para-attn/hunyuan-video-original.mp4" type="video/mp4">
Your browser does not support the video tag.
</video>
<small> HunyuanVideo without FBCache </small>
<video controls>
<source src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/para-attn/hunyuan-video-fbc.mp4" type="video/mp4">
Your browser does not support the video tag.
</video>
<small> HunyuanVideo with FBCache </small>
First Block Cache reduced the inference speed to 2271.06 seconds compared to the baseline, or 1.62x faster, while maintaining nearly zero quality loss.
</hfoption>
</hfoptions>
## fp8 quantization
fp8 with dynamic quantization further speeds up inference and reduces memory usage. Both the activations and weights must be quantized in order to use the 8-bit [NVIDIA Tensor Cores](https://www.nvidia.com/en-us/data-center/tensor-cores/).
Use `float8_weight_only` and `float8_dynamic_activation_float8_weight` to quantize the text encoder and transformer model.
The default quantization method is per tensor quantization, but if your GPU supports row-wise quantization, you can also try it for better accuracy.
Install [torchao](https://github.com/pytorch/ao/tree/main) with the command below.
```bash
pip3 install -U torch torchao
```
[torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) with `mode="max-autotune-no-cudagraphs"` or `mode="max-autotune"` selects the best kernel for performance. Compilation can take a long time if it's the first time the model is called, but it is worth it once the model has been compiled.
This example only quantizes the transformer model, but you can also quantize the text encoder to reduce memory usage even more.
> [!TIP]
> Dynamic quantization can significantly change the distribution of the model output, so you need to change the `residual_diff_threshold` to a larger value for it to take effect.
<hfoptions id="fp8-quantization">
<hfoption id="FLUX-1.dev">
```python
import time
import torch
from diffusers import FluxPipeline
pipe = FluxPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev",
torch_dtype=torch.bfloat16,
).to("cuda")
from para_attn.first_block_cache.diffusers_adapters import apply_cache_on_pipe
apply_cache_on_pipe(
pipe,
residual_diff_threshold=0.12, # Use a larger value to make the cache take effect
)
from torchao.quantization import quantize_, float8_dynamic_activation_float8_weight, float8_weight_only
quantize_(pipe.text_encoder, float8_weight_only())
quantize_(pipe.transformer, float8_dynamic_activation_float8_weight())
pipe.transformer = torch.compile(
pipe.transformer, mode="max-autotune-no-cudagraphs",
)
# Enable memory savings
# pipe.enable_model_cpu_offload()
# pipe.enable_sequential_cpu_offload()
for i in range(2):
begin = time.time()
image = pipe(
"A cat holding a sign that says hello world",
num_inference_steps=28,
).images[0]
end = time.time()
if i == 0:
print(f"Warm up time: {end - begin:.2f}s")
else:
print(f"Time: {end - begin:.2f}s")
print("Saving image to flux.png")
image.save("flux.png")
```
fp8 dynamic quantization and torch.compile reduced the inference speed to 7.56 seconds compared to the baseline, or 3.48x faster.
</hfoption>
<hfoption id="HunyuanVideo">
```python
import time
import torch
from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel
from diffusers.utils import export_to_video
model_id = "tencent/HunyuanVideo"
transformer = HunyuanVideoTransformer3DModel.from_pretrained(
model_id,
subfolder="transformer",
torch_dtype=torch.bfloat16,
revision="refs/pr/18",
)
pipe = HunyuanVideoPipeline.from_pretrained(
model_id,
transformer=transformer,
torch_dtype=torch.float16,
revision="refs/pr/18",
).to("cuda")
from para_attn.first_block_cache.diffusers_adapters import apply_cache_on_pipe
apply_cache_on_pipe(pipe)
from torchao.quantization import quantize_, float8_dynamic_activation_float8_weight, float8_weight_only
quantize_(pipe.text_encoder, float8_weight_only())
quantize_(pipe.transformer, float8_dynamic_activation_float8_weight())
pipe.transformer = torch.compile(
pipe.transformer, mode="max-autotune-no-cudagraphs",
)
# Enable memory savings
pipe.vae.enable_tiling()
# pipe.enable_model_cpu_offload()
# pipe.enable_sequential_cpu_offload()
for i in range(2):
begin = time.time()
output = pipe(
prompt="A cat walks on the grass, realistic",
height=720,
width=1280,
num_frames=129,
num_inference_steps=1 if i == 0 else 30,
).frames[0]
end = time.time()
if i == 0:
print(f"Warm up time: {end - begin:.2f}s")
else:
print(f"Time: {end - begin:.2f}s")
print("Saving video to hunyuan_video.mp4")
export_to_video(output, "hunyuan_video.mp4", fps=15)
```
A NVIDIA L20 GPU only has 48GB memory and could face out-of-memory (OOM) errors after compilation and if `enable_model_cpu_offload` isn't called because HunyuanVideo has very large activation tensors when running with high resolution and large number of frames. For GPUs with less than 80GB of memory, you can try reducing the resolution and number of frames to avoid OOM errors.
Large video generation models are usually bottlenecked by the attention computations rather than the fully connected layers. These models don't significantly benefit from quantization and torch.compile.
</hfoption>
</hfoptions>
## Context Parallelism
Context Parallelism parallelizes inference and scales with multiple GPUs. The ParaAttention compositional design allows you to combine Context Parallelism with First Block Cache and dynamic quantization.
> [!TIP]
> Refer to the [ParaAttention](https://github.com/chengzeyi/ParaAttention/tree/main) repository for detailed instructions and examples of how to scale inference with multiple GPUs.
If the inference process needs to be persistent and serviceable, it is suggested to use [torch.multiprocessing](https://pytorch.org/docs/stable/multiprocessing.html) to write your own inference processor. This can eliminate the overhead of launching the process and loading and recompiling the model.
<hfoptions id="context-parallelism">
<hfoption id="FLUX-1.dev">
The code sample below combines First Block Cache, fp8 dynamic quantization, torch.compile, and Context Parallelism for the fastest inference speed.
```python
import time
import torch
import torch.distributed as dist
from diffusers import FluxPipeline
dist.init_process_group()
torch.cuda.set_device(dist.get_rank())
pipe = FluxPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev",
torch_dtype=torch.bfloat16,
).to("cuda")
from para_attn.context_parallel import init_context_parallel_mesh
from para_attn.context_parallel.diffusers_adapters import parallelize_pipe
from para_attn.parallel_vae.diffusers_adapters import parallelize_vae
mesh = init_context_parallel_mesh(
pipe.device.type,
max_ring_dim_size=2,
)
parallelize_pipe(
pipe,
mesh=mesh,
)
parallelize_vae(pipe.vae, mesh=mesh._flatten())
from para_attn.first_block_cache.diffusers_adapters import apply_cache_on_pipe
apply_cache_on_pipe(
pipe,
residual_diff_threshold=0.12, # Use a larger value to make the cache take effect
)
from torchao.quantization import quantize_, float8_dynamic_activation_float8_weight, float8_weight_only
quantize_(pipe.text_encoder, float8_weight_only())
quantize_(pipe.transformer, float8_dynamic_activation_float8_weight())
torch._inductor.config.reorder_for_compute_comm_overlap = True
pipe.transformer = torch.compile(
pipe.transformer, mode="max-autotune-no-cudagraphs",
)
# Enable memory savings
# pipe.enable_model_cpu_offload(gpu_id=dist.get_rank())
# pipe.enable_sequential_cpu_offload(gpu_id=dist.get_rank())
for i in range(2):
begin = time.time()
image = pipe(
"A cat holding a sign that says hello world",
num_inference_steps=28,
output_type="pil" if dist.get_rank() == 0 else "pt",
).images[0]
end = time.time()
if dist.get_rank() == 0:
if i == 0:
print(f"Warm up time: {end - begin:.2f}s")
else:
print(f"Time: {end - begin:.2f}s")
if dist.get_rank() == 0:
print("Saving image to flux.png")
image.save("flux.png")
dist.destroy_process_group()
```
Save to `run_flux.py` and launch it with [torchrun](https://pytorch.org/docs/stable/elastic/run.html).
```bash
# Use --nproc_per_node to specify the number of GPUs
torchrun --nproc_per_node=2 run_flux.py
```
Inference speed is reduced to 8.20 seconds compared to the baseline, or 3.21x faster, with 2 NVIDIA L20 GPUs. On 4 L20s, inference speed is 3.90 seconds, or 6.75x faster.
</hfoption>
<hfoption id="HunyuanVideo">
The code sample below combines First Block Cache and Context Parallelism for the fastest inference speed.
```python
import time
import torch
import torch.distributed as dist
from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel
from diffusers.utils import export_to_video
dist.init_process_group()
torch.cuda.set_device(dist.get_rank())
model_id = "tencent/HunyuanVideo"
transformer = HunyuanVideoTransformer3DModel.from_pretrained(
model_id,
subfolder="transformer",
torch_dtype=torch.bfloat16,
revision="refs/pr/18",
)
pipe = HunyuanVideoPipeline.from_pretrained(
model_id,
transformer=transformer,
torch_dtype=torch.float16,
revision="refs/pr/18",
).to("cuda")
from para_attn.context_parallel import init_context_parallel_mesh
from para_attn.context_parallel.diffusers_adapters import parallelize_pipe
from para_attn.parallel_vae.diffusers_adapters import parallelize_vae
mesh = init_context_parallel_mesh(
pipe.device.type,
)
parallelize_pipe(
pipe,
mesh=mesh,
)
parallelize_vae(pipe.vae, mesh=mesh._flatten())
from para_attn.first_block_cache.diffusers_adapters import apply_cache_on_pipe
apply_cache_on_pipe(pipe)
# from torchao.quantization import quantize_, float8_dynamic_activation_float8_weight, float8_weight_only
#
# torch._inductor.config.reorder_for_compute_comm_overlap = True
#
# quantize_(pipe.text_encoder, float8_weight_only())
# quantize_(pipe.transformer, float8_dynamic_activation_float8_weight())
# pipe.transformer = torch.compile(
# pipe.transformer, mode="max-autotune-no-cudagraphs",
# )
# Enable memory savings
pipe.vae.enable_tiling()
# pipe.enable_model_cpu_offload(gpu_id=dist.get_rank())
# pipe.enable_sequential_cpu_offload(gpu_id=dist.get_rank())
for i in range(2):
begin = time.time()
output = pipe(
prompt="A cat walks on the grass, realistic",
height=720,
width=1280,
num_frames=129,
num_inference_steps=1 if i == 0 else 30,
output_type="pil" if dist.get_rank() == 0 else "pt",
).frames[0]
end = time.time()
if dist.get_rank() == 0:
if i == 0:
print(f"Warm up time: {end - begin:.2f}s")
else:
print(f"Time: {end - begin:.2f}s")
if dist.get_rank() == 0:
print("Saving video to hunyuan_video.mp4")
export_to_video(output, "hunyuan_video.mp4", fps=15)
dist.destroy_process_group()
```
Save to `run_hunyuan_video.py` and launch it with [torchrun](https://pytorch.org/docs/stable/elastic/run.html).
```bash
# Use --nproc_per_node to specify the number of GPUs
torchrun --nproc_per_node=8 run_hunyuan_video.py
```
Inference speed is reduced to 649.23 seconds compared to the baseline, or 5.66x faster, with 8 NVIDIA L20 GPUs.
</hfoption>
</hfoptions>
## Benchmarks
<hfoptions id="conclusion">
<hfoption id="FLUX-1.dev">
| GPU Type | Number of GPUs | Optimizations | Wall Time (s) | Speedup |
| - | - | - | - | - |
| NVIDIA L20 | 1 | Baseline | 26.36 | 1.00x |
| NVIDIA L20 | 1 | FBCache (rdt=0.08) | 17.01 | 1.55x |
| NVIDIA L20 | 1 | FP8 DQ | 13.40 | 1.96x |
| NVIDIA L20 | 1 | FBCache (rdt=0.12) + FP8 DQ | 7.56 | 3.48x |
| NVIDIA L20 | 2 | FBCache (rdt=0.12) + FP8 DQ + CP | 4.92 | 5.35x |
| NVIDIA L20 | 4 | FBCache (rdt=0.12) + FP8 DQ + CP | 3.90 | 6.75x |
</hfoption>
<hfoption id="HunyuanVideo">
| GPU Type | Number of GPUs | Optimizations | Wall Time (s) | Speedup |
| - | - | - | - | - |
| NVIDIA L20 | 1 | Baseline | 3675.71 | 1.00x |
| NVIDIA L20 | 1 | FBCache | 2271.06 | 1.62x |
| NVIDIA L20 | 2 | FBCache + CP | 1132.90 | 3.24x |
| NVIDIA L20 | 4 | FBCache + CP | 718.15 | 5.12x |
| NVIDIA L20 | 8 | FBCache + CP | 649.23 | 5.66x |
</hfoption>
</hfoptions>
@@ -0,0 +1,96 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# ConsisID
[ConsisID](https://github.com/PKU-YuanGroup/ConsisID) is an identity-preserving text-to-video generation model that keeps the face consistent in the generated video by frequency decomposition. The main features of ConsisID are:
- Frequency decomposition: The characteristics of the DiT architecture are analyzed from the frequency domain perspective, and based on these characteristics, a reasonable control information injection method is designed.
- Consistency training strategy: A coarse-to-fine training strategy, dynamic masking loss, and dynamic cross-face loss further enhance the model's generalization ability and identity preservation performance.
- Inference without finetuning: Previous methods required case-by-case finetuning of the input ID before inference, leading to significant time and computational costs. In contrast, ConsisID is tuning-free.
This guide will walk you through using ConsisID for use cases.
## Load Model Checkpoints
Model weights may be stored in separate subfolders on the Hub or locally, in which case, you should use the [`~DiffusionPipeline.from_pretrained`] method.
```python
# !pip install consisid_eva_clip insightface facexlib
import torch
from diffusers import ConsisIDPipeline
from diffusers.pipelines.consisid.consisid_utils import prepare_face_models, process_face_embeddings_infer
from huggingface_hub import snapshot_download
# Download ckpts
snapshot_download(repo_id="BestWishYsh/ConsisID-preview", local_dir="BestWishYsh/ConsisID-preview")
# Load face helper model to preprocess input face image
face_helper_1, face_helper_2, face_clip_model, face_main_model, eva_transform_mean, eva_transform_std = prepare_face_models("BestWishYsh/ConsisID-preview", device="cuda", dtype=torch.bfloat16)
# Load consisid base model
pipe = ConsisIDPipeline.from_pretrained("BestWishYsh/ConsisID-preview", torch_dtype=torch.bfloat16)
pipe.to("cuda")
```
## Identity-Preserving Text-to-Video
For identity-preserving text-to-video, pass a text prompt and an image contain clear face (e.g., preferably half-body or full-body). By default, ConsisID generates a 720x480 video for the best results.
```python
from diffusers.utils import export_to_video
prompt = "The video captures a boy walking along a city street, filmed in black and white on a classic 35mm camera. His expression is thoughtful, his brow slightly furrowed as if he's lost in contemplation. The film grain adds a textured, timeless quality to the image, evoking a sense of nostalgia. Around him, the cityscape is filled with vintage buildings, cobblestone sidewalks, and softly blurred figures passing by, their outlines faint and indistinct. Streetlights cast a gentle glow, while shadows play across the boy's path, adding depth to the scene. The lighting highlights the boy's subtle smile, hinting at a fleeting moment of curiosity. The overall cinematic atmosphere, complete with classic film still aesthetics and dramatic contrasts, gives the scene an evocative and introspective feel."
image = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_input.png?download=true"
id_cond, id_vit_hidden, image, face_kps = process_face_embeddings_infer(face_helper_1, face_clip_model, face_helper_2, eva_transform_mean, eva_transform_std, face_main_model, "cuda", torch.bfloat16, image, is_align_face=True)
video = pipe(image=image, prompt=prompt, num_inference_steps=50, guidance_scale=6.0, use_dynamic_cfg=False, id_vit_hidden=id_vit_hidden, id_cond=id_cond, kps_cond=face_kps, generator=torch.Generator("cuda").manual_seed(42))
export_to_video(video.frames[0], "output.mp4", fps=8)
```
<table>
<tr>
<th style="text-align: center;">Face Image</th>
<th style="text-align: center;">Video</th>
<th style="text-align: center;">Description</th
</tr>
<tr>
<td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_image_0.png?download=true" style="height: auto; width: 600px;"></td>
<td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_output_0.gif?download=true" style="height: auto; width: 2000px;"></td>
<td>The video, in a beautifully crafted animated style, features a confident woman riding a horse through a lush forest clearing. Her expression is focused yet serene as she adjusts her wide-brimmed hat with a practiced hand. She wears a flowy bohemian dress, which moves gracefully with the rhythm of the horse, the fabric flowing fluidly in the animated motion. The dappled sunlight filters through the trees, casting soft, painterly patterns on the forest floor. Her posture is poised, showing both control and elegance as she guides the horse with ease. The animation's gentle, fluid style adds a dreamlike quality to the scene, with the womans calm demeanor and the peaceful surroundings evoking a sense of freedom and harmony.</td>
</tr>
<tr>
<td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_image_1.png?download=true" style="height: auto; width: 600px;"></td>
<td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_output_1.gif?download=true" style="height: auto; width: 2000px;"></td>
<td>The video, in a captivating animated style, shows a woman standing in the center of a snowy forest, her eyes narrowed in concentration as she extends her hand forward. She is dressed in a deep blue cloak, her breath visible in the cold air, which is rendered with soft, ethereal strokes. A faint smile plays on her lips as she summons a wisp of ice magic, watching with focus as the surrounding trees and ground begin to shimmer and freeze, covered in delicate ice crystals. The animations fluid motion brings the magic to life, with the frost spreading outward in intricate, sparkling patterns. The environment is painted with soft, watercolor-like hues, enhancing the magical, dreamlike atmosphere. The overall mood is serene yet powerful, with the quiet winter air amplifying the delicate beauty of the frozen scene.</td>
</tr>
<tr>
<td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_image_2.png?download=true" style="height: auto; width: 600px;"></td>
<td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_output_2.gif?download=true" style="height: auto; width: 2000px;"></td>
<td>The animation features a whimsical portrait of a balloon seller standing in a gentle breeze, captured with soft, hazy brushstrokes that evoke the feel of a serene spring day. His face is framed by a gentle smile, his eyes squinting slightly against the sun, while a few wisps of hair flutter in the wind. He is dressed in a light, pastel-colored shirt, and the balloons around him sway with the wind, adding a sense of playfulness to the scene. The background blurs softly, with hints of a vibrant market or park, enhancing the light-hearted, yet tender mood of the moment.</td>
</tr>
<tr>
<td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_image_3.png?download=true" style="height: auto; width: 600px;"></td>
<td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_output_3.gif?download=true" style="height: auto; width: 2000px;"></td>
<td>The video captures a boy walking along a city street, filmed in black and white on a classic 35mm camera. His expression is thoughtful, his brow slightly furrowed as if he's lost in contemplation. The film grain adds a textured, timeless quality to the image, evoking a sense of nostalgia. Around him, the cityscape is filled with vintage buildings, cobblestone sidewalks, and softly blurred figures passing by, their outlines faint and indistinct. Streetlights cast a gentle glow, while shadows play across the boy's path, adding depth to the scene. The lighting highlights the boy's subtle smile, hinting at a fleeting moment of curiosity. The overall cinematic atmosphere, complete with classic film still aesthetics and dramatic contrasts, gives the scene an evocative and introspective feel.</td>
</tr>
<tr>
<td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_image_4.png?download=true" style="height: auto; width: 600px;"></td>
<td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_output_4.gif?download=true" style="height: auto; width: 2000px;"></td>
<td>The video features a baby wearing a bright superhero cape, standing confidently with arms raised in a powerful pose. The baby has a determined look on their face, with eyes wide and lips pursed in concentration, as if ready to take on a challenge. The setting appears playful, with colorful toys scattered around and a soft rug underfoot, while sunlight streams through a nearby window, highlighting the fluttering cape and adding to the impression of heroism. The overall atmosphere is lighthearted and fun, with the baby's expressions capturing a mix of innocence and an adorable attempt at bravery, as if truly ready to save the day.</td>
</tr>
</table>
## Resources
Learn more about ConsisID with the following resources.
- A [video](https://www.youtube.com/watch?v=PhlgC-bI5SQ) demonstrating ConsisID's main features.
- The research paper, [Identity-Preserving Text-to-Video Generation by Frequency Decomposition](https://hf.co/papers/2411.17440) for more details.
+2 -2
View File
@@ -461,12 +461,12 @@ Chain it to an upscaler pipeline to increase the image resolution:
from diffusers import StableDiffusionLatentUpscalePipeline
upscaler = StableDiffusionLatentUpscalePipeline.from_pretrained(
"stabilityai/sd-x2-latent-upscaler", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
"stabilityai/sd-x2-latent-upscaler", torch_dtype=torch.float16, use_safetensors=True
)
upscaler.enable_model_cpu_offload()
upscaler.enable_xformers_memory_efficient_attention()
image_2 = upscaler(prompt, image=image_1, output_type="latent").images[0]
image_2 = upscaler(prompt, image=image_1).images[0]
```
Finally, chain it to a super-resolution pipeline to further enhance the resolution:
@@ -106,7 +106,7 @@ Let's try it out!
## Deconstruct the Stable Diffusion pipeline
Stable Diffusion is a text-to-image *latent diffusion* model. It is called a latent diffusion model because it works with a lower-dimensional representation of the image instead of the actual pixel space, which makes it more memory efficient. The encoder compresses the image into a smaller representation, and a decoder to convert the compressed representation back into an image. For text-to-image models, you'll need a tokenizer and an encoder to generate text embeddings. From the previous example, you already know you need a UNet model and a scheduler.
Stable Diffusion is a text-to-image *latent diffusion* model. It is called a latent diffusion model because it works with a lower-dimensional representation of the image instead of the actual pixel space, which makes it more memory efficient. The encoder compresses the image into a smaller representation, and a decoder converts the compressed representation back into an image. For text-to-image models, you'll need a tokenizer and an encoder to generate text embeddings. From the previous example, you already know you need a UNet model and a scheduler.
As you can see, this is already more complex than the DDPM pipeline which only contains a UNet model. The Stable Diffusion model has three separate pretrained models.
+2
View File
@@ -5,6 +5,8 @@
title: 快速入门
- local: stable_diffusion
title: 有效和高效的扩散
- local: consisid
title: 身份保持的文本到视频生成
- local: installation
title: 安装
title: 开始
+100
View File
@@ -0,0 +1,100 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# ConsisID
[ConsisID](https://github.com/PKU-YuanGroup/ConsisID)是一种身份保持的文本到视频生成模型,其通过频率分解在生成的视频中保持面部一致性。它具有以下特点:
- 基于频率分解:将人物ID特征解耦为高频和低频部分,从频域的角度分析DIT架构的特性,并且基于此特性设计合理的控制信息注入方式。
- 一致性训练策略:我们提出粗到细训练策略、动态掩码损失、动态跨脸损失,进一步提高了模型的泛化能力和身份保持效果。
- 推理无需微调:之前的方法在推理前,需要对输入id进行case-by-case微调,时间和算力开销较大,而我们的方法是tuning-free的。
本指南将指导您使用 ConsisID 生成身份保持的视频。
## Load Model Checkpoints
模型权重可以存储在Hub上或本地的单独子文件夹中,在这种情况下,您应该使用 [`~DiffusionPipeline.from_pretrained`] 方法。
```python
# !pip install consisid_eva_clip insightface facexlib
import torch
from diffusers import ConsisIDPipeline
from diffusers.pipelines.consisid.consisid_utils import prepare_face_models, process_face_embeddings_infer
from huggingface_hub import snapshot_download
# Download ckpts
snapshot_download(repo_id="BestWishYsh/ConsisID-preview", local_dir="BestWishYsh/ConsisID-preview")
# Load face helper model to preprocess input face image
face_helper_1, face_helper_2, face_clip_model, face_main_model, eva_transform_mean, eva_transform_std = prepare_face_models("BestWishYsh/ConsisID-preview", device="cuda", dtype=torch.bfloat16)
# Load consisid base model
pipe = ConsisIDPipeline.from_pretrained("BestWishYsh/ConsisID-preview", torch_dtype=torch.bfloat16)
pipe.to("cuda")
```
## Identity-Preserving Text-to-Video
对于身份保持的文本到视频生成,需要输入文本提示和包含清晰面部(例如,最好是半身或全身)的图像。默认情况下,ConsisID 会生成 720x480 的视频以获得最佳效果。
```python
from diffusers.utils import export_to_video
prompt = "The video captures a boy walking along a city street, filmed in black and white on a classic 35mm camera. His expression is thoughtful, his brow slightly furrowed as if he's lost in contemplation. The film grain adds a textured, timeless quality to the image, evoking a sense of nostalgia. Around him, the cityscape is filled with vintage buildings, cobblestone sidewalks, and softly blurred figures passing by, their outlines faint and indistinct. Streetlights cast a gentle glow, while shadows play across the boy's path, adding depth to the scene. The lighting highlights the boy's subtle smile, hinting at a fleeting moment of curiosity. The overall cinematic atmosphere, complete with classic film still aesthetics and dramatic contrasts, gives the scene an evocative and introspective feel."
image = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_input.png?download=true"
id_cond, id_vit_hidden, image, face_kps = process_face_embeddings_infer(face_helper_1, face_clip_model, face_helper_2, eva_transform_mean, eva_transform_std, face_main_model, "cuda", torch.bfloat16, image, is_align_face=True)
video = pipe(image=image, prompt=prompt, num_inference_steps=50, guidance_scale=6.0, use_dynamic_cfg=False, id_vit_hidden=id_vit_hidden, id_cond=id_cond, kps_cond=face_kps, generator=torch.Generator("cuda").manual_seed(42))
export_to_video(video.frames[0], "output.mp4", fps=8)
```
<table>
<tr>
<th style="text-align: center;">Face Image</th>
<th style="text-align: center;">Video</th>
<th style="text-align: center;">Description</th
</tr>
<tr>
<td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_image_0.png?download=true" style="height: auto; width: 600px;"></td>
<td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_output_0.gif?download=true" style="height: auto; width: 2000px;"></td>
<td>The video, in a beautifully crafted animated style, features a confident woman riding a horse through a lush forest clearing. Her expression is focused yet serene as she adjusts her wide-brimmed hat with a practiced hand. She wears a flowy bohemian dress, which moves gracefully with the rhythm of the horse, the fabric flowing fluidly in the animated motion. The dappled sunlight filters through the trees, casting soft, painterly patterns on the forest floor. Her posture is poised, showing both control and elegance as she guides the horse with ease. The animation's gentle, fluid style adds a dreamlike quality to the scene, with the womans calm demeanor and the peaceful surroundings evoking a sense of freedom and harmony.</td>
</tr>
<tr>
<td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_image_1.png?download=true" style="height: auto; width: 600px;"></td>
<td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_output_1.gif?download=true" style="height: auto; width: 2000px;"></td>
<td>The video, in a captivating animated style, shows a woman standing in the center of a snowy forest, her eyes narrowed in concentration as she extends her hand forward. She is dressed in a deep blue cloak, her breath visible in the cold air, which is rendered with soft, ethereal strokes. A faint smile plays on her lips as she summons a wisp of ice magic, watching with focus as the surrounding trees and ground begin to shimmer and freeze, covered in delicate ice crystals. The animations fluid motion brings the magic to life, with the frost spreading outward in intricate, sparkling patterns. The environment is painted with soft, watercolor-like hues, enhancing the magical, dreamlike atmosphere. The overall mood is serene yet powerful, with the quiet winter air amplifying the delicate beauty of the frozen scene.</td>
</tr>
<tr>
<td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_image_2.png?download=true" style="height: auto; width: 600px;"></td>
<td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_output_2.gif?download=true" style="height: auto; width: 2000px;"></td>
<td>The animation features a whimsical portrait of a balloon seller standing in a gentle breeze, captured with soft, hazy brushstrokes that evoke the feel of a serene spring day. His face is framed by a gentle smile, his eyes squinting slightly against the sun, while a few wisps of hair flutter in the wind. He is dressed in a light, pastel-colored shirt, and the balloons around him sway with the wind, adding a sense of playfulness to the scene. The background blurs softly, with hints of a vibrant market or park, enhancing the light-hearted, yet tender mood of the moment.</td>
</tr>
<tr>
<td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_image_3.png?download=true" style="height: auto; width: 600px;"></td>
<td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_output_3.gif?download=true" style="height: auto; width: 2000px;"></td>
<td>The video captures a boy walking along a city street, filmed in black and white on a classic 35mm camera. His expression is thoughtful, his brow slightly furrowed as if he's lost in contemplation. The film grain adds a textured, timeless quality to the image, evoking a sense of nostalgia. Around him, the cityscape is filled with vintage buildings, cobblestone sidewalks, and softly blurred figures passing by, their outlines faint and indistinct. Streetlights cast a gentle glow, while shadows play across the boy's path, adding depth to the scene. The lighting highlights the boy's subtle smile, hinting at a fleeting moment of curiosity. The overall cinematic atmosphere, complete with classic film still aesthetics and dramatic contrasts, gives the scene an evocative and introspective feel.</td>
</tr>
<tr>
<td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_image_4.png?download=true" style="height: auto; width: 600px;"></td>
<td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_output_4.gif?download=true" style="height: auto; width: 2000px;"></td>
<td>The video features a baby wearing a bright superhero cape, standing confidently with arms raised in a powerful pose. The baby has a determined look on their face, with eyes wide and lips pursed in concentration, as if ready to take on a challenge. The setting appears playful, with colorful toys scattered around and a soft rug underfoot, while sunlight streams through a nearby window, highlighting the fluttering cape and adding to the impression of heroism. The overall atmosphere is lighthearted and fun, with the baby's expressions capturing a mix of innocence and an adorable attempt at bravery, as if truly ready to save the day.</td>
</tr>
</table>
## Resources
通过以下资源了解有关 ConsisID 的更多信息:
- 一段 [视频](https://www.youtube.com/watch?v=PhlgC-bI5SQ) 演示了 ConsisID 的主要功能;
- 有关更多详细信息,请参阅研究论文 [Identity-Preserving Text-to-Video Generation by Frequency Decomposition](https://hf.co/papers/2411.17440)。
@@ -1,6 +1,6 @@
#!/usr/bin/env python
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@@ -1,6 +1,6 @@
#!/usr/bin/env python
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@@ -1,6 +1,6 @@
#!/usr/bin/env python
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
+1 -1
View File
@@ -1,5 +1,5 @@
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team.
# Copyright 2025 The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Executable → Regular
+191 -30
View File
@@ -24,8 +24,8 @@ Please also check out our [Community Scripts](https://github.com/huggingface/dif
| Speech to Image | Using automatic-speech-recognition to transcribe text and Stable Diffusion to generate images | [Speech to Image](#speech-to-image) |[Notebook](https://github.com/huggingface/notebooks/blob/main/diffusers/speech_to_image.ipynb) | [Mikail Duzenli](https://github.com/MikailINTech)
| Wild Card Stable Diffusion | Stable Diffusion Pipeline that supports prompts that contain wildcard terms (indicated by surrounding double underscores), with values instantiated randomly from a corresponding txt file or a dictionary of possible values | [Wildcard Stable Diffusion](#wildcard-stable-diffusion) | [Notebook](https://github.com/huggingface/notebooks/blob/main/diffusers/wildcard_stable_diffusion.ipynb) | [Shyam Sudhakaran](https://github.com/shyamsn97) |
| [Composable Stable Diffusion](https://energy-based-model.github.io/Compositional-Visual-Generation-with-Composable-Diffusion-Models/) | Stable Diffusion Pipeline that supports prompts that contain "&#124;" in prompts (as an AND condition) and weights (separated by "&#124;" as well) to positively / negatively weight prompts. | [Composable Stable Diffusion](#composable-stable-diffusion) | - | [Mark Rich](https://github.com/MarkRich) |
| Seed Resizing Stable Diffusion | Stable Diffusion Pipeline that supports resizing an image and retaining the concepts of the 512 by 512 generation. | [Seed Resizing](#seed-resizing) | - | [Mark Rich](https://github.com/MarkRich) |
| Imagic Stable Diffusion | Stable Diffusion Pipeline that enables writing a text prompt to edit an existing image | [Imagic Stable Diffusion](#imagic-stable-diffusion) | - | [Mark Rich](https://github.com/MarkRich) |
| Seed Resizing Stable Diffusion | Stable Diffusion Pipeline that supports resizing an image and retaining the concepts of the 512 by 512 generation. | [Seed Resizing](#seed-resizing) | [Notebook](https://github.com/huggingface/notebooks/blob/main/diffusers/seed_resizing.ipynb) | [Mark Rich](https://github.com/MarkRich) |
| Imagic Stable Diffusion | Stable Diffusion Pipeline that enables writing a text prompt to edit an existing image | [Imagic Stable Diffusion](#imagic-stable-diffusion) | [Notebook](https://github.com/huggingface/notebooks/blob/main/diffusers/imagic_stable_diffusion.ipynb) | [Mark Rich](https://github.com/MarkRich) |
| Multilingual Stable Diffusion | Stable Diffusion Pipeline that supports prompts in 50 different languages. | [Multilingual Stable Diffusion](#multilingual-stable-diffusion-pipeline) | [Notebook](https://github.com/huggingface/notebooks/blob/main/diffusers/multilingual_stable_diffusion.ipynb) | [Juan Carlos Piñeros](https://github.com/juancopi81) |
| GlueGen Stable Diffusion | Stable Diffusion Pipeline that supports prompts in different languages using GlueGen adapter. | [GlueGen Stable Diffusion](#gluegen-stable-diffusion-pipeline) | - | [Phạm Hồng Vinh](https://github.com/rootonchair) |
| Image to Image Inpainting Stable Diffusion | Stable Diffusion Pipeline that enables the overlaying of two images and subsequent inpainting | [Image to Image Inpainting Stable Diffusion](#image-to-image-inpainting-stable-diffusion) | - | [Alex McKinney](https://github.com/vvvm23) |
@@ -37,7 +37,7 @@ Please also check out our [Community Scripts](https://github.com/huggingface/dif
| MagicMix | Diffusion Pipeline for semantic mixing of an image and a text prompt | [MagicMix](#magic-mix) | - | [Partho Das](https://github.com/daspartho) |
| Stable UnCLIP | Diffusion Pipeline for combining prior model (generate clip image embedding from text, UnCLIPPipeline `"kakaobrain/karlo-v1-alpha"`) and decoder pipeline (decode clip image embedding to image, StableDiffusionImageVariationPipeline `"lambdalabs/sd-image-variations-diffusers"` ). | [Stable UnCLIP](#stable-unclip) | [Notebook](https://github.com/huggingface/notebooks/blob/main/diffusers/stable_unclip.ipynb) | [Ray Wang](https://wrong.wang) |
| UnCLIP Text Interpolation Pipeline | Diffusion Pipeline that allows passing two prompts and produces images while interpolating between the text-embeddings of the two prompts | [UnCLIP Text Interpolation Pipeline](#unclip-text-interpolation-pipeline) | [Notebook](https://github.com/huggingface/notebooks/blob/main/diffusers/unclip_text_interpolation.ipynb)| [Naga Sai Abhinay Devarinti](https://github.com/Abhinay1997/) |
| UnCLIP Image Interpolation Pipeline | Diffusion Pipeline that allows passing two images/image_embeddings and produces images while interpolating between their image-embeddings | [UnCLIP Image Interpolation Pipeline](#unclip-image-interpolation-pipeline) | - | [Naga Sai Abhinay Devarinti](https://github.com/Abhinay1997/) |
| UnCLIP Image Interpolation Pipeline | Diffusion Pipeline that allows passing two images/image_embeddings and produces images while interpolating between their image-embeddings | [UnCLIP Image Interpolation Pipeline](#unclip-image-interpolation-pipeline) | [Notebook](https://github.com/huggingface/notebooks/blob/main/diffusers/unclip_image_interpolation.ipynb)| [Naga Sai Abhinay Devarinti](https://github.com/Abhinay1997/) |
| DDIM Noise Comparative Analysis Pipeline | Investigating how the diffusion models learn visual concepts from each noise level (which is a contribution of [P2 weighting (CVPR 2022)](https://arxiv.org/abs/2204.00227)) | [DDIM Noise Comparative Analysis Pipeline](#ddim-noise-comparative-analysis-pipeline) | [Notebook](https://github.com/huggingface/notebooks/blob/main/diffusers/ddim_noise_comparative_analysis.ipynb)| [Aengus (Duc-Anh)](https://github.com/aengusng8) |
| CLIP Guided Img2Img Stable Diffusion Pipeline | Doing CLIP guidance for image to image generation with Stable Diffusion | [CLIP Guided Img2Img Stable Diffusion](#clip-guided-img2img-stable-diffusion) | - | [Nipun Jindal](https://github.com/nipunjindal/) |
| TensorRT Stable Diffusion Text to Image Pipeline | Accelerates the Stable Diffusion Text2Image Pipeline using TensorRT | [TensorRT Stable Diffusion Text to Image Pipeline](#tensorrt-text2image-stable-diffusion-pipeline) | - | [Asfiya Baig](https://github.com/asfiyab-nvidia) |
@@ -57,7 +57,7 @@ Please also check out our [Community Scripts](https://github.com/huggingface/dif
| Latent Consistency Pipeline | Implementation of [Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference](https://arxiv.org/abs/2310.04378) | [Latent Consistency Pipeline](#latent-consistency-pipeline) | - | [Simian Luo](https://github.com/luosiallen) |
| Latent Consistency Img2img Pipeline | Img2img pipeline for Latent Consistency Models | [Latent Consistency Img2Img Pipeline](#latent-consistency-img2img-pipeline) | - | [Logan Zoellner](https://github.com/nagolinc) |
| Latent Consistency Interpolation Pipeline | Interpolate the latent space of Latent Consistency Models with multiple prompts | [Latent Consistency Interpolation Pipeline](#latent-consistency-interpolation-pipeline) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1pK3NrLWJSiJsBynLns1K1-IDTW9zbPvl?usp=sharing) | [Aryan V S](https://github.com/a-r-r-o-w) |
| SDE Drag Pipeline | The pipeline supports drag editing of images using stochastic differential equations | [SDE Drag Pipeline](#sde-drag-pipeline) | - | [NieShen](https://github.com/NieShenRuc) [Fengqi Zhu](https://github.com/Monohydroxides) |
| SDE Drag Pipeline | The pipeline supports drag editing of images using stochastic differential equations | [SDE Drag Pipeline](#sde-drag-pipeline) | [Notebook](https://github.com/huggingface/notebooks/blob/main/diffusers/sde_drag.ipynb) | [NieShen](https://github.com/NieShenRuc) [Fengqi Zhu](https://github.com/Monohydroxides) |
| Regional Prompting Pipeline | Assign multiple prompts for different regions | [Regional Prompting Pipeline](#regional-prompting-pipeline) | - | [hako-mikan](https://github.com/hako-mikan) |
| LDM3D-sr (LDM3D upscaler) | Upscale low resolution RGB and depth inputs to high resolution | [StableDiffusionUpscaleLDM3D Pipeline](https://github.com/estelleafl/diffusers/tree/ldm3d_upscaler_community/examples/community#stablediffusionupscaleldm3d-pipeline) | - | [Estelle Aflalo](https://github.com/estelleafl) |
| AnimateDiff ControlNet Pipeline | Combines AnimateDiff with precise motion control using ControlNets | [AnimateDiff ControlNet Pipeline](#animatediff-controlnet-pipeline) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1SKboYeGjEQmQPWoFC0aLYpBlYdHXkvAu?usp=sharing) | [Aryan V S](https://github.com/a-r-r-o-w) and [Edoardo Botta](https://github.com/EdoardoBotta) |
@@ -77,6 +77,7 @@ Please also check out our [Community Scripts](https://github.com/huggingface/dif
PIXART-α Controlnet pipeline | Implementation of the controlnet model for pixart alpha and its diffusers pipeline | [PIXART-α Controlnet pipeline](#pixart-α-controlnet-pipeline) | - | [Raul Ciotescu](https://github.com/raulc0399/) |
| HunyuanDiT Differential Diffusion Pipeline | Applies [Differential Diffusion](https://github.com/exx8/differential-diffusion) to [HunyuanDiT](https://github.com/huggingface/diffusers/pull/8240). | [HunyuanDiT with Differential Diffusion](#hunyuandit-with-differential-diffusion) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1v44a5fpzyr4Ffr4v2XBQ7BajzG874N4P?usp=sharing) | [Monjoy Choudhury](https://github.com/MnCSSJ4x) |
| [🪆Matryoshka Diffusion Models](https://huggingface.co/papers/2310.15111) | A diffusion process that denoises inputs at multiple resolutions jointly and uses a NestedUNet architecture where features and parameters for small scale inputs are nested within those of the large scales. See [original codebase](https://github.com/apple/ml-mdm). | [🪆Matryoshka Diffusion Models](#matryoshka-diffusion-models) | [![Hugging Face Space](https://img.shields.io/badge/🤗%20Hugging%20Face-Space-yellow)](https://huggingface.co/spaces/pcuenq/mdm) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/gist/tolgacangoz/1f54875fc7aeaabcf284ebde64820966/matryoshka_hf.ipynb) | [M. Tolga Cangöz](https://github.com/tolgacangoz) |
| Stable Diffusion XL Attentive Eraser Pipeline |[[AAAI2025 Oral] Attentive Eraser](https://github.com/Anonym0u3/AttentiveEraser) is a novel tuning-free method that enhances object removal capabilities in pre-trained diffusion models.|[Stable Diffusion XL Attentive Eraser Pipeline](#stable-diffusion-xl-attentive-eraser-pipeline)|-|[Wenhao Sun](https://github.com/Anonym0u3) and [Benlei Cui](https://github.com/Benny079)|
To load a custom pipeline you just need to pass the `custom_pipeline` argument to `DiffusionPipeline`, as one of the files in `diffusers/examples/community`. Feel free to send a PR with your own pipelines, we will merge them quickly.
@@ -947,10 +948,15 @@ image.save('./imagic/imagic_image_alpha_2.png')
Test seed resizing. Originally generate an image in 512 by 512, then generate image with same seed at 512 by 592 using seed resizing. Finally, generate 512 by 592 using original stable diffusion pipeline.
```python
import os
import torch as th
import numpy as np
from diffusers import DiffusionPipeline
# Ensure the save directory exists or create it
save_dir = './seed_resize/'
os.makedirs(save_dir, exist_ok=True)
has_cuda = th.cuda.is_available()
device = th.device('cpu' if not has_cuda else 'cuda')
@@ -964,7 +970,6 @@ def dummy(images, **kwargs):
pipe.safety_checker = dummy
images = []
th.manual_seed(0)
generator = th.Generator("cuda").manual_seed(0)
@@ -983,15 +988,14 @@ res = pipe(
width=width,
generator=generator)
image = res.images[0]
image.save('./seed_resize/seed_resize_{w}_{h}_image.png'.format(w=width, h=height))
image.save(os.path.join(save_dir, 'seed_resize_{w}_{h}_image.png'.format(w=width, h=height)))
th.manual_seed(0)
generator = th.Generator("cuda").manual_seed(0)
pipe = DiffusionPipeline.from_pretrained(
"CompVis/stable-diffusion-v1-4",
custom_pipeline="/home/mark/open_source/diffusers/examples/community/"
custom_pipeline="seed_resize_stable_diffusion"
).to(device)
width = 512
@@ -1005,11 +1009,11 @@ res = pipe(
width=width,
generator=generator)
image = res.images[0]
image.save('./seed_resize/seed_resize_{w}_{h}_image.png'.format(w=width, h=height))
image.save(os.path.join(save_dir, 'seed_resize_{w}_{h}_image.png'.format(w=width, h=height)))
pipe_compare = DiffusionPipeline.from_pretrained(
"CompVis/stable-diffusion-v1-4",
custom_pipeline="/home/mark/open_source/diffusers/examples/community/"
custom_pipeline="seed_resize_stable_diffusion"
).to(device)
res = pipe_compare(
@@ -1022,7 +1026,7 @@ res = pipe_compare(
)
image = res.images[0]
image.save('./seed_resize/seed_resize_{w}_{h}_image_compare.png'.format(w=width, h=height))
image.save(os.path.join(save_dir, 'seed_resize_{w}_{h}_image_compare.png'.format(w=width, h=height)))
```
### Multilingual Stable Diffusion Pipeline
@@ -1542,6 +1546,8 @@ This Diffusion Pipeline takes two images or an image_embeddings tensor of size 2
import torch
from diffusers import DiffusionPipeline
from PIL import Image
import requests
from io import BytesIO
device = torch.device("cpu" if not torch.cuda.is_available() else "cuda")
dtype = torch.float16 if torch.cuda.is_available() else torch.bfloat16
@@ -1553,13 +1559,25 @@ pipe = DiffusionPipeline.from_pretrained(
)
pipe.to(device)
images = [Image.open('./starry_night.jpg'), Image.open('./flowers.jpg')]
# List of image URLs
image_urls = [
'https://camo.githubusercontent.com/ef13c8059b12947c0d5e8d3ea88900de6bf1cd76bbf61ace3928e824c491290e/68747470733a2f2f68756767696e67666163652e636f2f64617461736574732f4e616761536169416268696e61792f556e434c4950496d616765496e746572706f6c6174696f6e53616d706c65732f7265736f6c76652f6d61696e2f7374617272795f6e696768742e6a7067',
'https://camo.githubusercontent.com/d1947ab7c49ae3f550c28409d5e8b120df48e456559cf4557306c0848337702c/68747470733a2f2f68756767696e67666163652e636f2f64617461736574732f4e616761536169416268696e61792f556e434c4950496d616765496e746572706f6c6174696f6e53616d706c65732f7265736f6c76652f6d61696e2f666c6f776572732e6a7067'
]
# Open images from URLs
images = []
for url in image_urls:
response = requests.get(url)
img = Image.open(BytesIO(response.content))
images.append(img)
# For best results keep the prompts close in length to each other. Of course, feel free to try out with differing lengths.
generator = torch.Generator(device=device).manual_seed(42)
output = pipe(image=images, steps=6, generator=generator)
for i,image in enumerate(output.images):
for i, image in enumerate(output.images):
image.save('starry_to_flowers_%s.jpg' % i)
```
@@ -3908,33 +3926,89 @@ This pipeline provides drag-and-drop image editing using stochastic differential
See [paper](https://arxiv.org/abs/2311.01410), [paper page](https://ml-gsai.github.io/SDE-Drag-demo/), [original repo](https://github.com/ML-GSAI/SDE-Drag) for more information.
```py
import PIL
import torch
from diffusers import DDIMScheduler, DiffusionPipeline
from PIL import Image
import requests
from io import BytesIO
import numpy as np
# Load the pipeline
model_path = "stable-diffusion-v1-5/stable-diffusion-v1-5"
scheduler = DDIMScheduler.from_pretrained(model_path, subfolder="scheduler")
pipe = DiffusionPipeline.from_pretrained(model_path, scheduler=scheduler, custom_pipeline="sde_drag")
pipe.to('cuda')
# To save GPU memory, torch.float16 can be used, but it may compromise image quality.
# If not training LoRA, please avoid using torch.float16
# pipe.to(torch.float16)
# Ensure the model is moved to the GPU
device = "cuda" if torch.cuda.is_available() else "cpu"
pipe.to(device)
# Provide prompt, image, mask image, and the starting and target points for drag editing.
prompt = "prompt of the image"
image = PIL.Image.open('/path/to/image')
mask_image = PIL.Image.open('/path/to/mask_image')
source_points = [[123, 456]]
target_points = [[234, 567]]
# Function to load image from URL
def load_image_from_url(url):
response = requests.get(url)
return Image.open(BytesIO(response.content)).convert("RGB")
# train_lora is optional, and in most cases, using train_lora can better preserve consistency with the original image.
pipe.train_lora(prompt, image)
# Function to prepare mask
def prepare_mask(mask_image):
# Convert to grayscale
mask = mask_image.convert("L")
return mask
output = pipe(prompt, image, mask_image, source_points, target_points)
output_image = PIL.Image.fromarray(output)
# Function to convert numpy array to PIL Image
def array_to_pil(array):
# Ensure the array is in uint8 format
if array.dtype != np.uint8:
if array.max() <= 1.0:
array = (array * 255).astype(np.uint8)
else:
array = array.astype(np.uint8)
# Handle different array shapes
if len(array.shape) == 3:
if array.shape[0] == 3: # If channels first
array = array.transpose(1, 2, 0)
return Image.fromarray(array)
elif len(array.shape) == 4: # If batch dimension
array = array[0]
if array.shape[0] == 3: # If channels first
array = array.transpose(1, 2, 0)
return Image.fromarray(array)
else:
raise ValueError(f"Unexpected array shape: {array.shape}")
# Image and mask URLs
image_url = 'https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png'
mask_url = 'https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png'
# Load the images
image = load_image_from_url(image_url)
mask_image = load_image_from_url(mask_url)
# Resize images to a size that's compatible with the model's latent space
image = image.resize((512, 512))
mask_image = mask_image.resize((512, 512))
# Prepare the mask (keep as PIL Image)
mask = prepare_mask(mask_image)
# Provide the prompt and points for drag editing
prompt = "A cute dog"
source_points = [[32, 32]] # Adjusted for 512x512 image
target_points = [[64, 64]] # Adjusted for 512x512 image
# Generate the output image
output_array = pipe(
prompt=prompt,
image=image,
mask_image=mask,
source_points=source_points,
target_points=target_points
)
# Convert output array to PIL Image and save
output_image = array_to_pil(output_array)
output_image.save("./output.png")
print("Output image saved as './output.png'")
```
### Instaflow Pipeline
@@ -4585,8 +4659,8 @@ image = pipe(
```
| ![Gradient](https://github.com/user-attachments/assets/e38ce4d5-1ae6-4df0-ab43-adc1b45716b5) | ![Input](https://github.com/user-attachments/assets/9c95679c-e9d7-4f5a-90d6-560203acd6b3) | ![Output](https://github.com/user-attachments/assets/5313ff64-a0c4-418b-8b55-a38f1a5e7532) |
| ------------------------------------------------------------------------------------------ | --------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------- |
| Gradient | Input | Output |
| -------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------ |
| Gradient | Input | Output |
A colab notebook demonstrating all results can be found [here](https://colab.research.google.com/drive/1v44a5fpzyr4Ffr4v2XBQ7BajzG874N4P?usp=sharing). Depth Maps have also been added in the same colab.
@@ -4634,6 +4708,93 @@ make_image_grid(image, rows=1, cols=len(image))
# 50+, 100+, and 250+ num_inference_steps are recommended for nesting levels 0, 1, and 2 respectively.
```
### Stable Diffusion XL Attentive Eraser Pipeline
<img src="https://raw.githubusercontent.com/Anonym0u3/Images/refs/heads/main/fenmian.png" width="600" />
**Stable Diffusion XL Attentive Eraser Pipeline** is an advanced object removal pipeline that leverages SDXL for precise content suppression and seamless region completion. This pipeline uses **self-attention redirection guidance** to modify the models self-attention mechanism, allowing for effective removal and inpainting across various levels of mask precision, including semantic segmentation masks, bounding boxes, and hand-drawn masks. If you are interested in more detailed information and have any questions, please refer to the [paper](https://arxiv.org/abs/2412.12974) and [official implementation](https://github.com/Anonym0u3/AttentiveEraser).
#### Key features
- **Tuning-Free**: No additional training is required, making it easy to integrate and use.
- **Flexible Mask Support**: Works with different types of masks for targeted object removal.
- **High-Quality Results**: Utilizes the inherent generative power of diffusion models for realistic content completion.
#### Usage example
To use the Stable Diffusion XL Attentive Eraser Pipeline, you can initialize it as follows:
```py
import torch
from diffusers import DDIMScheduler, DiffusionPipeline
from diffusers.utils import load_image
import torch.nn.functional as F
from torchvision.transforms.functional import to_tensor, gaussian_blur
dtype = torch.float16
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
scheduler = DDIMScheduler(beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear", clip_sample=False, set_alpha_to_one=False)
pipeline = DiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
custom_pipeline="pipeline_stable_diffusion_xl_attentive_eraser",
scheduler=scheduler,
variant="fp16",
use_safetensors=True,
torch_dtype=dtype,
).to(device)
def preprocess_image(image_path, device):
image = to_tensor((load_image(image_path)))
image = image.unsqueeze_(0).float() * 2 - 1 # [0,1] --> [-1,1]
if image.shape[1] != 3:
image = image.expand(-1, 3, -1, -1)
image = F.interpolate(image, (1024, 1024))
image = image.to(dtype).to(device)
return image
def preprocess_mask(mask_path, device):
mask = to_tensor((load_image(mask_path, convert_method=lambda img: img.convert('L'))))
mask = mask.unsqueeze_(0).float() # 0 or 1
mask = F.interpolate(mask, (1024, 1024))
mask = gaussian_blur(mask, kernel_size=(77, 77))
mask[mask < 0.1] = 0
mask[mask >= 0.1] = 1
mask = mask.to(dtype).to(device)
return mask
prompt = "" # Set prompt to null
seed=123
generator = torch.Generator(device=device).manual_seed(seed)
source_image_path = "https://raw.githubusercontent.com/Anonym0u3/Images/refs/heads/main/an1024.png"
mask_path = "https://raw.githubusercontent.com/Anonym0u3/Images/refs/heads/main/an1024_mask.png"
source_image = preprocess_image(source_image_path, device)
mask = preprocess_mask(mask_path, device)
image = pipeline(
prompt=prompt,
image=source_image,
mask_image=mask,
height=1024,
width=1024,
AAS=True, # enable AAS
strength=0.8, # inpainting strength
rm_guidance_scale=9, # removal guidance scale
ss_steps = 9, # similarity suppression steps
ss_scale = 0.3, # similarity suppression scale
AAS_start_step=0, # AAS start step
AAS_start_layer=34, # AAS start layer
AAS_end_layer=70, # AAS end layer
num_inference_steps=50, # number of inference steps # AAS_end_step = int(strength*num_inference_steps)
generator=generator,
guidance_scale=1,
).images[0]
image.save('./removed_img.png')
print("Object removal completed")
```
| Source Image | Mask | Output |
| ---------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------- |
| ![Source Image](https://raw.githubusercontent.com/Anonym0u3/Images/refs/heads/main/an1024.png) | ![Mask](https://raw.githubusercontent.com/Anonym0u3/Images/refs/heads/main/an1024_mask.png) | ![Output](https://raw.githubusercontent.com/Anonym0u3/Images/refs/heads/main/AE_step40_layer34.png) |
# Perturbed-Attention Guidance
[Project](https://ku-cvlab.github.io/Perturbed-Attention-Guidance/) / [arXiv](https://arxiv.org/abs/2403.17377) / [GitHub](https://github.com/KU-CVLAB/Perturbed-Attention-Guidance)
+3 -2
View File
@@ -404,10 +404,11 @@ def my_forward(
# TODO: this requires sync between CPU and GPU. So try to pass timesteps as tensors if you can
# This would be a good case for the `match` statement (Python 3.10+)
is_mps = sample.device.type == "mps"
is_npu = sample.device.type == "npu"
if isinstance(timestep, float):
dtype = torch.float32 if is_mps else torch.float64
dtype = torch.float32 if (is_mps or is_npu) else torch.float64
else:
dtype = torch.int32 if is_mps else torch.int64
dtype = torch.int32 if (is_mps or is_npu) else torch.int64
timesteps = torch.tensor([timesteps], dtype=dtype, device=sample.device)
elif len(timesteps.shape) == 0:
timesteps = timesteps[None].to(sample.device)
+8 -76
View File
@@ -80,7 +80,6 @@ from diffusers.utils import (
USE_PEFT_BACKEND,
BaseOutput,
deprecate,
is_torch_version,
is_torch_xla_available,
logging,
replace_example_docstring,
@@ -869,23 +868,7 @@ class CrossAttnDownBlock2D(nn.Module):
for i, (resnet, attn) in enumerate(blocks):
if torch.is_grad_enabled() and self.gradient_checkpointing:
def create_custom_forward(module, return_dict=None):
def custom_forward(*inputs):
if return_dict is not None:
return module(*inputs, return_dict=return_dict)
else:
return module(*inputs)
return custom_forward
ckpt_kwargs: Dict[str, Any] = {"use_reentrant": False} if is_torch_version(">=", "1.11.0") else {}
hidden_states = torch.utils.checkpoint.checkpoint(
create_custom_forward(resnet),
hidden_states,
temb,
**ckpt_kwargs,
)
hidden_states = self._gradient_checkpointing_func(resnet, hidden_states, temb)
hidden_states = attn(
hidden_states,
encoder_hidden_states=encoder_hidden_states,
@@ -1030,17 +1013,6 @@ class UNetMidBlock2DCrossAttn(nn.Module):
hidden_states = self.resnets[0](hidden_states, temb)
for attn, resnet in zip(self.attentions, self.resnets[1:]):
if torch.is_grad_enabled() and self.gradient_checkpointing:
def create_custom_forward(module, return_dict=None):
def custom_forward(*inputs):
if return_dict is not None:
return module(*inputs, return_dict=return_dict)
else:
return module(*inputs)
return custom_forward
ckpt_kwargs: Dict[str, Any] = {"use_reentrant": False} if is_torch_version(">=", "1.11.0") else {}
hidden_states = attn(
hidden_states,
encoder_hidden_states=encoder_hidden_states,
@@ -1049,12 +1021,7 @@ class UNetMidBlock2DCrossAttn(nn.Module):
encoder_attention_mask=encoder_attention_mask,
return_dict=False,
)[0]
hidden_states = torch.utils.checkpoint.checkpoint(
create_custom_forward(resnet),
hidden_states,
temb,
**ckpt_kwargs,
)
hidden_states = self._gradient_checkpointing_func(resnet, hidden_states, temb)
else:
hidden_states = attn(
hidden_states,
@@ -1192,23 +1159,7 @@ class CrossAttnUpBlock2D(nn.Module):
hidden_states = torch.cat([hidden_states, res_hidden_states], dim=1)
if torch.is_grad_enabled() and self.gradient_checkpointing:
def create_custom_forward(module, return_dict=None):
def custom_forward(*inputs):
if return_dict is not None:
return module(*inputs, return_dict=return_dict)
else:
return module(*inputs)
return custom_forward
ckpt_kwargs: Dict[str, Any] = {"use_reentrant": False} if is_torch_version(">=", "1.11.0") else {}
hidden_states = torch.utils.checkpoint.checkpoint(
create_custom_forward(resnet),
hidden_states,
temb,
**ckpt_kwargs,
)
hidden_states = self._gradient_checkpointing_func(resnet, hidden_states, temb)
hidden_states = attn(
hidden_states,
encoder_hidden_states=encoder_hidden_states,
@@ -1282,10 +1233,6 @@ class MatryoshkaTransformer2DModel(LegacyModelMixin, LegacyConfigMixin):
]
)
def _set_gradient_checkpointing(self, module, value=False):
if hasattr(module, "gradient_checkpointing"):
module.gradient_checkpointing = value
def forward(
self,
hidden_states: torch.Tensor,
@@ -1365,19 +1312,8 @@ class MatryoshkaTransformer2DModel(LegacyModelMixin, LegacyConfigMixin):
# Blocks
for block in self.transformer_blocks:
if torch.is_grad_enabled() and self.gradient_checkpointing:
def create_custom_forward(module, return_dict=None):
def custom_forward(*inputs):
if return_dict is not None:
return module(*inputs, return_dict=return_dict)
else:
return module(*inputs)
return custom_forward
ckpt_kwargs: Dict[str, Any] = {"use_reentrant": False} if is_torch_version(">=", "1.11.0") else {}
hidden_states = torch.utils.checkpoint.checkpoint(
create_custom_forward(block),
hidden_states = self._gradient_checkpointing_func(
block,
hidden_states,
attention_mask,
encoder_hidden_states,
@@ -1385,7 +1321,6 @@ class MatryoshkaTransformer2DModel(LegacyModelMixin, LegacyConfigMixin):
timestep,
cross_attention_kwargs,
class_labels,
**ckpt_kwargs,
)
else:
hidden_states = block(
@@ -2724,10 +2659,6 @@ class MatryoshkaUNet2DConditionModel(
for module in self.children():
fn_recursive_set_attention_slice(module, reversed_slice_size)
def _set_gradient_checkpointing(self, module, value=False):
if hasattr(module, "gradient_checkpointing"):
module.gradient_checkpointing = value
def enable_freeu(self, s1: float, s2: float, b1: float, b2: float):
r"""Enables the FreeU mechanism from https://arxiv.org/abs/2309.11497.
@@ -2806,10 +2737,11 @@ class MatryoshkaUNet2DConditionModel(
# TODO: this requires sync between CPU and GPU. So try to pass timesteps as tensors if you can
# This would be a good case for the `match` statement (Python 3.10+)
is_mps = sample.device.type == "mps"
is_npu = sample.device.type == "npu"
if isinstance(timestep, float):
dtype = torch.float32 if is_mps else torch.float64
dtype = torch.float32 if (is_mps or is_npu) else torch.float64
else:
dtype = torch.int32 if is_mps else torch.int64
dtype = torch.int32 if (is_mps or is_npu) else torch.int64
timesteps = torch.tensor([timesteps], dtype=dtype, device=sample.device)
elif len(timesteps.shape) == 0:
timesteps = timesteps[None].to(sample.device)
File diff suppressed because it is too large Load Diff
File diff suppressed because it is too large Load Diff
@@ -1,5 +1,5 @@
#
# Copyright 2024 The HuggingFace Inc. team.
# Copyright 2025 The HuggingFace Inc. team.
# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
@@ -1,5 +1,5 @@
#
# Copyright 2024 The HuggingFace Inc. team.
# Copyright 2025 The HuggingFace Inc. team.
# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
@@ -1,5 +1,5 @@
#
# Copyright 2024 The HuggingFace Inc. team.
# Copyright 2025 The HuggingFace Inc. team.
# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
@@ -193,7 +193,8 @@ class StableDiffusionXLControlNetReferencePipeline(StableDiffusionXLControlNetPi
def prepare_ref_latents(self, refimage, batch_size, dtype, device, generator, do_classifier_free_guidance):
refimage = refimage.to(device=device)
if self.vae.dtype == torch.float16 and self.vae.config.force_upcast:
needs_upcasting = self.vae.dtype == torch.float16 and self.vae.config.force_upcast
if needs_upcasting:
self.upcast_vae()
refimage = refimage.to(next(iter(self.vae.post_quant_conv.parameters())).dtype)
if refimage.dtype != self.vae.dtype:
@@ -223,6 +224,11 @@ class StableDiffusionXLControlNetReferencePipeline(StableDiffusionXLControlNetPi
# aligning device to prevent device errors when concating it with the latent model input
ref_image_latents = ref_image_latents.to(device=device, dtype=dtype)
# cast back to fp16 if needed
if needs_upcasting:
self.vae.to(dtype=torch.float16)
return ref_image_latents
def prepare_ref_image(
@@ -139,7 +139,8 @@ def retrieve_timesteps(
class StableDiffusionXLReferencePipeline(StableDiffusionXLPipeline):
def prepare_ref_latents(self, refimage, batch_size, dtype, device, generator, do_classifier_free_guidance):
refimage = refimage.to(device=device)
if self.vae.dtype == torch.float16 and self.vae.config.force_upcast:
needs_upcasting = self.vae.dtype == torch.float16 and self.vae.config.force_upcast
if needs_upcasting:
self.upcast_vae()
refimage = refimage.to(next(iter(self.vae.post_quant_conv.parameters())).dtype)
if refimage.dtype != self.vae.dtype:
@@ -169,6 +170,11 @@ class StableDiffusionXLReferencePipeline(StableDiffusionXLPipeline):
# aligning device to prevent device errors when concating it with the latent model input
ref_image_latents = ref_image_latents.to(device=device, dtype=dtype)
# cast back to fp16 if needed
if needs_upcasting:
self.vae.to(dtype=torch.float16)
return ref_image_latents
def prepare_ref_image(
@@ -1,6 +1,6 @@
#!/usr/bin/env python
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@@ -1,6 +1,6 @@
#!/usr/bin/env python
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@@ -1,6 +1,6 @@
#!/usr/bin/env python
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@@ -1,6 +1,6 @@
#!/usr/bin/env python
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
+1 -1
View File
@@ -1,6 +1,6 @@
#!/usr/bin/env python
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
+1 -1
View File
@@ -1,6 +1,6 @@
#!/usr/bin/env python
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
+1 -1
View File
@@ -1,6 +1,6 @@
#!/usr/bin/env python
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
+1 -1
View File
@@ -1,6 +1,6 @@
#!/usr/bin/env python
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
+1 -1
View File
@@ -1,6 +1,6 @@
#!/usr/bin/env python
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
+26
View File
@@ -742,3 +742,29 @@ accelerate launch train_dreambooth.py \
## Stable Diffusion XL
We support fine-tuning of the UNet shipped in [Stable Diffusion XL](https://huggingface.co/papers/2307.01952) with DreamBooth and LoRA via the `train_dreambooth_lora_sdxl.py` script. Please refer to the docs [here](./README_sdxl.md).
## Dataset
We support 🤗 [Datasets](https://huggingface.co/docs/datasets/index), you can find a dataset on the [Hugging Face Hub](https://huggingface.co/datasets) or use your own.
The quickest way to get started with your custom dataset is 🤗 Datasets' [`ImageFolder`](https://huggingface.co/docs/datasets/image_dataset#imagefolder).
We need to create a file `metadata.jsonl` in the directory with our images:
```
{"file_name": "01.jpg", "prompt": "prompt 01"}
{"file_name": "02.jpg", "prompt": "prompt 02"}
```
If we have a directory with image-text pairs e.g. `01.jpg` and `01.txt` then `convert_to_imagefolder.py` can create `metadata.jsonl`.
```sh
python convert_to_imagefolder.py --path my_dataset/
```
We use `--dataset_name` and `--caption_column` with training scripts.
```
--dataset_name=my_dataset/
--caption_column=prompt
```
@@ -0,0 +1,32 @@
import argparse
import json
import pathlib
parser = argparse.ArgumentParser()
parser.add_argument(
"--path",
type=str,
required=True,
help="Path to folder with image-text pairs.",
)
parser.add_argument("--caption_column", type=str, default="prompt", help="Name of caption column.")
args = parser.parse_args()
path = pathlib.Path(args.path)
if not path.exists():
raise RuntimeError(f"`--path` '{args.path}' does not exist.")
all_files = list(path.glob("*"))
captions = list(path.glob("*.txt"))
images = set(all_files) - set(captions)
images = {image.stem: image for image in images}
caption_image = {caption: images.get(caption.stem) for caption in captions if images.get(caption.stem)}
metadata = path.joinpath("metadata.jsonl")
with metadata.open("w", encoding="utf-8") as f:
for caption, image in caption_image.items():
caption_text = caption.read_text(encoding="utf-8")
json.dump({"file_name": image.name, args.caption_column: caption_text}, f)
f.write("\n")
+1 -1
View File
@@ -1,6 +1,6 @@
#!/usr/bin/env python
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
+4 -4
View File
@@ -1,6 +1,6 @@
#!/usr/bin/env python
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@@ -1716,9 +1716,9 @@ def main(args):
pipeline = FluxPipeline.from_pretrained(
args.pretrained_model_name_or_path,
vae=vae,
text_encoder=accelerator.unwrap_model(text_encoder_one),
text_encoder_2=accelerator.unwrap_model(text_encoder_two),
transformer=accelerator.unwrap_model(transformer),
text_encoder=accelerator.unwrap_model(text_encoder_one, keep_fp32_wrapper=False),
text_encoder_2=accelerator.unwrap_model(text_encoder_two, keep_fp32_wrapper=False),
transformer=accelerator.unwrap_model(transformer, keep_fp32_wrapper=False),
revision=args.revision,
variant=args.variant,
torch_dtype=weight_dtype,
+12 -9
View File
@@ -1,6 +1,6 @@
#!/usr/bin/env python
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@@ -54,7 +54,11 @@ from diffusers import (
)
from diffusers.loaders import StableDiffusionLoraLoaderMixin
from diffusers.optimization import get_scheduler
from diffusers.training_utils import _set_state_dict_into_text_encoder, cast_training_params
from diffusers.training_utils import (
_set_state_dict_into_text_encoder,
cast_training_params,
free_memory,
)
from diffusers.utils import (
check_min_version,
convert_state_dict_to_diffusers,
@@ -151,14 +155,14 @@ def log_validation(
if args.validation_images is None:
images = []
for _ in range(args.num_validation_images):
with torch.cuda.amp.autocast():
with torch.amp.autocast(accelerator.device.type):
image = pipeline(**pipeline_args, generator=generator).images[0]
images.append(image)
else:
images = []
for image in args.validation_images:
image = Image.open(image)
with torch.cuda.amp.autocast():
with torch.amp.autocast(accelerator.device.type):
image = pipeline(**pipeline_args, image=image, generator=generator).images[0]
images.append(image)
@@ -177,7 +181,7 @@ def log_validation(
)
del pipeline
torch.cuda.empty_cache()
free_memory()
return images
@@ -793,7 +797,7 @@ def main(args):
cur_class_images = len(list(class_images_dir.iterdir()))
if cur_class_images < args.num_class_images:
torch_dtype = torch.float16 if accelerator.device.type == "cuda" else torch.float32
torch_dtype = torch.float16 if accelerator.device.type in ("cuda", "xpu") else torch.float32
if args.prior_generation_precision == "fp32":
torch_dtype = torch.float32
elif args.prior_generation_precision == "fp16":
@@ -829,8 +833,7 @@ def main(args):
image.save(image_filename)
del pipeline
if torch.cuda.is_available():
torch.cuda.empty_cache()
free_memory()
# Handle the repository creation
if accelerator.is_main_process:
@@ -1085,7 +1088,7 @@ def main(args):
tokenizer = None
gc.collect()
torch.cuda.empty_cache()
free_memory()
else:
pre_computed_encoder_hidden_states = None
validation_prompt_encoder_hidden_states = None
@@ -1,6 +1,6 @@
#!/usr/bin/env python
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@@ -1,6 +1,6 @@
#!/usr/bin/env python
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@@ -63,6 +63,7 @@ from diffusers.utils import (
is_wandb_available,
)
from diffusers.utils.hub_utils import load_or_create_model_card, populate_model_card
from diffusers.utils.import_utils import is_torch_npu_available
from diffusers.utils.torch_utils import is_compiled_module
@@ -74,6 +75,9 @@ check_min_version("0.33.0.dev0")
logger = get_logger(__name__)
if is_torch_npu_available():
torch.npu.config.allow_internal_format = False
def save_model_card(
repo_id: str,
@@ -158,6 +162,9 @@ def log_validation(
f"Running validation... \n Generating {args.num_validation_images} images with prompt:"
f" {args.validation_prompt}."
)
if args.enable_vae_tiling:
pipeline.vae.enable_tiling(tile_sample_min_height=1024, tile_sample_stride_width=1024)
pipeline.text_encoder = pipeline.text_encoder.to(torch.bfloat16)
pipeline = pipeline.to(accelerator.device)
pipeline.set_progress_bar_config(disable=True)
@@ -597,6 +604,8 @@ def parse_args(input_args=None):
help="Whether to offload the VAE and the text encoder to CPU when they are not used.",
)
parser.add_argument("--local_rank", type=int, default=-1, help="For distributed training: local_rank")
parser.add_argument("--enable_vae_tiling", action="store_true", help="Enabla vae tiling in log validation")
parser.add_argument("--enable_npu_flash_attention", action="store_true", help="Enabla Flash Attention for NPU")
if input_args is not None:
args = parser.parse_args(input_args)
@@ -920,8 +929,7 @@ def main(args):
image.save(image_filename)
del pipeline
if torch.cuda.is_available():
torch.cuda.empty_cache()
free_memory()
# Handle the repository creation
if accelerator.is_main_process:
@@ -984,6 +992,14 @@ def main(args):
# because Gemma2 is particularly suited for bfloat16.
text_encoder.to(dtype=torch.bfloat16)
if args.enable_npu_flash_attention:
if is_torch_npu_available():
logger.info("npu flash attention enabled.")
for block in transformer.transformer_blocks:
block.attn2.set_use_npu_flash_attention(True)
else:
raise ValueError("npu flash attention requires torch_npu extensions and is supported only on npu device ")
# Initialize a text encoding pipeline and keep it to CPU for now.
text_encoding_pipeline = SanaPipeline.from_pretrained(
args.pretrained_model_name_or_path,
@@ -1,6 +1,6 @@
#!/usr/bin/env python
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@@ -1,6 +1,6 @@
#!/usr/bin/env python
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
+1 -1
View File
@@ -1,6 +1,6 @@
#!/usr/bin/env python
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
+3 -1
View File
@@ -1,6 +1,6 @@
#!/usr/bin/env python
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@@ -812,6 +812,8 @@ def main(args):
for name, module in flux_transformer.named_modules():
if "transformer_blocks" in name:
module.requires_grad_(True)
else:
module.requirs_grad_(False)
def unwrap_model(model):
model = accelerator.unwrap_model(model)
@@ -1,6 +1,6 @@
#!/usr/bin/env python
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@@ -1,6 +1,6 @@
#!/usr/bin/env python
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@@ -695,7 +695,7 @@ def main():
)
# We need to ensure that the original and the edited images undergo the same
# augmentation transforms.
images = np.concatenate([original_images, edited_images])
images = np.stack([original_images, edited_images])
images = torch.tensor(images)
images = 2 * (images / 255) - 1
return train_transforms(images)
@@ -706,7 +706,7 @@ def main():
# Since the original and edited images were concatenated before
# applying the transformations, we need to separate them and reshape
# them accordingly.
original_images, edited_images = preprocessed_images.chunk(2)
original_images, edited_images = preprocessed_images
original_images = original_images.reshape(-1, 3, args.resolution, args.resolution)
edited_images = edited_images.reshape(-1, 3, args.resolution, args.resolution)
@@ -766,7 +766,7 @@ def main():
)
# We need to ensure that the original and the edited images undergo the same
# augmentation transforms.
images = np.concatenate([original_images, edited_images])
images = np.stack([original_images, edited_images])
images = torch.tensor(images)
images = 2 * (images / 255) - 1
return train_transforms(images)
@@ -906,7 +906,7 @@ def main():
# Since the original and edited images were concatenated before
# applying the transformations, we need to separate them and reshape
# them accordingly.
original_images, edited_images = preprocessed_images.chunk(2)
original_images, edited_images = preprocessed_images
original_images = original_images.reshape(-1, 3, args.resolution, args.resolution)
edited_images = edited_images.reshape(-1, 3, args.resolution, args.resolution)
@@ -1,6 +1,6 @@
#!/usr/bin/env python
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@@ -1,5 +1,5 @@
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@@ -1,5 +1,5 @@
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@@ -1,6 +1,6 @@
#!/usr/bin/env python
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
+2 -22
View File
@@ -82,31 +82,11 @@ pipeline = EasyPipelineForInpainting.from_huggingface(
## Search Civitai and Huggingface
```python
from pipeline_easy import (
search_huggingface,
search_civitai,
)
# Search Lora
Lora = search_civitai(
"Keyword_to_search_Lora",
model_type="LORA",
base_model = "SD 1.5",
download=True,
)
# Load Lora into the pipeline.
pipeline.load_lora_weights(Lora)
pipeline.auto_load_lora_weights("Detail Tweaker")
# Search TextualInversion
TextualInversion = search_civitai(
"EasyNegative",
model_type="TextualInversion",
base_model = "SD 1.5",
download=True
)
# Load TextualInversion into the pipeline.
pipeline.load_textual_inversion(TextualInversion, token="EasyNegative")
pipeline.auto_load_textual_inversion("EasyNegative", token="EasyNegative")
```
### Search Civitai
+482 -110
View File
@@ -1,5 +1,5 @@
# coding=utf-8
# Copyright 2024 suzukimain
# Copyright 2025 suzukimain
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@@ -15,11 +15,13 @@
import os
import re
import types
from collections import OrderedDict
from dataclasses import asdict, dataclass
from typing import Union
from dataclasses import asdict, dataclass, field
from typing import Dict, List, Optional, Union
import requests
import torch
from huggingface_hub import hf_api, hf_hub_download
from huggingface_hub.file_download import http_get
from huggingface_hub.utils import validate_hf_hub_args
@@ -30,6 +32,7 @@ from diffusers.loaders.single_file_utils import (
infer_diffusers_model_type,
load_single_file_checkpoint,
)
from diffusers.pipelines.animatediff import AnimateDiffPipeline, AnimateDiffSDXLPipeline
from diffusers.pipelines.auto_pipeline import (
AutoPipelineForImage2Image,
AutoPipelineForInpainting,
@@ -39,13 +42,18 @@ from diffusers.pipelines.controlnet import (
StableDiffusionControlNetImg2ImgPipeline,
StableDiffusionControlNetInpaintPipeline,
StableDiffusionControlNetPipeline,
StableDiffusionXLControlNetImg2ImgPipeline,
StableDiffusionXLControlNetPipeline,
)
from diffusers.pipelines.flux import FluxImg2ImgPipeline, FluxPipeline
from diffusers.pipelines.pipeline_utils import DiffusionPipeline
from diffusers.pipelines.stable_diffusion import (
StableDiffusionImg2ImgPipeline,
StableDiffusionInpaintPipeline,
StableDiffusionPipeline,
StableDiffusionUpscalePipeline,
)
from diffusers.pipelines.stable_diffusion_3 import StableDiffusion3Img2ImgPipeline, StableDiffusion3Pipeline
from diffusers.pipelines.stable_diffusion_xl import (
StableDiffusionXLImg2ImgPipeline,
StableDiffusionXLInpaintPipeline,
@@ -59,46 +67,133 @@ logger = logging.get_logger(__name__)
SINGLE_FILE_CHECKPOINT_TEXT2IMAGE_PIPELINE_MAPPING = OrderedDict(
[
("xl_base", StableDiffusionXLPipeline),
("xl_refiner", StableDiffusionXLPipeline),
("xl_inpaint", None),
("playground-v2-5", StableDiffusionXLPipeline),
("upscale", None),
("animatediff_rgb", AnimateDiffPipeline),
("animatediff_scribble", AnimateDiffPipeline),
("animatediff_sdxl_beta", AnimateDiffSDXLPipeline),
("animatediff_v1", AnimateDiffPipeline),
("animatediff_v2", AnimateDiffPipeline),
("animatediff_v3", AnimateDiffPipeline),
("autoencoder-dc-f128c512", None),
("autoencoder-dc-f32c32", None),
("autoencoder-dc-f32c32-sana", None),
("autoencoder-dc-f64c128", None),
("controlnet", StableDiffusionControlNetPipeline),
("controlnet_xl", StableDiffusionXLControlNetPipeline),
("controlnet_xl_large", StableDiffusionXLControlNetPipeline),
("controlnet_xl_mid", StableDiffusionXLControlNetPipeline),
("controlnet_xl_small", StableDiffusionXLControlNetPipeline),
("flux-depth", FluxPipeline),
("flux-dev", FluxPipeline),
("flux-fill", FluxPipeline),
("flux-schnell", FluxPipeline),
("hunyuan-video", None),
("inpainting", None),
("inpainting_v2", None),
("controlnet", StableDiffusionControlNetPipeline),
("v2", StableDiffusionPipeline),
("ltx-video", None),
("ltx-video-0.9.1", None),
("mochi-1-preview", None),
("playground-v2-5", StableDiffusionXLPipeline),
("sd3", StableDiffusion3Pipeline),
("sd35_large", StableDiffusion3Pipeline),
("sd35_medium", StableDiffusion3Pipeline),
("stable_cascade_stage_b", None),
("stable_cascade_stage_b_lite", None),
("stable_cascade_stage_c", None),
("stable_cascade_stage_c_lite", None),
("upscale", StableDiffusionUpscalePipeline),
("v1", StableDiffusionPipeline),
("v2", StableDiffusionPipeline),
("xl_base", StableDiffusionXLPipeline),
("xl_inpaint", None),
("xl_refiner", StableDiffusionXLPipeline),
]
)
SINGLE_FILE_CHECKPOINT_IMAGE2IMAGE_PIPELINE_MAPPING = OrderedDict(
[
("xl_base", StableDiffusionXLImg2ImgPipeline),
("xl_refiner", StableDiffusionXLImg2ImgPipeline),
("xl_inpaint", None),
("playground-v2-5", StableDiffusionXLImg2ImgPipeline),
("upscale", None),
("animatediff_rgb", AnimateDiffPipeline),
("animatediff_scribble", AnimateDiffPipeline),
("animatediff_sdxl_beta", AnimateDiffSDXLPipeline),
("animatediff_v1", AnimateDiffPipeline),
("animatediff_v2", AnimateDiffPipeline),
("animatediff_v3", AnimateDiffPipeline),
("autoencoder-dc-f128c512", None),
("autoencoder-dc-f32c32", None),
("autoencoder-dc-f32c32-sana", None),
("autoencoder-dc-f64c128", None),
("controlnet", StableDiffusionControlNetImg2ImgPipeline),
("controlnet_xl", StableDiffusionXLControlNetImg2ImgPipeline),
("controlnet_xl_large", StableDiffusionXLControlNetImg2ImgPipeline),
("controlnet_xl_mid", StableDiffusionXLControlNetImg2ImgPipeline),
("controlnet_xl_small", StableDiffusionXLControlNetImg2ImgPipeline),
("flux-depth", FluxImg2ImgPipeline),
("flux-dev", FluxImg2ImgPipeline),
("flux-fill", FluxImg2ImgPipeline),
("flux-schnell", FluxImg2ImgPipeline),
("hunyuan-video", None),
("inpainting", None),
("inpainting_v2", None),
("controlnet", StableDiffusionControlNetImg2ImgPipeline),
("v2", StableDiffusionImg2ImgPipeline),
("ltx-video", None),
("ltx-video-0.9.1", None),
("mochi-1-preview", None),
("playground-v2-5", StableDiffusionXLImg2ImgPipeline),
("sd3", StableDiffusion3Img2ImgPipeline),
("sd35_large", StableDiffusion3Img2ImgPipeline),
("sd35_medium", StableDiffusion3Img2ImgPipeline),
("stable_cascade_stage_b", None),
("stable_cascade_stage_b_lite", None),
("stable_cascade_stage_c", None),
("stable_cascade_stage_c_lite", None),
("upscale", StableDiffusionUpscalePipeline),
("v1", StableDiffusionImg2ImgPipeline),
("v2", StableDiffusionImg2ImgPipeline),
("xl_base", StableDiffusionXLImg2ImgPipeline),
("xl_inpaint", None),
("xl_refiner", StableDiffusionXLImg2ImgPipeline),
]
)
SINGLE_FILE_CHECKPOINT_INPAINT_PIPELINE_MAPPING = OrderedDict(
[
("xl_base", None),
("xl_refiner", None),
("xl_inpaint", StableDiffusionXLInpaintPipeline),
("playground-v2-5", None),
("upscale", None),
("animatediff_rgb", None),
("animatediff_scribble", None),
("animatediff_sdxl_beta", None),
("animatediff_v1", None),
("animatediff_v2", None),
("animatediff_v3", None),
("autoencoder-dc-f128c512", None),
("autoencoder-dc-f32c32", None),
("autoencoder-dc-f32c32-sana", None),
("autoencoder-dc-f64c128", None),
("controlnet", StableDiffusionControlNetInpaintPipeline),
("controlnet_xl", None),
("controlnet_xl_large", None),
("controlnet_xl_mid", None),
("controlnet_xl_small", None),
("flux-depth", None),
("flux-dev", None),
("flux-fill", None),
("flux-schnell", None),
("hunyuan-video", None),
("inpainting", StableDiffusionInpaintPipeline),
("inpainting_v2", StableDiffusionInpaintPipeline),
("controlnet", StableDiffusionControlNetInpaintPipeline),
("v2", None),
("ltx-video", None),
("ltx-video-0.9.1", None),
("mochi-1-preview", None),
("playground-v2-5", None),
("sd3", None),
("sd35_large", None),
("sd35_medium", None),
("stable_cascade_stage_b", None),
("stable_cascade_stage_b_lite", None),
("stable_cascade_stage_c", None),
("stable_cascade_stage_c_lite", None),
("upscale", StableDiffusionUpscalePipeline),
("v1", None),
("v2", None),
("xl_base", None),
("xl_inpaint", StableDiffusionXLInpaintPipeline),
("xl_refiner", None),
]
)
@@ -116,14 +211,33 @@ CONFIG_FILE_LIST = [
"diffusion_pytorch_model.non_ema.safetensors",
]
DIFFUSERS_CONFIG_DIR = ["safety_checker", "unet", "vae", "text_encoder", "text_encoder_2"]
INPAINT_PIPELINE_KEYS = [
"xl_inpaint",
"inpainting",
"inpainting_v2",
DIFFUSERS_CONFIG_DIR = [
"safety_checker",
"unet",
"vae",
"text_encoder",
"text_encoder_2",
]
TOKENIZER_SHAPE_MAP = {
768: [
"SD 1.4",
"SD 1.5",
"SD 1.5 LCM",
"SDXL 0.9",
"SDXL 1.0",
"SDXL 1.0 LCM",
"SDXL Distilled",
"SDXL Turbo",
"SDXL Lightning",
"PixArt a",
"Playground v2",
"Pony",
],
1024: ["SD 2.0", "SD 2.0 768", "SD 2.1", "SD 2.1 768", "SD 2.1 Unclip"],
}
EXTENSION = [".safetensors", ".ckpt", ".bin"]
CACHE_HOME = os.path.expanduser("~/.cache")
@@ -162,12 +276,28 @@ class ModelStatus:
The name of the model file.
local (`bool`):
Whether the model exists locally
site_url (`str`):
The URL of the site where the model is hosted.
"""
search_word: str = ""
download_url: str = ""
file_name: str = ""
local: bool = False
site_url: str = ""
@dataclass
class ExtraStatus:
r"""
Data class for storing extra status information.
Attributes:
trained_words (`str`):
The words used to trigger the model
"""
trained_words: Union[List[str], None] = None
@dataclass
@@ -191,8 +321,9 @@ class SearchResult:
model_path: str = ""
loading_method: Union[str, None] = None
checkpoint_format: Union[str, None] = None
repo_status: RepoStatus = RepoStatus()
model_status: ModelStatus = ModelStatus()
repo_status: RepoStatus = field(default_factory=RepoStatus)
model_status: ModelStatus = field(default_factory=ModelStatus)
extra_status: ExtraStatus = field(default_factory=ExtraStatus)
@validate_hf_hub_args
@@ -385,6 +516,7 @@ def file_downloader(
proxies = kwargs.pop("proxies", None)
force_download = kwargs.pop("force_download", False)
displayed_filename = kwargs.pop("displayed_filename", None)
# Default mode for file writing and initial file size
mode = "wb"
file_size = 0
@@ -396,7 +528,7 @@ def file_downloader(
if os.path.exists(save_path):
if not force_download:
# If the file exists and force_download is False, skip the download
logger.warning(f"File already exists: {save_path}, skipping download.")
logger.info(f"File already exists: {save_path}, skipping download.")
return None
elif resume:
# If resuming, set mode to append binary and get current file size
@@ -457,10 +589,18 @@ def search_huggingface(search_word: str, **kwargs) -> Union[str, SearchResult, N
gated = kwargs.pop("gated", False)
skip_error = kwargs.pop("skip_error", False)
file_list = []
hf_repo_info = {}
hf_security_info = {}
model_path = ""
repo_id, file_name = "", ""
diffusers_model_exists = False
# Get the type and loading method for the keyword
search_word_status = get_keyword_types(search_word)
if search_word_status["type"]["hf_repo"]:
hf_repo_info = hf_api.model_info(repo_id=search_word, securityStatus=True)
if download:
model_path = DiffusionPipeline.download(
search_word,
@@ -503,13 +643,6 @@ def search_huggingface(search_word: str, **kwargs) -> Union[str, SearchResult, N
)
model_dicts = [asdict(value) for value in list(hf_models)]
file_list = []
hf_repo_info = {}
hf_security_info = {}
model_path = ""
repo_id, file_name = "", ""
diffusers_model_exists = False
# Loop through models to find a suitable candidate
for repo_info in model_dicts:
repo_id = repo_info["id"]
@@ -523,7 +656,10 @@ def search_huggingface(search_word: str, **kwargs) -> Union[str, SearchResult, N
if hf_security_info["scansDone"]:
for info in repo_info["siblings"]:
file_path = info["rfilename"]
if "model_index.json" == file_path and checkpoint_format in ["diffusers", "all"]:
if "model_index.json" == file_path and checkpoint_format in [
"diffusers",
"all",
]:
diffusers_model_exists = True
break
@@ -571,6 +707,10 @@ def search_huggingface(search_word: str, **kwargs) -> Union[str, SearchResult, N
force_download=force_download,
)
# `pathlib.PosixPath` may be returned
if model_path:
model_path = str(model_path)
if file_name:
download_url = f"https://huggingface.co/{repo_id}/blob/main/{file_name}"
else:
@@ -586,10 +726,12 @@ def search_huggingface(search_word: str, **kwargs) -> Union[str, SearchResult, N
repo_status=RepoStatus(repo_id=repo_id, repo_hash=hf_repo_info.sha, version=revision),
model_status=ModelStatus(
search_word=search_word,
site_url=download_url,
download_url=download_url,
file_name=file_name,
local=download,
),
extra_status=ExtraStatus(trained_words=None),
)
else:
@@ -605,6 +747,8 @@ def search_civitai(search_word: str, **kwargs) -> Union[str, SearchResult, None]
The search query string.
model_type (`str`, *optional*, defaults to `Checkpoint`):
The type of model to search for.
sort (`str`, *optional*):
The order in which you wish to sort the results(for example, `Highest Rated`, `Most Downloaded`, `Newest`).
base_model (`str`, *optional*):
The base model to filter by.
download (`bool`, *optional*, defaults to `False`):
@@ -628,6 +772,7 @@ def search_civitai(search_word: str, **kwargs) -> Union[str, SearchResult, None]
# Extract additional parameters from kwargs
model_type = kwargs.pop("model_type", "Checkpoint")
sort = kwargs.pop("sort", None)
download = kwargs.pop("download", False)
base_model = kwargs.pop("base_model", None)
force_download = kwargs.pop("force_download", False)
@@ -642,6 +787,7 @@ def search_civitai(search_word: str, **kwargs) -> Union[str, SearchResult, None]
repo_name = ""
repo_id = ""
version_id = ""
trainedWords = ""
models_list = []
selected_repo = {}
selected_model = {}
@@ -652,12 +798,16 @@ def search_civitai(search_word: str, **kwargs) -> Union[str, SearchResult, None]
params = {
"query": search_word,
"types": model_type,
"sort": "Most Downloaded",
"limit": 20,
}
if base_model is not None:
if not isinstance(base_model, list):
base_model = [base_model]
params["baseModel"] = base_model
if sort is not None:
params["sort"] = sort
headers = {}
if token:
headers["Authorization"] = f"Bearer {token}"
@@ -686,25 +836,30 @@ def search_civitai(search_word: str, **kwargs) -> Union[str, SearchResult, None]
# Sort versions within the selected repo by download count
sorted_versions = sorted(
selected_repo["modelVersions"], key=lambda x: x["stats"]["downloadCount"], reverse=True
selected_repo["modelVersions"],
key=lambda x: x["stats"]["downloadCount"],
reverse=True,
)
for selected_version in sorted_versions:
version_id = selected_version["id"]
trainedWords = selected_version["trainedWords"]
models_list = []
for model_data in selected_version["files"]:
# Check if the file passes security scans and has a valid extension
file_name = model_data["name"]
if (
model_data["pickleScanResult"] == "Success"
and model_data["virusScanResult"] == "Success"
and any(file_name.endswith(ext) for ext in EXTENSION)
and os.path.basename(os.path.dirname(file_name)) not in DIFFUSERS_CONFIG_DIR
):
file_status = {
"filename": file_name,
"download_url": model_data["downloadUrl"],
}
models_list.append(file_status)
# When searching for textual inversion, results other than the values entered for the base model may come up, so check again.
if base_model is None or selected_version["baseModel"] in base_model:
for model_data in selected_version["files"]:
# Check if the file passes security scans and has a valid extension
file_name = model_data["name"]
if (
model_data["pickleScanResult"] == "Success"
and model_data["virusScanResult"] == "Success"
and any(file_name.endswith(ext) for ext in EXTENSION)
and os.path.basename(os.path.dirname(file_name)) not in DIFFUSERS_CONFIG_DIR
):
file_status = {
"filename": file_name,
"download_url": model_data["downloadUrl"],
}
models_list.append(file_status)
if models_list:
# Sort the models list by filename and find the safest model
@@ -764,19 +919,229 @@ def search_civitai(search_word: str, **kwargs) -> Union[str, SearchResult, None]
repo_status=RepoStatus(repo_id=repo_name, repo_hash=repo_id, version=version_id),
model_status=ModelStatus(
search_word=search_word,
site_url=f"https://civitai.com/models/{repo_id}?modelVersionId={version_id}",
download_url=download_url,
file_name=file_name,
local=output_info["type"]["local"],
),
extra_status=ExtraStatus(trained_words=trainedWords or None),
)
def add_methods(pipeline):
r"""
Add methods from `AutoConfig` to the pipeline.
Parameters:
pipeline (`Pipeline`):
The pipeline to which the methods will be added.
"""
for attr_name in dir(AutoConfig):
attr_value = getattr(AutoConfig, attr_name)
if callable(attr_value) and not attr_name.startswith("__"):
setattr(pipeline, attr_name, types.MethodType(attr_value, pipeline))
return pipeline
class AutoConfig:
def auto_load_textual_inversion(
self,
pretrained_model_name_or_path: Union[str, List[str]],
token: Optional[Union[str, List[str]]] = None,
base_model: Optional[Union[str, List[str]]] = None,
tokenizer=None,
text_encoder=None,
**kwargs,
):
r"""
Load Textual Inversion embeddings into the text encoder of [`StableDiffusionPipeline`] (both 🤗 Diffusers and
Automatic1111 formats are supported).
Parameters:
pretrained_model_name_or_path (`str` or `os.PathLike` or `List[str or os.PathLike]` or `Dict` or `List[Dict]`):
Can be either one of the following or a list of them:
- Search keywords for pretrained model (for example `EasyNegative`).
- A string, the *model id* (for example `sd-concepts-library/low-poly-hd-logos-icons`) of a
pretrained model hosted on the Hub.
- A path to a *directory* (for example `./my_text_inversion_directory/`) containing the textual
inversion weights.
- A path to a *file* (for example `./my_text_inversions.pt`) containing textual inversion weights.
- A [torch state
dict](https://pytorch.org/tutorials/beginner/saving_loading_models.html#what-is-a-state-dict).
token (`str` or `List[str]`, *optional*):
Override the token to use for the textual inversion weights. If `pretrained_model_name_or_path` is a
list, then `token` must also be a list of equal length.
text_encoder ([`~transformers.CLIPTextModel`], *optional*):
Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)).
If not specified, function will take self.tokenizer.
tokenizer ([`~transformers.CLIPTokenizer`], *optional*):
A `CLIPTokenizer` to tokenize text. If not specified, function will take self.tokenizer.
weight_name (`str`, *optional*):
Name of a custom weight file. This should be used when:
- The saved textual inversion file is in 🤗 Diffusers format, but was saved under a specific weight
name such as `text_inv.bin`.
- The saved textual inversion file is in the Automatic1111 format.
cache_dir (`Union[str, os.PathLike]`, *optional*):
Path to a directory where a downloaded pretrained model configuration is cached if the standard cache
is not used.
force_download (`bool`, *optional*, defaults to `False`):
Whether or not to force the (re-)download of the model weights and configuration files, overriding the
cached versions if they exist.
proxies (`Dict[str, str]`, *optional*):
A dictionary of proxy servers to use by protocol or endpoint, for example, `{'http': 'foo.bar:3128',
'http://hostname': 'foo.bar:4012'}`. The proxies are used on each request.
local_files_only (`bool`, *optional*, defaults to `False`):
Whether to only load local model weights and configuration files or not. If set to `True`, the model
won't be downloaded from the Hub.
token (`str` or *bool*, *optional*):
The token to use as HTTP bearer authorization for remote files. If `True`, the token generated from
`diffusers-cli login` (stored in `~/.huggingface`) is used.
revision (`str`, *optional*, defaults to `"main"`):
The specific model version to use. It can be a branch name, a tag name, a commit id, or any identifier
allowed by Git.
subfolder (`str`, *optional*, defaults to `""`):
The subfolder location of a model file within a larger model repository on the Hub or locally.
mirror (`str`, *optional*):
Mirror source to resolve accessibility issues if you're downloading a model in China. We do not
guarantee the timeliness or safety of the source, and you should refer to the mirror site for more
information.
Examples:
```py
>>> from auto_diffusers import EasyPipelineForText2Image
>>> pipeline = EasyPipelineForText2Image.from_huggingface("stable-diffusion-v1-5")
>>> pipeline.auto_load_textual_inversion("EasyNegative", token="EasyNegative")
>>> image = pipeline(prompt).images[0]
```
"""
# 1. Set tokenizer and text encoder
tokenizer = tokenizer or getattr(self, "tokenizer", None)
text_encoder = text_encoder or getattr(self, "text_encoder", None)
# Check if tokenizer and text encoder are provided
if tokenizer is None or text_encoder is None:
raise ValueError("Tokenizer and text encoder must be provided.")
# 2. Normalize inputs
pretrained_model_name_or_paths = (
[pretrained_model_name_or_path]
if not isinstance(pretrained_model_name_or_path, list)
else pretrained_model_name_or_path
)
# 2.1 Normalize tokens
tokens = [token] if not isinstance(token, list) else token
if tokens[0] is None:
tokens = tokens * len(pretrained_model_name_or_paths)
for check_token in tokens:
# Check if token is already in tokenizer vocabulary
if check_token in tokenizer.get_vocab():
raise ValueError(
f"Token {token} already in tokenizer vocabulary. Please choose a different token name or remove {token} and embedding from the tokenizer and text encoder."
)
expected_shape = text_encoder.get_input_embeddings().weight.shape[-1] # Expected shape of tokenizer
for search_word in pretrained_model_name_or_paths:
if isinstance(search_word, str):
# Update kwargs to ensure the model is downloaded and parameters are included
_status = {
"download": True,
"include_params": True,
"skip_error": False,
"model_type": "TextualInversion",
}
# Get tags for the base model of textual inversion compatible with tokenizer.
# If the tokenizer is 768-dimensional, set tags for SD 1.x and SDXL.
# If the tokenizer is 1024-dimensional, set tags for SD 2.x.
if expected_shape in TOKENIZER_SHAPE_MAP:
# Retrieve the appropriate tags from the TOKENIZER_SHAPE_MAP based on the expected shape
tags = TOKENIZER_SHAPE_MAP[expected_shape]
if base_model is not None:
if isinstance(base_model, list):
tags.extend(base_model)
else:
tags.append(base_model)
_status["base_model"] = tags
kwargs.update(_status)
# Search for the model on Civitai and get the model status
textual_inversion_path = search_civitai(search_word, **kwargs)
logger.warning(
f"textual_inversion_path: {search_word} -> {textual_inversion_path.model_status.site_url}"
)
pretrained_model_name_or_paths[
pretrained_model_name_or_paths.index(search_word)
] = textual_inversion_path.model_path
self.load_textual_inversion(
pretrained_model_name_or_paths, token=tokens, tokenizer=tokenizer, text_encoder=text_encoder, **kwargs
)
def auto_load_lora_weights(
self, pretrained_model_name_or_path_or_dict: Union[str, Dict[str, torch.Tensor]], adapter_name=None, **kwargs
):
r"""
Load LoRA weights specified in `pretrained_model_name_or_path_or_dict` into `self.unet` and
`self.text_encoder`.
All kwargs are forwarded to `self.lora_state_dict`.
See [`~loaders.StableDiffusionLoraLoaderMixin.lora_state_dict`] for more details on how the state dict is
loaded.
See [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_into_unet`] for more details on how the state dict is
loaded into `self.unet`.
See [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_into_text_encoder`] for more details on how the state
dict is loaded into `self.text_encoder`.
Parameters:
pretrained_model_name_or_path_or_dict (`str` or `os.PathLike` or `dict`):
See [`~loaders.StableDiffusionLoraLoaderMixin.lora_state_dict`].
adapter_name (`str`, *optional*):
Adapter name to be used for referencing the loaded adapter model. If not specified, it will use
`default_{i}` where i is the total number of adapters being loaded.
low_cpu_mem_usage (`bool`, *optional*):
Speed up model loading by only loading the pretrained LoRA weights and not initializing the random
weights.
kwargs (`dict`, *optional*):
See [`~loaders.StableDiffusionLoraLoaderMixin.lora_state_dict`].
"""
if isinstance(pretrained_model_name_or_path_or_dict, str):
# Update kwargs to ensure the model is downloaded and parameters are included
_status = {
"download": True,
"include_params": True,
"skip_error": False,
"model_type": "LORA",
}
kwargs.update(_status)
# Search for the model on Civitai and get the model status
lora_path = search_civitai(pretrained_model_name_or_path_or_dict, **kwargs)
logger.warning(f"lora_path: {lora_path.model_status.site_url}")
logger.warning(f"trained_words: {lora_path.extra_status.trained_words}")
pretrained_model_name_or_path_or_dict = lora_path.model_path
self.load_lora_weights(pretrained_model_name_or_path_or_dict, adapter_name=adapter_name, **kwargs)
class EasyPipelineForText2Image(AutoPipelineForText2Image):
r"""
[`AutoPipelineForText2Image`] is a generic pipeline class that instantiates a text-to-image pipeline class. The
[`EasyPipelineForText2Image`] is a generic pipeline class that instantiates a text-to-image pipeline class. The
specific underlying pipeline class is automatically selected from either the
[`~AutoPipelineForText2Image.from_pretrained`] or [`~AutoPipelineForText2Image.from_pipe`] methods.
[`~EasyPipelineForText2Image.from_pretrained`], [`~EasyPipelineForText2Image.from_pipe`], [`~EasyPipelineForText2Image.from_huggingface`] or [`~EasyPipelineForText2Image.from_civitai`] methods.
This class cannot be instantiated using `__init__()` (throws an error).
@@ -891,9 +1256,9 @@ class EasyPipelineForText2Image(AutoPipelineForText2Image):
Examples:
```py
>>> from diffusers import AutoPipelineForText2Image
>>> from auto_diffusers import EasyPipelineForText2Image
>>> pipeline = AutoPipelineForText2Image.from_huggingface("stable-diffusion-v1-5")
>>> pipeline = EasyPipelineForText2Image.from_huggingface("stable-diffusion-v1-5")
>>> image = pipeline(prompt).images[0]
```
"""
@@ -907,20 +1272,21 @@ class EasyPipelineForText2Image(AutoPipelineForText2Image):
kwargs.update(_status)
# Search for the model on Hugging Face and get the model status
hf_model_status = search_huggingface(pretrained_model_link_or_path, **kwargs)
logger.warning(f"checkpoint_path: {hf_model_status.model_status.download_url}")
checkpoint_path = hf_model_status.model_path
hf_checkpoint_status = search_huggingface(pretrained_model_link_or_path, **kwargs)
logger.warning(f"checkpoint_path: {hf_checkpoint_status.model_status.download_url}")
checkpoint_path = hf_checkpoint_status.model_path
# Check the format of the model checkpoint
if hf_model_status.checkpoint_format == "single_file":
if hf_checkpoint_status.loading_method == "from_single_file":
# Load the pipeline from a single file checkpoint
return load_pipeline_from_single_file(
pipeline = load_pipeline_from_single_file(
pretrained_model_or_path=checkpoint_path,
pipeline_mapping=SINGLE_FILE_CHECKPOINT_TEXT2IMAGE_PIPELINE_MAPPING,
**kwargs,
)
else:
return cls.from_pretrained(checkpoint_path, **kwargs)
pipeline = cls.from_pretrained(checkpoint_path, **kwargs)
return add_methods(pipeline)
@classmethod
def from_civitai(cls, pretrained_model_link_or_path, **kwargs):
@@ -999,9 +1365,9 @@ class EasyPipelineForText2Image(AutoPipelineForText2Image):
Examples:
```py
>>> from diffusers import AutoPipelineForText2Image
>>> from auto_diffusers import EasyPipelineForText2Image
>>> pipeline = AutoPipelineForText2Image.from_huggingface("stable-diffusion-v1-5")
>>> pipeline = EasyPipelineForText2Image.from_huggingface("stable-diffusion-v1-5")
>>> image = pipeline(prompt).images[0]
```
"""
@@ -1015,24 +1381,25 @@ class EasyPipelineForText2Image(AutoPipelineForText2Image):
kwargs.update(_status)
# Search for the model on Civitai and get the model status
model_status = search_civitai(pretrained_model_link_or_path, **kwargs)
logger.warning(f"checkpoint_path: {model_status.model_status.download_url}")
checkpoint_path = model_status.model_path
checkpoint_status = search_civitai(pretrained_model_link_or_path, **kwargs)
logger.warning(f"checkpoint_path: {checkpoint_status.model_status.site_url}")
checkpoint_path = checkpoint_status.model_path
# Load the pipeline from a single file checkpoint
return load_pipeline_from_single_file(
pipeline = load_pipeline_from_single_file(
pretrained_model_or_path=checkpoint_path,
pipeline_mapping=SINGLE_FILE_CHECKPOINT_TEXT2IMAGE_PIPELINE_MAPPING,
**kwargs,
)
return add_methods(pipeline)
class EasyPipelineForImage2Image(AutoPipelineForImage2Image):
r"""
[`AutoPipelineForImage2Image`] is a generic pipeline class that instantiates an image-to-image pipeline class. The
[`EasyPipelineForImage2Image`] is a generic pipeline class that instantiates an image-to-image pipeline class. The
specific underlying pipeline class is automatically selected from either the
[`~AutoPipelineForImage2Image.from_pretrained`] or [`~AutoPipelineForImage2Image.from_pipe`] methods.
[`~EasyPipelineForImage2Image.from_pretrained`], [`~EasyPipelineForImage2Image.from_pipe`], [`~EasyPipelineForImage2Image.from_huggingface`] or [`~EasyPipelineForImage2Image.from_civitai`] methods.
This class cannot be instantiated using `__init__()` (throws an error).
@@ -1147,10 +1514,10 @@ class EasyPipelineForImage2Image(AutoPipelineForImage2Image):
Examples:
```py
>>> from diffusers import AutoPipelineForText2Image
>>> from auto_diffusers import EasyPipelineForImage2Image
>>> pipeline = AutoPipelineForText2Image.from_huggingface("stable-diffusion-v1-5")
>>> image = pipeline(prompt).images[0]
>>> pipeline = EasyPipelineForImage2Image.from_huggingface("stable-diffusion-v1-5")
>>> image = pipeline(prompt, image).images[0]
```
"""
# Update kwargs to ensure the model is downloaded and parameters are included
@@ -1163,20 +1530,22 @@ class EasyPipelineForImage2Image(AutoPipelineForImage2Image):
kwargs.update(_parmas)
# Search for the model on Hugging Face and get the model status
model_status = search_huggingface(pretrained_model_link_or_path, **kwargs)
logger.warning(f"checkpoint_path: {model_status.model_status.download_url}")
checkpoint_path = model_status.model_path
hf_checkpoint_status = search_huggingface(pretrained_model_link_or_path, **kwargs)
logger.warning(f"checkpoint_path: {hf_checkpoint_status.model_status.download_url}")
checkpoint_path = hf_checkpoint_status.model_path
# Check the format of the model checkpoint
if model_status.checkpoint_format == "single_file":
if hf_checkpoint_status.loading_method == "from_single_file":
# Load the pipeline from a single file checkpoint
return load_pipeline_from_single_file(
pipeline = load_pipeline_from_single_file(
pretrained_model_or_path=checkpoint_path,
pipeline_mapping=SINGLE_FILE_CHECKPOINT_IMAGE2IMAGE_PIPELINE_MAPPING,
**kwargs,
)
else:
return cls.from_pretrained(checkpoint_path, **kwargs)
pipeline = cls.from_pretrained(checkpoint_path, **kwargs)
return add_methods(pipeline)
@classmethod
def from_civitai(cls, pretrained_model_link_or_path, **kwargs):
@@ -1255,10 +1624,10 @@ class EasyPipelineForImage2Image(AutoPipelineForImage2Image):
Examples:
```py
>>> from diffusers import AutoPipelineForText2Image
>>> from auto_diffusers import EasyPipelineForImage2Image
>>> pipeline = AutoPipelineForText2Image.from_huggingface("stable-diffusion-v1-5")
>>> image = pipeline(prompt).images[0]
>>> pipeline = EasyPipelineForImage2Image.from_huggingface("stable-diffusion-v1-5")
>>> image = pipeline(prompt, image).images[0]
```
"""
# Update kwargs to ensure the model is downloaded and parameters are included
@@ -1271,24 +1640,25 @@ class EasyPipelineForImage2Image(AutoPipelineForImage2Image):
kwargs.update(_status)
# Search for the model on Civitai and get the model status
model_status = search_civitai(pretrained_model_link_or_path, **kwargs)
logger.warning(f"checkpoint_path: {model_status.model_status.download_url}")
checkpoint_path = model_status.model_path
checkpoint_status = search_civitai(pretrained_model_link_or_path, **kwargs)
logger.warning(f"checkpoint_path: {checkpoint_status.model_status.site_url}")
checkpoint_path = checkpoint_status.model_path
# Load the pipeline from a single file checkpoint
return load_pipeline_from_single_file(
pipeline = load_pipeline_from_single_file(
pretrained_model_or_path=checkpoint_path,
pipeline_mapping=SINGLE_FILE_CHECKPOINT_IMAGE2IMAGE_PIPELINE_MAPPING,
**kwargs,
)
return add_methods(pipeline)
class EasyPipelineForInpainting(AutoPipelineForInpainting):
r"""
[`AutoPipelineForInpainting`] is a generic pipeline class that instantiates an inpainting pipeline class. The
[`EasyPipelineForInpainting`] is a generic pipeline class that instantiates an inpainting pipeline class. The
specific underlying pipeline class is automatically selected from either the
[`~AutoPipelineForInpainting.from_pretrained`] or [`~AutoPipelineForInpainting.from_pipe`] methods.
[`~EasyPipelineForInpainting.from_pretrained`], [`~EasyPipelineForInpainting.from_pipe`], [`~EasyPipelineForInpainting.from_huggingface`] or [`~EasyPipelineForInpainting.from_civitai`] methods.
This class cannot be instantiated using `__init__()` (throws an error).
@@ -1403,10 +1773,10 @@ class EasyPipelineForInpainting(AutoPipelineForInpainting):
Examples:
```py
>>> from diffusers import AutoPipelineForText2Image
>>> from auto_diffusers import EasyPipelineForInpainting
>>> pipeline = AutoPipelineForText2Image.from_huggingface("stable-diffusion-v1-5")
>>> image = pipeline(prompt).images[0]
>>> pipeline = EasyPipelineForInpainting.from_huggingface("stable-diffusion-2-inpainting")
>>> image = pipeline(prompt, image=init_image, mask_image=mask_image).images[0]
```
"""
# Update kwargs to ensure the model is downloaded and parameters are included
@@ -1419,20 +1789,21 @@ class EasyPipelineForInpainting(AutoPipelineForInpainting):
kwargs.update(_status)
# Search for the model on Hugging Face and get the model status
model_status = search_huggingface(pretrained_model_link_or_path, **kwargs)
logger.warning(f"checkpoint_path: {model_status.model_status.download_url}")
checkpoint_path = model_status.model_path
hf_checkpoint_status = search_huggingface(pretrained_model_link_or_path, **kwargs)
logger.warning(f"checkpoint_path: {hf_checkpoint_status.model_status.download_url}")
checkpoint_path = hf_checkpoint_status.model_path
# Check the format of the model checkpoint
if model_status.checkpoint_format == "single_file":
if hf_checkpoint_status.loading_method == "from_single_file":
# Load the pipeline from a single file checkpoint
return load_pipeline_from_single_file(
pipeline = load_pipeline_from_single_file(
pretrained_model_or_path=checkpoint_path,
pipeline_mapping=SINGLE_FILE_CHECKPOINT_INPAINT_PIPELINE_MAPPING,
**kwargs,
)
else:
return cls.from_pretrained(checkpoint_path, **kwargs)
pipeline = cls.from_pretrained(checkpoint_path, **kwargs)
return add_methods(pipeline)
@classmethod
def from_civitai(cls, pretrained_model_link_or_path, **kwargs):
@@ -1511,10 +1882,10 @@ class EasyPipelineForInpainting(AutoPipelineForInpainting):
Examples:
```py
>>> from diffusers import AutoPipelineForText2Image
>>> from auto_diffusers import EasyPipelineForInpainting
>>> pipeline = AutoPipelineForText2Image.from_huggingface("stable-diffusion-v1-5")
>>> image = pipeline(prompt).images[0]
>>> pipeline = EasyPipelineForInpainting.from_huggingface("stable-diffusion-2-inpainting")
>>> image = pipeline(prompt, image=init_image, mask_image=mask_image).images[0]
```
"""
# Update kwargs to ensure the model is downloaded and parameters are included
@@ -1527,13 +1898,14 @@ class EasyPipelineForInpainting(AutoPipelineForInpainting):
kwargs.update(_status)
# Search for the model on Civitai and get the model status
model_status = search_civitai(pretrained_model_link_or_path, **kwargs)
logger.warning(f"checkpoint_path: {model_status.model_status.download_url}")
checkpoint_path = model_status.model_path
checkpoint_status = search_civitai(pretrained_model_link_or_path, **kwargs)
logger.warning(f"checkpoint_path: {checkpoint_status.model_status.site_url}")
checkpoint_path = checkpoint_status.model_path
# Load the pipeline from a single file checkpoint
return load_pipeline_from_single_file(
pipeline = load_pipeline_from_single_file(
pretrained_model_or_path=checkpoint_path,
pipeline_mapping=SINGLE_FILE_CHECKPOINT_INPAINT_PIPELINE_MAPPING,
**kwargs,
)
return add_methods(pipeline)
@@ -0,0 +1,59 @@
# AutoencoderKL training example
## Installing the dependencies
Before running the scripts, make sure to install the library's training dependencies:
**Important**
To make sure you can successfully run the latest versions of the example scripts, we highly recommend **installing from source** and keeping the install up to date as we update the example scripts frequently and install some example-specific requirements. To do this, execute the following steps in a new virtual environment:
```bash
git clone https://github.com/huggingface/diffusers
cd diffusers
pip install .
```
Then cd in the example folder and run
```bash
pip install -r requirements.txt
```
And initialize an [🤗Accelerate](https://github.com/huggingface/accelerate/) environment with:
```bash
accelerate config
```
## Training on CIFAR10
Please replace the validation image with your own image.
```bash
accelerate launch train_autoencoderkl.py \
--pretrained_model_name_or_path stabilityai/sd-vae-ft-mse \
--dataset_name=cifar10 \
--image_column=img \
--validation_image images/bird.jpg images/car.jpg images/dog.jpg images/frog.jpg \
--num_train_epochs 100 \
--gradient_accumulation_steps 2 \
--learning_rate 4.5e-6 \
--lr_scheduler cosine \
--report_to wandb \
```
## Training on ImageNet
```bash
accelerate launch train_autoencoderkl.py \
--pretrained_model_name_or_path stabilityai/sd-vae-ft-mse \
--num_train_epochs 100 \
--gradient_accumulation_steps 2 \
--learning_rate 4.5e-6 \
--lr_scheduler cosine \
--report_to wandb \
--mixed_precision bf16 \
--train_data_dir /path/to/ImageNet/train \
--validation_image ./image.png \
--decoder_only
```
@@ -0,0 +1,15 @@
accelerate>=0.16.0
bitsandbytes
datasets
huggingface_hub
lpips
numpy
packaging
Pillow
taming_transformers
torch
torchvision
tqdm
transformers
wandb
xformers
File diff suppressed because it is too large Load Diff
@@ -1,6 +1,6 @@
#!/usr/bin/env python
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@@ -1,6 +1,6 @@
#!/usr/bin/env python
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@@ -0,0 +1,30 @@
# Diffusion Model Alignment Using GRPO
This directory provides LoRA implementations of Diffusion [GRPO](https://arxiv.org/abs/2402.03300) an RL based alignment method which is a variant of Proximal Policy Optimization (PPO) in the diffusion model setting.
## SDXL training command
```bash
accelerate launch train_diffusion_grpo_sdxl.py \
--pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-1.0 \
--pretrained_vae_model_name_or_path=madebyollin/sdxl-vae-fp16-fix \
--output_dir="diffusion-sdxl-dpo" \
--mixed_precision="fp16" \
--dataset_name=kashif/pickascore \
--train_batch_size=8 \
--gradient_accumulation_steps=2 \
--gradient_checkpointing \
--use_8bit_adam \
--rank=8 \
--learning_rate=1e-5 \
--report_to="wandb" \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--max_train_steps=2000 \
--checkpointing_steps=500 \
--run_validation --validation_steps=50 \
--seed="0" \
--report_to="wandb" \
--push_to_hub
```
@@ -0,0 +1,8 @@
accelerate>=0.16.0
torchvision
transformers>=4.25.1
ftfy
tensorboard
Jinja2
peft
wandb
File diff suppressed because it is too large Load Diff
@@ -1,6 +1,6 @@
#!/usr/bin/env python
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@@ -1,6 +1,6 @@
#!/usr/bin/env python
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@@ -1,6 +1,6 @@
#!/usr/bin/env python
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@@ -1,6 +1,6 @@
#!/usr/bin/env python
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@@ -1,6 +1,6 @@
#!/usr/bin/env python
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@@ -1,5 +1,5 @@
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@@ -1,6 +1,6 @@
#!/usr/bin/env python
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@@ -1,6 +1,6 @@
#!/usr/bin/env python
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@@ -1,6 +1,6 @@
#!/usr/bin/env python
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@@ -8,7 +8,6 @@ from diffusers.models import PixArtTransformer2DModel
from diffusers.models.attention import BasicTransformerBlock
from diffusers.models.modeling_outputs import Transformer2DModelOutput
from diffusers.models.modeling_utils import ModelMixin
from diffusers.utils.torch_utils import is_torch_version
class PixArtControlNetAdapterBlock(nn.Module):
@@ -151,10 +150,6 @@ class PixArtControlNetTransformerModel(ModelMixin, ConfigMixin):
self.transformer = transformer
self.controlnet = controlnet
def _set_gradient_checkpointing(self, module, value=False):
if hasattr(module, "gradient_checkpointing"):
module.gradient_checkpointing = value
def forward(
self,
hidden_states: torch.Tensor,
@@ -220,18 +215,8 @@ class PixArtControlNetTransformerModel(ModelMixin, ConfigMixin):
print("Gradient checkpointing is not supported for the controlnet transformer model, yet.")
exit(1)
def create_custom_forward(module, return_dict=None):
def custom_forward(*inputs):
if return_dict is not None:
return module(*inputs, return_dict=return_dict)
else:
return module(*inputs)
return custom_forward
ckpt_kwargs: Dict[str, Any] = {"use_reentrant": False} if is_torch_version(">=", "1.11.0") else {}
hidden_states = torch.utils.checkpoint.checkpoint(
create_custom_forward(block),
hidden_states = self._gradient_checkpointing_func(
block,
hidden_states,
attention_mask,
encoder_hidden_states,
@@ -239,7 +224,6 @@ class PixArtControlNetTransformerModel(ModelMixin, ConfigMixin):
timestep,
cross_attention_kwargs,
None,
**ckpt_kwargs,
)
else:
# the control nets are only used for the blocks 1 to self.blocks_num
@@ -1031,10 +1031,11 @@ class PixArtAlphaControlnetPipeline(DiffusionPipeline):
# TODO: this requires sync between CPU and GPU. So try to pass timesteps as tensors if you can
# This would be a good case for the `match` statement (Python 3.10+)
is_mps = latent_model_input.device.type == "mps"
is_npu = latent_model_input.device.type == "npu"
if isinstance(current_timestep, float):
dtype = torch.float32 if is_mps else torch.float64
dtype = torch.float32 if (is_mps or is_npu) else torch.float64
else:
dtype = torch.int32 if is_mps else torch.int64
dtype = torch.int32 if (is_mps or is_npu) else torch.int64
current_timestep = torch.tensor([current_timestep], dtype=dtype, device=latent_model_input.device)
elif len(current_timestep.shape) == 0:
current_timestep = current_timestep[None].to(latent_model_input.device)
@@ -258,10 +258,11 @@ class PromptDiffusionControlNetModel(ControlNetModel):
# TODO: this requires sync between CPU and GPU. So try to pass timesteps as tensors if you can
# This would be a good case for the `match` statement (Python 3.10+)
is_mps = sample.device.type == "mps"
is_npu = sample.device.type == "npu"
if isinstance(timestep, float):
dtype = torch.float32 if is_mps else torch.float64
dtype = torch.float32 if (is_mps or is_npu) else torch.float64
else:
dtype = torch.int32 if is_mps else torch.int64
dtype = torch.int32 if (is_mps or is_npu) else torch.int64
timesteps = torch.tensor([timesteps], dtype=dtype, device=sample.device)
elif len(timesteps.shape) == 0:
timesteps = timesteps[None].to(sample.device)
@@ -0,0 +1,100 @@
# Generating images using Flux and PyTorch/XLA
The `flux_inference` script shows how to do image generation using Flux on TPU devices using PyTorch/XLA. It uses the pallas kernel for flash attention for faster generation.
It has been tested on [Trillium](https://cloud.google.com/blog/products/compute/introducing-trillium-6th-gen-tpus) TPU versions. No other TPU types have been tested.
## Create TPU
To create a TPU on Google Cloud, follow [this guide](https://cloud.google.com/tpu/docs/v6e)
## Setup TPU environment
SSH into the VM and install Pytorch, Pytorch/XLA
```bash
pip install torch~=2.5.0 torch_xla[tpu]~=2.5.0 -f https://storage.googleapis.com/libtpu-releases/index.html -f https://storage.googleapis.com/libtpu-wheels/index.html
pip install torch_xla[pallas] -f https://storage.googleapis.com/jax-releases/jax_nightly_releases.html -f https://storage.googleapis.com/jax-releases/jaxlib_nightly_releases.html
```
Verify that PyTorch and PyTorch/XLA were installed correctly:
```bash
python3 -c "import torch; import torch_xla;"
```
Install dependencies
```bash
pip install transformers accelerate sentencepiece structlog
pushd ../../..
pip install .
popd
```
## Run the inference job
### Authenticate
Run the following command to authenticate your token in order to download Flux weights.
```bash
huggingface-cli login
```
Then run:
```bash
python flux_inference.py
```
The script loads the text encoders onto the CPU and the Flux transformer and VAE models onto the TPU. The first time the script runs, the compilation time is longer, while the cache stores the compiled programs. On subsequent runs, compilation is much faster and the subsequent passes being the fastest.
On a Trillium v6e-4, you should expect ~9 sec / 4 images or 2.25 sec / image (as devices run generation in parallel):
```bash
WARNING:root:libtpu.so and TPU device found. Setting PJRT_DEVICE=TPU.
Loading checkpoint shards: 100%|███████████████████████████████| 2/2 [00:00<00:00, 7.01it/s]
Loading pipeline components...: 40%|██████████▍ | 2/5 [00:00<00:00, 3.78it/s]You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Loading pipeline components...: 100%|██████████████████████████| 5/5 [00:00<00:00, 6.72it/s]
2025-01-10 00:51:25 [info ] loading flux from black-forest-labs/FLUX.1-dev
2025-01-10 00:51:25 [info ] loading flux from black-forest-labs/FLUX.1-dev
2025-01-10 00:51:26 [info ] loading flux from black-forest-labs/FLUX.1-dev
2025-01-10 00:51:26 [info ] loading flux from black-forest-labs/FLUX.1-dev
Loading pipeline components...: 100%|██████████████████████████| 3/3 [00:00<00:00, 4.29it/s]
Loading pipeline components...: 100%|██████████████████████████| 3/3 [00:00<00:00, 3.26it/s]
Loading pipeline components...: 100%|██████████████████████████| 3/3 [00:00<00:00, 3.27it/s]
Loading pipeline components...: 100%|██████████████████████████| 3/3 [00:00<00:00, 3.25it/s]
2025-01-10 00:51:34 [info ] starting compilation run...
2025-01-10 00:51:35 [info ] starting compilation run...
2025-01-10 00:51:37 [info ] starting compilation run...
2025-01-10 00:51:37 [info ] starting compilation run...
2025-01-10 00:52:52 [info ] compilation took 78.5155531649998 sec.
2025-01-10 00:52:53 [info ] starting inference run...
2025-01-10 00:52:57 [info ] compilation took 79.52986721400157 sec.
2025-01-10 00:52:57 [info ] compilation took 81.91776501700042 sec.
2025-01-10 00:52:57 [info ] compilation took 80.24951512600092 sec.
2025-01-10 00:52:57 [info ] starting inference run...
2025-01-10 00:52:57 [info ] starting inference run...
2025-01-10 00:52:58 [info ] starting inference run...
2025-01-10 00:53:22 [info ] inference time: 25.112665320000815
2025-01-10 00:53:30 [info ] inference time: 7.7019307739992655
2025-01-10 00:53:38 [info ] inference time: 7.693858365000779
2025-01-10 00:53:46 [info ] inference time: 7.690621814001133
2025-01-10 00:53:53 [info ] inference time: 7.679490454000188
2025-01-10 00:54:01 [info ] inference time: 7.68949568500102
2025-01-10 00:54:09 [info ] inference time: 7.686633744000574
2025-01-10 00:54:16 [info ] inference time: 7.696786873999372
2025-01-10 00:54:24 [info ] inference time: 7.691988694999964
2025-01-10 00:54:32 [info ] inference time: 7.700649563999832
2025-01-10 00:54:39 [info ] inference time: 7.684993574001055
2025-01-10 00:54:47 [info ] inference time: 7.68343457499941
2025-01-10 00:54:55 [info ] inference time: 7.667921153999487
2025-01-10 00:55:02 [info ] inference time: 7.683585194001353
2025-01-10 00:55:06 [info ] avg. inference over 15 iterations took 8.61202360273334 sec.
2025-01-10 00:55:07 [info ] avg. inference over 15 iterations took 8.952725123600006 sec.
2025-01-10 00:55:10 [info ] inference time: 7.673799695001435
2025-01-10 00:55:10 [info ] avg. inference over 15 iterations took 8.849190365400379 sec.
2025-01-10 00:55:10 [info ] saved metric information as /tmp/metrics_report.txt
2025-01-10 00:55:12 [info ] avg. inference over 15 iterations took 8.940161458400205 sec.
```
@@ -0,0 +1,120 @@
from argparse import ArgumentParser
from pathlib import Path
from time import perf_counter
import structlog
import torch
import torch_xla.core.xla_model as xm
import torch_xla.debug.metrics as met
import torch_xla.debug.profiler as xp
import torch_xla.distributed.xla_multiprocessing as xmp
import torch_xla.runtime as xr
from diffusers import FluxPipeline
logger = structlog.get_logger()
metrics_filepath = "/tmp/metrics_report.txt"
def _main(index, args, text_pipe, ckpt_id):
cache_path = Path("/tmp/data/compiler_cache_tRiLlium_eXp")
cache_path.mkdir(parents=True, exist_ok=True)
xr.initialize_cache(str(cache_path), readonly=False)
profile_path = Path("/tmp/data/profiler_out_tRiLlium_eXp")
profile_path.mkdir(parents=True, exist_ok=True)
profiler_port = 9012
profile_duration = args.profile_duration
if args.profile:
logger.info(f"starting profiler on port {profiler_port}")
_ = xp.start_server(profiler_port)
device0 = xm.xla_device()
logger.info(f"loading flux from {ckpt_id}")
flux_pipe = FluxPipeline.from_pretrained(
ckpt_id, text_encoder=None, tokenizer=None, text_encoder_2=None, tokenizer_2=None, torch_dtype=torch.bfloat16
).to(device0)
flux_pipe.transformer.enable_xla_flash_attention(partition_spec=("data", None, None, None), is_flux=True)
prompt = "photograph of an electronics chip in the shape of a race car with trillium written on its side"
width = args.width
height = args.height
guidance = args.guidance
n_steps = 4 if args.schnell else 28
logger.info("starting compilation run...")
ts = perf_counter()
with torch.no_grad():
prompt_embeds, pooled_prompt_embeds, text_ids = text_pipe.encode_prompt(
prompt=prompt, prompt_2=None, max_sequence_length=512
)
prompt_embeds = prompt_embeds.to(device0)
pooled_prompt_embeds = pooled_prompt_embeds.to(device0)
image = flux_pipe(
prompt_embeds=prompt_embeds,
pooled_prompt_embeds=pooled_prompt_embeds,
num_inference_steps=28,
guidance_scale=guidance,
height=height,
width=width,
).images[0]
logger.info(f"compilation took {perf_counter() - ts} sec.")
image.save("/tmp/compile_out.png")
base_seed = 4096 if args.seed is None else args.seed
seed_range = 1000
unique_seed = base_seed + index * seed_range
xm.set_rng_state(seed=unique_seed, device=device0)
times = []
logger.info("starting inference run...")
for _ in range(args.itters):
ts = perf_counter()
with torch.no_grad():
prompt_embeds, pooled_prompt_embeds, text_ids = text_pipe.encode_prompt(
prompt=prompt, prompt_2=None, max_sequence_length=512
)
prompt_embeds = prompt_embeds.to(device0)
pooled_prompt_embeds = pooled_prompt_embeds.to(device0)
if args.profile:
xp.trace_detached(f"localhost:{profiler_port}", str(profile_path), duration_ms=profile_duration)
image = flux_pipe(
prompt_embeds=prompt_embeds,
pooled_prompt_embeds=pooled_prompt_embeds,
num_inference_steps=n_steps,
guidance_scale=guidance,
height=height,
width=width,
).images[0]
inference_time = perf_counter() - ts
if index == 0:
logger.info(f"inference time: {inference_time}")
times.append(inference_time)
logger.info(f"avg. inference over {args.itters} iterations took {sum(times)/len(times)} sec.")
image.save(f"/tmp/inference_out-{index}.png")
if index == 0:
metrics_report = met.metrics_report()
with open(metrics_filepath, "w+") as fout:
fout.write(metrics_report)
logger.info(f"saved metric information as {metrics_filepath}")
if __name__ == "__main__":
parser = ArgumentParser()
parser.add_argument("--schnell", action="store_true", help="run flux schnell instead of dev")
parser.add_argument("--width", type=int, default=1024, help="width of the image to generate")
parser.add_argument("--height", type=int, default=1024, help="height of the image to generate")
parser.add_argument("--guidance", type=float, default=3.5, help="gauidance strentgh for dev")
parser.add_argument("--seed", type=int, default=None, help="seed for inference")
parser.add_argument("--profile", action="store_true", help="enable profiling")
parser.add_argument("--profile-duration", type=int, default=10000, help="duration for profiling in msec.")
parser.add_argument("--itters", type=int, default=15, help="tiems to run inference and get avg time in sec.")
args = parser.parse_args()
if args.schnell:
ckpt_id = "black-forest-labs/FLUX.1-schnell"
else:
ckpt_id = "black-forest-labs/FLUX.1-dev"
text_pipe = FluxPipeline.from_pretrained(ckpt_id, transformer=None, vae=None, torch_dtype=torch.bfloat16).to("cpu")
xmp.spawn(_main, args=(args, text_pipe, ckpt_id))
@@ -1,6 +1,6 @@
#!/usr/bin/env python
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@@ -1,6 +1,6 @@
#!/usr/bin/env python
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@@ -1,6 +1,6 @@
#!/usr/bin/env python
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@@ -1,6 +1,6 @@
#!/usr/bin/env python
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@@ -1,6 +1,6 @@
#!/usr/bin/env python
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@@ -1,6 +1,6 @@
#!/usr/bin/env python
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@@ -1,6 +1,6 @@
#!/usr/bin/env python
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@@ -1,6 +1,6 @@
#!/usr/bin/env python
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@@ -1,6 +1,6 @@
#!/usr/bin/env python
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@@ -1,6 +1,6 @@
#!/usr/bin/env python
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@@ -1,4 +1,4 @@
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@@ -1,4 +1,4 @@
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@@ -1,6 +1,6 @@
#!/usr/bin/env python
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.

Some files were not shown because too many files have changed in this diff Show More