Compare commits

...

58 Commits

Author SHA1 Message Date
Dhruv Nair de9528ebc7 Change checkpoint key used to identify CLIP models in single file checkpoints (#8319)
update
2024-06-04 10:05:58 -10:00
YiYi Xu 77cab27c47 prepare for 0.28.2 2024-06-04 10:04:24 -10:00
sayakpaul 0091f08a1a Release: v0.28.1 2024-06-04 14:25:09 +04:00
Sayak Paul 5f4a6b2bea [Transformer2DModel] Handle norm_type safely while remapping (#8370)
* handle norm_type of transformer2d_model safely.

* log an info when old model class is being returned.

* Apply suggestions from code review

Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com>

* remove extra stuff

---------

Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com>
2024-06-04 14:23:57 +04:00
XCL f0d4153a3c Tencent Hunyuan Team - Updated Doc for HunyuanDiT (#8383)
* add hunyuandit doc

* update hunyuandit doc

* update hunyuandit 2d model

* update toctree.yml for hunyuandit
2024-06-04 14:23:43 +04:00
XCL a83f85d59e Tencent Hunyuan Team: add HunyuanDiT related updates (#8240)
* Hunyuan Team: add HunyuanDiT related updates


---------

Co-authored-by: XCLiu <liuxc1996@gmail.com>
Co-authored-by: yiyixuxu <yixu310@gmail.com>
2024-06-04 14:23:31 +04:00
Anton Obukhov 902d7996ff resolve comflicts 2024-06-04 14:23:15 +04:00
Sayak Paul 137403ff31 [Core] Introduce class variants for Transformer2DModel (#7647)
* init for patches

* finish patched model.

* continuous transformer

* vectorized transformer2d.

* style.

* inits.

* fix-copies.

* introduce DiTTransformer2DModel.

* fixes

* use REMAPPING as suggested by @DN6

* better logging.

* add pixart transformer model.

* inits.

* caption_channels.

* attention masking.

* fix use_additional_conditions.

* remove print.

* debug

* flatten

* fix: assertion for sigma

* handle remapping for modeling_utils

* add tests for dit transformer2d

* quality

* placeholder for pixart tests

* pixart tests

* add _no_split_modules

* add docs.

* check

* check

* check

* check

* fix tests

* fix tests

* move Transformer output to modeling_output

* move errors better and bring back use_additional_conditions attribute.

* add unnecessary things from DiT.

* clean up pixart

* fix remapping

* fix device_map things in pixart2d.

* replace Transformer2DModel with appropriate classes in dit, pixart tests

* empty

* legacy mixin classes./

* use a remapping dict for fetching class names.

* change to specifc model types in the pipeline implementations.

* move _fetch_remapped_cls_from_config to modeling_loading_utils.py

* fix dependency problems.

* add deprecation note.
2024-06-04 14:22:00 +04:00
sayakpaul 7828d4eb00 Release: v0.28.0 2024-05-27 17:24:18 +05:30
Anton Obukhov b3d10d6d65 [Pipeline] Marigold depth and normals estimation (#7847)
* implement marigold depth and normals pipelines in diffusers core

* remove bibtex

* remove deprecations

* remove save_memory argument

* remove validate_vae

* remove config output

* remove batch_size autodetection

* remove presets logic
move default denoising_steps and processing_resolution into the model config
make default ensemble_size 1

* remove no_grad

* add fp16 to the example usage

* implement is_matplotlib_available
use is_matplotlib_available, is_scipy_available for conditional imports in the marigold depth pipeline

* move colormap, visualize_depth, and visualize_normals into export_utils.py

* make the denoising loop more lucid
fix the outputs to always be 4d tensors or lists of pil images
support a 4d input_image case
attempt to support model_cpu_offload_seq
move check_inputs into a separate function
change default batch_size to 1, remove any logic to make it bigger implicitly

* style

* rename denoising_steps into num_inference_steps

* rename input_image into image

* rename input_latent into latents

* remove decode_image
change decode_prediction to use the AutoencoderKL.decode method

* move clean_latent outside of progress_bar

* refactor marigold-reusable image processing bits into MarigoldImageProcessor class

* clean up the usage example docstring

* make ensemble functions members of the pipelines

* add early checks in check_inputs
rename E into ensemble_size in depth ensembling

* fix vae_scale_factor computation

* better compatibility with torch.compile
better variable naming

* move export_depth_to_png to export_utils

* remove encode_prediction

* improve visualize_depth and visualize_normals to accept multi-dimensional data and lists
remove visualization functions from the pipelines
move exporting depth as 16-bit PNGs functionality from the depth pipeline
update example docstrings

* do not shortcut vae.config variables

* change all asserts to raise ValueError

* rename output_prediction_type to output_type

* better variable names
clean up variable deletion code

* better variable names

* pass desc and leave kwargs into the diffusers progress_bar
implement nested progress bar for images and steps loops

* implement scale_invariant and shift_invariant flags in the ensemble_depth function
add scale_invariant and shift_invariant flags readout from the model config
further refactor ensemble_depth
support ensembling without alignment
add ensemble_depth docstring

* fix generator device placement checks

* move encode_empty_text body into the pipeline call

* minor empty text encoding simplifications

* adjust pipelines' class docstrings to explain the added construction arguments

* improve the scipy failure condition
add comments
improve docstrings
change the default use_full_z_range to True

* make input image values range check configurable in the preprocessor
refactor load_image_canonical in preprocessor to reject unknown types and return the image in the expected 4D format of tensor and on right device
support a list of everything as inputs to the pipeline, change type to PipelineImageInput
implement a check that all input list elements have the same dimensions
improve docstrings of pipeline outputs
remove check_input pipeline argument

* remove forgotten print

* add prediction_type model config

* add uncertainty visualization into export utils
fix NaN values in normals uncertainties

* change default of output_uncertainty to False
better handle the case of an attempt to export or visualize none

* fix `output_uncertainty=False`

* remove kwargs
fix check_inputs according to the new inputs of the pipeline

* rename prepare_latent into prepare_latents as in other pipelines
annotate prepare_latents in normals pipeline with "Copied from"
annotate encode_image in normals pipeline with "Copied from"

* move nested-capable `progress_bar` method into the pipelines
revert the original `progress_bar` method in pipeline_utils

* minor message improvement

* fix cpu offloading

* move colormap, visualize_depth, export_depth_to_16bit_png, visualize_normals, visualize_uncertainty to marigold_image_processing.py
update example docstrings

* fix missing comma

* change torch.FloatTensor to torch.Tensor

* fix importing of MarigoldImageProcessor

* fix vae offloading
fix batched image encoding
remove separate encode_image function and use vae.encode instead

* implement marigold's intial tests
relax generator checks in line with other pipelines
implement return_dict __call__ argument in line with other pipelines

* fix num_images computation

* remove MarigoldImageProcessor and outputs from import structure
update tests

* update docstrings

* update init

* update

* style

* fix

* fix

* up

* up

* up

* add simple test

* up

* update expected np input/output to be channel last

* move expand_tensor_or_array into the MarigoldImageProcessor

* rewrite tests to follow conventions - hardcoded slices instead of image artifacts
write more smoke tests

* add basic docs.

* add anton's contribution statement

* remove todos.

* fix assertion values for marigold depth slow tests

* fix assertion values for depth normals.

* remove print

* support AutoencoderTiny in the pipelines

* update documentation page
add Available Pipelines section
add Available Checkpoints section
add warning about num_inference_steps

* fix missing import in docstring
fix wrong value in visualize_depth docstring

* [doc] add marigold to pipelines overview

* [doc] add section "usage examples"

* fix an issue with latents check in the pipelines

* add "Frame-by-frame Video Processing with Consistency" section

* grammarly

* replace tables with images with css-styled images (blindly)

* style

* print

* fix the assertions.

* take from the github runner.

* take the slices from action artifacts

* style.

* update with the slices from the runner.

* remove unnecessary code blocks.

* Revert "[doc] add marigold to pipelines overview"

This reverts commit a505165150afd8dab23c474d1a054ea505a56a5f.

* remove invitation for new modalities

* split out marigold usage examples

* doc cleanup

---------

Co-authored-by: yiyixuxu <yixu310@gmail.com>
Co-authored-by: yiyixuxu <yixu310@gmail,com>
Co-authored-by: sayakpaul <spsayakpaul@gmail.com>
2024-05-27 17:21:49 +05:30
Dhruv Nair b82f9f5666 Add zip package to doc builder image (#8284)
update
2024-05-27 15:50:00 +05:30
Sayak Paul 6a5ba1b719 [Workflows] add a more secure way to run tests from a PR. (#7969)
* add a more secure way to run tests from a PR.

* make pytest more secure.

* address dhruv's comments.

* improve validation check.

* Update .github/workflows/run_tests_from_a_pr.yml

Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com>

---------

Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com>
2024-05-27 13:47:50 +05:30
Dhaivat Bhatt 4d40c9140c Add details about 1-stage implementation in I2VGen-XL docs (#8282)
* Add details about 1-stage implementation

* Add details about 1-stage implementation
2024-05-27 09:56:32 +05:30
Tolga Cangöz 0ab63ff647 Fix CPU Offloading Usage & Typos (#8230)
* Fix typos

* Fix `pipe.enable_model_cpu_offload()` usage

* Fix cpu offloading

* Update numbers
2024-05-24 11:25:29 -07:00
Tolga Cangöz db33af065b Fix a grammatical error in the raise messages (#8272)
Fix grammatical error
2024-05-24 11:15:00 -07:00
Yue Wu 1096f88e2b sampling bug fix in diffusers tutorial "basic_training.md" (#8223)
sampling bug fix in basic_training.md

In the diffusers basic training tutorial, setting the manual seed argument (generator=torch.manual_seed(config.seed)) in the pipeline call inside evaluate() function rewinds the dataloader shuffling, leading to overfitting due to the model seeing same sequence of training examples after every evaluation call. Using generator=torch.Generator(device='cpu').manual_seed(config.seed) avoids this.
2024-05-24 11:14:32 -07:00
Dhruv Nair cef4a51223 Clean up from_single_file docs (#8268)
* update

* update
2024-05-24 17:43:51 +05:30
Lucain edf5ba6a17 Respect resume_download deprecation V2 (#8267)
* Fix resume_downoad FutureWarning

* only resume download
2024-05-24 12:11:03 +02:00
Sayak Paul 9941f1f61b [Chore] run the documentation workflow in a custom container. (#8266)
run the documentation workflow in a custom container.
2024-05-24 15:10:02 +05:30
Yifan Zhou 46a9db0336 [Community Pipeline] FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation (#8239)
* code and doc

* update paper link

* remove redundant codes

* add example video

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2024-05-24 14:44:20 +05:30
Dhruv Nair 370146e4e0 Use freedesktop_os_release() in diffusers cli for Python >=3.10 (#8235)
* update

* update
2024-05-24 13:30:40 +05:30
Dhruv Nair 5cd45c24bf Create custom container for doc builder (#8263)
* update

* update
2024-05-24 12:53:48 +05:30
Dhruv Nair 67b3fe0aae Fix resize issue in SVD pipeline with VideoProcessor (#8229)
update

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2024-05-23 11:57:34 +05:30
Dhruv Nair baab065679 Remove unnecessary single file tests for SD Cascade UNet (#7996)
update
2024-05-22 12:29:59 +05:30
BootesVoid 509741aea7 fix: Attribute error in Logger object (logger.warning) (#8183) 2024-05-22 12:29:11 +05:30
Lucain e1df77ee1e Use HF_TOKEN env var in CI (#7993) 2024-05-21 14:58:10 +05:30
Steven Liu fdb1baa05c [docs] VideoProcessor (#7965)
* fix?

* fix?

* fix
2024-05-21 08:18:21 +05:30
Vinh H. Pham 6529ee67ec Make VAE compatible to torch.compile() (#7984)
make VAE compatible to torch.compile()

Co-authored-by: YiYi Xu <yixu310@gmail.com>
2024-05-20 13:43:59 -04:00
Sai-Suraj-27 df2bc5ef28 fix: Fixed few docstrings according to the Google Style Guide (#7717)
Fixed few docstrings according to the Google Style Guide.
2024-05-20 10:26:05 -07:00
Aleksei Zhuravlev a7bf77fc28 Passing cross_attention_kwargs to StableDiffusionInstructPix2PixPipeline (#7961)
* Update pipeline_stable_diffusion_instruct_pix2pix.py

Add `cross_attention_kwargs` to `__call__` method of `StableDiffusionInstructPix2PixPipeline`, which are passed to UNet.

* Update documentation for pipeline_stable_diffusion_instruct_pix2pix.py

* Update docstring

* Update docstring

* Fix typing import
2024-05-20 13:14:34 -04:00
Junsong Chen 0f0defdb65 [docs] add doc for PixArtSigmaPipeline (#7857)
* 1. add doc for PixArtSigmaPipeline;

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Guillaume LEGENDRE <glegendre01@gmail.com>
Co-authored-by: Álvaro Somoza <asomoza@users.noreply.github.com>
Co-authored-by: Bagheera <59658056+bghira@users.noreply.github.com>
Co-authored-by: bghira <bghira@users.github.com>
Co-authored-by: Hyoungwon Cho <jhw9811@korea.ac.kr>
Co-authored-by: yiyixuxu <yixu310@gmail.com>
Co-authored-by: Tolga Cangöz <46008593+standardAI@users.noreply.github.com>
Co-authored-by: Philip Pham <phillypham@google.com>
2024-05-20 12:40:57 -04:00
Nikita 19df9f3ec0 Update pipeline_controlnet_inpaint_sd_xl.py (#7983) 2024-05-20 12:24:49 -04:00
Jacob Marks d6ca120987 Fix typo in "attention" (#7977) 2024-05-20 11:54:29 -04:00
Sayak Paul fb7ae0184f [tests] fix Pixart Sigma tests (#7966)
* checking tests

* checking ii.

* remove prints.

* test_pixart_1024

* fix 1024.
2024-05-19 20:56:31 +05:30
Sayak Paul 70f8d4b488 remove unsafe workflow. (#7967) 2024-05-17 13:46:24 +05:30
Álvaro Somoza 6c60e430ee Consistent SDXL Controlnet callback tensor inputs (#7958)
* make _callback_tensor_inputs consistent between sdxl pipelines

* forgot this one

* fix failing test

* fix test_components_function

* fix controlnet inpaint tests
2024-05-16 07:15:10 -10:00
Alphin Jain 1221b28eac Fix AttributeError in train_lcm_distill_lora_sdxl_wds.py (#7923)
Fix conditional teacher model check in train_lcm_distill_lora_sdxl_wds.py

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2024-05-16 15:49:54 +05:30
Liang Hou 746f603b20 Fix the text tokenizer name in logger warning of PixArt pipelines (#7912)
Fix CLIP to T5 in logger warning
2024-05-15 18:49:29 -10:00
Sai-Suraj-27 2afea72d29 refactor: Refactored code by Merging isinstance calls (#7710)
* Merged isinstance calls to make the code simpler.

* Corrected formatting errors using ruff.

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: YiYi Xu <yixu310@gmail.com>
2024-05-15 18:33:19 -10:00
Sayak Paul 0f111ab794 [Workflows] add a workflow that can be manually triggered on a PR. (#7942)
* add a workflow that can be manually triggered on a PR.

* remove sudo

* add command

* small fixes.
2024-05-15 17:18:56 +05:30
Guillaume LEGENDRE 4dd7aaa06f move to GH hosted M1 runner (#7949) 2024-05-15 13:47:36 +05:30
Isamu Isozaki d27e996ccd Adding VQGAN Training script (#5483)
* Init commit

* Removed einops

* Added default movq config for training

* Update explanation of prompts

* Fixed inheritance of discriminator and init_tracker

* Fixed incompatible api between muse and here

* Fixed output

* Setup init training

* Basic structure done

* Removed attention for quick tests

* Style fixes

* Fixed vae/vqgan styles

* Removed redefinition of wandb

* Fixed log_validation and tqdm

* Nothing commit

* Added commit loss to lookup_from_codebook

* Update src/diffusers/models/vq_model.py

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* Adding perliminary README

* Fixed one typo

* Local changes

* Fixed main issues

* Merging

* Update src/diffusers/models/vq_model.py

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* Testing+Fixed bugs in training script

* Some style fixes

* Added wandb to docs

* Fixed timm test

* get testing suite ready.

* remove return loss

* remove return_loss

* Remove diffs

* Remove diffs

* fix ruff format

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com>
2024-05-15 08:47:12 +05:30
Sayak Paul 72780ff5b1 [tests] decorate StableDiffusion21PipelineSingleFileSlowTests with slow. (#7941)
decorate StableDiffusion21PipelineSingleFileSlowTests with slow.
2024-05-14 14:26:21 -10:00
Jingyang Zhang 69fdb8720f [Pipeline] Adding BoxDiff to community examples (#7947)
add boxdiff to community examples
2024-05-14 11:18:29 -10:00
Nikita b2140a895b Fix added_cond_kwargs when using IP-Adapter in StableDiffusionXLControlNetInpaintPipeline (#7924)
Fix `added_cond_kwargs` when using IP-Adapter

Fix error when using IP-Adapter in pipeline and passing `ip_adapter_image_embeds` instead of `ip_adapter_image`

Co-authored-by: YiYi Xu <yixu310@gmail.com>
2024-05-14 10:32:08 -10:00
Sayak Paul e0e8c58f64 [Core] separate the loading utilities in modeling similar to pipelines. (#7943)
separate the loading utilities in modeling similar to pipelines.
2024-05-14 22:33:43 +05:30
Sayak Paul cbea5d1725 update to use hf-workflows for reporting the Docker build statuses (#7938)
update to use hf-workflows for reporting
2024-05-14 09:25:13 +05:30
Tolga Cangöz a1245c2c61 Expansion proposal of diffusers-cli env (#7403)
* Expand `diffusers-cli env`

* SafeTensors -> Safetensors

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* Move `safetensors_version = "not installed"` to `else`

* Update `safetensors_version` checking

* Add GPU detection for Linux, Mac OS, and Windows

* Add accelerator detection to environment command

* Add is_peft_version to import_utils

* Update env.py

* Add `huggingface_hub` reference

* Add `transformers` reference

* Add reference for `huggingface_hub`

* Fix print statement in env.py for unusual OS

* Up

* Fix platform information in env.py

* up

* Fix import order in env.py

* ruff

* make style

* Fix platform system check in env.py

* Fix run method return type in env.py

* 🤗

* No need f-string

* Remove location info

* Remove accelerate config

* Refactor env.py to remove accelerate config

* feat: Add support for `bitsandbytes` library in environment command

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com>
2024-05-14 08:20:24 +05:30
bssrdf cdda94f412 fix VAE loading issue in train_dreambooth (#7632)
* fixed vae loading issue #7619

* rerun make style && make quality

* bring back model_has_vae and add change \ to / in config_file_name on windows os to make match work

* add missing import platform

* bring back import model_info

* make config_file_name OS independent

* switch to using Path.as_posix() to resolve OS dependence

* improve style

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: bssrdf <bssrdf@gmail.com>
Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com>
2024-05-14 08:19:53 +05:30
dependabot[bot] 5b830aa356 Bump transformers from 4.36.0 to 4.38.0 in /examples/research_projects/realfill (#7635)
Bump transformers in /examples/research_projects/realfill

Bumps [transformers](https://github.com/huggingface/transformers) from 4.36.0 to 4.38.0.
- [Release notes](https://github.com/huggingface/transformers/releases)
- [Commits](https://github.com/huggingface/transformers/compare/v4.36.0...v4.38.0)

---
updated-dependencies:
- dependency-name: transformers
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-05-14 08:17:06 +05:30
Kohei 9e7bae9881 Update requirements.txt for text_to_image (#7892)
Update requirements.txt

If the datasets library is old, it will not read the metadata.jsonl and the label will default to an integer of type int.

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2024-05-14 08:09:12 +05:30
rebel-kblee b41ce1e090 fix multicontrolnet save_pretrained logic for compatibility (#7821)
fix multicontrolnet save_pretrained logic for compatibility

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2024-05-13 09:32:06 -10:00
Sayak Paul 95d3748453 [LoRA] Fix LoRA tests (side effects of RGB ordering) part ii (#7932)
* check

* check 2.

* update slices
2024-05-13 09:23:48 -10:00
Fabio Rigano 44aa9e566d fix AnimateDiff creation with a unet loaded with IP Adapter (#7791)
* Fix loading from_pipe

* Fix style

---------

Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com>
2024-05-13 08:15:01 -10:00
Álvaro Somoza fdb05f54ef Official callbacks (#7761) 2024-05-12 17:10:29 -10:00
HelloWorldBeginner 98ba18ba55 Add Ascend NPU support for SDXL. (#7916)
Co-authored-by: mhh001 <mahonghao1@huawei.com>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2024-05-12 13:34:23 +02:00
Sayak Paul 5bb38586a9 [Core] fix offload behaviour when device_map is enabled. (#7919)
fix offload behaviour when device_map is enabled.
2024-05-12 13:29:43 +02:00
Sai-Suraj-27 ec9e88139a fix: Fixed a wrong link to supported python versions in contributing.md file (#7638)
* Fixed a wrong link to python versions in contributing.md file.

* Updated the link to a permalink, so that it will permanently point to the specific line.
2024-05-12 13:21:18 +02:00
209 changed files with 14256 additions and 810 deletions
+1 -1
View File
@@ -39,7 +39,7 @@ jobs:
python utils/print_env.py
- name: Diffusers Benchmarking
env:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.DIFFUSERS_BOT_TOKEN }}
HF_TOKEN: ${{ secrets.DIFFUSERS_BOT_TOKEN }}
BASE_PATH: benchmark_outputs
run: |
export TOTAL_GPU_MEMORY=$(python -c "import torch; print(torch.cuda.get_device_properties(0).total_memory / (1024**3))")
+10 -22
View File
@@ -25,17 +25,17 @@ jobs:
steps:
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v1
- name: Check out code
uses: actions/checkout@v3
- name: Find Changed Dockerfiles
id: file_changes
uses: jitterbit/get-changed-files@v1
with:
format: 'space-delimited'
token: ${{ secrets.GITHUB_TOKEN }}
- name: Build Changed Docker Images
run: |
CHANGED_FILES="${{ steps.file_changes.outputs.all }}"
@@ -52,7 +52,7 @@ jobs:
build-and-push-docker-images:
runs-on: [ self-hosted, intel-cpu, 8-cpu, ci ]
if: github.event_name != 'pull_request'
permissions:
contents: read
packages: write
@@ -69,6 +69,7 @@ jobs:
- diffusers-flax-tpu
- diffusers-onnxruntime-cpu
- diffusers-onnxruntime-cuda
- diffusers-doc-builder
steps:
- name: Checkout repository
@@ -90,24 +91,11 @@ jobs:
- name: Post to a Slack channel
id: slack
uses: slackapi/slack-github-action@6c661ce58804a1a20f6dc5fbee7f0381b469e001
uses: huggingface/hf-workflows/.github/actions/post-slack@main
with:
# Slack channel id, channel name, or user id to post message.
# See also: https://api.slack.com/methods/chat.postMessage#channels
channel-id: ${{ env.CI_SLACK_CHANNEL }}
# For posting a rich message using Block Kit
payload: |
{
"text": "${{ matrix.image-name }} Docker Image build result: ${{ job.status }}\n${{ github.event.head_commit.url }}",
"blocks": [
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "${{ matrix.image-name }} Docker Image build result: ${{ job.status }}\n${{ github.event.head_commit.url }}"
}
}
]
}
env:
SLACK_BOT_TOKEN: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
slack_channel: ${{ env.CI_SLACK_CHANNEL }}
title: "🤗 Results of the ${{ matrix.image-name }} Docker Image build"
status: ${{ job.status }}
slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
+1 -1
View File
@@ -21,7 +21,7 @@ jobs:
package: diffusers
notebook_folder: diffusers_doc
languages: en ko zh ja pt
custom_container: diffusers/diffusers-doc-builder
secrets:
token: ${{ secrets.HUGGINGFACE_PUSH }}
hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
@@ -20,3 +20,4 @@ jobs:
install_libgl1: true
package: diffusers
languages: en ko zh ja pt
custom_container: diffusers/diffusers-doc-builder
+7 -7
View File
@@ -81,7 +81,7 @@ jobs:
- name: Nightly PyTorch CUDA checkpoint (pipelines) tests
env:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
HF_TOKEN: ${{ secrets.HF_TOKEN }}
# https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms
CUBLAS_WORKSPACE_CONFIG: :16:8
run: |
@@ -141,7 +141,7 @@ jobs:
- name: Run nightly PyTorch CUDA tests for non-pipeline modules
if: ${{ matrix.module != 'examples'}}
env:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
HF_TOKEN: ${{ secrets.HF_TOKEN }}
# https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms
CUBLAS_WORKSPACE_CONFIG: :16:8
run: |
@@ -154,7 +154,7 @@ jobs:
- name: Run nightly example tests with Torch
if: ${{ matrix.module == 'examples' }}
env:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
HF_TOKEN: ${{ secrets.HF_TOKEN }}
# https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms
CUBLAS_WORKSPACE_CONFIG: :16:8
run: |
@@ -211,7 +211,7 @@ jobs:
- name: Run nightly LoRA tests with PEFT and Torch
env:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
HF_TOKEN: ${{ secrets.HF_TOKEN }}
# https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms
CUBLAS_WORKSPACE_CONFIG: :16:8
run: |
@@ -269,7 +269,7 @@ jobs:
- name: Run nightly Flax TPU tests
env:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
HF_TOKEN: ${{ secrets.HF_TOKEN }}
run: |
python -m pytest -n 0 \
-s -v -k "Flax" \
@@ -324,7 +324,7 @@ jobs:
- name: Run nightly ONNXRuntime CUDA tests
env:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
HF_TOKEN: ${{ secrets.HF_TOKEN }}
run: |
python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile \
-s -v -k "Onnx" \
@@ -390,7 +390,7 @@ jobs:
shell: arch -arch arm64 bash {0}
env:
HF_HOME: /System/Volumes/Data/mnt/cache
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
HF_TOKEN: ${{ secrets.HF_TOKEN }}
run: |
${CONDA_RUN} python -m pytest -n 1 -s -v --make-reports=tests_torch_mps \
--report-log=tests_torch_mps.log \
+1 -1
View File
@@ -156,7 +156,7 @@ jobs:
if: ${{ matrix.config.framework == 'pytorch_examples' }}
run: |
python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH"
python -m uv pip install peft
python -m uv pip install peft timm
python -m pytest -n 4 --max-worker-restart=0 --dist=loadfile \
--make-reports=tests_${{ matrix.config.report }} \
examples
+9 -8
View File
@@ -87,7 +87,7 @@ jobs:
python utils/print_env.py
- name: Slow PyTorch CUDA checkpoint tests on Ubuntu
env:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
HF_TOKEN: ${{ secrets.HF_TOKEN }}
# https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms
CUBLAS_WORKSPACE_CONFIG: :16:8
run: |
@@ -144,7 +144,7 @@ jobs:
- name: Run slow PyTorch CUDA tests
env:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
HF_TOKEN: ${{ secrets.HF_TOKEN }}
# https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms
CUBLAS_WORKSPACE_CONFIG: :16:8
run: |
@@ -194,7 +194,7 @@ jobs:
- name: Run slow PEFT CUDA tests
env:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
HF_TOKEN: ${{ secrets.HF_TOKEN }}
# https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms
CUBLAS_WORKSPACE_CONFIG: :16:8
run: |
@@ -243,7 +243,7 @@ jobs:
- name: Run slow Flax TPU tests
env:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
HF_TOKEN: ${{ secrets.HF_TOKEN }}
run: |
python -m pytest -n 0 \
-s -v -k "Flax" \
@@ -290,7 +290,7 @@ jobs:
- name: Run slow ONNXRuntime CUDA tests
env:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
HF_TOKEN: ${{ secrets.HF_TOKEN }}
run: |
python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile \
-s -v -k "Onnx" \
@@ -337,7 +337,7 @@ jobs:
python utils/print_env.py
- name: Run example tests on GPU
env:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
HF_TOKEN: ${{ secrets.HF_TOKEN }}
run: |
python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile -s -v -k "compile" --make-reports=tests_torch_compile_cuda tests/
- name: Failure short reports
@@ -378,7 +378,7 @@ jobs:
python utils/print_env.py
- name: Run example tests on GPU
env:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
HF_TOKEN: ${{ secrets.HF_TOKEN }}
run: |
python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile -s -v -k "xformers" --make-reports=tests_torch_xformers_cuda tests/
- name: Failure short reports
@@ -423,9 +423,10 @@ jobs:
- name: Run example tests on GPU
env:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
HF_TOKEN: ${{ secrets.HF_TOKEN }}
run: |
python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH"
python -m uv pip install timm
python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile -s -v --make-reports=examples_torch_cuda examples/
- name: Failure short reports
+1 -1
View File
@@ -107,7 +107,7 @@ jobs:
if: ${{ matrix.config.framework == 'pytorch_examples' }}
run: |
python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH"
python -m uv pip install peft
python -m uv pip install peft timm
python -m pytest -n 4 --max-worker-restart=0 --dist=loadfile \
--make-reports=tests_${{ matrix.config.report }} \
examples
+2 -2
View File
@@ -23,7 +23,7 @@ concurrency:
jobs:
run_fast_tests_apple_m1:
name: Fast PyTorch MPS tests on MacOS
runs-on: [ self-hosted, apple-m1 ]
runs-on: macos-13-xlarge
steps:
- name: Checkout diffusers
@@ -59,7 +59,7 @@ jobs:
shell: arch -arch arm64 bash {0}
env:
HF_HOME: /System/Volumes/Data/mnt/cache
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
HF_TOKEN: ${{ secrets.HF_TOKEN }}
run: |
${CONDA_RUN} python -m pytest -n 0 -s -v --make-reports=tests_torch_mps tests/
+73
View File
@@ -0,0 +1,73 @@
name: Check running SLOW tests from a PR (only GPU)
on:
workflow_dispatch:
inputs:
docker_image:
default: 'diffusers/diffusers-pytorch-cuda'
description: 'Name of the Docker image'
required: true
branch:
description: 'PR Branch to test on'
required: true
test:
description: 'Tests to run (e.g.: `tests/models`).'
required: true
env:
DIFFUSERS_IS_CI: yes
IS_GITHUB_CI: "1"
HF_HOME: /mnt/cache
OMP_NUM_THREADS: 8
MKL_NUM_THREADS: 8
PYTEST_TIMEOUT: 600
RUN_SLOW: yes
jobs:
run_tests:
name: "Run a test on our runner from a PR"
runs-on: [single-gpu, nvidia-gpu, t4, ci]
container:
image: ${{ github.event.inputs.docker_image }}
options: --gpus 0 --privileged --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/
steps:
- name: Validate test files input
id: validate_test_files
env:
PY_TEST: ${{ github.event.inputs.test }}
run: |
if [[ ! "$PY_TEST" =~ ^tests/ ]]; then
echo "Error: The input string must start with 'tests/'."
exit 1
fi
if [[ ! "$PY_TEST" =~ ^tests/(models|pipelines) ]]; then
echo "Error: The input string must contain either 'models' or 'pipelines' after 'tests/'."
exit 1
fi
if [[ "$PY_TEST" == *";"* ]]; then
echo "Error: The input string must not contain ';'."
exit 1
fi
echo "$PY_TEST"
- name: Checkout PR branch
uses: actions/checkout@v4
with:
ref: ${{ github.event.inputs.branch }}
repository: ${{ github.event.pull_request.head.repo.full_name }}
- name: Install pytest
run: |
python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH"
python -m uv pip install -e [quality,test]
python -m uv pip install peft
- name: Run tests
env:
PY_TEST: ${{ github.event.inputs.test }}
run: |
pytest "$PY_TEST"
+1 -1
View File
@@ -25,6 +25,6 @@ jobs:
- name: Update metadata
env:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.SAYAK_HF_TOKEN }}
HF_TOKEN: ${{ secrets.SAYAK_HF_TOKEN }}
run: |
python utils/update_metadata.py --commit_sha ${{ github.sha }}
+1 -1
View File
@@ -355,7 +355,7 @@ You will need basic `git` proficiency to be able to contribute to
manual. Type `git --help` in a shell and enjoy. If you prefer books, [Pro
Git](https://git-scm.com/book/en/v2) is a very good reference.
Follow these steps to start contributing ([supported Python versions](https://github.com/huggingface/diffusers/blob/main/setup.py#L265)):
Follow these steps to start contributing ([supported Python versions](https://github.com/huggingface/diffusers/blob/42f25d601a910dceadaee6c44345896b4cfa9928/setup.py#L270)):
1. Fork the [repository](https://github.com/huggingface/diffusers) by
clicking on the 'Fork' button on the repository's page. This creates a copy of the code
+2 -2
View File
@@ -77,7 +77,7 @@ Please refer to the [How to use Stable Diffusion in Apple Silicon](https://huggi
## Quickstart
Generating outputs is super easy with 🤗 Diffusers. To generate an image from text, use the `from_pretrained` method to load any pretrained diffusion model (browse the [Hub](https://huggingface.co/models?library=diffusers&sort=downloads) for 22000+ checkpoints):
Generating outputs is super easy with 🤗 Diffusers. To generate an image from text, use the `from_pretrained` method to load any pretrained diffusion model (browse the [Hub](https://huggingface.co/models?library=diffusers&sort=downloads) for 25.000+ checkpoints):
```python
from diffusers import DiffusionPipeline
@@ -219,7 +219,7 @@ Also, say 👋 in our public Discord channel <a href="https://discord.gg/G7tWnz9
- https://github.com/deep-floyd/IF
- https://github.com/bentoml/BentoML
- https://github.com/bmaltais/kohya_ss
- +9000 other amazing GitHub repositories 💪
- +11.000 other amazing GitHub repositories 💪
Thank you for using us ❤️.
+51
View File
@@ -0,0 +1,51 @@
FROM ubuntu:20.04
LABEL maintainer="Hugging Face"
LABEL repository="diffusers"
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get -y update \
&& apt-get install -y software-properties-common \
&& add-apt-repository ppa:deadsnakes/ppa
RUN apt install -y bash \
build-essential \
git \
git-lfs \
curl \
ca-certificates \
libsndfile1-dev \
python3.10 \
python3-pip \
libgl1 \
zip \
python3.10-venv && \
rm -rf /var/lib/apt/lists
# make sure to use venv
RUN python3.10 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"
# pre-install the heavy dependencies (these can later be overridden by the deps from setup.py)
RUN python3.10 -m pip install --no-cache-dir --upgrade pip uv==0.1.11 && \
python3.10 -m uv pip install --no-cache-dir \
torch \
torchvision \
torchaudio \
invisible_watermark \
--extra-index-url https://download.pytorch.org/whl/cpu && \
python3.10 -m uv pip install --no-cache-dir \
accelerate \
datasets \
hf-doc-builder \
huggingface-hub \
Jinja2 \
librosa \
numpy \
scipy \
tensorboard \
transformers \
matplotlib \
setuptools==69.5.1
CMD ["/bin/bash"]
+14
View File
@@ -93,6 +93,8 @@
title: Trajectory Consistency Distillation-LoRA
- local: using-diffusers/svd
title: Stable Video Diffusion
- local: using-diffusers/marigold_usage
title: Marigold Computer Vision
title: Specific pipeline examples
- sections:
- local: training/overview
@@ -232,6 +234,12 @@
title: ConsistencyDecoderVAE
- local: api/models/transformer2d
title: Transformer2D
- local: api/models/pixart_transformer2d
title: PixArtTransformer2D
- local: api/models/dit_transformer2d
title: DiTTransformer2D
- local: api/models/hunyuan_transformer_2d
title: HunyuanDiT2DModel
- local: api/models/transformer_temporal
title: Transformer Temporal
- local: api/models/prior_transformer
@@ -279,6 +287,8 @@
title: DiffEdit
- local: api/pipelines/dit
title: DiT
- local: api/pipelines/hunyuandit
title: Hunyuan-DiT
- local: api/pipelines/i2vgenxl
title: I2VGen-XL
- local: api/pipelines/pix2pix
@@ -295,6 +305,8 @@
title: Latent Diffusion
- local: api/pipelines/ledits_pp
title: LEDITS++
- local: api/pipelines/marigold
title: Marigold
- local: api/pipelines/panorama
title: MultiDiffusion
- local: api/pipelines/musicldm
@@ -305,6 +317,8 @@
title: Personalized Image Animator (PIA)
- local: api/pipelines/pixart
title: PixArt-α
- local: api/pipelines/pixart_sigma
title: PixArt-Σ
- local: api/pipelines/self_attention_guidance
title: Self-Attention Guidance
- local: api/pipelines/semantic_stable_diffusion
+33 -21
View File
@@ -12,9 +12,9 @@ specific language governing permissions and limitations under the License.
# Loading Pipelines and Models via `from_single_file`
The `from_single_file` method allows you to load supported pipelines using a single checkpoint file as opposed to the folder format used by Diffusers. This is useful if you are working with many of the Stable Diffusion Web UI's (such as A1111) that extensively rely on a single file to distribute all the components of a diffusion model.
The `from_single_file` method allows you to load supported pipelines using a single checkpoint file as opposed to Diffusers' multiple folders format. This is useful if you are working with Stable Diffusion Web UI's (such as A1111) that rely on a single file format to distribute all the components of a model.
The `from_single_file` method also supports loading models in their originally distributed format. This means that supported models that have been finetuned with other services can be loaded directly into supported Diffusers model objects and pipelines.
The `from_single_file` method also supports loading models in their originally distributed format. This means that supported models that have been finetuned with other services can be loaded directly into Diffusers model objects and pipelines.
## Pipelines that currently support `from_single_file` loading
@@ -59,7 +59,7 @@ pipe = StableDiffusionXLPipeline.from_single_file(ckpt_path)
## Setting components in a Pipeline using `from_single_file`
Swap components of the pipeline by passing them directly to the `from_single_file` method. e.g If you would like use a different scheduler than the pipeline default.
Set components of a pipeline by passing them directly to the `from_single_file` method. For example, here we are swapping out the pipeline's default scheduler with the `DDIMScheduler`.
```python
from diffusers import StableDiffusionXLPipeline, DDIMScheduler
@@ -71,13 +71,15 @@ pipe = StableDiffusionXLPipeline.from_single_file(ckpt_path, scheduler=scheduler
```
Here we are passing in a ControlNet model to the `StableDiffusionControlNetPipeline`.
```python
from diffusers import StableDiffusionPipeline, ControlNetModel
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
ckpt_path = "https://huggingface.co/runwayml/stable-diffusion-v1-5/blob/main/v1-5-pruned-emaonly.safetensors"
controlnet = ControlNetModel.from_pretrained("https://huggingface.co/runwayml/stable-diffusion-v1-5/blob/main/v1-5-pruned-emaonly.safetensors")
pipe = StableDiffusionPipeline.from_single_file(ckpt_path, controlnet=controlnet)
controlnet = ControlNetModel.from_pretrained("lllyasviel/control_v11p_sd15_canny")
pipe = StableDiffusionControlNetPipeline.from_single_file(ckpt_path, controlnet=controlnet)
```
@@ -93,7 +95,7 @@ model = StableCascadeUNet.from_single_file(ckpt_path)
## Using a Diffusers model repository to configure single file loading
Under the hood, `from_single_file` will try to determine a model repository to use to configure the components of the pipeline. You can also pass in a repository id to the `config` argument of the `from_single_file` method to explicitly set the repository to use.
Under the hood, `from_single_file` will try to automatically determine a model repository to use to configure the components of a pipeline. You can also explicitly set the model repository to configure the pipeline with the `config` argument.
```python
from diffusers import StableDiffusionXLPipeline
@@ -105,9 +107,19 @@ pipe = StableDiffusionXLPipeline.from_single_file(ckpt_path, config=repo_id)
```
In the example above, since we explicitly passed `repo_id="segmind/SSD-1B"` to the `config` argument, it will use this [configuration file](https://huggingface.co/segmind/SSD-1B/blob/main/unet/config.json) from the `unet` subfolder in `"segmind/SSD-1B"` to configure the `unet` component of the pipeline; Similarly, it will use the `config.json` file from `vae` subfolder to configure the `vae` model, `config.json` file from `text_encoder` folder to configure `text_encoder` and so on.
<Tip>
Most of the time you do not need to explicitly set a `config` argument. `from_single_file` will automatically map the checkpoint to the appropriate model repository. However, this option can be useful in cases where model components in the checkpoint might have been changed from what was originally distributed, or in cases where a checkpoint file might not have the necessary metadata to correctly determine the configuration to use for the pipeline.
</Tip>
## Override configuration options when using single file loading
Override the default model or pipeline configuration options when using `from_single_file` by passing in the relevant arguments directly to the `from_single_file` method. Any argument that is supported by the model or pipeline class can be configured in this way:
Override the default model or pipeline configuration options by providing the relevant arguments directly to the `from_single_file` method. Any argument supported by the model or pipeline class can be configured in this way:
### Setting a pipeline configuration option
```python
from diffusers import StableDiffusionXLInstructPix2PixPipeline
@@ -117,6 +129,8 @@ pipe = StableDiffusionXLInstructPix2PixPipeline.from_single_file(ckpt_path, conf
```
### Setting a model configuration option
```python
from diffusers import UNet2DConditionModel
@@ -125,10 +139,6 @@ model = UNet2DConditionModel.from_single_file(ckpt_path, upcast_attention=True)
```
In the example above, since we explicitly passed `repo_id="segmind/SSD-1B"`, it will use this [configuration file](https://huggingface.co/segmind/SSD-1B/blob/main/unet/config.json) from the "unet" subfolder in `"segmind/SSD-1B"` to configure the unet component included in the checkpoint; Similarly, it will use the `config.json` file from `"vae"` subfolder to configure the vae model, `config.json` file from text_encoder folder to configure text_encoder and so on.
Note that most of the time you do not need to explicitly a `config` argument, `from_single_file` will automatically map the checkpoint to a repo id (we will discuss this in more details in next section). However, this can be useful in cases where model components might have been changed from what was originally distributed or in cases where a checkpoint file might not have the necessary metadata to correctly determine the configuration to use for the pipeline.
<Tip>
To learn more about how to load single file weights, see the [Load different Stable Diffusion formats](../../using-diffusers/other-formats) loading guide.
@@ -137,9 +147,11 @@ To learn more about how to load single file weights, see the [Load different Sta
## Working with local files
As of `diffusers>=0.28.0` the `from_single_file` method will attempt to configure a pipeline or model by first inferring the model type from the checkpoint file and then using the model type to determine the appropriate model repo configuration to use from the Hugging Face Hub. For example, any single file checkpoint based on the Stable Diffusion XL base model will use the [`stabilityai/stable-diffusion-xl-base-1.0`](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) model repo to configure the pipeline.
As of `diffusers>=0.28.0` the `from_single_file` method will attempt to configure a pipeline or model by first inferring the model type from the keys in the checkpoint file. This inferred model type is then used to determine the appropriate model repository on the Hugging Face Hub to configure the model or pipeline.
If you are working in an environment with restricted internet access, it is recommended to download the config files and checkpoints for the model to your preferred directory and pass the local paths to the `pretrained_model_link_or_path` and `config` arguments of the `from_single_file` method.
For example, any single file checkpoint based on the Stable Diffusion XL base model will use the [`stabilityai/stable-diffusion-xl-base-1.0`](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) model repository to configure the pipeline.
If you are working in an environment with restricted internet access, it is recommended that you download the config files and checkpoints for the model to your preferred directory and pass the local paths to the `pretrained_model_link_or_path` and `config` arguments of the `from_single_file` method.
```python
from huggingface_hub import hf_hub_download, snapshot_download
@@ -211,13 +223,14 @@ pipe = StableDiffusionXLPipeline.from_single_file(my_local_checkpoint_path, conf
```
<Tip>
Disabling symlinking means that the `huggingface_hub` caching mechanism has no way to determine whether a file has already been downloaded to the local directory. This means that the `hf_hub_download` and `snapshot_download` functions will download files to the local directory each time they are executed. If you are disabling symlinking, it is recommended that you separate the model download and loading steps to avoid downloading the same file multiple times.
As of `huggingface_hub>=0.23.0` the `local_dir_use_symlinks` argument isn't necessary for the `hf_hub_download` and `snapshot_download` functions.
</Tip>
## Using the original configuration file of a model
If you would like to configure the parameters of the model components in the pipeline using the orignal YAML configuration file, you can pass a local path or url to the original configuration file to the `original_config` argument of the `from_single_file` method.
If you would like to configure the model components in a pipeline using the orignal YAML configuration file, you can pass a local path or url to the original configuration file via the `original_config` argument.
```python
from diffusers import StableDiffusionXLPipeline
@@ -229,13 +242,12 @@ original_config = "https://raw.githubusercontent.com/Stability-AI/generative-mod
pipe = StableDiffusionXLPipeline.from_single_file(ckpt_path, original_config=original_config)
```
In the example above, the `original_config` file is only used to configure the parameters of the individual model components of the pipeline. For example it will be used to configure parameters such as the `in_channels` of the `vae` model and `unet` model. It is not used to determine the type of component objects in the pipeline.
<Tip>
When using `original_config` with local_files_only=True`, Diffusers will attempt to infer the components based on the type signatures of pipeline class, rather than attempting to fetch the pipeline config from the Hugging Face Hub. This is to prevent backwards breaking changes in existing code that might not be able to connect to the internet to fetch the necessary pipeline config files.
This is not as reliable as providing a path to a local config repo and might lead to errors when configuring the pipeline. To avoid this, please run the pipeline with `local_files_only=False` once to download the appropriate pipeline config files to the local cache.
When using `original_config` with `local_files_only=True`, Diffusers will attempt to infer the components of the pipeline based on the type signatures of pipeline class, rather than attempting to fetch the configuration files from a model repository on the Hugging Face Hub. This is to prevent backward breaking changes in existing code that might not be able to connect to the internet to fetch the necessary configuration files.
This is not as reliable as providing a path to a local model repository using the `config` argument and might lead to errors when configuring the pipeline. To avoid this, please run the pipeline with `local_files_only=False` once to download the appropriate pipeline configuration files to the local cache.
</Tip>
@@ -0,0 +1,19 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# DiTTransformer2D
A Transformer model for image-like data from [DiT](https://huggingface.co/papers/2212.09748).
## DiTTransformer2DModel
[[autodoc]] DiTTransformer2DModel
@@ -0,0 +1,20 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# HunyuanDiT2DModel
A Diffusion Transformer model for 2D data from [Hunyuan-DiT](https://github.com/Tencent/HunyuanDiT).
## HunyuanDiT2DModel
[[autodoc]] HunyuanDiT2DModel
@@ -0,0 +1,19 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# PixArtTransformer2D
A Transformer model for image-like data from [PixArt-Alpha](https://huggingface.co/papers/2310.00426) and [PixArt-Sigma](https://huggingface.co/papers/2403.04692).
## PixArtTransformer2DModel
[[autodoc]] PixArtTransformer2DModel
@@ -0,0 +1,37 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Hunyuan-DiT
![chinese elements understanding](https://github.com/gnobitab/diffusers-hunyuan/assets/1157982/39b99036-c3cb-4f16-bb1a-40ec25eda573)
[Hunyuan-DiT : A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding](https://arxiv.org/abs/2405.08748)] from Tencent Hunyuan.
The abstract from the paper is:
*We present Hunyuan-DiT, a text-to-image diffusion transformer with fine-grained understanding of both English and Chinese. To construct Hunyuan-DiT, we carefully design the transformer structure, text encoder, and positional encoding. We also build from scratch a whole data pipeline to update and evaluate data for iterative model optimization. For fine-grained language understanding, we train a Multimodal Large Language Model to refine the captions of the images. Finally, Hunyuan-DiT can perform multi-turn multimodal dialogue with users, generating and refining images according to the context. Through our holistic human evaluation protocol with more than 50 professional human evaluators, Hunyuan-DiT sets a new state-of-the-art in Chinese-to-image generation compared with other open-source models.*
You can find the original codebase at [Tencent/HunyuanDiT](https://github.com/Tencent/HunyuanDiT) and all the available checkpoints at [Tencent-Hunyuan](https://huggingface.co/Tencent-Hunyuan/HunyuanDiT).
**Highlights**: HunyuanDiT supports Chinese/English-to-image, multi-resolution generation.
HunyuanDiT has the following components:
* It uses a diffusion transformer as the backbone
* It combines two text encoders, a bilingual CLIP and a multilingual T5 encoder
## HunyuanDiTPipeline
[[autodoc]] HunyuanDiTPipeline
- all
- __call__
+1
View File
@@ -47,6 +47,7 @@ Sample output with I2VGenXL:
* Unlike SVD, it additionally accepts text prompts as inputs.
* It can generate higher resolution videos.
* When using the [`DDIMScheduler`] (which is default for this pipeline), less than 50 steps for inference leads to bad results.
* This implementation is 1-stage variant of I2VGenXL. The main figure in the [I2VGen-XL](https://arxiv.org/abs/2311.04145) paper shows a 2-stage variant, however, 1-stage variant works well. See [this discussion](https://github.com/huggingface/diffusers/discussions/7952) for more details.
## I2VGenXLPipeline
[[autodoc]] I2VGenXLPipeline
+76
View File
@@ -0,0 +1,76 @@
<!--Copyright 2024 Marigold authors and The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Marigold Pipelines for Computer Vision Tasks
![marigold](https://marigoldmonodepth.github.io/images/teaser_collage_compressed.jpg)
Marigold was proposed in [Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation](https://huggingface.co/papers/2312.02145), a CVPR 2024 Oral paper by [Bingxin Ke](http://www.kebingxin.com/), [Anton Obukhov](https://www.obukhov.ai/), [Shengyu Huang](https://shengyuh.github.io/), [Nando Metzger](https://nandometzger.github.io/), [Rodrigo Caye Daudt](https://rcdaudt.github.io/), and [Konrad Schindler](https://scholar.google.com/citations?user=FZuNgqIAAAAJ&hl=en).
The idea is to repurpose the rich generative prior of Text-to-Image Latent Diffusion Models (LDMs) for traditional computer vision tasks.
Initially, this idea was explored to fine-tune Stable Diffusion for Monocular Depth Estimation, as shown in the teaser above.
Later,
- [Tianfu Wang](https://tianfwang.github.io/) trained the first Latent Consistency Model (LCM) of Marigold, which unlocked fast single-step inference;
- [Kevin Qu](https://www.linkedin.com/in/kevin-qu-b3417621b/?locale=en_US) extended the approach to Surface Normals Estimation;
- [Anton Obukhov](https://www.obukhov.ai/) contributed the pipelines and documentation into diffusers (enabled and supported by [YiYi Xu](https://yiyixuxu.github.io/) and [Sayak Paul](https://sayak.dev/)).
The abstract from the paper is:
*Monocular depth estimation is a fundamental computer vision task. Recovering 3D depth from a single image is geometrically ill-posed and requires scene understanding, so it is not surprising that the rise of deep learning has led to a breakthrough. The impressive progress of monocular depth estimators has mirrored the growth in model capacity, from relatively modest CNNs to large Transformer architectures. Still, monocular depth estimators tend to struggle when presented with images with unfamiliar content and layout, since their knowledge of the visual world is restricted by the data seen during training, and challenged by zero-shot generalization to new domains. This motivates us to explore whether the extensive priors captured in recent generative diffusion models can enable better, more generalizable depth estimation. We introduce Marigold, a method for affine-invariant monocular depth estimation that is derived from Stable Diffusion and retains its rich prior knowledge. The estimator can be fine-tuned in a couple of days on a single GPU using only synthetic training data. It delivers state-of-the-art performance across a wide range of datasets, including over 20% performance gains in specific cases. Project page: https://marigoldmonodepth.github.io.*
## Available Pipelines
Each pipeline supports one Computer Vision task, which takes an input RGB image as input and produces a *prediction* of the modality of interest, such as a depth map of the input image.
Currently, the following tasks are implemented:
| Pipeline | Predicted Modalities | Demos |
|---------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------------------------------------------------------------:|
| [MarigoldDepthPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/marigold/pipeline_marigold_depth.py) | [Depth](https://en.wikipedia.org/wiki/Depth_map), [Disparity](https://en.wikipedia.org/wiki/Binocular_disparity) | [Fast Demo (LCM)](https://huggingface.co/spaces/prs-eth/marigold-lcm), [Slow Original Demo (DDIM)](https://huggingface.co/spaces/prs-eth/marigold) |
| [MarigoldNormalsPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/marigold/pipeline_marigold_normals.py) | [Surface normals](https://en.wikipedia.org/wiki/Normal_mapping) | [Fast Demo (LCM)](https://huggingface.co/spaces/prs-eth/marigold-normals-lcm) |
## Available Checkpoints
The original checkpoints can be found under the [PRS-ETH](https://huggingface.co/prs-eth/) Hugging Face organization.
<Tip>
Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. Also, to know more about reducing the memory usage of this pipeline, refer to the ["Reduce memory usage"] section [here](../../using-diffusers/svd#reduce-memory-usage).
</Tip>
<Tip warning={true}>
Marigold pipelines were designed and tested only with `DDIMScheduler` and `LCMScheduler`.
Depending on the scheduler, the number of inference steps required to get reliable predictions varies, and there is no universal value that works best across schedulers.
Because of that, the default value of `num_inference_steps` in the `__call__` method of the pipeline is set to `None` (see the API reference).
Unless set explicitly, its value will be taken from the checkpoint configuration `model_index.json`.
This is done to ensure high-quality predictions when calling the pipeline with just the `image` argument.
</Tip>
See also Marigold [usage examples](marigold_usage).
## MarigoldDepthPipeline
[[autodoc]] MarigoldDepthPipeline
- all
- __call__
## MarigoldNormalsPipeline
[[autodoc]] MarigoldNormalsPipeline
- all
- __call__
## MarigoldDepthOutput
[[autodoc]] pipelines.marigold.pipeline_marigold_depth.MarigoldDepthOutput
## MarigoldNormalsOutput
[[autodoc]] pipelines.marigold.pipeline_marigold_normals.MarigoldNormalsOutput
+1 -1
View File
@@ -31,7 +31,7 @@ Some notes about this pipeline:
<Tip>
Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.md) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading.md#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.
</Tip>
@@ -0,0 +1,151 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# PixArt-Σ
![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/pixart/header_collage_sigma.jpg)
[PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation](https://huggingface.co/papers/2403.04692) is Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li.
The abstract from the paper is:
*In this paper, we introduce PixArt-Σ, a Diffusion Transformer model (DiT) capable of directly generating images at 4K resolution. PixArt-Σ represents a significant advancement over its predecessor, PixArt-α, offering images of markedly higher fidelity and improved alignment with text prompts. A key feature of PixArt-Σ is its training efficiency. Leveraging the foundational pre-training of PixArt-α, it evolves from the weaker baseline to a stronger model via incorporating higher quality data, a process we term “weak-to-strong training”. The advancements in PixArt-Σ are twofold: (1) High-Quality Training Data: PixArt-Σ incorporates superior-quality image data, paired with more precise and detailed image captions. (2) Efficient Token Compression: we propose a novel attention module within the DiT framework that compresses both keys and values, significantly improving efficiency and facilitating ultra-high-resolution image generation. Thanks to these improvements, PixArt-Σ achieves superior image quality and user prompt adherence capabilities with significantly smaller model size (0.6B parameters) than existing text-to-image diffusion models, such as SDXL (2.6B parameters) and SD Cascade (5.1B parameters). Moreover, PixArt-Σ’s capability to generate 4K images supports the creation of high-resolution posters and wallpapers, efficiently bolstering the production of highquality visual content in industries such as film and gaming.*
You can find the original codebase at [PixArt-alpha/PixArt-sigma](https://github.com/PixArt-alpha/PixArt-sigma) and all the available checkpoints at [PixArt-alpha](https://huggingface.co/PixArt-alpha).
Some notes about this pipeline:
* It uses a Transformer backbone (instead of a UNet) for denoising. As such it has a similar architecture as [DiT](https://hf.co/docs/transformers/model_doc/dit).
* It was trained using text conditions computed from T5. This aspect makes the pipeline better at following complex text prompts with intricate details.
* It is good at producing high-resolution images at different aspect ratios. To get the best results, the authors recommend some size brackets which can be found [here](https://github.com/PixArt-alpha/PixArt-sigma/blob/master/diffusion/data/datasets/utils.py).
* It rivals the quality of state-of-the-art text-to-image generation systems (as of this writing) such as PixArt-α, Stable Diffusion XL, Playground V2.0 and DALL-E 3, while being more efficient than them.
* It shows the ability of generating super high resolution images, such as 2048px or even 4K.
* It shows that text-to-image models can grow from a weak model to a stronger one through several improvements (VAEs, datasets, and so on.)
<Tip>
Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.
</Tip>
## Inference with under 8GB GPU VRAM
Run the [`PixArtSigmaPipeline`] with under 8GB GPU VRAM by loading the text encoder in 8-bit precision. Let's walk through a full-fledged example.
First, install the [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) library:
```bash
pip install -U bitsandbytes
```
Then load the text encoder in 8-bit:
```python
from transformers import T5EncoderModel
from diffusers import PixArtSigmaPipeline
import torch
text_encoder = T5EncoderModel.from_pretrained(
"PixArt-alpha/PixArt-Sigma-XL-2-1024-MS",
subfolder="text_encoder",
load_in_8bit=True,
device_map="auto",
)
pipe = PixArtSigmaPipeline.from_pretrained(
"PixArt-alpha/PixArt-Sigma-XL-2-1024-MS",
text_encoder=text_encoder,
transformer=None,
device_map="balanced"
)
```
Now, use the `pipe` to encode a prompt:
```python
with torch.no_grad():
prompt = "cute cat"
prompt_embeds, prompt_attention_mask, negative_embeds, negative_prompt_attention_mask = pipe.encode_prompt(prompt)
```
Since text embeddings have been computed, remove the `text_encoder` and `pipe` from the memory, and free up som GPU VRAM:
```python
import gc
def flush():
gc.collect()
torch.cuda.empty_cache()
del text_encoder
del pipe
flush()
```
Then compute the latents with the prompt embeddings as inputs:
```python
pipe = PixArtSigmaPipeline.from_pretrained(
"PixArt-alpha/PixArt-Sigma-XL-2-1024-MS",
text_encoder=None,
torch_dtype=torch.float16,
).to("cuda")
latents = pipe(
negative_prompt=None,
prompt_embeds=prompt_embeds,
negative_prompt_embeds=negative_embeds,
prompt_attention_mask=prompt_attention_mask,
negative_prompt_attention_mask=negative_prompt_attention_mask,
num_images_per_prompt=1,
output_type="latent",
).images
del pipe.transformer
flush()
```
<Tip>
Notice that while initializing `pipe`, you're setting `text_encoder` to `None` so that it's not loaded.
</Tip>
Once the latents are computed, pass it off to the VAE to decode into a real image:
```python
with torch.no_grad():
image = pipe.vae.decode(latents / pipe.vae.config.scaling_factor, return_dict=False)[0]
image = pipe.image_processor.postprocess(image, output_type="pil")[0]
image.save("cat.png")
```
By deleting components you aren't using and flushing the GPU VRAM, you should be able to run [`PixArtSigmaPipeline`] with under 8GB GPU VRAM.
![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/pixart/8bits_cat.png)
If you want a report of your memory-usage, run this [script](https://gist.github.com/sayakpaul/3ae0f847001d342af27018a96f467e4e).
<Tip warning={true}>
Text embeddings computed in 8-bit can impact the quality of the generated images because of the information loss in the representation space caused by the reduced precision. It's recommended to compare the outputs with and without 8-bit.
</Tip>
While loading the `text_encoder`, you set `load_in_8bit` to `True`. You could also specify `load_in_4bit` to bring your memory requirements down even further to under 7GB.
## PixArtSigmaPipeline
[[autodoc]] PixArtSigmaPipeline
- all
- __call__
+7 -1
View File
@@ -12,4 +12,10 @@ specific language governing permissions and limitations under the License.
# Video Processor
The `VideoProcessor` provides a unified API for video pipelines to prepare inputs for VAE encoding and post-processing outputs once they're decoded. The class inherits [`VaeImageProcessor`] so it includes transformations such as resizing, normalization, and conversion between PIL Image, PyTorch, and NumPy arrays.
The [`VideoProcessor`] provides a unified API for video pipelines to prepare inputs for VAE encoding and post-processing outputs once they're decoded. The class inherits [`VaeImageProcessor`] so it includes transformations such as resizing, normalization, and conversion between PIL Image, PyTorch, and NumPy arrays.
## VideoProcessor
[[autodoc]] video_processor.VideoProcessor.preprocess_video
[[autodoc]] video_processor.VideoProcessor.postprocess_video
+1 -1
View File
@@ -12,7 +12,7 @@ specific language governing permissions and limitations under the License.
# Speed up inference
There are several ways to optimize Diffusers for inference speed, such as reducing the computational burden by lowering the data precision or using a lightweight distilled model. There are also memory-efficient attention implementations, [xFormers](xformers) and [scaled dot product attetntion](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) in PyTorch 2.0, that reduce memory usage which also indirectly speeds up inference. Different speed optimizations can be stacked together to get the fastest inference times.
There are several ways to optimize Diffusers for inference speed, such as reducing the computational burden by lowering the data precision or using a lightweight distilled model. There are also memory-efficient attention implementations, [xFormers](xformers) and [scaled dot product attention](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) in PyTorch 2.0, that reduce memory usage which also indirectly speeds up inference. Different speed optimizations can be stacked together to get the fastest inference times.
> [!TIP]
> Optimizing for inference speed or reduced memory usage can lead to improved performance in the other category, so you should try to optimize for both whenever you can. This guide focuses on inference speed, but you can learn more about lowering memory usage in the [Reduce memory usage](memory) guide.
+12 -12
View File
@@ -6,7 +6,7 @@ Before you begin, make sure you install T-GATE.
```bash
pip install tgate
pip install -U pytorch diffusers transformers accelerate DeepCache
pip install -U torch diffusers transformers accelerate DeepCache
```
@@ -46,12 +46,12 @@ pipe = TgatePixArtLoader(
image = pipe.tgate(
"An alpaca made of colorful building blocks, cyberpunk.",
gate_step=gate_step,
gate_step=gate_step,
num_inference_steps=inference_step,
).images[0]
```
</hfoption>
<hfoption id="Stable Diffusion XL">
<hfoption id="Stable Diffusion XL">
Accelerate `StableDiffusionXLPipeline` with T-GATE:
@@ -78,9 +78,9 @@ pipe = TgateSDXLLoader(
).to("cuda")
image = pipe.tgate(
"Astronaut in a jungle, cold color palette, muted colors, detailed, 8k.",
gate_step=gate_step,
num_inference_steps=inference_step
"Astronaut in a jungle, cold color palette, muted colors, detailed, 8k.",
gate_step=gate_step,
num_inference_steps=inference_step
).images[0]
```
</hfoption>
@@ -111,9 +111,9 @@ pipe = TgateSDXLDeepCacheLoader(
).to("cuda")
image = pipe.tgate(
"Astronaut in a jungle, cold color palette, muted colors, detailed, 8k.",
gate_step=gate_step,
num_inference_steps=inference_step
"Astronaut in a jungle, cold color palette, muted colors, detailed, 8k.",
gate_step=gate_step,
num_inference_steps=inference_step
).images[0]
```
</hfoption>
@@ -151,9 +151,9 @@ pipe = TgateSDXLLoader(
).to("cuda")
image = pipe.tgate(
"Astronaut in a jungle, cold color palette, muted colors, detailed, 8k.",
gate_step=gate_step,
num_inference_steps=inference_step
"Astronaut in a jungle, cold color palette, muted colors, detailed, 8k.",
gate_step=gate_step,
num_inference_steps=inference_step
).images[0]
```
</hfoption>
+1 -1
View File
@@ -260,7 +260,7 @@ Then, you'll need a way to evaluate the model. For evaluation, you can use the [
... # The default pipeline output type is `List[PIL.Image]`
... images = pipeline(
... batch_size=config.eval_batch_size,
... generator=torch.manual_seed(config.seed),
... generator=torch.Generator(device='cpu').manual_seed(config.seed), # Use a separate torch generator to avoid rewinding the random state of the main training loop
... ).images
... # Make a grid out of the images
+64 -3
View File
@@ -19,13 +19,74 @@ The denoising loop of a pipeline can be modified with custom defined functions u
This guide will demonstrate how callbacks work by a few features you can implement with them.
## Official callbacks
We provide a list of callbacks you can plug into an existing pipeline and modify the denoising loop. This is the current list of official callbacks:
- `SDCFGCutoffCallback`: Disables the CFG after a certain number of steps for all SD 1.5 pipelines, including text-to-image, image-to-image, inpaint, and controlnet.
- `SDXLCFGCutoffCallback`: Disables the CFG after a certain number of steps for all SDXL pipelines, including text-to-image, image-to-image, inpaint, and controlnet.
- `IPAdapterScaleCutoffCallback`: Disables the IP Adapter after a certain number of steps for all pipelines supporting IP-Adapter.
> [!TIP]
> If you want to add a new official callback, feel free to open a [feature request](https://github.com/huggingface/diffusers/issues/new/choose) or [submit a PR](https://huggingface.co/docs/diffusers/main/en/conceptual/contribution#how-to-open-a-pr).
To set up a callback, you need to specify the number of denoising steps after which the callback comes into effect. You can do so by using either one of these two arguments
- `cutoff_step_ratio`: Float number with the ratio of the steps.
- `cutoff_step_index`: Integer number with the exact number of the step.
```python
import torch
from diffusers import DPMSolverMultistepScheduler, StableDiffusionXLPipeline
from diffusers.callbacks import SDXLCFGCutoffCallback
callback = SDXLCFGCutoffCallback(cutoff_step_ratio=0.4)
# can also be used with cutoff_step_index
# callback = SDXLCFGCutoffCallback(cutoff_step_ratio=None, cutoff_step_index=10)
pipeline = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16,
variant="fp16",
).to("cuda")
pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config, use_karras_sigmas=True)
prompt = "a sports car at the road, best quality, high quality, high detail, 8k resolution"
generator = torch.Generator(device="cpu").manual_seed(2628670641)
out = pipeline(
prompt=prompt,
negative_prompt="",
guidance_scale=6.5,
num_inference_steps=25,
generator=generator,
callback_on_step_end=callback,
)
out.images[0].save("official_callback.png")
```
<div class="flex gap-4">
<div>
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/without_cfg_callback.png" alt="generated image of a sports car at the road" />
<figcaption class="mt-2 text-center text-sm text-gray-500">without SDXLCFGCutoffCallback</figcaption>
</div>
<div>
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/with_cfg_callback.png" alt="generated image of a a sports car at the road with cfg callback" />
<figcaption class="mt-2 text-center text-sm text-gray-500">with SDXLCFGCutoffCallback</figcaption>
</div>
</div>
## Dynamic classifier-free guidance
Dynamic classifier-free guidance (CFG) is a feature that allows you to disable CFG after a certain number of inference steps which can help you save compute with minimal cost to performance. The callback function for this should have the following arguments:
* `pipeline` (or the pipeline instance) provides access to important properties such as `num_timesteps` and `guidance_scale`. You can modify these properties by updating the underlying attributes. For this example, you'll disable CFG by setting `pipeline._guidance_scale=0.0`.
* `step_index` and `timestep` tell you where you are in the denoising loop. Use `step_index` to turn off CFG after reaching 40% of `num_timesteps`.
* `callback_kwargs` is a dict that contains tensor variables you can modify during the denoising loop. It only includes variables specified in the `callback_on_step_end_tensor_inputs` argument, which is passed to the pipeline's `__call__` method. Different pipelines may use different sets of variables, so please check a pipeline's `_callback_tensor_inputs` attribute for the list of variables you can modify. Some common variables include `latents` and `prompt_embeds`. For this function, change the batch size of `prompt_embeds` after setting `guidance_scale=0.0` in order for it to work properly.
- `pipeline` (or the pipeline instance) provides access to important properties such as `num_timesteps` and `guidance_scale`. You can modify these properties by updating the underlying attributes. For this example, you'll disable CFG by setting `pipeline._guidance_scale=0.0`.
- `step_index` and `timestep` tell you where you are in the denoising loop. Use `step_index` to turn off CFG after reaching 40% of `num_timesteps`.
- `callback_kwargs` is a dict that contains tensor variables you can modify during the denoising loop. It only includes variables specified in the `callback_on_step_end_tensor_inputs` argument, which is passed to the pipeline's `__call__` method. Different pipelines may use different sets of variables, so please check a pipeline's `_callback_tensor_inputs` attribute for the list of variables you can modify. Some common variables include `latents` and `prompt_embeds`. For this function, change the batch size of `prompt_embeds` after setting `guidance_scale=0.0` in order for it to work properly.
Your callback function should look something like this:
@@ -78,7 +78,7 @@ image = pipe(
prompt=prompt,
num_inference_steps=4,
guidance_scale=0,
eta=0.3,
eta=0.3,
generator=torch.Generator(device=device).manual_seed(0),
).images[0]
```
@@ -156,14 +156,14 @@ image = pipe(
prompt=prompt,
num_inference_steps=8,
guidance_scale=0,
eta=0.3,
eta=0.3,
generator=torch.Generator(device=device).manual_seed(0),
).images[0]
```
![](https://github.com/jabir-zheng/TCD/raw/main/assets/animagine_xl.png)
TCD-LoRA also supports other LoRAs trained on different styles. For example, let's load the [TheLastBen/Papercut_SDXL](https://huggingface.co/TheLastBen/Papercut_SDXL) LoRA and fuse it with the TCD-LoRA with the [`~loaders.UNet2DConditionLoadersMixin.set_adapters`] method.
TCD-LoRA also supports other LoRAs trained on different styles. For example, let's load the [TheLastBen/Papercut_SDXL](https://huggingface.co/TheLastBen/Papercut_SDXL) LoRA and fuse it with the TCD-LoRA with the [`~loaders.UNet2DConditionLoadersMixin.set_adapters`] method.
> [!TIP]
> Check out the [Merge LoRAs](merge_loras) guide to learn more about efficient merging methods.
@@ -171,7 +171,7 @@ TCD-LoRA also supports other LoRAs trained on different styles. For example, let
```python
import torch
from diffusers import StableDiffusionXLPipeline
from scheduling_tcd import TCDScheduler
from scheduling_tcd import TCDScheduler
device = "cuda"
base_model_id = "stabilityai/stable-diffusion-xl-base-1.0"
@@ -191,7 +191,7 @@ image = pipe(
prompt=prompt,
num_inference_steps=4,
guidance_scale=0,
eta=0.3,
eta=0.3,
generator=torch.Generator(device=device).manual_seed(0),
).images[0]
```
@@ -215,7 +215,7 @@ from PIL import Image
from transformers import DPTFeatureExtractor, DPTForDepthEstimation
from diffusers import ControlNetModel, StableDiffusionXLControlNetPipeline
from diffusers.utils import load_image, make_image_grid
from scheduling_tcd import TCDScheduler
from scheduling_tcd import TCDScheduler
device = "cuda"
depth_estimator = DPTForDepthEstimation.from_pretrained("Intel/dpt-hybrid-midas").to(device)
@@ -249,13 +249,13 @@ controlnet = ControlNetModel.from_pretrained(
controlnet_id,
torch_dtype=torch.float16,
variant="fp16",
).to(device)
)
pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
base_model_id,
controlnet=controlnet,
torch_dtype=torch.float16,
variant="fp16",
).to(device)
)
pipe.enable_model_cpu_offload()
pipe.scheduler = TCDScheduler.from_config(pipe.scheduler.config)
@@ -271,9 +271,9 @@ depth_image = get_depth_map(image)
controlnet_conditioning_scale = 0.5 # recommended for good generalization
image = pipe(
prompt,
image=depth_image,
num_inference_steps=4,
prompt,
image=depth_image,
num_inference_steps=4,
guidance_scale=0,
eta=0.3,
controlnet_conditioning_scale=controlnet_conditioning_scale,
@@ -290,7 +290,7 @@ grid_image = make_image_grid([depth_image, image], rows=1, cols=2)
import torch
from diffusers import ControlNetModel, StableDiffusionXLControlNetPipeline
from diffusers.utils import load_image, make_image_grid
from scheduling_tcd import TCDScheduler
from scheduling_tcd import TCDScheduler
device = "cuda"
base_model_id = "stabilityai/stable-diffusion-xl-base-1.0"
@@ -301,13 +301,13 @@ controlnet = ControlNetModel.from_pretrained(
controlnet_id,
torch_dtype=torch.float16,
variant="fp16",
).to(device)
)
pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
base_model_id,
controlnet=controlnet,
torch_dtype=torch.float16,
variant="fp16",
).to(device)
)
pipe.enable_model_cpu_offload()
pipe.scheduler = TCDScheduler.from_config(pipe.scheduler.config)
@@ -322,9 +322,9 @@ canny_image = load_image("https://huggingface.co/datasets/hf-internal-testing/di
controlnet_conditioning_scale = 0.5 # recommended for good generalization
image = pipe(
prompt,
image=canny_image,
num_inference_steps=4,
prompt,
image=canny_image,
num_inference_steps=4,
guidance_scale=0,
eta=0.3,
controlnet_conditioning_scale=controlnet_conditioning_scale,
@@ -336,7 +336,7 @@ grid_image = make_image_grid([canny_image, image], rows=1, cols=2)
![](https://github.com/jabir-zheng/TCD/raw/main/assets/controlnet_canny_tcd.png)
<Tip>
The inference parameters in this example might not work for all examples, so we recommend you to try different values for `num_inference_steps`, `guidance_scale`, `controlnet_conditioning_scale` and `cross_attention_kwargs` parameters and choose the best one.
The inference parameters in this example might not work for all examples, so we recommend you to try different values for `num_inference_steps`, `guidance_scale`, `controlnet_conditioning_scale` and `cross_attention_kwargs` parameters and choose the best one.
</Tip>
</hfoption>
@@ -350,7 +350,7 @@ from diffusers import StableDiffusionXLPipeline
from diffusers.utils import load_image, make_image_grid
from ip_adapter import IPAdapterXL
from scheduling_tcd import TCDScheduler
from scheduling_tcd import TCDScheduler
device = "cuda"
base_model_path = "stabilityai/stable-diffusion-xl-base-1.0"
@@ -359,8 +359,8 @@ ip_ckpt = "sdxl_models/ip-adapter_sdxl.bin"
tcd_lora_id = "h1t/TCD-SDXL-LoRA"
pipe = StableDiffusionXLPipeline.from_pretrained(
base_model_path,
torch_dtype=torch.float16,
base_model_path,
torch_dtype=torch.float16,
variant="fp16"
)
pipe.scheduler = TCDScheduler.from_config(pipe.scheduler.config)
@@ -375,13 +375,13 @@ ref_image = load_image("https://raw.githubusercontent.com/tencent-ailab/IP-Adapt
prompt = "best quality, high quality, wearing sunglasses"
image = ip_model.generate(
pil_image=ref_image,
pil_image=ref_image,
prompt=prompt,
scale=0.5,
num_samples=1,
num_inference_steps=4,
num_samples=1,
num_inference_steps=4,
guidance_scale=0,
eta=0.3,
eta=0.3,
seed=0,
)[0]
+4 -4
View File
@@ -230,7 +230,7 @@ from diffusers.utils import load_image, make_image_grid
pipeline = AutoPipelineForInpainting.from_pretrained(
"runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16"
).to("cuda")
)
pipeline.enable_model_cpu_offload()
# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
pipeline.enable_xformers_memory_efficient_attention()
@@ -255,7 +255,7 @@ from diffusers.utils import load_image, make_image_grid
pipeline = AutoPipelineForInpainting.from_pretrained(
"runwayml/stable-diffusion-inpainting", torch_dtype=torch.float16, variant="fp16"
).to("cuda")
)
pipeline.enable_model_cpu_offload()
# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
pipeline.enable_xformers_memory_efficient_attention()
@@ -296,7 +296,7 @@ from diffusers.utils import load_image, make_image_grid
pipeline = AutoPipelineForInpainting.from_pretrained(
"runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16"
).to("cuda")
)
pipeline.enable_model_cpu_offload()
# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
pipeline.enable_xformers_memory_efficient_attention()
@@ -319,7 +319,7 @@ from diffusers.utils import load_image, make_image_grid
pipeline = AutoPipelineForInpainting.from_pretrained(
"runwayml/stable-diffusion-inpainting", torch_dtype=torch.float16, variant="fp16"
).to("cuda")
)
pipeline.enable_model_cpu_offload()
# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
pipeline.enable_xformers_memory_efficient_attention()
@@ -0,0 +1,466 @@
<!--Copyright 2024 Marigold authors and The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Marigold Pipelines for Computer Vision Tasks
[Marigold](../api/pipelines/marigold) is a novel diffusion-based dense prediction approach, and a set of pipelines for various computer vision tasks, such as monocular depth estimation.
This guide will show you how to use Marigold to obtain fast and high-quality predictions for images and videos.
Each pipeline supports one Computer Vision task, which takes an input RGB image as input and produces a *prediction* of the modality of interest, such as a depth map of the input image.
Currently, the following tasks are implemented:
| Pipeline | Predicted Modalities | Demos |
|---------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------------------------------------------------------------:|
| [MarigoldDepthPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/marigold/pipeline_marigold_depth.py) | [Depth](https://en.wikipedia.org/wiki/Depth_map), [Disparity](https://en.wikipedia.org/wiki/Binocular_disparity) | [Fast Demo (LCM)](https://huggingface.co/spaces/prs-eth/marigold-lcm), [Slow Original Demo (DDIM)](https://huggingface.co/spaces/prs-eth/marigold) |
| [MarigoldNormalsPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/marigold/pipeline_marigold_normals.py) | [Surface normals](https://en.wikipedia.org/wiki/Normal_mapping) | [Fast Demo (LCM)](https://huggingface.co/spaces/prs-eth/marigold-normals-lcm) |
The original checkpoints can be found under the [PRS-ETH](https://huggingface.co/prs-eth/) Hugging Face organization.
These checkpoints are meant to work with diffusers pipelines and the [original codebase](https://github.com/prs-eth/marigold).
The original code can also be used to train new checkpoints.
| Checkpoint | Modality | Comment |
|-----------------------------------------------------------------------------------------------|----------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| [prs-eth/marigold-v1-0](https://huggingface.co/prs-eth/marigold-v1-0) | Depth | The first Marigold Depth checkpoint, which predicts *affine-invariant depth* maps. The performance of this checkpoint in benchmarks was studied in the original [paper](https://huggingface.co/papers/2312.02145). Designed to be used with the `DDIMScheduler` at inference, it requires at least 10 steps to get reliable predictions. Affine-invariant depth prediction has a range of values in each pixel between 0 (near plane) and 1 (far plane); both planes are chosen by the model as part of the inference process. See the `MarigoldImageProcessor` reference for visualization utilities. |
| [prs-eth/marigold-depth-lcm-v1-0](https://huggingface.co/prs-eth/marigold-depth-lcm-v1-0) | Depth | The fast Marigold Depth checkpoint, fine-tuned from `prs-eth/marigold-v1-0`. Designed to be used with the `LCMScheduler` at inference, it requires as little as 1 step to get reliable predictions. The prediction reliability saturates at 4 steps and declines after that. |
| [prs-eth/marigold-normals-v0-1](https://huggingface.co/prs-eth/marigold-normals-v0-1) | Normals | A preview checkpoint for the Marigold Normals pipeline. Designed to be used with the `DDIMScheduler` at inference, it requires at least 10 steps to get reliable predictions. The surface normals predictions are unit-length 3D vectors with values in the range from -1 to 1. *This checkpoint will be phased out after the release of `v1-0` version.* |
| [prs-eth/marigold-normals-lcm-v0-1](https://huggingface.co/prs-eth/marigold-normals-lcm-v0-1) | Normals | The fast Marigold Normals checkpoint, fine-tuned from `prs-eth/marigold-normals-v0-1`. Designed to be used with the `LCMScheduler` at inference, it requires as little as 1 step to get reliable predictions. The prediction reliability saturates at 4 steps and declines after that. *This checkpoint will be phased out after the release of `v1-0` version.* |
The examples below are mostly given for depth prediction, but they can be universally applied with other supported modalities.
We showcase the predictions using the same input image of Albert Einstein generated by Midjourney.
This makes it easier to compare visualizations of the predictions across various modalities and checkpoints.
<div class="flex gap-4" style="justify-content: center; width: 100%;">
<div style="flex: 1 1 50%; max-width: 50%;">
<img class="rounded-xl" src="https://marigoldmonodepth.github.io/images/einstein.jpg"/>
<figcaption class="mt-1 text-center text-sm text-gray-500">
Example input image for all Marigold pipelines
</figcaption>
</div>
</div>
### Depth Prediction Quick Start
To get the first depth prediction, load `prs-eth/marigold-depth-lcm-v1-0` checkpoint into `MarigoldDepthPipeline` pipeline, put the image through the pipeline, and save the predictions:
```python
import diffusers
import torch
pipe = diffusers.MarigoldDepthPipeline.from_pretrained(
"prs-eth/marigold-depth-lcm-v1-0", variant="fp16", torch_dtype=torch.float16
).to("cuda")
image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
depth = pipe(image)
vis = pipe.image_processor.visualize_depth(depth.prediction)
vis[0].save("einstein_depth.png")
depth_16bit = pipe.image_processor.export_depth_to_16bit_png(depth.prediction)
depth_16bit[0].save("einstein_depth_16bit.png")
```
The visualization function for depth [`~pipelines.marigold.marigold_image_processing.MarigoldImageProcessor.visualize_depth`] applies one of [matplotlib's colormaps](https://matplotlib.org/stable/users/explain/colors/colormaps.html) (`Spectral` by default) to map the predicted pixel values from a single-channel `[0, 1]` depth range into an RGB image.
With the `Spectral` colormap, pixels with near depth are painted red, and far pixels are assigned blue color.
The 16-bit PNG file stores the single channel values mapped linearly from the `[0, 1]` range into `[0, 65535]`.
Below are the raw and the visualized predictions; as can be seen, dark areas (mustache) are easier to distinguish in the visualization:
<div class="flex gap-4">
<div style="flex: 1 1 50%; max-width: 50%;">
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/marigold_einstein_lcm_depth_16bit.png"/>
<figcaption class="mt-1 text-center text-sm text-gray-500">
Predicted depth (16-bit PNG)
</figcaption>
</div>
<div style="flex: 1 1 50%; max-width: 50%;">
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/marigold_einstein_lcm_depth.png"/>
<figcaption class="mt-1 text-center text-sm text-gray-500">
Predicted depth visualization (Spectral)
</figcaption>
</div>
</div>
### Surface Normals Prediction Quick Start
Load `prs-eth/marigold-normals-lcm-v0-1` checkpoint into `MarigoldNormalsPipeline` pipeline, put the image through the pipeline, and save the predictions:
```python
import diffusers
import torch
pipe = diffusers.MarigoldNormalsPipeline.from_pretrained(
"prs-eth/marigold-normals-lcm-v0-1", variant="fp16", torch_dtype=torch.float16
).to("cuda")
image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
normals = pipe(image)
vis = pipe.image_processor.visualize_normals(normals.prediction)
vis[0].save("einstein_normals.png")
```
The visualization function for normals [`~pipelines.marigold.marigold_image_processing.MarigoldImageProcessor.visualize_normals`] maps the three-dimensional prediction with pixel values in the range `[-1, 1]` into an RGB image.
The visualization function supports flipping surface normals axes to make the visualization compatible with other choices of the frame of reference.
Conceptually, each pixel is painted according to the surface normal vector in the frame of reference, where `X` axis points right, `Y` axis points up, and `Z` axis points at the viewer.
Below is the visualized prediction:
<div class="flex gap-4" style="justify-content: center; width: 100%;">
<div style="flex: 1 1 50%; max-width: 50%;">
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/marigold_einstein_lcm_normals.png"/>
<figcaption class="mt-1 text-center text-sm text-gray-500">
Predicted surface normals visualization
</figcaption>
</div>
</div>
In this example, the nose tip almost certainly has a point on the surface, in which the surface normal vector points straight at the viewer, meaning that its coordinates are `[0, 0, 1]`.
This vector maps to the RGB `[128, 128, 255]`, which corresponds to the violet-blue color.
Similarly, a surface normal on the cheek in the right part of the image has a large `X` component, which increases the red hue.
Points on the shoulders pointing up with a large `Y` promote green color.
### Speeding up inference
The above quick start snippets are already optimized for speed: they load the LCM checkpoint, use the `fp16` variant of weights and computation, and perform just one denoising diffusion step.
The `pipe(image)` call completes in 280ms on RTX 3090 GPU.
Internally, the input image is encoded with the Stable Diffusion VAE encoder, then the U-Net performs one denoising step, and finally, the prediction latent is decoded with the VAE decoder into pixel space.
In this case, two out of three module calls are dedicated to converting between pixel and latent space of LDM.
Because Marigold's latent space is compatible with the base Stable Diffusion, it is possible to speed up the pipeline call by more than 3x (85ms on RTX 3090) by using a [lightweight replacement of the SD VAE](../api/models/autoencoder_tiny):
```diff
import diffusers
import torch
pipe = diffusers.MarigoldDepthPipeline.from_pretrained(
"prs-eth/marigold-depth-lcm-v1-0", variant="fp16", torch_dtype=torch.float16
).to("cuda")
+ pipe.vae = diffusers.AutoencoderTiny.from_pretrained(
+ "madebyollin/taesd", torch_dtype=torch.float16
+ ).cuda()
image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
depth = pipe(image)
```
As suggested in [Optimizations](../optimization/torch2.0#torch.compile), adding `torch.compile` may squeeze extra performance depending on the target hardware:
```diff
import diffusers
import torch
pipe = diffusers.MarigoldDepthPipeline.from_pretrained(
"prs-eth/marigold-depth-lcm-v1-0", variant="fp16", torch_dtype=torch.float16
).to("cuda")
+ pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
depth = pipe(image)
```
## Qualitative Comparison with Depth Anything
With the above speed optimizations, Marigold delivers predictions with more details and faster than [Depth Anything](https://huggingface.co/docs/transformers/main/en/model_doc/depth_anything) with the largest checkpoint [LiheYoung/depth-anything-large-hf](https://huggingface.co/LiheYoung/depth-anything-large-hf):
<div class="flex gap-4">
<div style="flex: 1 1 50%; max-width: 50%;">
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/marigold_einstein_lcm_depth.png"/>
<figcaption class="mt-1 text-center text-sm text-gray-500">
Marigold LCM fp16 with Tiny AutoEncoder
</figcaption>
</div>
<div style="flex: 1 1 50%; max-width: 50%;">
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/einstein_depthanything_large.png"/>
<figcaption class="mt-1 text-center text-sm text-gray-500">
Depth Anything Large
</figcaption>
</div>
</div>
## Maximizing Precision and Ensembling
Marigold pipelines have a built-in ensembling mechanism combining multiple predictions from different random latents.
This is a brute-force way of improving the precision of predictions, capitalizing on the generative nature of diffusion.
The ensembling path is activated automatically when the `ensemble_size` argument is set greater than `1`.
When aiming for maximum precision, it makes sense to adjust `num_inference_steps` simultaneously with `ensemble_size`.
The recommended values vary across checkpoints but primarily depend on the scheduler type.
The effect of ensembling is particularly well-seen with surface normals:
```python
import diffusers
model_path = "prs-eth/marigold-normals-v1-0"
model_paper_kwargs = {
diffusers.schedulers.DDIMScheduler: {
"num_inference_steps": 10,
"ensemble_size": 10,
},
diffusers.schedulers.LCMScheduler: {
"num_inference_steps": 4,
"ensemble_size": 5,
},
}
image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
pipe = diffusers.MarigoldNormalsPipeline.from_pretrained(model_path).to("cuda")
pipe_kwargs = model_paper_kwargs[type(pipe.scheduler)]
depth = pipe(image, **pipe_kwargs)
vis = pipe.image_processor.visualize_normals(depth.prediction)
vis[0].save("einstein_normals.png")
```
<div class="flex gap-4">
<div style="flex: 1 1 50%; max-width: 50%;">
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/marigold_einstein_lcm_normals.png"/>
<figcaption class="mt-1 text-center text-sm text-gray-500">
Surface normals, no ensembling
</figcaption>
</div>
<div style="flex: 1 1 50%; max-width: 50%;">
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/marigold_einstein_normals.png"/>
<figcaption class="mt-1 text-center text-sm text-gray-500">
Surface normals, with ensembling
</figcaption>
</div>
</div>
As can be seen, all areas with fine-grained structurers, such as hair, got more conservative and on average more correct predictions.
Such a result is more suitable for precision-sensitive downstream tasks, such as 3D reconstruction.
## Quantitative Evaluation
To evaluate Marigold quantitatively in standard leaderboards and benchmarks (such as NYU, KITTI, and other datasets), follow the evaluation protocol outlined in the paper: load the full precision fp32 model and use appropriate values for `num_inference_steps` and `ensemble_size`.
Optionally seed randomness to ensure reproducibility. Maximizing `batch_size` will deliver maximum device utilization.
```python
import diffusers
import torch
device = "cuda"
seed = 2024
model_path = "prs-eth/marigold-v1-0"
model_paper_kwargs = {
diffusers.schedulers.DDIMScheduler: {
"num_inference_steps": 50,
"ensemble_size": 10,
},
diffusers.schedulers.LCMScheduler: {
"num_inference_steps": 4,
"ensemble_size": 10,
},
}
image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
generator = torch.Generator(device=device).manual_seed(seed)
pipe = diffusers.MarigoldDepthPipeline.from_pretrained(model_path).to(device)
pipe_kwargs = model_paper_kwargs[type(pipe.scheduler)]
depth = pipe(image, generator=generator, **pipe_kwargs)
# evaluate metrics
```
## Using Predictive Uncertainty
The ensembling mechanism built into Marigold pipelines combines multiple predictions obtained from different random latents.
As a side effect, it can be used to quantify epistemic (model) uncertainty; simply specify `ensemble_size` greater than 1 and set `output_uncertainty=True`.
The resulting uncertainty will be available in the `uncertainty` field of the output.
It can be visualized as follows:
```python
import diffusers
import torch
pipe = diffusers.MarigoldDepthPipeline.from_pretrained(
"prs-eth/marigold-depth-lcm-v1-0", variant="fp16", torch_dtype=torch.float16
).to("cuda")
image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
depth = pipe(
image,
ensemble_size=10, # any number greater than 1; higher values yield higher precision
output_uncertainty=True,
)
uncertainty = pipe.image_processor.visualize_uncertainty(depth.uncertainty)
uncertainty[0].save("einstein_depth_uncertainty.png")
```
<div class="flex gap-4">
<div style="flex: 1 1 50%; max-width: 50%;">
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/marigold_einstein_depth_uncertainty.png"/>
<figcaption class="mt-1 text-center text-sm text-gray-500">
Depth uncertainty
</figcaption>
</div>
<div style="flex: 1 1 50%; max-width: 50%;">
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/marigold_einstein_normals_uncertainty.png"/>
<figcaption class="mt-1 text-center text-sm text-gray-500">
Surface normals uncertainty
</figcaption>
</div>
</div>
The interpretation of uncertainty is easy: higher values (white) correspond to pixels, where the model struggles to make consistent predictions.
Evidently, the depth model is the least confident around edges with discontinuity, where the object depth changes drastically.
The surface normals model is the least confident in fine-grained structures, such as hair, and dark areas, such as the collar.
## Frame-by-frame Video Processing with Temporal Consistency
Due to Marigold's generative nature, each prediction is unique and defined by the random noise sampled for the latent initialization.
This becomes an obvious drawback compared to traditional end-to-end dense regression networks, as exemplified in the following videos:
<div class="flex gap-4">
<div style="flex: 1 1 50%; max-width: 50%;">
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/marigold_obama.gif"/>
<figcaption class="mt-1 text-center text-sm text-gray-500">Input video</figcaption>
</div>
<div style="flex: 1 1 50%; max-width: 50%;">
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/marigold_obama_depth_independent.gif"/>
<figcaption class="mt-1 text-center text-sm text-gray-500">Marigold Depth applied to input video frames independently</figcaption>
</div>
</div>
To address this issue, it is possible to pass `latents` argument to the pipelines, which defines the starting point of diffusion.
Empirically, we found that a convex combination of the very same starting point noise latent and the latent corresponding to the previous frame prediction give sufficiently smooth results, as implemented in the snippet below:
```python
import imageio
from PIL import Image
from tqdm import tqdm
import diffusers
import torch
device = "cuda"
path_in = "obama.mp4"
path_out = "obama_depth.gif"
pipe = diffusers.MarigoldDepthPipeline.from_pretrained(
"prs-eth/marigold-depth-lcm-v1-0", variant="fp16", torch_dtype=torch.float16
).to(device)
pipe.vae = diffusers.AutoencoderTiny.from_pretrained(
"madebyollin/taesd", torch_dtype=torch.float16
).to(device)
pipe.set_progress_bar_config(disable=True)
with imageio.get_reader(path_in) as reader:
size = reader.get_meta_data()['size']
last_frame_latent = None
latent_common = torch.randn(
(1, 4, 768 * size[1] // (8 * max(size)), 768 * size[0] // (8 * max(size)))
).to(device=device, dtype=torch.float16)
out = []
for frame_id, frame in tqdm(enumerate(reader), desc="Processing Video"):
frame = Image.fromarray(frame)
latents = latent_common
if last_frame_latent is not None:
latents = 0.9 * latents + 0.1 * last_frame_latent
depth = pipe(
frame, match_input_resolution=False, latents=latents, output_latent=True,
)
last_frame_latent = depth.latent
out.append(pipe.image_processor.visualize_depth(depth.prediction)[0])
diffusers.utils.export_to_gif(out, path_out, fps=reader.get_meta_data()['fps'])
```
Here, the diffusion process starts from the given computed latent.
The pipeline sets `output_latent=True` to access `out.latent` and computes its contribution to the next frame's latent initialization.
The result is much more stable now:
<div class="flex gap-4">
<div style="flex: 1 1 50%; max-width: 50%;">
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/marigold_obama_depth_independent.gif"/>
<figcaption class="mt-1 text-center text-sm text-gray-500">Marigold Depth applied to input video frames independently</figcaption>
</div>
<div style="flex: 1 1 50%; max-width: 50%;">
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/marigold_obama_depth_consistent.gif"/>
<figcaption class="mt-1 text-center text-sm text-gray-500">Marigold Depth with forced latents initialization</figcaption>
</div>
</div>
## Marigold for ControlNet
A very common application for depth prediction with diffusion models comes in conjunction with ControlNet.
Depth crispness plays a crucial role in obtaining high-quality results from ControlNet.
As seen in comparisons with other methods above, Marigold excels at that task.
The snippet below demonstrates how to load an image, compute depth, and pass it into ControlNet in a compatible format:
```python
import torch
import diffusers
device = "cuda"
generator = torch.Generator(device=device).manual_seed(2024)
image = diffusers.utils.load_image(
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet_depth_source.png"
)
pipe = diffusers.MarigoldDepthPipeline.from_pretrained(
"prs-eth/marigold-depth-lcm-v1-0", torch_dtype=torch.float16, variant="fp16"
).to("cuda")
depth_image = pipe(image, generator=generator).prediction
depth_image = pipe.image_processor.visualize_depth(depth_image, color_map="binary")
depth_image[0].save("motorcycle_controlnet_depth.png")
controlnet = diffusers.ControlNetModel.from_pretrained(
"diffusers/controlnet-depth-sdxl-1.0", torch_dtype=torch.float16, variant="fp16"
).to("cuda")
pipe = diffusers.StableDiffusionXLControlNetPipeline.from_pretrained(
"SG161222/RealVisXL_V4.0", torch_dtype=torch.float16, variant="fp16", controlnet=controlnet
).to("cuda")
pipe.scheduler = diffusers.DPMSolverMultistepScheduler.from_config(pipe.scheduler.config, use_karras_sigmas=True)
controlnet_out = pipe(
prompt="high quality photo of a sports bike, city",
negative_prompt="",
guidance_scale=6.5,
num_inference_steps=25,
image=depth_image,
controlnet_conditioning_scale=0.7,
control_guidance_end=0.7,
generator=generator,
).images
controlnet_out[0].save("motorcycle_controlnet_out.png")
```
<div class="flex gap-4">
<div style="flex: 1 1 33%; max-width: 33%;">
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet_depth_source.png"/>
<figcaption class="mt-1 text-center text-sm text-gray-500">
Input image
</figcaption>
</div>
<div style="flex: 1 1 33%; max-width: 33%;">
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/motorcycle_controlnet_depth.png"/>
<figcaption class="mt-1 text-center text-sm text-gray-500">
Depth in the format compatible with ControlNet
</figcaption>
</div>
<div style="flex: 1 1 33%; max-width: 33%;">
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/motorcycle_controlnet_out.png"/>
<figcaption class="mt-1 text-center text-sm text-gray-500">
ControlNet generation, conditioned on depth and prompt: "high quality photo of a sports bike, city"
</figcaption>
</div>
</div>
Hopefully, you will find Marigold useful for solving your downstream tasks, be it a part of a more broad generative workflow, or a perception task, such as 3D reconstruction.
@@ -71,7 +71,7 @@ from diffusers.utils.import_utils import is_xformers_available
# Will error if the minimal version of diffusers is not installed. Remove at your own risks.
check_min_version("0.28.0.dev0")
check_min_version("0.28.0")
logger = get_logger(__name__)
@@ -981,7 +981,7 @@ def collate_fn(examples, with_prior_preservation=False):
class PromptDataset(Dataset):
"A simple dataset to prepare the prompts to generate class images on multiple GPUs."
"""A simple dataset to prepare the prompts to generate class images on multiple GPUs."""
def __init__(self, prompt, num_samples):
self.prompt = prompt
@@ -78,7 +78,7 @@ from diffusers.utils.torch_utils import is_compiled_module
# Will error if the minimal version of diffusers is not installed. Remove at your own risks.
check_min_version("0.28.0.dev0")
check_min_version("0.28.0")
logger = get_logger(__name__)
@@ -1136,7 +1136,7 @@ def collate_fn(examples, with_prior_preservation=False):
class PromptDataset(Dataset):
"A simple dataset to prepare the prompts to generate class images on multiple GPUs."
"""A simple dataset to prepare the prompts to generate class images on multiple GPUs."""
def __init__(self, prompt, num_samples):
self.prompt = prompt
+161 -10
View File
@@ -68,6 +68,8 @@ Please also check out our [Community Scripts](https://github.com/huggingface/dif
| InstantID Pipeline | Stable Diffusion XL Pipeline that supports InstantID | [InstantID Pipeline](#instantid-pipeline) | [![Hugging Face Space](https://img.shields.io/badge/🤗%20Hugging%20Face-Space-yellow)](https://huggingface.co/spaces/InstantX/InstantID) | [Haofan Wang](https://github.com/haofanwang) |
| UFOGen Scheduler | Scheduler for UFOGen Model (compatible with Stable Diffusion pipelines) | [UFOGen Scheduler](#ufogen-scheduler) | - | [dg845](https://github.com/dg845) |
| Stable Diffusion XL IPEX Pipeline | Accelerate Stable Diffusion XL inference pipeline with BF16/FP32 precision on Intel Xeon CPUs with [IPEX](https://github.com/intel/intel-extension-for-pytorch) | [Stable Diffusion XL on IPEX](#stable-diffusion-xl-on-ipex) | - | [Dan Li](https://github.com/ustcuna/) |
| Stable Diffusion BoxDiff Pipeline | Training-free controlled generation with bounding boxes using [BoxDiff](https://github.com/showlab/BoxDiff) | [Stable Diffusion BoxDiff Pipeline](#stable-diffusion-boxdiff) | - | [Jingyang Zhang](https://github.com/zjysteven/) |
| FRESCO V2V Pipeline | Implementation of [[CVPR 2024] FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation](https://arxiv.org/abs/2403.12962) | [FRESCO V2V Pipeline](#fresco) | - | [Yifan Zhou](https://github.com/SingleZombie) |
To load a custom pipeline you just need to pass the `custom_pipeline` argument to `DiffusionPipeline`, as one of the files in `diffusers/examples/community`. Feel free to send a PR with your own pipelines, we will merge them quickly.
@@ -238,12 +240,12 @@ pipeline_output = pipe(
# denoising_steps=10, # (optional) Number of denoising steps of each inference pass. Default: 10.
# ensemble_size=10, # (optional) Number of inference passes in the ensemble. Default: 10.
# ------------------------------------------------
# ----- recommended setting for LCM version ------
# denoising_steps=4,
# ensemble_size=5,
# -------------------------------------------------
# processing_res=768, # (optional) Maximum resolution of processing. If set to 0: will not resize at all. Defaults to 768.
# match_input_res=True, # (optional) Resize depth prediction to match input resolution.
# batch_size=0, # (optional) Inference batch size, no bigger than `num_ensemble`. If set to 0, the script will automatically decide the proper batch size. Defaults to 0.
@@ -1030,7 +1032,7 @@ image = pipe().images[0]
Make sure you have @crowsonkb's <https://github.com/crowsonkb/k-diffusion> installed:
```
```sh
pip install k-diffusion
```
@@ -1676,6 +1678,68 @@ image = pipe(prompt, image=input_image, strength=0.75,).images[0]
image.save('tensorrt_img2img_new_zealand_hills.png')
```
### Stable Diffusion BoxDiff
BoxDiff is a training-free method for controlled generation with bounding box coordinates. It shoud work with any Stable Diffusion model. Below shows an example with `stable-diffusion-2-1-base`.
```py
import torch
from PIL import Image, ImageDraw
from copy import deepcopy
from examples.community.pipeline_stable_diffusion_boxdiff import StableDiffusionBoxDiffPipeline
def draw_box_with_text(img, boxes, names):
colors = ["red", "olive", "blue", "green", "orange", "brown", "cyan", "purple"]
img_new = deepcopy(img)
draw = ImageDraw.Draw(img_new)
W, H = img.size
for bid, box in enumerate(boxes):
draw.rectangle([box[0] * W, box[1] * H, box[2] * W, box[3] * H], outline=colors[bid % len(colors)], width=4)
draw.text((box[0] * W, box[1] * H), names[bid], fill=colors[bid % len(colors)])
return img_new
pipe = StableDiffusionBoxDiffPipeline.from_pretrained(
"stabilityai/stable-diffusion-2-1-base",
torch_dtype=torch.float16,
)
pipe.to("cuda")
# example 1
prompt = "as the aurora lights up the sky, a herd of reindeer leisurely wanders on the grassy meadow, admiring the breathtaking view, a serene lake quietly reflects the magnificent display, and in the distance, a snow-capped mountain stands majestically, fantasy, 8k, highly detailed"
phrases = [
"aurora",
"reindeer",
"meadow",
"lake",
"mountain"
]
boxes = [[1,3,512,202], [75,344,421,495], [1,327,508,507], [2,217,507,341], [1,135,509,242]]
# example 2
# prompt = "A rabbit wearing sunglasses looks very proud"
# phrases = ["rabbit", "sunglasses"]
# boxes = [[67,87,366,512], [66,130,364,262]]
boxes = [[x / 512 for x in box] for box in boxes]
images = pipe(
prompt,
boxdiff_phrases=phrases,
boxdiff_boxes=boxes,
boxdiff_kwargs={
"attention_res": 16,
"normalize_eot": True
},
num_inference_steps=50,
guidance_scale=7.5,
generator=torch.manual_seed(42),
safety_checker=None
).images
draw_box_with_text(images[0], boxes, phrases).save("output.png")
```
### Stable Diffusion Reference
This pipeline uses the Reference Control. Refer to the [sd-webui-controlnet discussion: Reference-only Control](https://github.com/Mikubill/sd-webui-controlnet/discussions/1236)[sd-webui-controlnet discussion: Reference-adain Control](https://github.com/Mikubill/sd-webui-controlnet/discussions/1280).
@@ -1790,13 +1854,13 @@ To use this pipeline, you need to:
You can simply use pip to install IPEX with the latest version.
```python
```sh
python -m pip install intel_extension_for_pytorch
```
**Note:** To install a specific version, run with the following command:
```
```sh
python -m pip install intel_extension_for_pytorch==<version_name> -f https://developer.intel.com/ipex-whl-stable-cpu
```
@@ -1894,13 +1958,13 @@ To use this pipeline, you need to:
You can simply use pip to install IPEX with the latest version.
```python
```sh
python -m pip install intel_extension_for_pytorch
```
**Note:** To install a specific version, run with the following command:
```
```sh
python -m pip install intel_extension_for_pytorch==<version_name> -f https://developer.intel.com/ipex-whl-stable-cpu
```
@@ -2946,8 +3010,8 @@ This code implements a pipeline for the Stable Diffusion model, enabling the div
### Sample Code
```
from from examples.community.regional_prompting_stable_diffusion import RegionalPromptingStableDiffusionPipeline
```py
from examples.community.regional_prompting_stable_diffusion import RegionalPromptingStableDiffusionPipeline
pipe = RegionalPromptingStableDiffusionPipeline.from_single_file(model_path, vae=vae)
rp_args = {
@@ -3972,6 +4036,93 @@ onestep_image = pipe(prompt, num_inference_steps=1).images[0]
multistep_image = pipe(prompt, num_inference_steps=4).images[0]
```
### FRESCO
This is the Diffusers implementation of zero-shot video-to-video translation pipeline [FRESCO](https://github.com/williamyang1991/FRESCO) (without Ebsynth postprocessing and background smooth). To run the code, please install gmflow. Then modify the path in `gmflow_dir`. After that, you can run the pipeline with:
```py
from PIL import Image
import cv2
import torch
import numpy as np
from diffusers import ControlNetModel,DDIMScheduler, DiffusionPipeline
import sys
gmflow_dir = "/path/to/gmflow"
sys.path.insert(0, gmflow_dir)
def video_to_frame(video_path: str, interval: int):
vidcap = cv2.VideoCapture(video_path)
success = True
count = 0
res = []
while success:
count += 1
success, image = vidcap.read()
if count % interval != 1:
continue
if image is not None:
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
res.append(image)
if len(res) >= 8:
break
vidcap.release()
return res
input_video_path = 'https://github.com/williamyang1991/FRESCO/raw/main/data/car-turn.mp4'
output_video_path = 'car.gif'
# You can use any fintuned SD here
model_path = 'SG161222/Realistic_Vision_V2.0'
prompt = 'a red car turns in the winter'
a_prompt = ', RAW photo, subject, (high detailed skin:1.2), 8k uhd, dslr, soft lighting, high quality, film grain, Fujifilm XT3, '
n_prompt = '(deformed iris, deformed pupils, semi-realistic, cgi, 3d, render, sketch, cartoon, drawing, anime, mutated hands and fingers:1.4), (deformed, distorted, disfigured:1.3), poorly drawn, bad anatomy, wrong anatomy, extra limb, missing limb, floating limbs, disconnected limbs, mutation, mutated, ugly, disgusting, amputation'
input_interval = 5
frames = video_to_frame(
input_video_path, input_interval)
control_frames = []
# get canny image
for frame in frames:
image = cv2.Canny(frame, 50, 100)
np_image = np.array(image)
np_image = np_image[:, :, None]
np_image = np.concatenate([np_image, np_image, np_image], axis=2)
canny_image = Image.fromarray(np_image)
control_frames.append(canny_image)
# You can use any ControlNet here
controlnet = ControlNetModel.from_pretrained(
"lllyasviel/sd-controlnet-canny").to('cuda')
pipe = DiffusionPipeline.from_pretrained(
model_path, controlnet=controlnet, custom_pipeline='fresco_v2v').to('cuda')
pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)
generator = torch.manual_seed(0)
frames = [Image.fromarray(frame) for frame in frames]
output_frames = pipe(
prompt + a_prompt,
frames,
control_frames,
num_inference_steps=20,
strength=0.75,
controlnet_conditioning_scale=0.7,
generator=generator,
negative_prompt=n_prompt
).images
output_frames[0].save(output_video_path, save_all=True,
append_images=output_frames[1:], duration=100, loop=0)
```
# Perturbed-Attention Guidance
[Project](https://ku-cvlab.github.io/Perturbed-Attention-Guidance/) / [arXiv](https://arxiv.org/abs/2403.17377) / [GitHub](https://github.com/KU-CVLAB/Perturbed-Attention-Guidance)
@@ -3980,7 +4131,7 @@ This implementation is based on [Diffusers](https://huggingface.co/docs/diffuser
## Example Usage
```
```py
import os
import torch
File diff suppressed because it is too large Load Diff
@@ -565,7 +565,7 @@ class LCMSchedulerWithTimestamp(SchedulerMixin, ConfigMixin):
# Glide cosine schedule
self.betas = betas_for_alpha_bar(num_train_timesteps)
else:
raise NotImplementedError(f"{beta_schedule} does is not implemented for {self.__class__}")
raise NotImplementedError(f"{beta_schedule} is not implemented for {self.__class__}")
# Rescale for zero SNR
if rescale_betas_zero_snr:
@@ -477,7 +477,7 @@ class LCMScheduler(SchedulerMixin, ConfigMixin):
# Glide cosine schedule
self.betas = betas_for_alpha_bar(num_train_timesteps)
else:
raise NotImplementedError(f"{beta_schedule} does is not implemented for {self.__class__}")
raise NotImplementedError(f"{beta_schedule} is not implemented for {self.__class__}")
# Rescale for zero SNR
if rescale_betas_zero_snr:
@@ -43,8 +43,7 @@ from diffusers.utils import BaseOutput, check_min_version
# Will error if the minimal version of diffusers is not installed. Remove at your own risks.
check_min_version("0.25.0")
check_min_version("0.28.0")
class MarigoldDepthOutput(BaseOutput):
"""
File diff suppressed because it is too large Load Diff
@@ -460,7 +460,7 @@ class StableDiffusionUpscaleLDM3DPipeline(
)
# verify batch size of prompt and image are same if image is a list or tensor or numpy array
if isinstance(image, list) or isinstance(image, torch.Tensor) or isinstance(image, np.ndarray):
if isinstance(image, (list, np.ndarray, torch.Tensor)):
if prompt is not None and isinstance(prompt, str):
batch_size = 1
elif prompt is not None and isinstance(prompt, list):
+1 -1
View File
@@ -218,7 +218,7 @@ class UFOGenScheduler(SchedulerMixin, ConfigMixin):
betas = torch.linspace(-6, 6, num_train_timesteps)
self.betas = torch.sigmoid(betas) * (beta_end - beta_start) + beta_start
else:
raise NotImplementedError(f"{beta_schedule} does is not implemented for {self.__class__}")
raise NotImplementedError(f"{beta_schedule} is not implemented for {self.__class__}")
# Rescale for zero SNR
if rescale_betas_zero_snr:
@@ -78,7 +78,7 @@ def torch_dfs(model: torch.nn.Module):
class StableDiffusionReferencePipeline(
DiffusionPipeline, TextualInversionLoaderMixin, LoraLoaderMixin, IPAdapterMixin, FromSingleFileMixin
):
r""" "
r"""
Pipeline for Stable Diffusion Reference.
This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods
@@ -73,7 +73,7 @@ if is_wandb_available():
import wandb
# Will error if the minimal version of diffusers is not installed. Remove at your own risks.
check_min_version("0.28.0.dev0")
check_min_version("0.28.0")
logger = get_logger(__name__)
@@ -66,7 +66,7 @@ if is_wandb_available():
import wandb
# Will error if the minimal version of diffusers is not installed. Remove at your own risks.
check_min_version("0.28.0.dev0")
check_min_version("0.28.0")
logger = get_logger(__name__)
@@ -79,7 +79,7 @@ if is_wandb_available():
import wandb
# Will error if the minimal version of diffusers is not installed. Remove at your own risks.
check_min_version("0.28.0.dev0")
check_min_version("0.28.0")
logger = get_logger(__name__)
@@ -1358,7 +1358,7 @@ def main(args):
# estimates to predict the data point in the augmented PF-ODE trajectory corresponding to the next ODE
# solver timestep.
with torch.no_grad():
if torch.backends.mps.is_available() or "playground" in args.pretrained_model_name_or_path:
if torch.backends.mps.is_available() or "playground" in args.pretrained_teacher_model:
autocast_ctx = nullcontext()
else:
autocast_ctx = torch.autocast(accelerator.device.type)
@@ -72,7 +72,7 @@ if is_wandb_available():
import wandb
# Will error if the minimal version of diffusers is not installed. Remove at your own risks.
check_min_version("0.28.0.dev0")
check_min_version("0.28.0")
logger = get_logger(__name__)
@@ -78,7 +78,7 @@ if is_wandb_available():
import wandb
# Will error if the minimal version of diffusers is not installed. Remove at your own risks.
check_min_version("0.28.0.dev0")
check_min_version("0.28.0")
logger = get_logger(__name__)
+1 -1
View File
@@ -60,7 +60,7 @@ if is_wandb_available():
import wandb
# Will error if the minimal version of diffusers is not installed. Remove at your own risks.
check_min_version("0.28.0.dev0")
check_min_version("0.28.0")
logger = get_logger(__name__)
+1 -1
View File
@@ -60,7 +60,7 @@ if is_wandb_available():
import wandb
# Will error if the minimal version of diffusers is not installed. Remove at your own risks.
check_min_version("0.28.0.dev0")
check_min_version("0.28.0")
logger = logging.getLogger(__name__)
+1 -1
View File
@@ -61,7 +61,7 @@ if is_wandb_available():
import wandb
# Will error if the minimal version of diffusers is not installed. Remove at your own risks.
check_min_version("0.28.0.dev0")
check_min_version("0.28.0")
logger = get_logger(__name__)
if is_torch_npu_available():
@@ -63,7 +63,7 @@ from diffusers.utils.import_utils import is_xformers_available
# Will error if the minimal version of diffusers is not installed. Remove at your own risks.
check_min_version("0.28.0.dev0")
check_min_version("0.28.0")
logger = get_logger(__name__)
@@ -152,7 +152,7 @@ def collate_fn(examples, with_prior_preservation):
class PromptDataset(Dataset):
"A simple dataset to prepare the prompts to generate class images on multiple GPUs."
"""A simple dataset to prepare the prompts to generate class images on multiple GPUs."""
def __init__(self, prompt, num_samples):
self.prompt = prompt
+3 -3
View File
@@ -63,7 +63,7 @@ if is_wandb_available():
import wandb
# Will error if the minimal version of diffusers is not installed. Remove at your own risks.
check_min_version("0.28.0.dev0")
check_min_version("0.28.0")
logger = get_logger(__name__)
@@ -742,7 +742,7 @@ def collate_fn(examples, with_prior_preservation=False):
class PromptDataset(Dataset):
"A simple dataset to prepare the prompts to generate class images on multiple GPUs."
"""A simple dataset to prepare the prompts to generate class images on multiple GPUs."""
def __init__(self, prompt, num_samples):
self.prompt = prompt
@@ -759,7 +759,7 @@ class PromptDataset(Dataset):
def model_has_vae(args):
config_file_name = os.path.join("vae", AutoencoderKL.config_name)
config_file_name = Path("vae", AutoencoderKL.config_name).as_posix()
if os.path.isdir(args.pretrained_model_name_or_path):
config_file_name = os.path.join(args.pretrained_model_name_or_path, config_file_name)
return os.path.isfile(config_file_name)
+2 -2
View File
@@ -35,7 +35,7 @@ from diffusers.utils import check_min_version
# Will error if the minimal version of diffusers is not installed. Remove at your own risks.
check_min_version("0.28.0.dev0")
check_min_version("0.28.0")
# Cache compiled models across invocations of this script.
cc.initialize_cache(os.path.expanduser("~/.cache/jax/compilation_cache"))
@@ -301,7 +301,7 @@ class DreamBoothDataset(Dataset):
class PromptDataset(Dataset):
"A simple dataset to prepare the prompts to generate class images on multiple GPUs."
"""A simple dataset to prepare the prompts to generate class images on multiple GPUs."""
def __init__(self, prompt, num_samples):
self.prompt = prompt
+2 -2
View File
@@ -70,7 +70,7 @@ if is_wandb_available():
import wandb
# Will error if the minimal version of diffusers is not installed. Remove at your own risks.
check_min_version("0.28.0.dev0")
check_min_version("0.28.0")
logger = get_logger(__name__)
@@ -680,7 +680,7 @@ def collate_fn(examples, with_prior_preservation=False):
class PromptDataset(Dataset):
"A simple dataset to prepare the prompts to generate class images on multiple GPUs."
"""A simple dataset to prepare the prompts to generate class images on multiple GPUs."""
def __init__(self, prompt, num_samples):
self.prompt = prompt
@@ -78,7 +78,7 @@ if is_wandb_available():
import wandb
# Will error if the minimal version of diffusers is not installed. Remove at your own risks.
check_min_version("0.28.0.dev0")
check_min_version("0.28.0")
logger = get_logger(__name__)
@@ -903,7 +903,7 @@ def collate_fn(examples, with_prior_preservation=False):
class PromptDataset(Dataset):
"A simple dataset to prepare the prompts to generate class images on multiple GPUs."
"""A simple dataset to prepare the prompts to generate class images on multiple GPUs."""
def __init__(self, prompt, num_samples):
self.prompt = prompt
@@ -57,7 +57,7 @@ if is_wandb_available():
import wandb
# Will error if the minimal version of diffusers is not installed. Remove at your own risks.
check_min_version("0.28.0.dev0")
check_min_version("0.28.0")
logger = get_logger(__name__, log_level="INFO")
@@ -60,7 +60,7 @@ if is_wandb_available():
import wandb
# Will error if the minimal version of diffusers is not installed. Remove at your own risks.
check_min_version("0.28.0.dev0")
check_min_version("0.28.0")
logger = get_logger(__name__, log_level="INFO")
@@ -52,7 +52,7 @@ if is_wandb_available():
# Will error if the minimal version of diffusers is not installed. Remove at your own risks.
check_min_version("0.28.0.dev0")
check_min_version("0.28.0")
logger = get_logger(__name__, log_level="INFO")
@@ -896,7 +896,6 @@ def main():
images = []
if args.validation_prompts is not None:
logger.info("Running inference for collecting generated images...")
pipeline = pipeline.to(accelerator.device)
pipeline.torch_dtype = weight_dtype
pipeline.set_progress_bar_config(disable=True)
pipeline.enable_model_cpu_offload()
@@ -46,7 +46,7 @@ from diffusers.utils import check_min_version, is_wandb_available
# Will error if the minimal version of diffusers is not installed. Remove at your own risks.
check_min_version("0.28.0.dev0")
check_min_version("0.28.0")
logger = get_logger(__name__, log_level="INFO")
@@ -46,7 +46,7 @@ from diffusers.utils import check_min_version, is_wandb_available
# Will error if the minimal version of diffusers is not installed. Remove at your own risks.
check_min_version("0.28.0.dev0")
check_min_version("0.28.0")
logger = get_logger(__name__, log_level="INFO")
@@ -51,7 +51,7 @@ if is_wandb_available():
# Will error if the minimal version of diffusers is not installed. Remove at your own risks.
check_min_version("0.28.0.dev0")
check_min_version("0.28.0")
logger = get_logger(__name__, log_level="INFO")
@@ -327,7 +327,7 @@ class DreamBoothDataset(Dataset):
class PromptDataset(Dataset):
"A simple dataset to prepare the prompts to generate class images on multiple GPUs."
"""A simple dataset to prepare the prompts to generate class images on multiple GPUs."""
def __init__(self, prompt, num_samples):
self.prompt = prompt
@@ -385,7 +385,7 @@ class DreamBoothDataset(Dataset):
class PromptDataset(Dataset):
"A simple dataset to prepare the prompts to generate class images on multiple GPUs."
"""A simple dataset to prepare the prompts to generate class images on multiple GPUs."""
def __init__(self, prompt, num_samples):
self.prompt = prompt
@@ -384,7 +384,7 @@ class DreamBoothDataset(Dataset):
class PromptDataset(Dataset):
"A simple dataset to prepare the prompts to generate class images on multiple GPUs."
"""A simple dataset to prepare the prompts to generate class images on multiple GPUs."""
def __init__(self, prompt, num_samples):
self.prompt = prompt
@@ -1,6 +1,6 @@
diffusers==0.20.1
accelerate==0.23.0
transformers==4.36.0
transformers==4.38.0
peft==0.5.0
torch==2.0.1
torchvision>=0.16
@@ -762,7 +762,7 @@ def collate_fn(examples, with_prior_preservation=False):
class PromptDataset(Dataset):
"A simple dataset to prepare the prompts to generate class images on multiple GPUs."
"""A simple dataset to prepare the prompts to generate class images on multiple GPUs."""
def __init__(self, prompt, num_samples):
self.prompt = prompt
@@ -700,7 +700,7 @@ def collate_fn(examples, with_prior_preservation=False):
class PromptDataset(Dataset):
"A simple dataset to prepare the prompts to generate class images on multiple GPUs."
"""A simple dataset to prepare the prompts to generate class images on multiple GPUs."""
def __init__(self, prompt, num_samples):
self.prompt = prompt
@@ -922,7 +922,7 @@ def collate_fn(examples, with_prior_preservation=False):
class PromptDataset(Dataset):
"A simple dataset to prepare the prompts to generate class images on multiple GPUs."
"""A simple dataset to prepare the prompts to generate class images on multiple GPUs."""
def __init__(self, prompt, num_samples):
self.prompt = prompt
@@ -60,7 +60,7 @@ if is_wandb_available():
import wandb
# Will error if the minimal version of diffusers is not installed. Remove at your own risks.
check_min_version("0.28.0.dev0")
check_min_version("0.28.0")
logger = get_logger(__name__)
+2 -2
View File
@@ -1,8 +1,8 @@
accelerate>=0.16.0
torchvision
transformers>=4.25.1
datasets
datasets>=2.19.1
ftfy
tensorboard
Jinja2
peft==0.7.0
peft==0.7.0
@@ -57,7 +57,7 @@ if is_wandb_available():
# Will error if the minimal version of diffusers is not installed. Remove at your own risks.
check_min_version("0.28.0.dev0")
check_min_version("0.28.0")
logger = get_logger(__name__, log_level="INFO")
@@ -49,7 +49,7 @@ from diffusers.utils import check_min_version
# Will error if the minimal version of diffusers is not installed. Remove at your own risks.
check_min_version("0.28.0.dev0")
check_min_version("0.28.0")
logger = logging.getLogger(__name__)
@@ -53,7 +53,7 @@ from diffusers.utils.torch_utils import is_compiled_module
# Will error if the minimal version of diffusers is not installed. Remove at your own risks.
check_min_version("0.28.0.dev0")
check_min_version("0.28.0")
logger = get_logger(__name__, log_level="INFO")
@@ -65,7 +65,7 @@ from diffusers.utils.torch_utils import is_compiled_module
# Will error if the minimal version of diffusers is not installed. Remove at your own risks.
check_min_version("0.28.0.dev0")
check_min_version("0.28.0")
logger = get_logger(__name__)
if is_torch_npu_available():
@@ -50,15 +50,16 @@ from diffusers.optimization import get_scheduler
from diffusers.training_utils import EMAModel, compute_snr
from diffusers.utils import check_min_version, is_wandb_available
from diffusers.utils.hub_utils import load_or_create_model_card, populate_model_card
from diffusers.utils.import_utils import is_xformers_available
from diffusers.utils.import_utils import is_torch_npu_available, is_xformers_available
from diffusers.utils.torch_utils import is_compiled_module
# Will error if the minimal version of diffusers is not installed. Remove at your own risks.
check_min_version("0.28.0.dev0")
check_min_version("0.28.0")
logger = get_logger(__name__)
if is_torch_npu_available():
torch.npu.config.allow_internal_format = False
DATASET_NAME_MAPPING = {
"lambdalabs/naruto-blip-captions": ("image", "text"),
@@ -460,6 +461,9 @@ def parse_args(input_args=None):
),
)
parser.add_argument("--local_rank", type=int, default=-1, help="For distributed training: local_rank")
parser.add_argument(
"--enable_npu_flash_attention", action="store_true", help="Whether or not to use npu flash attention."
)
parser.add_argument(
"--enable_xformers_memory_efficient_attention", action="store_true", help="Whether or not to use xformers."
)
@@ -716,7 +720,12 @@ def main(args):
args.pretrained_model_name_or_path, subfolder="unet", revision=args.revision, variant=args.variant
)
ema_unet = EMAModel(ema_unet.parameters(), model_cls=UNet2DConditionModel, model_config=ema_unet.config)
if args.enable_npu_flash_attention:
if is_torch_npu_available():
logger.info("npu flash attention enabled.")
unet.enable_npu_flash_attention()
else:
raise ValueError("npu flash attention requires torch_npu extensions and is supported only on npu devices.")
if args.enable_xformers_memory_efficient_attention:
if is_xformers_available():
import xformers
@@ -81,7 +81,7 @@ else:
# Will error if the minimal version of diffusers is not installed. Remove at your own risks.
check_min_version("0.28.0.dev0")
check_min_version("0.28.0")
logger = get_logger(__name__)
@@ -56,7 +56,7 @@ else:
# ------------------------------------------------------------------------------
# Will error if the minimal version of diffusers is not installed. Remove at your own risks.
check_min_version("0.28.0.dev0")
check_min_version("0.28.0")
logger = logging.getLogger(__name__)
@@ -76,7 +76,7 @@ else:
# Will error if the minimal version of diffusers is not installed. Remove at your own risks.
check_min_version("0.28.0.dev0")
check_min_version("0.28.0")
logger = get_logger(__name__)
@@ -29,7 +29,7 @@ from diffusers.utils.import_utils import is_xformers_available
# Will error if the minimal version of diffusers is not installed. Remove at your own risks.
check_min_version("0.28.0.dev0")
check_min_version("0.28.0")
logger = get_logger(__name__, log_level="INFO")
+127
View File
@@ -0,0 +1,127 @@
## Training an VQGAN VAE
VQVAEs were first introduced in [Neural Discrete Representation Learning](https://arxiv.org/abs/1711.00937) and was combined with a GAN in the paper [Taming Transformers for High-Resolution Image Synthesis](https://arxiv.org/abs/2012.09841). The basic idea of a VQVAE is it's a type of a variational auto encoder with tokens as the latent space similar to tokens for LLMs. This script was adapted from a [pr to huggingface's open-muse project](https://github.com/huggingface/open-muse/pull/52) with general code following [lucidrian's implementation of the vqgan training script](https://github.com/lucidrains/muse-maskgit-pytorch/blob/main/muse_maskgit_pytorch/trainers.py) but both of these implementation follow from the [taming transformer repo](https://github.com/CompVis/taming-transformers?tab=readme-ov-file).
Creating a training image set is [described in a different document](https://huggingface.co/docs/datasets/image_process#image-datasets).
### Installing the dependencies
Before running the scripts, make sure to install the library's training dependencies:
**Important**
To make sure you can successfully run the latest versions of the example scripts, we highly recommend **installing from source** and keeping the install up to date as we update the example scripts frequently and install some example-specific requirements. To do this, execute the following steps in a new virtual environment:
```bash
git clone https://github.com/huggingface/diffusers
cd diffusers
pip install .
```
Then cd in the example folder and run
```bash
pip install -r requirements.txt
```
And initialize an [🤗Accelerate](https://github.com/huggingface/accelerate/) environment with:
```bash
accelerate config
```
### Training on CIFAR10
The command to train a VQGAN model on cifar10 dataset:
```bash
accelerate launch train_vqgan.py \
--dataset_name=cifar10 \
--image_column=img \
--validation_images images/bird.jpg images/car.jpg images/dog.jpg images/frog.jpg \
--resolution=128 \
--train_batch_size=2 \
--gradient_accumulation_steps=8 \
--report_to=wandb
```
An example training run is [here](https://wandb.ai/sayakpaul/vqgan-training/runs/0m5kzdfp) by @sayakpaul and a lower scale one [here](https://wandb.ai/dsbuddy27/vqgan-training/runs/eqd6xi4n?nw=nwuserisamu). The validation images can be obtained from [here](https://huggingface.co/datasets/diffusers/docs-images/tree/main/vqgan_validation_images).
The simplest way to improve the quality of a VQGAN model is to maximize the amount of information present in the bottleneck. The easiest way to do this is increasing the image resolution. However, other ways include, but not limited to, lowering compression by downsampling fewer times or increasing the vocaburary size which at most can be around 16384. How to do this is shown below.
# Modifying the architecture
To modify the architecture of the vqgan model you can save the config taken from [here](https://huggingface.co/kandinsky-community/kandinsky-2-2-decoder/blob/main/movq/config.json) and then provide that to the script with the option --model_config_name_or_path. This config is below
```
{
"_class_name": "VQModel",
"_diffusers_version": "0.17.0.dev0",
"act_fn": "silu",
"block_out_channels": [
128,
256,
256,
512
],
"down_block_types": [
"DownEncoderBlock2D",
"DownEncoderBlock2D",
"DownEncoderBlock2D",
"AttnDownEncoderBlock2D"
],
"in_channels": 3,
"latent_channels": 4,
"layers_per_block": 2,
"norm_num_groups": 32,
"norm_type": "spatial",
"num_vq_embeddings": 16384,
"out_channels": 3,
"sample_size": 32,
"scaling_factor": 0.18215,
"up_block_types": [
"AttnUpDecoderBlock2D",
"UpDecoderBlock2D",
"UpDecoderBlock2D",
"UpDecoderBlock2D"
],
"vq_embed_dim": 4
}
```
To lower the amount of layers in a VQGan, you can remove layers by modifying the block_out_channels, down_block_types, and up_block_types like below
```
{
"_class_name": "VQModel",
"_diffusers_version": "0.17.0.dev0",
"act_fn": "silu",
"block_out_channels": [
128,
256,
256,
],
"down_block_types": [
"DownEncoderBlock2D",
"DownEncoderBlock2D",
"DownEncoderBlock2D",
],
"in_channels": 3,
"latent_channels": 4,
"layers_per_block": 2,
"norm_num_groups": 32,
"norm_type": "spatial",
"num_vq_embeddings": 16384,
"out_channels": 3,
"sample_size": 32,
"scaling_factor": 0.18215,
"up_block_types": [
"UpDecoderBlock2D",
"UpDecoderBlock2D",
"UpDecoderBlock2D"
],
"vq_embed_dim": 4
}
```
For increasing the size of the vocaburaries you can increase num_vq_embeddings. However, [some research](https://magvit.cs.cmu.edu/v2/) shows that the representation of VQGANs start degrading after 2^14~16384 vq embeddings so it's not recommended to go past that.
## Extra training tips/ideas
During logging take care to make sure data_time is low. data_time is the amount spent loading the data and where the GPU is not active. So essentially, it's the time wasted. The easiest way to lower data time is to increase the --dataloader_num_workers to a higher number like 4. Due to a bug in Pytorch, this only works on linux based systems. For more details check [here](https://github.com/huggingface/diffusers/issues/7646)
Secondly, training should seem to be done when both the discriminator and the generator loss converges.
Thirdly, another low hanging fruit is just using ema using the --use_ema parameter. This tends to make the output images smoother. This has a con where you have to lower your batch size by 1 but it may be worth it.
Another more experimental low hanging fruit is changing from the vgg19 to different models for the lpips loss using the --timm_model_backend. If you do this, I recommend also changing the timm_model_layers parameter to the layer in your model which you think is best for representation. However, becareful with the feature map norms since this can easily overdominate the loss.
+48
View File
@@ -0,0 +1,48 @@
"""
Ported from Paella
"""
import torch
from torch import nn
from diffusers.configuration_utils import ConfigMixin, register_to_config
from diffusers.models.modeling_utils import ModelMixin
# Discriminator model ported from Paella https://github.com/dome272/Paella/blob/main/src_distributed/vqgan.py
class Discriminator(ModelMixin, ConfigMixin):
@register_to_config
def __init__(self, in_channels=3, cond_channels=0, hidden_channels=512, depth=6):
super().__init__()
d = max(depth - 3, 3)
layers = [
nn.utils.spectral_norm(
nn.Conv2d(in_channels, hidden_channels // (2**d), kernel_size=3, stride=2, padding=1)
),
nn.LeakyReLU(0.2),
]
for i in range(depth - 1):
c_in = hidden_channels // (2 ** max((d - i), 0))
c_out = hidden_channels // (2 ** max((d - 1 - i), 0))
layers.append(nn.utils.spectral_norm(nn.Conv2d(c_in, c_out, kernel_size=3, stride=2, padding=1)))
layers.append(nn.InstanceNorm2d(c_out))
layers.append(nn.LeakyReLU(0.2))
self.encoder = nn.Sequential(*layers)
self.shuffle = nn.Conv2d(
(hidden_channels + cond_channels) if cond_channels > 0 else hidden_channels, 1, kernel_size=1
)
self.logits = nn.Sigmoid()
def forward(self, x, cond=None):
x = self.encoder(x)
if cond is not None:
cond = cond.view(
cond.size(0),
cond.size(1),
1,
1,
).expand(-1, -1, x.size(-2), x.size(-1))
x = torch.cat([x, cond], dim=1)
x = self.shuffle(x)
x = self.logits(x)
return x
+8
View File
@@ -0,0 +1,8 @@
accelerate>=0.16.0
torchvision
transformers>=4.25.1
datasets
timm
numpy
tqdm
tensorboard
+395
View File
@@ -0,0 +1,395 @@
#!/usr/bin/env python
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import json
import logging
import os
import shutil
import sys
import tempfile
import torch
from diffusers import VQModel
from diffusers.utils.testing_utils import require_timm
sys.path.append("..")
from test_examples_utils import ExamplesTestsAccelerate, run_command # noqa: E402
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger()
stream_handler = logging.StreamHandler(sys.stdout)
logger.addHandler(stream_handler)
@require_timm
class TextToImage(ExamplesTestsAccelerate):
@property
def test_vqmodel_config(self):
return {
"_class_name": "VQModel",
"_diffusers_version": "0.17.0.dev0",
"act_fn": "silu",
"block_out_channels": [
32,
],
"down_block_types": [
"DownEncoderBlock2D",
],
"in_channels": 3,
"latent_channels": 4,
"layers_per_block": 2,
"norm_num_groups": 32,
"norm_type": "spatial",
"num_vq_embeddings": 32,
"out_channels": 3,
"sample_size": 32,
"scaling_factor": 0.18215,
"up_block_types": [
"UpDecoderBlock2D",
],
"vq_embed_dim": 4,
}
@property
def test_discriminator_config(self):
return {
"_class_name": "Discriminator",
"_diffusers_version": "0.27.0.dev0",
"in_channels": 3,
"cond_channels": 0,
"hidden_channels": 8,
"depth": 4,
}
def get_vq_and_discriminator_configs(self, tmpdir):
vqmodel_config_path = os.path.join(tmpdir, "vqmodel.json")
discriminator_config_path = os.path.join(tmpdir, "discriminator.json")
with open(vqmodel_config_path, "w") as fp:
json.dump(self.test_vqmodel_config, fp)
with open(discriminator_config_path, "w") as fp:
json.dump(self.test_discriminator_config, fp)
return vqmodel_config_path, discriminator_config_path
def test_vqmodel(self):
with tempfile.TemporaryDirectory() as tmpdir:
vqmodel_config_path, discriminator_config_path = self.get_vq_and_discriminator_configs(tmpdir)
test_args = f"""
examples/vqgan/train_vqgan.py
--dataset_name hf-internal-testing/dummy_image_text_data
--resolution 32
--image_column image
--train_batch_size 1
--gradient_accumulation_steps 1
--max_train_steps 2
--learning_rate 5.0e-04
--scale_lr
--lr_scheduler constant
--lr_warmup_steps 0
--model_config_name_or_path {vqmodel_config_path}
--discriminator_config_name_or_path {discriminator_config_path}
--output_dir {tmpdir}
""".split()
run_command(self._launch_args + test_args)
# save_pretrained smoke test
self.assertTrue(
os.path.isfile(os.path.join(tmpdir, "discriminator", "diffusion_pytorch_model.safetensors"))
)
self.assertTrue(os.path.isfile(os.path.join(tmpdir, "vqmodel", "diffusion_pytorch_model.safetensors")))
def test_vqmodel_checkpointing(self):
with tempfile.TemporaryDirectory() as tmpdir:
vqmodel_config_path, discriminator_config_path = self.get_vq_and_discriminator_configs(tmpdir)
# Run training script with checkpointing
# max_train_steps == 4, checkpointing_steps == 2
# Should create checkpoints at steps 2, 4
initial_run_args = f"""
examples/vqgan/train_vqgan.py
--dataset_name hf-internal-testing/dummy_image_text_data
--resolution 32
--image_column image
--train_batch_size 1
--gradient_accumulation_steps 1
--max_train_steps 4
--learning_rate 5.0e-04
--scale_lr
--lr_scheduler constant
--lr_warmup_steps 0
--model_config_name_or_path {vqmodel_config_path}
--discriminator_config_name_or_path {discriminator_config_path}
--checkpointing_steps=2
--output_dir {tmpdir}
--seed=0
""".split()
run_command(self._launch_args + initial_run_args)
# check checkpoint directories exist
self.assertEqual(
{x for x in os.listdir(tmpdir) if "checkpoint" in x},
{"checkpoint-2", "checkpoint-4"},
)
# check can run an intermediate checkpoint
model = VQModel.from_pretrained(tmpdir, subfolder="checkpoint-2/vqmodel")
image = torch.randn(1, model.config.in_channels, model.config.sample_size, model.config.sample_size)
_ = model(image)
# Remove checkpoint 2 so that we can check only later checkpoints exist after resuming
shutil.rmtree(os.path.join(tmpdir, "checkpoint-2"))
self.assertEqual(
{x for x in os.listdir(tmpdir) if "checkpoint" in x},
{"checkpoint-4"},
)
# Run training script for 2 total steps resuming from checkpoint 4
resume_run_args = f"""
examples/vqgan/train_vqgan.py
--dataset_name hf-internal-testing/dummy_image_text_data
--resolution 32
--image_column image
--train_batch_size 1
--gradient_accumulation_steps 1
--max_train_steps 6
--learning_rate 5.0e-04
--scale_lr
--lr_scheduler constant
--lr_warmup_steps 0
--model_config_name_or_path {vqmodel_config_path}
--discriminator_config_name_or_path {discriminator_config_path}
--checkpointing_steps=1
--resume_from_checkpoint={os.path.join(tmpdir, 'checkpoint-4')}
--output_dir {tmpdir}
--seed=0
""".split()
run_command(self._launch_args + resume_run_args)
# check can run new fully trained pipeline
model = VQModel.from_pretrained(tmpdir, subfolder="vqmodel")
image = torch.randn(1, model.config.in_channels, model.config.sample_size, model.config.sample_size)
_ = model(image)
# no checkpoint-2 -> check old checkpoints do not exist
# check new checkpoints exist
# In the current script, checkpointing_steps 1 is equivalent to checkpointing_steps 2 as after the generator gets trained for one step,
# the discriminator gets trained and loss and saving happens after that. Thus we do not expect to get a checkpoint-5
self.assertEqual(
{x for x in os.listdir(tmpdir) if "checkpoint" in x},
{"checkpoint-4", "checkpoint-6"},
)
def test_vqmodel_checkpointing_use_ema(self):
with tempfile.TemporaryDirectory() as tmpdir:
vqmodel_config_path, discriminator_config_path = self.get_vq_and_discriminator_configs(tmpdir)
# Run training script with checkpointing
# max_train_steps == 4, checkpointing_steps == 2
# Should create checkpoints at steps 2, 4
initial_run_args = f"""
examples/vqgan/train_vqgan.py
--dataset_name hf-internal-testing/dummy_image_text_data
--resolution 32
--image_column image
--train_batch_size 1
--gradient_accumulation_steps 1
--max_train_steps 4
--learning_rate 5.0e-04
--scale_lr
--lr_scheduler constant
--lr_warmup_steps 0
--model_config_name_or_path {vqmodel_config_path}
--discriminator_config_name_or_path {discriminator_config_path}
--checkpointing_steps=2
--output_dir {tmpdir}
--use_ema
--seed=0
""".split()
run_command(self._launch_args + initial_run_args)
model = VQModel.from_pretrained(tmpdir, subfolder="vqmodel")
image = torch.randn(1, model.config.in_channels, model.config.sample_size, model.config.sample_size)
_ = model(image)
# check checkpoint directories exist
self.assertEqual(
{x for x in os.listdir(tmpdir) if "checkpoint" in x},
{"checkpoint-2", "checkpoint-4"},
)
# check can run an intermediate checkpoint
model = VQModel.from_pretrained(tmpdir, subfolder="checkpoint-2/vqmodel")
image = torch.randn(1, model.config.in_channels, model.config.sample_size, model.config.sample_size)
_ = model(image)
# Remove checkpoint 2 so that we can check only later checkpoints exist after resuming
shutil.rmtree(os.path.join(tmpdir, "checkpoint-2"))
# Run training script for 2 total steps resuming from checkpoint 4
resume_run_args = f"""
examples/vqgan/train_vqgan.py
--dataset_name hf-internal-testing/dummy_image_text_data
--resolution 32
--image_column image
--train_batch_size 1
--gradient_accumulation_steps 1
--max_train_steps 6
--learning_rate 5.0e-04
--scale_lr
--lr_scheduler constant
--lr_warmup_steps 0
--model_config_name_or_path {vqmodel_config_path}
--discriminator_config_name_or_path {discriminator_config_path}
--checkpointing_steps=1
--resume_from_checkpoint={os.path.join(tmpdir, 'checkpoint-4')}
--output_dir {tmpdir}
--use_ema
--seed=0
""".split()
run_command(self._launch_args + resume_run_args)
# check can run new fully trained pipeline
model = VQModel.from_pretrained(tmpdir, subfolder="vqmodel")
image = torch.randn(1, model.config.in_channels, model.config.sample_size, model.config.sample_size)
_ = model(image)
# no checkpoint-2 -> check old checkpoints do not exist
# check new checkpoints exist
self.assertEqual(
{x for x in os.listdir(tmpdir) if "checkpoint" in x},
{"checkpoint-4", "checkpoint-6"},
)
def test_vqmodel_checkpointing_checkpoints_total_limit(self):
with tempfile.TemporaryDirectory() as tmpdir:
vqmodel_config_path, discriminator_config_path = self.get_vq_and_discriminator_configs(tmpdir)
# Run training script with checkpointing
# max_train_steps == 6, checkpointing_steps == 2, checkpoints_total_limit == 2
# Should create checkpoints at steps 2, 4, 6
# with checkpoint at step 2 deleted
initial_run_args = f"""
examples/vqgan/train_vqgan.py
--dataset_name hf-internal-testing/dummy_image_text_data
--resolution 32
--image_column image
--train_batch_size 1
--gradient_accumulation_steps 1
--max_train_steps 6
--learning_rate 5.0e-04
--scale_lr
--lr_scheduler constant
--lr_warmup_steps 0
--model_config_name_or_path {vqmodel_config_path}
--discriminator_config_name_or_path {discriminator_config_path}
--output_dir {tmpdir}
--checkpointing_steps=2
--checkpoints_total_limit=2
--seed=0
""".split()
run_command(self._launch_args + initial_run_args)
model = VQModel.from_pretrained(tmpdir, subfolder="vqmodel")
image = torch.randn(1, model.config.in_channels, model.config.sample_size, model.config.sample_size)
_ = model(image)
# check checkpoint directories exist
# checkpoint-2 should have been deleted
self.assertEqual({x for x in os.listdir(tmpdir) if "checkpoint" in x}, {"checkpoint-4", "checkpoint-6"})
def test_vqmodel_checkpointing_checkpoints_total_limit_removes_multiple_checkpoints(self):
with tempfile.TemporaryDirectory() as tmpdir:
vqmodel_config_path, discriminator_config_path = self.get_vq_and_discriminator_configs(tmpdir)
# Run training script with checkpointing
# max_train_steps == 4, checkpointing_steps == 2
# Should create checkpoints at steps 2, 4
initial_run_args = f"""
examples/vqgan/train_vqgan.py
--dataset_name hf-internal-testing/dummy_image_text_data
--resolution 32
--image_column image
--train_batch_size 1
--gradient_accumulation_steps 1
--max_train_steps 4
--learning_rate 5.0e-04
--scale_lr
--lr_scheduler constant
--lr_warmup_steps 0
--model_config_name_or_path {vqmodel_config_path}
--discriminator_config_name_or_path {discriminator_config_path}
--checkpointing_steps=2
--output_dir {tmpdir}
--seed=0
""".split()
run_command(self._launch_args + initial_run_args)
model = VQModel.from_pretrained(tmpdir, subfolder="vqmodel")
image = torch.randn(1, model.config.in_channels, model.config.sample_size, model.config.sample_size)
_ = model(image)
# check checkpoint directories exist
self.assertEqual(
{x for x in os.listdir(tmpdir) if "checkpoint" in x},
{"checkpoint-2", "checkpoint-4"},
)
# resume and we should try to checkpoint at 6, where we'll have to remove
# checkpoint-2 and checkpoint-4 instead of just a single previous checkpoint
resume_run_args = f"""
examples/vqgan/train_vqgan.py
--dataset_name hf-internal-testing/dummy_image_text_data
--resolution 32
--image_column image
--train_batch_size 1
--gradient_accumulation_steps 1
--max_train_steps 8
--learning_rate 5.0e-04
--scale_lr
--lr_scheduler constant
--lr_warmup_steps 0
--model_config_name_or_path {vqmodel_config_path}
--discriminator_config_name_or_path {discriminator_config_path}
--output_dir {tmpdir}
--checkpointing_steps=2
--resume_from_checkpoint={os.path.join(tmpdir, 'checkpoint-4')}
--checkpoints_total_limit=2
--seed=0
""".split()
run_command(self._launch_args + resume_run_args)
model = VQModel.from_pretrained(tmpdir, subfolder="vqmodel")
image = torch.randn(1, model.config.in_channels, model.config.sample_size, model.config.sample_size)
_ = model(image)
# check checkpoint directories exist
self.assertEqual(
{x for x in os.listdir(tmpdir) if "checkpoint" in x},
{"checkpoint-6", "checkpoint-8"},
)
File diff suppressed because it is too large Load Diff
@@ -50,7 +50,7 @@ if is_wandb_available():
# Will error if the minimal version of diffusers is not installed. Remove at your own risks.
check_min_version("0.28.0.dev0")
check_min_version("0.28.0")
logger = get_logger(__name__, log_level="INFO")
@@ -51,7 +51,7 @@ if is_wandb_available():
# Will error if the minimal version of diffusers is not installed. Remove at your own risks.
check_min_version("0.28.0.dev0")
check_min_version("0.28.0")
logger = get_logger(__name__, log_level="INFO")
+1 -1
View File
@@ -254,7 +254,7 @@ version_range_max = max(sys.version_info[1], 10) + 1
setup(
name="diffusers",
version="0.28.0.dev0", # expected format is one of x.y.z.dev0, or x.y.z.rc1 or x.y.z (no to dashes, yes to dots)
version="0.28.2", # expected format is one of x.y.z.dev0, or x.y.z.rc1 or x.y.z (no to dashes, yes to dots)
description="State-of-the-art diffusion in PyTorch and JAX.",
long_description=open("README.md", "r", encoding="utf-8").read(),
long_description_content_type="text/markdown",
+13 -1
View File
@@ -1,4 +1,4 @@
__version__ = "0.28.0.dev0"
__version__ = "0.28.2"
from typing import TYPE_CHECKING
@@ -82,11 +82,14 @@ else:
"ConsistencyDecoderVAE",
"ControlNetModel",
"ControlNetXSAdapter",
"DiTTransformer2DModel",
"HunyuanDiT2DModel",
"I2VGenXLUNet",
"Kandinsky3UNet",
"ModelMixin",
"MotionAdapter",
"MultiAdapter",
"PixArtTransformer2DModel",
"PriorTransformer",
"StableCascadeUNet",
"T2IAdapter",
@@ -227,6 +230,7 @@ else:
"BlipDiffusionPipeline",
"CLIPImageProjection",
"CycleDiffusionPipeline",
"HunyuanDiTPipeline",
"I2VGenXLPipeline",
"IFImg2ImgPipeline",
"IFImg2ImgSuperResolutionPipeline",
@@ -259,6 +263,8 @@ else:
"LDMTextToImagePipeline",
"LEditsPPPipelineStableDiffusion",
"LEditsPPPipelineStableDiffusionXL",
"MarigoldDepthPipeline",
"MarigoldNormalsPipeline",
"MusicLDMPipeline",
"PaintByExamplePipeline",
"PIAPipeline",
@@ -482,11 +488,14 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
ConsistencyDecoderVAE,
ControlNetModel,
ControlNetXSAdapter,
DiTTransformer2DModel,
HunyuanDiT2DModel,
I2VGenXLUNet,
Kandinsky3UNet,
ModelMixin,
MotionAdapter,
MultiAdapter,
PixArtTransformer2DModel,
PriorTransformer,
T2IAdapter,
T5FilmDecoder,
@@ -605,6 +614,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
AudioLDMPipeline,
CLIPImageProjection,
CycleDiffusionPipeline,
HunyuanDiTPipeline,
I2VGenXLPipeline,
IFImg2ImgPipeline,
IFImg2ImgSuperResolutionPipeline,
@@ -637,6 +647,8 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
LDMTextToImagePipeline,
LEditsPPPipelineStableDiffusion,
LEditsPPPipelineStableDiffusionXL,
MarigoldDepthPipeline,
MarigoldNormalsPipeline,
MusicLDMPipeline,
PaintByExamplePipeline,
PIAPipeline,
+156
View File
@@ -0,0 +1,156 @@
from typing import Any, Dict, List
from .configuration_utils import ConfigMixin, register_to_config
from .utils import CONFIG_NAME
class PipelineCallback(ConfigMixin):
"""
Base class for all the official callbacks used in a pipeline. This class provides a structure for implementing
custom callbacks and ensures that all callbacks have a consistent interface.
Please implement the following:
`tensor_inputs`: This should return a list of tensor inputs specific to your callback. You will only be able to
include
variables listed in the `._callback_tensor_inputs` attribute of your pipeline class.
`callback_fn`: This method defines the core functionality of your callback.
"""
config_name = CONFIG_NAME
@register_to_config
def __init__(self, cutoff_step_ratio=1.0, cutoff_step_index=None):
super().__init__()
if (cutoff_step_ratio is None and cutoff_step_index is None) or (
cutoff_step_ratio is not None and cutoff_step_index is not None
):
raise ValueError("Either cutoff_step_ratio or cutoff_step_index should be provided, not both or none.")
if cutoff_step_ratio is not None and (
not isinstance(cutoff_step_ratio, float) or not (0.0 <= cutoff_step_ratio <= 1.0)
):
raise ValueError("cutoff_step_ratio must be a float between 0.0 and 1.0.")
@property
def tensor_inputs(self) -> List[str]:
raise NotImplementedError(f"You need to set the attribute `tensor_inputs` for {self.__class__}")
def callback_fn(self, pipeline, step_index, timesteps, callback_kwargs) -> Dict[str, Any]:
raise NotImplementedError(f"You need to implement the method `callback_fn` for {self.__class__}")
def __call__(self, pipeline, step_index, timestep, callback_kwargs) -> Dict[str, Any]:
return self.callback_fn(pipeline, step_index, timestep, callback_kwargs)
class MultiPipelineCallbacks:
"""
This class is designed to handle multiple pipeline callbacks. It accepts a list of PipelineCallback objects and
provides a unified interface for calling all of them.
"""
def __init__(self, callbacks: List[PipelineCallback]):
self.callbacks = callbacks
@property
def tensor_inputs(self) -> List[str]:
return [input for callback in self.callbacks for input in callback.tensor_inputs]
def __call__(self, pipeline, step_index, timestep, callback_kwargs) -> Dict[str, Any]:
"""
Calls all the callbacks in order with the given arguments and returns the final callback_kwargs.
"""
for callback in self.callbacks:
callback_kwargs = callback(pipeline, step_index, timestep, callback_kwargs)
return callback_kwargs
class SDCFGCutoffCallback(PipelineCallback):
"""
Callback function for Stable Diffusion Pipelines. After certain number of steps (set by `cutoff_step_ratio` or
`cutoff_step_index`), this callback will disable the CFG.
Note: This callback mutates the pipeline by changing the `_guidance_scale` attribute to 0.0 after the cutoff step.
"""
tensor_inputs = ["prompt_embeds"]
def callback_fn(self, pipeline, step_index, timestep, callback_kwargs) -> Dict[str, Any]:
cutoff_step_ratio = self.config.cutoff_step_ratio
cutoff_step_index = self.config.cutoff_step_index
# Use cutoff_step_index if it's not None, otherwise use cutoff_step_ratio
cutoff_step = (
cutoff_step_index if cutoff_step_index is not None else int(pipeline.num_timesteps * cutoff_step_ratio)
)
if step_index == cutoff_step:
prompt_embeds = callback_kwargs[self.tensor_inputs[0]]
prompt_embeds = prompt_embeds[-1:] # "-1" denotes the embeddings for conditional text tokens.
pipeline._guidance_scale = 0.0
callback_kwargs[self.tensor_inputs[0]] = prompt_embeds
return callback_kwargs
class SDXLCFGCutoffCallback(PipelineCallback):
"""
Callback function for Stable Diffusion XL Pipelines. After certain number of steps (set by `cutoff_step_ratio` or
`cutoff_step_index`), this callback will disable the CFG.
Note: This callback mutates the pipeline by changing the `_guidance_scale` attribute to 0.0 after the cutoff step.
"""
tensor_inputs = ["prompt_embeds", "add_text_embeds", "add_time_ids"]
def callback_fn(self, pipeline, step_index, timestep, callback_kwargs) -> Dict[str, Any]:
cutoff_step_ratio = self.config.cutoff_step_ratio
cutoff_step_index = self.config.cutoff_step_index
# Use cutoff_step_index if it's not None, otherwise use cutoff_step_ratio
cutoff_step = (
cutoff_step_index if cutoff_step_index is not None else int(pipeline.num_timesteps * cutoff_step_ratio)
)
if step_index == cutoff_step:
prompt_embeds = callback_kwargs[self.tensor_inputs[0]]
prompt_embeds = prompt_embeds[-1:] # "-1" denotes the embeddings for conditional text tokens.
add_text_embeds = callback_kwargs[self.tensor_inputs[1]]
add_text_embeds = add_text_embeds[-1:] # "-1" denotes the embeddings for conditional pooled text tokens
add_time_ids = callback_kwargs[self.tensor_inputs[2]]
add_time_ids = add_time_ids[-1:] # "-1" denotes the embeddings for conditional added time vector
pipeline._guidance_scale = 0.0
callback_kwargs[self.tensor_inputs[0]] = prompt_embeds
callback_kwargs[self.tensor_inputs[1]] = add_text_embeds
callback_kwargs[self.tensor_inputs[2]] = add_time_ids
return callback_kwargs
class IPAdapterScaleCutoffCallback(PipelineCallback):
"""
Callback function for any pipeline that inherits `IPAdapterMixin`. After certain number of steps (set by
`cutoff_step_ratio` or `cutoff_step_index`), this callback will set the IP Adapter scale to `0.0`.
Note: This callback mutates the IP Adapter attention processors by setting the scale to 0.0 after the cutoff step.
"""
tensor_inputs = []
def callback_fn(self, pipeline, step_index, timestep, callback_kwargs) -> Dict[str, Any]:
cutoff_step_ratio = self.config.cutoff_step_ratio
cutoff_step_index = self.config.cutoff_step_index
# Use cutoff_step_index if it's not None, otherwise use cutoff_step_ratio
cutoff_step = (
cutoff_step_index if cutoff_step_index is not None else int(pipeline.num_timesteps * cutoff_step_ratio)
)
if step_index == cutoff_step:
pipeline.set_ip_adapter_scale(0.0)
return callback_kwargs
+110 -6
View File
@@ -13,12 +13,25 @@
# limitations under the License.
import platform
import subprocess
from argparse import ArgumentParser
import huggingface_hub
from .. import __version__ as version
from ..utils import is_accelerate_available, is_torch_available, is_transformers_available, is_xformers_available
from ..utils import (
is_accelerate_available,
is_bitsandbytes_available,
is_flax_available,
is_google_colab,
is_notebook,
is_peft_available,
is_safetensors_available,
is_torch_available,
is_transformers_available,
is_xformers_available,
)
from ..utils.testing_utils import get_python_version
from . import BaseDiffusersCLICommand
@@ -28,13 +41,19 @@ def info_command_factory(_):
class EnvironmentCommand(BaseDiffusersCLICommand):
@staticmethod
def register_subcommand(parser: ArgumentParser):
def register_subcommand(parser: ArgumentParser) -> None:
download_parser = parser.add_parser("env")
download_parser.set_defaults(func=info_command_factory)
def run(self):
def run(self) -> dict:
hub_version = huggingface_hub.__version__
safetensors_version = "not installed"
if is_safetensors_available():
import safetensors
safetensors_version = safetensors.__version__
pt_version = "not installed"
pt_cuda_available = "NA"
if is_torch_available():
@@ -43,6 +62,20 @@ class EnvironmentCommand(BaseDiffusersCLICommand):
pt_version = torch.__version__
pt_cuda_available = torch.cuda.is_available()
flax_version = "not installed"
jax_version = "not installed"
jaxlib_version = "not installed"
jax_backend = "NA"
if is_flax_available():
import flax
import jax
import jaxlib
flax_version = flax.__version__
jax_version = jax.__version__
jaxlib_version = jaxlib.__version__
jax_backend = jax.lib.xla_bridge.get_backend().platform
transformers_version = "not installed"
if is_transformers_available():
import transformers
@@ -55,21 +88,92 @@ class EnvironmentCommand(BaseDiffusersCLICommand):
accelerate_version = accelerate.__version__
peft_version = "not installed"
if is_peft_available():
import peft
peft_version = peft.__version__
bitsandbytes_version = "not installed"
if is_bitsandbytes_available():
import bitsandbytes
bitsandbytes_version = bitsandbytes.__version__
xformers_version = "not installed"
if is_xformers_available():
import xformers
xformers_version = xformers.__version__
if get_python_version() >= (3, 10):
platform_info = f"{platform.freedesktop_os_release().get('PRETTY_NAME', None)} - {platform.platform()}"
else:
platform_info = platform.platform()
is_notebook_str = "Yes" if is_notebook() else "No"
is_google_colab_str = "Yes" if is_google_colab() else "No"
accelerator = "NA"
if platform.system() in {"Linux", "Windows"}:
try:
sp = subprocess.Popen(
["nvidia-smi", "--query-gpu=gpu_name,memory.total", "--format=csv,noheader"],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
)
out_str, _ = sp.communicate()
out_str = out_str.decode("utf-8")
if len(out_str) > 0:
accelerator = out_str.strip() + " VRAM"
except FileNotFoundError:
pass
elif platform.system() == "Darwin": # Mac OS
try:
sp = subprocess.Popen(
["system_profiler", "SPDisplaysDataType"],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
)
out_str, _ = sp.communicate()
out_str = out_str.decode("utf-8")
start = out_str.find("Chipset Model:")
if start != -1:
start += len("Chipset Model:")
end = out_str.find("\n", start)
accelerator = out_str[start:end].strip()
start = out_str.find("VRAM (Total):")
if start != -1:
start += len("VRAM (Total):")
end = out_str.find("\n", start)
accelerator += " VRAM: " + out_str[start:end].strip()
except FileNotFoundError:
pass
else:
print("It seems you are running an unusual OS. Could you fill in the accelerator manually?")
info = {
"`diffusers` version": version,
"Platform": platform.platform(),
"🤗 Diffusers version": version,
"Platform": platform_info,
"Running on a notebook?": is_notebook_str,
"Running on Google Colab?": is_google_colab_str,
"Python version": platform.python_version(),
"PyTorch version (GPU?)": f"{pt_version} ({pt_cuda_available})",
"Flax version (CPU?/GPU?/TPU?)": f"{flax_version} ({jax_backend})",
"Jax version": jax_version,
"JaxLib version": jaxlib_version,
"Huggingface_hub version": hub_version,
"Transformers version": transformers_version,
"Accelerate version": accelerate_version,
"PEFT version": peft_version,
"Bitsandbytes version": bitsandbytes_version,
"Safetensors version": safetensors_version,
"xFormers version": xformers_version,
"Accelerator": accelerator,
"Using GPU in script?": "<fill in>",
"Using distributed or parallel set-up in script?": "<fill in>",
}
@@ -80,5 +184,5 @@ class EnvironmentCommand(BaseDiffusersCLICommand):
return info
@staticmethod
def format_dict(d):
def format_dict(d: dict) -> str:
return "\n".join([f"- {prop}: {val}" for prop, val in d.items()]) + "\n"
+17
View File
@@ -706,3 +706,20 @@ def flax_register_to_config(cls):
cls.__init__ = init
return cls
class LegacyConfigMixin(ConfigMixin):
r"""
A subclass of `ConfigMixin` to resolve class mapping from legacy classes (like `Transformer2DModel`) to more
pipeline-specific classes (like `DiTTransformer2DModel`).
"""
@classmethod
def from_config(cls, config: Union[FrozenDict, Dict[str, Any]] = None, return_unused_kwargs=False, **kwargs):
# To prevent depedency import problem.
from .models.model_loading_utils import _fetch_remapped_cls_from_config
# resolve remapping
remapped_class = _fetch_remapped_cls_from_config(config, cls)
return remapped_class.from_config(config, return_unused_kwargs, **kwargs)
+1 -1
View File
@@ -363,7 +363,7 @@ class LoraLoaderMixin:
is_model_cpu_offload = False
is_sequential_cpu_offload = False
if _pipeline is not None:
if _pipeline is not None and _pipeline.hf_device_map is None:
for _, component in _pipeline.components.items():
if isinstance(component, nn.Module) and hasattr(component, "_hf_hook"):
if not is_model_cpu_offload:
+1 -1
View File
@@ -340,7 +340,7 @@ class FromSingleFileMixin:
deprecate("original_config_file", "1.0.0", deprecation_message)
original_config = original_config_file
resume_download = kwargs.pop("resume_download", False)
resume_download = kwargs.pop("resume_download", None)
force_download = kwargs.pop("force_download", False)
proxies = kwargs.pop("proxies", None)
token = kwargs.pop("token", None)
+1 -1
View File
@@ -166,7 +166,7 @@ class FromOriginalModelMixin:
"`from_single_file` cannot accept both `config` and `original_config` arguments. Please provide only one of these arguments"
)
resume_download = kwargs.pop("resume_download", False)
resume_download = kwargs.pop("resume_download", None)
force_download = kwargs.pop("force_download", False)
proxies = kwargs.pop("proxies", None)
token = kwargs.pop("token", None)
+4 -4
View File
@@ -63,7 +63,7 @@ CHECKPOINT_KEY_NAMES = {
"controlnet": "control_model.time_embed.0.weight",
"playground-v2-5": "edm_mean",
"inpainting": "model.diffusion_model.input_blocks.0.0.weight",
"clip": "cond_stage_model.transformer.text_model.embeddings.position_ids",
"clip": "cond_stage_model.transformer.text_model.embeddings.position_embedding.weight",
"clip_sdxl": "conditioner.embedders.0.transformer.text_model.embeddings.position_embedding.weight",
"open_clip": "cond_stage_model.model.token_embedding.weight",
"open_clip_sdxl": "conditioner.embedders.1.model.positional_embedding",
@@ -826,8 +826,8 @@ def convert_ldm_unet_checkpoint(checkpoint, config, extract_ema=False, **kwargs)
# at least a 100 parameters have to start with `model_ema` in order for the checkpoint to be EMA
if sum(k.startswith("model_ema") for k in keys) > 100 and extract_ema:
logger.warninging("Checkpoint has both EMA and non-EMA weights.")
logger.warninging(
logger.warning("Checkpoint has both EMA and non-EMA weights.")
logger.warning(
"In this conversion only the EMA weights are extracted. If you want to instead extract the non-EMA"
" weights (useful to continue fine-tuning), please make sure to remove the `--extract_ema` flag."
)
@@ -837,7 +837,7 @@ def convert_ldm_unet_checkpoint(checkpoint, config, extract_ema=False, **kwargs)
unet_state_dict[key.replace(unet_key, "")] = checkpoint.get(flat_ema_key)
else:
if sum(k.startswith("model_ema") for k in keys) > 100:
logger.warninging(
logger.warning(
"In this conversion only the non-EMA weights are extracted. If you want to instead extract the EMA"
" weights (usually better for inference), please make sure to add the `--extract_ema` flag."
)
+14 -13
View File
@@ -419,19 +419,20 @@ class TextualInversionLoaderMixin:
# 7.1 Offload all hooks in case the pipeline was cpu offloaded before make sure, we offload and onload again
is_model_cpu_offload = False
is_sequential_cpu_offload = False
for _, component in self.components.items():
if isinstance(component, nn.Module):
if hasattr(component, "_hf_hook"):
is_model_cpu_offload = isinstance(getattr(component, "_hf_hook"), CpuOffload)
is_sequential_cpu_offload = (
isinstance(getattr(component, "_hf_hook"), AlignDevicesHook)
or hasattr(component._hf_hook, "hooks")
and isinstance(component._hf_hook.hooks[0], AlignDevicesHook)
)
logger.info(
"Accelerate hooks detected. Since you have called `load_textual_inversion()`, the previous hooks will be first removed. Then the textual inversion parameters will be loaded and the hooks will be applied again."
)
remove_hook_from_module(component, recurse=is_sequential_cpu_offload)
if self.hf_device_map is None:
for _, component in self.components.items():
if isinstance(component, nn.Module):
if hasattr(component, "_hf_hook"):
is_model_cpu_offload = isinstance(getattr(component, "_hf_hook"), CpuOffload)
is_sequential_cpu_offload = (
isinstance(getattr(component, "_hf_hook"), AlignDevicesHook)
or hasattr(component._hf_hook, "hooks")
and isinstance(component._hf_hook.hooks[0], AlignDevicesHook)
)
logger.info(
"Accelerate hooks detected. Since you have called `load_textual_inversion()`, the previous hooks will be first removed. Then the textual inversion parameters will be loaded and the hooks will be applied again."
)
remove_hook_from_module(component, recurse=is_sequential_cpu_offload)
# 7.2 save expected device and dtype
device = text_encoder.device
+6
View File
@@ -36,6 +36,9 @@ if is_torch_available():
_import_structure["dual_transformer_2d"] = ["DualTransformer2DModel"]
_import_structure["embeddings"] = ["ImageProjection"]
_import_structure["modeling_utils"] = ["ModelMixin"]
_import_structure["transformers.dit_transformer_2d"] = ["DiTTransformer2DModel"]
_import_structure["transformers.hunyuan_transformer_2d"] = ["HunyuanDiT2DModel"]
_import_structure["transformers.pixart_transformer_2d"] = ["PixArtTransformer2DModel"]
_import_structure["transformers.prior_transformer"] = ["PriorTransformer"]
_import_structure["transformers.t5_film_transformer"] = ["T5FilmDecoder"]
_import_structure["transformers.transformer_2d"] = ["Transformer2DModel"]
@@ -73,7 +76,10 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
from .embeddings import ImageProjection
from .modeling_utils import ModelMixin
from .transformers import (
DiTTransformer2DModel,
DualTransformer2DModel,
HunyuanDiT2DModel,
PixArtTransformer2DModel,
PriorTransformer,
T5FilmDecoder,
Transformer2DModel,
+12
View File
@@ -50,6 +50,18 @@ def get_activation(act_fn: str) -> nn.Module:
raise ValueError(f"Unsupported activation function: {act_fn}")
class FP32SiLU(nn.Module):
r"""
SiLU activation function with input upcasted to torch.float32.
"""
def __init__(self):
super().__init__()
def forward(self, inputs: torch.Tensor) -> torch.Tensor:
return F.silu(inputs.float(), inplace=False).to(inputs.dtype)
class GELU(nn.Module):
r"""
GELU activation function with tanh approximation support with `approximate="tanh"`.
+108
View File
@@ -103,6 +103,7 @@ class Attention(nn.Module):
upcast_softmax: bool = False,
cross_attention_norm: Optional[str] = None,
cross_attention_norm_num_groups: int = 32,
qk_norm: Optional[str] = None,
added_kv_proj_dim: Optional[int] = None,
norm_num_groups: Optional[int] = None,
spatial_norm_dim: Optional[int] = None,
@@ -161,6 +162,15 @@ class Attention(nn.Module):
else:
self.spatial_norm = None
if qk_norm is None:
self.norm_q = None
self.norm_k = None
elif qk_norm == "layer_norm":
self.norm_q = nn.LayerNorm(dim_head, eps=eps)
self.norm_k = nn.LayerNorm(dim_head, eps=eps)
else:
raise ValueError(f"unknown qk_norm: {qk_norm}. Should be None or 'layer_norm'")
if cross_attention_norm is None:
self.norm_cross = None
elif cross_attention_norm == "layer_norm":
@@ -1426,6 +1436,104 @@ class AttnProcessor2_0:
return hidden_states
class HunyuanAttnProcessor2_0:
r"""
Processor for implementing scaled dot-product attention (enabled by default if you're using PyTorch 2.0). This is
used in the HunyuanDiT model. It applies a s normalization layer and rotary embedding on query and key vector.
"""
def __init__(self):
if not hasattr(F, "scaled_dot_product_attention"):
raise ImportError("AttnProcessor2_0 requires PyTorch 2.0, to use it, please upgrade PyTorch to 2.0.")
def __call__(
self,
attn: Attention,
hidden_states: torch.Tensor,
encoder_hidden_states: Optional[torch.Tensor] = None,
attention_mask: Optional[torch.Tensor] = None,
temb: Optional[torch.Tensor] = None,
image_rotary_emb: Optional[torch.Tensor] = None,
) -> torch.Tensor:
from .embeddings import apply_rotary_emb
residual = hidden_states
if attn.spatial_norm is not None:
hidden_states = attn.spatial_norm(hidden_states, temb)
input_ndim = hidden_states.ndim
if input_ndim == 4:
batch_size, channel, height, width = hidden_states.shape
hidden_states = hidden_states.view(batch_size, channel, height * width).transpose(1, 2)
batch_size, sequence_length, _ = (
hidden_states.shape if encoder_hidden_states is None else encoder_hidden_states.shape
)
if attention_mask is not None:
attention_mask = attn.prepare_attention_mask(attention_mask, sequence_length, batch_size)
# scaled_dot_product_attention expects attention_mask shape to be
# (batch, heads, source_length, target_length)
attention_mask = attention_mask.view(batch_size, attn.heads, -1, attention_mask.shape[-1])
if attn.group_norm is not None:
hidden_states = attn.group_norm(hidden_states.transpose(1, 2)).transpose(1, 2)
query = attn.to_q(hidden_states)
if encoder_hidden_states is None:
encoder_hidden_states = hidden_states
elif attn.norm_cross:
encoder_hidden_states = attn.norm_encoder_hidden_states(encoder_hidden_states)
key = attn.to_k(encoder_hidden_states)
value = attn.to_v(encoder_hidden_states)
inner_dim = key.shape[-1]
head_dim = inner_dim // attn.heads
query = query.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
key = key.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
value = value.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
if attn.norm_q is not None:
query = attn.norm_q(query)
if attn.norm_k is not None:
key = attn.norm_k(key)
# Apply RoPE if needed
if image_rotary_emb is not None:
query = apply_rotary_emb(query, image_rotary_emb)
if not attn.is_cross_attention:
key = apply_rotary_emb(key, image_rotary_emb)
# the output of sdp = (batch, num_heads, seq_len, head_dim)
# TODO: add support for attn.scale when we move to Torch 2.1
hidden_states = F.scaled_dot_product_attention(
query, key, value, attn_mask=attention_mask, dropout_p=0.0, is_causal=False
)
hidden_states = hidden_states.transpose(1, 2).reshape(batch_size, -1, attn.heads * head_dim)
hidden_states = hidden_states.to(query.dtype)
# linear proj
hidden_states = attn.to_out[0](hidden_states)
# dropout
hidden_states = attn.to_out[1](hidden_states)
if input_ndim == 4:
hidden_states = hidden_states.transpose(-1, -2).reshape(batch_size, channel, height, width)
if attn.residual_connection:
hidden_states = hidden_states + residual
hidden_states = hidden_states / attn.rescale_output_factor
return hidden_states
class FusedAttnProcessor2_0:
r"""
Processor for implementing scaled dot-product attention (enabled by default if you're using PyTorch 2.0). It uses

Some files were not shown because too many files have changed in this diff Show More