QUESTIONS

What is the stable diffusion method?

Updated 3 years ago on July 18, 2023

Get in Touch

Table of Contents

Elaboration
Technology
Architecture
Training data
Training Procedures
Limitations
Fine-tuning for end users
Opportunities
Generating text to image
Image modification
ControlNet
Uses and controversies
Case law
License

Stable diffusion - an image-generating machine learning model.

Stable Diffusion is a deep learning model for text-to-image conversion released in 2022. It is primarily used to create detailed images based on textual descriptions, but can also be used for other tasks such as inpainting, outpainting, and creating image-to-image translations based on textual cues. It was developed by researchers in the CompVis group at Ludwig Maximilian University of Munich and Runway based on computational resources provided by Stability AI and training data from non-profit organizations.

Stable Diffusion is a latent-diffusion model, a type of deep generative artificial neural network. Its code and model weights are publicly published, and it can run on most consumer hardware equipped with a modest GPU with at least 8 GB of VRAM. This makes it different from previous proprietary text-to-image models such as DALL-E and Midjourney, which were only available through cloud services.

Elaboration

The development of Stable Diffusion was funded and shaped by the startup company Stability AI. The technical license for the model was issued by the CompVis group at Ludwig Maximilian University of Munich. The development was led by Patrick Esser of Runway and Robin Rombach of CompVis, who previously invented the latent diffusion model architecture used in Stable Diffusion. Stability AI also credited EleutherAI and LAION (a German non-profit organization that collected the dataset on which Stable Diffusion was trained) as supporters of the project.

In October 2022, Stability AI raised $101 million in a round led by Lightspeed Venture Partners and Coatue Management.

Technology

Architecture

Stable Diffusion uses a kind of diffusion model (DM) called latent diffusion model (LDM), developed by the CompVis group at LMU Munich. The diffusion models presented in 2015 are trained to remove successive overlays of Gaussian noise on training images, which can be thought of as a sequence of autoencoders for discharge. Stable Diffusion consists of three parts: a variational autoencoder (VAE), a U-Net and an additional text encoder. The VAE encoder compresses the image from pixel space into a lower dimensional latent space, which allows capturing the more fundamental semantic meaning of the image. Gaussian noise is iteratively superimposed on the compressed latent representation in a forward diffusion process. A U-Net block, consisting of a ResNet reference network, denoises the forward diffusion output in the reverse direction to obtain the latent representation. Finally, the VAE decoder generates the final image by converting the representation back to pixel space.

The denoization step can be flexibly conditioned by a line of text, an image, or another modality. The encoded conditioning data is passed to the denoization U-network via a cross-attention mechanism. A fixed, pre-trained CLIP ViT-L/14 text coder is used to generate text-based conditionals, which transforms textual cues into embedding space. The researchers cite the improved computational efficiency of training and generation as an advantage of LDM.

With 860 million parameters in the U-net and 123 million in the text encoder, Stable Diffusion is considered relatively lightweight by 2022 standards and, unlike other diffusion models, can run on consumer GPUs.

Training data

Stable Diffusion was trained on pairs of images and image captions taken from LAION-5B, a publicly available dataset derived from Common Crawl data collected from the Internet. The 5 billion pairs of images and text were classified based on language and filtered into separate datasets based on resolution, predicted likelihood of a watermark, and predicted "aesthetic" score (e.g., subjective visual quality). The dataset was created by LAION, a German non-profit organization that receives funding from Stability AI. The Stable Diffusion model was trained on three subsets of LAION-5B: laion2B-en, laion-high-resolution, and laion-aesthetics v2 5+. Third-party analysis of the training data showed that of the smaller subset of 12 million images from the original larger dataset, approximately 47% of the image sample came from 100 different domains, with Pinterest accounting for 8.5% of the subset, followed by sites such as WordPress, Blogspot, Flickr, DeviantArt, and Wikimedia Commons.

Training Procedures

The model was initially trained on the laion2B-en and laion-high-resolution subsets, and the last few rounds of training were performed on LAION-Aesthetics v2 5+, a subset of 600 million captioned images that LAION-Aesthetics Predictor V2 predicts that, on average, people score at least 5 out of 10 when asked to rate how much they like them. The LAION-Aesthetics v2 5+ subset also excluded low-resolution images and images that LAION-5B-WatermarkDetection identified as having a watermark with a probability greater than 80%. In the final rounds of training, an additional 10% of textual terms were excluded to improve classifier-free targeting.

The model was trained using 256 Nvidia A100 GPUs on Amazon Web Services for 150k GPU-hours at a cost of $600k.

Limitations

Stable diffusion has problems with degradation and inaccuracy in certain scenarios. The first versions of the model were trained on a dataset consisting of 512×512 resolution images, which means that there is a noticeable degradation in the quality of the generated images when user specifications deviate from the "expected" 512×512 resolution; the updated version 2.0 of the Stable Diffusion model now has the ability to generate images with a resolution of 768×768. Another problem is related to the generation of human limbs due to the poor quality of limb data in the LAION database. The model is not sufficiently trained to understand human limbs and faces due to the lack of representative features in the database, and asking the model to generate images of this type may confuse it.

Accessibility to individual developers can also be a challenge. Adapting the model to new use cases not included in the dataset, such as anime character generation ("waifu diffusion"), requires new data and additional training. The fine-tuning of the Stable Diffusion adaptation created by additional retraining has been used for many different use cases, from medical visualization to algorithmic music generation. However, this fine-tuning process is sensitive to the quality of the new data: low-resolution images, or those with a different resolution than the original data, may not only fail in the new task, but also degrade the overall performance of the model. Even if the model is additionally trained on high quality images, it is difficult for people to run models in consumer electronics. For example, training a waifu-diffusion model requires at least 30 GB of VRAM, which exceeds the typical resource of consumer GPUs such as Nvidia's GeForce 30 series, which is only about 12 GB.

The creators of Stable Diffusion recognize the possibility of algorithmic bias because the model was trained primarily on images with descriptions in English. As a result, the images generated reinforce social biases and represent a Western perspective because, as the creators note, the model lacks data about other communities and cultures. The model produces more accurate results for cues written in English than those written in other languages, with Western or White culture often represented by default.

Fine-tuning for end users

To address the limitations of initial model training, end users can resort to additional training to fine-tune the generation results to fit more specific use cases. There are three methods by which user fine-tuning can be applied to the Stable Diffusion model benchmark:

An "embedding" can be trained based on a collection of user-supplied images and allows the model to generate visually similar images whenever the name of the embedding is used in the generation hint. The embeddings are based on the concept of "textual inversion," developed by researchers at Tel Aviv University in 2022 with support from Nvidia, where vector representations for certain tokens used by the model's text coder are associated with new pseudowords. The embeddings can be used to reduce errors in the original model or to mimic visual styles.
A "hypernet" is a small pre-trained neural network that is applied to different points of a larger neural network, and refers to a technique created by NovelAI developer Kurumuz in 2021, originally intended for text transformation models. Hypernets steer results in a particular direction, allowing stable diffusion-based models to mimic the artistic style of specific artists, even if the artist is not recognized by the original model; they process the image by finding key regions such as hair and eyes, and then place these regions into a secondary latent space.
DreamBooth is a deep learning model developed by researchers at Google Research and Boston University in 2022 that, once trained on a set of images depicting a specific object, can fine-tune the model to create accurate, personalized results depicting a specific object.

Opportunities

The Stable Diffusion model supports the ability to create new images from scratch by using a textual cue describing the elements to be included or omitted in the output. Existing images can be redrawn by the model to incorporate the new elements described in the textual cue (this process is known as "guided image synthesis") using a diffusion-denoising mechanism. In addition, the model allows the tooltips to be used to partially modify existing images by shading and recoloring them using an appropriate user interface that supports such capabilities, for which many different open-source implementations exist.

Stable Diffusion is recommended for RAM sizes of 10 GB or more, but users with smaller RAM sizes can load weights with float16 instead of the default float32 to compromise between model performance and less RAM usage.

Generating text to image

The text-to-image sampling script in Stable Diffusion, known as "txt2img", uses a text query in addition to various parameters defining the sample type, output image size, and seed values. The script produces an image file based on the model's interpretation of the query. The generated images are marked with an invisible digital watermark that allows users to identify the image produced by Stable Diffusion, but this watermark loses its effectiveness when the image is resized or rotated.

Each txt2img generation uses a specific seed value that affects the output image. The user can choose a random seed value to explore different generation results, or use the same seed to produce the same output image as previously generated. The user can also customize the number of output steps for the sampler; a larger value takes longer, but a smaller value may result in visual defects. Another customizable option, the unclassified pointing scale value, allows the user to adjust how closely the output image matches the cue. In more experimental cases, a smaller scale value can be used, while in more results-oriented cases, a larger value can be used.

Additional text2img features are provided by external implementations of Stable Diffusion, which allow you to change the weight given to certain parts of the text hint. Underscore markers allow you to add or de-emphasize keywords by enclosing them in brackets. An alternative way to change the weight of parts of a cue is "negative cues". Negative hints is a feature included in some front-end implementations, including Stability AI's own DreamStudio cloud service, and allows the user to specify hints that the model should avoid when generating an image. These cues can be undesirable image features that would otherwise be present in the output images due to positive user cues or as a result of initial model training, such as mangled human hands.

Image modification

Stable Diffusion also includes a sample script "img2img" that uses a text hint, the path to an existing image, and a strength value between 0.0 and 1.0. The script outputs a new image based on the original image, which also contains the elements specified in the text hint. The intensity value determines the amount of noise added to the output image. A higher intensity value produces a more variable image, but may produce an image that semantically does not match the cue.

The ability of img2img to add noise to the original image makes it potentially useful for data anonymization and augmentation, in which the visual characteristics of images are altered and anonymized. The same process can be useful for image upscaling, where the resolution of an image is increased and more detail is added to the image. In addition, Stable Diffusion has been tried as a tool for image compression. Compared to JPEG and WebP, the current image compression methods used in Stable Diffusion have limitations in preserving small text and faces.

Additional uses for image modification with img2img are offered by numerous external implementations of the Stable Diffusion model. Inpainting involves selectively modifying a portion of an existing image, delineated by a user-defined layer mask, and filling the masked space with new content generated by a given request. A special model specifically designed for inpainting use cases was created by Stability AI at the same time as the release of Stable Diffusion 2.0. Conversely, in outpainting, the image goes beyond its dimensions, filling the previously empty space with content generated by a given query.

The November 24, 2022 release of Stable Diffusion 2.0 introduces a depth-oriented model called "depth2img" that determines the depth of the input image and generates a new output image based on textual cues and depth information, which preserves the coherence and depth of the original input image in the generated output image.

ControlNet

ControlNet is a neural network architecture designed to control diffusion models by incorporating additional conditions. It duplicates weights of neural network blocks into "blocked" and "trained" copies. The "trained" copy is trained with the required condition, while the "blocked" copy retains the original model. This approach ensures that training on small sets of image pairs does not compromise the integrity of the production-ready diffusion models. A "zero convolution" is a 1×1 convolution with weight and offset initialized to zero. Before training, all zero convolutions produce zero output, which prevents any distortion caused by ControlNet. No layer is trained from scratch; the process is still a fine-tuning that preserves the original model. This method allows training on small and even personal devices.

Uses and controversies

Stable Diffusion does not claim any rights to the generated images and freely grants users the rights to use any images generated by the model, provided that the content is not illegal or harmful to the individual. The freedom given to users to use images has generated controversy over property ethics, as Stable Diffusion and other generative models are trained on copyrighted images without the owner's consent.

Because visual styles and compositions are not subject to copyright, it is often assumed that Stable Diffusion users who create images of works of art should not infringe on the copyrights of visually similar works. However, the individuals depicted in the generated images may be protected by personality rights if their likeness is used, and intellectual property, such as recognizable brand logos, is still protected by copyright. Nevertheless, image artists have expressed concerns that the widespread use of image synthesis programs such as Stable Diffusion could lead to human artists, as well as photographers, models, filmmakers and actors, gradually losing commercial viability against artificial intelligence-based competitors.

Compared to other commercial products based on generative artificial intelligence, Stable Diffusion is noticeably freer with regard to the types of content that can be generated by users, such as violent or sexually explicit images. In response to concerns that the model could be misused, Stability AI CEO Emad Mostak argues that "[it] is the responsibility of people to make sure they use the technology ethically, morally, and legally," and that putting Stable Diffusion's capabilities in the hands of the public will result in a net benefit to the technology, despite the potential negative consequences. Furthermore, Mostak argues that the open availability of Stable Diffusion is intended to end the control and dominance of corporations over such technologies, which previously only developed closed artificial intelligence systems for image synthesis. This is evidenced by the fact that any restrictions Stability AI places on the content that users can generate can be easily circumvented thanks to the availability of the source code.

Case law

In January 2023, three artists, Sarah Andersen, Kelly McKernan, and Carla Ortiz, filed a copyright infringement lawsuit against Stability AI, Midjourney, and DeviantArt, claiming that these companies violated the rights of millions of artists by training artificial intelligence tools on five billion images taken from the Internet without the original artists' consent. In the same month, Stability AI was sued by Getty Images for using its images in training data.

License

Unlike DALL-E type models, Stable Diffusion provides access to the source code along with the model (pre-trained weights). The Creative ML OpenRAIL-M license, which is a variant of the Responsible AI License (RAIL), applies to the model (M). The license prohibits certain uses, including crime, defamation, harassment, doxxing, "exploitation of ... minors," providing medical advice, automatically creating legal obligations, preparing legal evidence, "discriminating against or harming individuals or groups based on ... social behavior or ... personal or personality characteristics ... [or] legally protected characteristics or categories." The user owns the rights to the output images they create and is free to use them for commercial purposes.

Get in Touch with NOSOTA

Let's get in touch!

Please feel free to send us a message through the contact form.

Drop us a line at mail request@nosota.com / Give us a call over skype nosota.skype