Stability AI delivers the new clarity and power of AI generation audio with Stable Audio 2.0

Updated 3 months ago on July 27, 2024

Stability AI continues to advance its vision of generative artificial intelligence with today's introduction of the Stable Audio 2.0 audio model.

Stability AI may be best known for its Stable Diffusion models that turn text into images, but that's just one of many models the company is working on. Stable Audio was first released in September 2023, introducing users to the ability to generate short audio clips with a simple text prompt. With Stable Audio 2.0, users can now create high-quality audio clips of up to 3 minutes, double the 90 seconds in the initial release of Stable Audio.

In addition to supporting text-to-audio conversion, Stable Audio 2.0 will also support audio-to-audio generation when users upload a sample they want to use as a cue. Stability AI is making Stable Audio available for limited free use on the Stable Audio website, and the API will soon be made available so developers can create services.

The new Stable Audio 2.0 release is the first major model update from Stability AI since former CEO and founder Emad Mostaque abruptly left his post at the end of March. According to the company, it's still business as usual, and the Stable Audio 2.0 update is proof of that.

Join enterprise leaders in San Francisco July 9-11 at our flagship AI event. Network with peers, explore the opportunities and challenges of generative AI, and learn how to integrate AI applications into your industry. Register

Lessons learned from Stable Audio 1.0 have made themselves known in version 2.0

In 2023, Stability AI iterated its initial experience developing Stable Audio.

Zach Evans, head of audio research at Stability AI, told VentureBeat that the focus for the release of Stable Audio 1.0 was to create a revolutionary generative text-to-sound model with exceptional audio fidelity and meaningful output duration.

"Since the first version was released, we've dedicated ourselves to improving its musicality, extending the duration of output, and honing its ability to respond accurately to detailed cues," says Evans. "These improvements are aimed at optimizing the technology for practical real-world applications."

Stable Audio 2.0 provides the ability to create complete music tracks with a coherent musical structure. Using hidden diffusion technology, the model can generate songs up to 3 minutes long, containing a distinct introduction, progression and intro. This is a step forward from the previous version of Stable Audio, which could only create short loops or fragments, not full songs.

Looking at the machine learning (ML) science behind Stable Audio 2.0, the model still relies on the so-called latent diffusion model (LDM). Evans explained that since the Stable Audio 1.1 beta release in December, Stable Audio has been transformer-based, making it a so-called "diffusion transformer" model.

"We also increased the degree of data compression applied to the audio data during training, which allowed us to increase the model output time to three minutes or more while maintaining acceptable output times," says Evans.

Converting audio samples using text prompts

In addition to generating audio from text cues, Stable Audio 2.0 allows you to make transitions from sound to sound.

Users can upload audio samples and use natural language instructions to transform sounds into new variations. This opens up possibilities for creative workflows such as iterative refinement and editing of audio using textual instructions.

Stable Audio 2.0 also greatly expands the range of sound effects and textures that can be generated using artificial intelligence. Users can ask the system to generate immersive environments, ambient textures, crowds, cityscapes, and more. The model also allows you to change the style and tone of generated or uploaded audio samples.

A recurring issue in the field of genetic AI is the proper use of inputs for model training.

Stability AI prioritizes intellectual property protection in its new audio model. To address copyright issues, Stable Audio 2.0 has been trained exclusively on AudioSparx licensed data, with requests to opt-out taken into account. Audio uploads are monitored using content recognition to prevent the processing of copyrighted material.

Copyright protection is very important so that Stability AI can commercialize Stable Audio and the technology can be used safely by organizations. Stable Audio is currently monetized through a subscription to the Stable Audio web app, and will soon be available through the Stable Audio API.

However, Stable Audio is not an open model, at least not yet.

"The weights for Stable Audio 2.0 will not be available for download; however, we are working on open audio models that will be released later this year," Evans said.

Let's get in touch!

Please feel free to send us a message through the contact form.

Drop us a line at mailrequest@nosota.com / Give us a call over skypenosota.skype