Stability.ai has launched a new AI model called Stable Audio can generate audio of any length, conditioned on text. This means that Stable Audio can be used to create realistic and high-quality music, sound effects, and other types of audio, simply by providing it with a text description.
Stable Audio is a type of diffusion model, which is a type of AI model that learns to generate data by gradually denoising a random noise input. Diffusion models have been shown to be very effective at generating realistic images, video, and audio.
One of the main challenges in generating audio using diffusion models is that diffusion models are usually trained to generate a fixed-size output. For example, an audio diffusion model might be trained on 30-second audio clips, and will only be able to generate audio in 30-second chunks. This is an issue when training on and trying to generate audio of greatly varying lengths, as is the case when generating full songs.
Stable Audio addresses this issue by conditioning the diffusion model on text metadata, as well as audio file duration and start time. This allows users to control the content and length of the generated audio.
To train Stable Audio, the researchers used a dataset of over 800,000 audio files containing music, sound effects, and single-instrument stems, as well as corresponding text metadata. This dataset adds up to over 19,500 hours of audio.
The Stable Audio model is able to generate 95 seconds of stereo audio at a 44.1 kHz sample rate in less than one second on an NVIDIA A100 GPU. This makes it much faster than previous audio diffusion models.
Stable Audio can be used to create new and innovative types of music, sound effects, and other audio content. It can also be used to improve the quality of existing audio content.
The sources for this piece include an article in Stability.ai.