- The Next
- Posts
- Stability AI debuts Stable Video Diffusion models in research preview
Stability AI debuts Stable Video Diffusion models in research preview
#Shitty News
Stability AI has launched Stable Video Diffusion, a new AI model that aims to enter the video generation space. The model, available for research purposes only, includes two advanced models - SVD and SVD-XT - that produce short clips from images. Both models are said to produce high-quality outputs, matching or surpassing the performance of other AI video generators. Stability AI plans to open-source the models as part of its research preview and will use user feedback to refine them for commercial application.
Understanding Stable Video Diffusion
According to a blog post from the company, SVD and SVD-XT are latent diffusion models that take in a still image as a conditioning frame and generate 576 X 1024 video from it. Both models produce content at speeds between three to 30 frames per second, but the output is rather short: lasting just up to four seconds only. The SVD model has been trained to produce 14 frames from stills, while the latter goes up to 25, Stability AI noted.
The company developed Stable Video Diffusion by training a base model on a large, systematically curated video dataset of 600 million samples. The model was then fine-tuned on a smaller, high-quality dataset of up to a million clips for tasks like text-to-video and image-to-video prediction.
Stability AI has developed a model that uses publicly available research datasets for training and fine-tuning. The model can also be used to fine-tune a diffusion model capable of multi-view synthesis, generating multiple consistent views of an object using a single still image. This could lead to applications in advertising, education, and entertainment, according to Stability AI's whitepaper. The exact source of the data remains unclear.
High-quality output but limitations remain
SVD outputs were deemed high quality by human voters, surpassing leading closed text-to-video models from Runway and Pika Labs. However, the company acknowledges that these models are still in their early stages, lacking photorealism, motionless videos, slow camera pans, and faces and people as expected. The models are still in their early stages of development.
Eventually, the company plans to use this research preview to refine both models, rule out their present gaps and introduce new features, like support for text prompts or text rendering in videos, for commercial applications. It emphasized that the current release is mainly aimed at inviting open investigation of the models, which could flag more issues (like biases) and help with safe deployment later.
“We are planning a variety of models that build on and extend this base, similar to the ecosystem that has built around stable diffusion,” the company wrote. It has also started calling users to sign up for an upcoming web experience that would allow users to generate videos from text.
That said, it remains unclear when exactly the experience will be available.
How to use the models?
To get started with the new open-source Stable Video Diffusion models, users can find the code on the company’s GitHub repository and the weights required to run the model locally on its Hugging Face page. The company notes that usage will be allowed only after acceptance of its terms, which detail both allowed and excluded applications.
As of now, along with researching and probing the models, permitted use cases include generating artworks for design and other artistic processes and applications in educational or creative tools.
Generating factual or “true representations of people or events” remains out of scope, Stability AI said.