A picture may tell a thousand words, but a video tells a thousand pictures. OpenAI recently announced the release of Sora, its premiere text-to-video GenAI model that turns prompts into detailed 60-second videos. The results are astonishing.
Building on work the company has done with the DALL·E image generator series, Sora takes GenAI to the next level. In the same way that DALL·E works, users input a written prompt and receive an output 60-second video clip. The clips can come in a variety of styles and have extraordinary quality with only minor errors. Plus, Sora can do more than just generate video; it can also bring still images to life as video and extend, merge, and edit videos.
OpenAI made Sora available to red teamers and certain early access creators to test critical areas. OpenAI seems to be branding the product around creative workflows for visual artists, designers, and filmmakers, with the bold vision of eventually replacing full feature-length films.
Granted, text-to-video is a nascent but not unpopulated market. Other GenAI companies, like Runway, have been developing text-to-video models for some time, as well as players like Google and Midjourney. It’s hard to say which will come out on top, but OpenAI’s enormous user base from ChatGPT Plus puts them in the race.
Applications for Sora
Given that Sora can generate and edit existing video clips, the applications for Sora at the outset are pretty wide, from generating B-roll to stock video to video ads and full-fledged video editing. In the longer term, it might be used for content marketing, feature-length ads as well as full-length YouTube videos and movies.
In the short term, Sora can be used to generate short-form content for marketing teams and entrepreneurs. While Sora is in its infancy, the pipeline of creating short clips for Instagram Reels, TikTok, and YouTube shorts can be largely automated through generative AI-powered editing platforms. While Sora itself is simple — you type a prompt and it spits out a video — text-to-video will continue to weave into existing video editing tools like Adobe Premiere Pro and Final Cut, powering editors with generative AI in a variety of ways.
Limitations of Text-to-Video
Sora is not without its limitations. The videos aren’t flawless. In particular, Sora has trouble with cause and effect: For example, if a character takes a bite out of an apple, it might not display the bite in the apple. Complex scenes with many animals or humans can have some visual errors. Sometimes objects overlap in strange ways. Events that unfold in stages or over time sometimes contain errors as well.
Moreover, Sora itself presents challenges for those who use it. The safety, privacy, copyright, misinformation, and discretion concerns that come with other forms of generative AI including ChatGPT are only more present with Sora. A schoolchild using Sora for a class project could accidentally generate videos containing harmful material in high quality. It’s become even easier to generate deep fake videos of someone doing something they didn’t do. Companies building Sora and text-to-video into their creative workflows run the risk of hallucinations and poor-quality outputs.
OpenAI’s protection measures, like any put up by generative AI companies, are limited. The company’s mentality seems to be more around releasing Sora earlier to gather feedback to make improvements rather than perfecting it before release.
Without a doubt, regulators need to step in now. These tools are only improving and the implications of them being used in inappropriate manners stand tall.
Conclusion
Sora is a strong step forward for OpenAI. In OpenAI’s own words, “Sora serves as a foundation for models that can understand and simulate the real world, a capability we believe will be an important milestone for achieving AGI.”
On a personal level, I’m shocked at how rapidly text-to-video has reached this level of performance. I remember watching AI-generated videos even half a year ago and wondering when they would be usable. That moment was a lot closer than I thought. Plus, text-to-video right now is the worst it will ever be and the pace of improvement is only increasing. AI-generated full-length feature films are likely feasible within a matter of years.
This poses many big questions: What skills will we need for the future? What jobs will be replaced and which will be created? How can businesses evaluate the technology in the context of their operations? And what does this mean for how we consume information as a society?