Photo: Camilo Concha/Shutterstock.com
OpenAI is taking the text-to-video business a bit further with the launch of Sora, the company’s text-to-video model designed to comprehend and simulate the physical world in motion and generate up to one-minute videos.
In a blog post, OpenAI showcased videos generated via specific text prompts. According to the company, the primary objective of Sora is to train models that facilitate problem-solving through real-world interaction.
The model is built on a transformer architecture similar to GPT models. Sora uses patches as small as units of data to unify the data representation for improved scalability across various visual data, durations, resolutions and aspect ratios.
Sora builds upon past research in DALL·E and GPT models, employing the recaptioning technique from the former. This technique involves generating descriptive captions for visual training data, enabling the model to faithfully follow users’ text instructions in generated videos.
The videos are quite clear and follow the prompt quite well. For example, below is a sample video for the prompt: An adorable happy otter confidently stands on a surfboard wearing a yellow lifejacket, riding along turquoise tropical waters near lush tropical islands, 3D digital render art style.
The otter definitely looks happy and with the background, one can guess that the location is indeed a tropical island. This 17-second video is one of many that OpenAI showcased in the blog.
This video suggests that Sora can generate complex scenes with characters, specific motion types, and accurate details of the subject and background.
Users may find it a bit difficult to understand the background detail. For this reason, we will head to another example.
Here’s a prompt for the following video: Beautiful, snowy Tokyo city is bustling. The camera moves through the bustling city street, following several people enjoying the beautiful snowy weather and shopping at nearby stalls. Gorgeous sakura petals are flying through the wind along with snowflakes.
The video combines the medieval Tokyo architecture with the modern buildings in the background. A drone-like camera follows a couple holding hands amidst bustling shops in what seems like Tokyo. This example showcases that Sora did a great job in rendering expansive cities with just a prompt. Also, for those with a keen eye, some details may be a bit confusing or blurry, but for most of us, the prompt seemed to generate an alright video.
This example showcases that Sora goes beyond merely understanding user prompts; it comprehends how these elements exist in the physical world.
Currently, the model is available to red teamers to assess critical areas prone to harm or risks. Simultaneously, visual artists, designers, and filmmakers are granted access to provide valuable feedback, ensuring the model’s evolution to cater to the needs of creative professionals.
“The model has a deep understanding of language, enabling it to accurately interpret prompts and generate compelling characters that express vibrant emotions. Sora can also create multiple shots within a single generated video that accurately persist characters and visual style,” said OpenAI.
However, the model is not without its limitations. Challenges include struggles with accurately simulating the physics of complex scenes and understanding specific cause-and-effect instances.
OpenAI is developing tools to detect harmful content, including a detection classifier to identify videos generated by Sora. The company plans to include C2PA metadata in the future aim of enhancing accountability.
The company plans to leverage existing safety methods from previous models, such as DALL·E 3, including text classifiers and image classifiers to filter content in violation of usage policies before it reaches users.
As with all the new generative AI products, questions remain about the intellectual property rights of the videos and the future of 3D artists. AI has long been suspected of taking away jobs and Sora will add fuel to this debate, if nothing else.
AI-generated images of children are already proliferating the internet and are quite difficult to tackle. With the advent of text-to-video technology with this level of detailing, surely authorities have to punch in extra hours once the technology gets in the wrong hands.
In May last year, top industry leaders and politicians, including Elon Musk, and Demis Hassabis, released a one-sentence statement claiming AI can pose a risk to human society. What’s ironic is that Elon released Grok, an AI chatbot assistant even though he believes that AI poses a substantial risk. The AI train is here to stay and everyone is trying to grab it before they are left behind.
With Sora, OpenAI is looking to venture into the field of artificial general intelligence (AGI). Meta has already announced plans to develop an open-source AGI and is even buying hundreds of thousands of GPUs for this purpose.
In the News: Temu under scrutiny for malware distribution and privacy violation