Artificial intelligence (AI) researchers are showing off the technology’s latest leap forward by demonstrating models capable of creating realistic, coherent videos using a text prompt — raising questions about whether AI’s rapid advancement will soon threaten our ability to know what is real or not.
In the last week, both Meta (formerly known as Facebook) and Google have each showcased “text-to-video” AI systems that can create new, unique videos with high-quality graphics based on anything from a few words to a long, intricate sentence.
Late last month, Meta first showed off its Make-a-Video system, which, on top of its text-to-video ability, can also animate still images. Just a week later, Google released its two projects, Imagen Video and Phenaki.
Meta’s model can produce videos with photorealistic graphics of subjects carrying out actions and interacting with objects — like a realistic video of a young couple walking in the rain or a surreal teddy bear painting a portrait.
Google’s competitor Imagen Video is similar. Phenaki, on the other hand, doesn’t have quite the same visual quality but is able to turn long prompts into videos of multiple minutes in length with a dream-like feeling. One example:
Lots of traffic in futuristic city. An alien spaceship arrives to the futuristic city. The camera gets inside the alien spaceship. The camera moves forward until showing an astronaut in the blue room. The astronaut is typing in the keyboard. The camera moves away from the astronaut. The astronaut leaves the keyboard and walks to the left. The astronaut leaves the keyboard and walks away. The camera moves beyond the astronaut and looks at the screen. The screen behind the astronaut displays fish swimming in the sea. Crash zoom into the blue fish. We follow the blue fish as it swims in the dark ocean. The camera points up to the sky through the water. The ocean and the coastline of a futuristic city. Crash zoom towards a futuristic skyscraper. The camera zooms into one of the many windows. We are in an office room with empty desks. A lion runs on top of the office desks. The camera zooms into the lion’s face, inside the office. Zoom out to the lion wearing a dark suit in an office room. The lion wearing looks at the camera and smiles. The camera zooms out slowly to the skyscraper exterior. Timelapse of sunset in the modern city.
Text prompt
Both have been built using diffusion models, a type of model trained by feeding it training data which the model breaks apart and then tries to build back together anew (these diffusion models are being used in new generation text-to-image AI models, too). Researchers gave the models datasets of millions of videos paired with captions, which it’s using to recognise and reproduce patterns.
Researchers from both companies have not yet released these models to the public, but it’s only a matter of time before these become accessible. Much like text-to-image models before it, text-to-videos are a powerful new tool that opens up the world for users — who beforehand would have needed to go through the time-consuming and technically demanding process of manually animating something to get a similar effect — but also presents a threat to humans’ understanding of reality.
No person would mistake the current text-to-video artificial intelligence outputs as real — yet. Advances in this technology may soon challenge that. What happens when artificially generated video becomes indistinguishable from real video? Such a premise may have once seemed like a farfetched dystopian novel’s plot, but now seems not that far away.
Will this technology bring about an age of video manipulation? Let us know by writing to letters@crikey.com.au. Please include your full name to be considered for publication. We reserve the right to edit for length and clarity.
‘…technology’s latest leap forward…’
It may be forward but it’s hardly desirable. It will be easier to create propaganda & artificial news once this craft is refined. Why do we need it?
Some of us older, paler fellers can’t find find real girlfriends who meet our impossibly high standards?
Too bad AI can’t do something useful like expose tax evasion or root out fake news.
Information as content is making the whole information age rather hollow and a far cry from the utopian (or even distopian) visions of old science fiction.
Nearly time to stop watching videos.
When, not if, it is used for porn will there be any pushback(sic!)?
People have been prosecuted & convicted of having sex dolls which are too lifelike and Japanese read astonishing violent misogynist manga openly on suburban trains.
What about those who prefer to only interact with Alexa/Siri – will they need to be vaxxed if they never leave their hermetically sealed existence?
Nothing new, as per John Donne or the 60s Simon & Garfunkel song or EM Forster’s “The Machine Stops”.
Qanon and it’s ilk will lap this up. The Fake news ingesters will fall further down the rabbit hole.
As someone has already stated – the porn industry, closely followed by gamers will be early adopters. What a brave new world awaits us.
PS: for those with a bit of technical expertise, deepfake photos & videos are all too easy already.
QAnon and its ilk – imagine Cam’s reports from CPAC when this has taken hold.