Astra is Google's answer to the new ChatGPT

Updated 6 months ago on May 10, 2024

ChatGPT isn't even two years old yet, but the idea of communicating with artificial intelligence by typing in a box is already starting to seem fanciful.

At Google's I/O developer conference today, Demis Hassabis, who is leading the company's efforts to regain leadership in artificial intelligence, unveiled a "next-generation AI assistant" called Project Astra. The video showed it in the form of an app on a smartphone as well as a smart glasses prototype. The new concept fulfills the promise that Hassabis made about the Gemini's potential when the model was first unveiled last December.

By responding to voice commands, Astra was able to make sense of objects and scenes seen through device cameras and talk about them in natural language. She identified a computer speaker and answered questions about its components, recognized a London neighborhood from the view from an office window, read and analyzed code from a computer screen, composed a limerick about pencils, and remembered where a person left a pair of glasses.

This vision of the future of AI is strikingly similar to the one demonstrated by OpenAI on Monday. OpenAI unveiled a new interface for ChatGPT that can quickly communicate using voice and talk about what's visible through a smartphone camera or on a computer screen. This version of ChatGPT, based on a new AI model called GPT-4o, also uses a more human-like voice and emotionally expressive tone, mimicking emotions such as surprise and even flirting.

Google's Project Astra uses an enhanced version of Gemini Ultra, an AI model designed to compete with the one running ChatGPT from March 2023. Gemini-like OpenAI's GPT-4o-is "multimodal," meaning it's trained on audio, images and video as well as text, and can naturally receive, process and generate data in all of these formats. Google and OpenAI's move to this technology represents a new era in generative AI; the breakthroughs that have given the world ChatGPT and its competitors have so far come from AI models that work exclusively with text and must be combined with other systems to add image or audio capabilities.

In an interview ahead of today's event, Hassabis said he believes text-based chatbots will prove to be only a "transitional stage" on the way to much more sophisticated and hopefully useful assistants - artificial intelligence. "That's always been Gemini's vision," Hassabis added. "That's why we made it multimodal."

The new versions of Gemini and ChatGPT that see, hear and speak are impressive demos, but what place they will have in the workplace or personal life remains to be seen.

Pulkit Agrawal, an MIT assistant professor of AI and robotics, says recent demonstrations by Google and OpenAI are impressive and show how quickly multimodal AI models are evolving. OpenAI launched GPT-4V, a system capable of analyzing images, in September 2023. It was impressed by Gemini's ability to make sense of live video - for example, correctly interpreting changes made to a diagram on a whiteboard in real time. OpenAI's new version of ChatGPT appears to be capable of the same thing.

Agrawal says the assistants demonstrated by Google and OpenAI can provide companies with new data for training as users interact with models in the real world. "But they have to be useful," he adds. "The big question is what people will use them for - it's not very clear yet."

Google says Project Astra will be available through a new interface called Gemini Live later this year. Hassabis said the company is still testing a few prototypes of the smartglasses and has not yet made a decision to go into production.

Astra's capabilities could give Google a chance to reboot a version of its ill-fated Glass smart glasses, although attempts to create hardware suitable for generative AI have so far failed. Despite impressive demonstrations by OpenAI and Google, multimodal modules cannot fully understand the physical world and the objects in it, which places limitations on what they will be able to do.

"The ability to build a mental model of the physical world around you is absolutely essential to building more human intelligence," says Brenden Lake, an assistant professor at New York University who uses AI to study human intelligence.

Lake notes that the best current AI models are still language-centric, as the bulk of their learning is based on texts taken from books and the Internet. This is fundamentally different from how language is learned by humans, who acquire it through interaction with the physical world. "Compared to childhood development, it's a throwback," he says of the process of creating multimodal models.

Hassabis believes that AI models' better understanding of the physical world will be key to further progress in AI and make systems like Project Astra more robust. Other areas of AI, including Google DeepMind's work on AI gaming programs, could help, he said. Hassabis and others hope such work could revolutionize robotics, an area in which Google is also investing.

"Multimodal universal agent-assistants are on the road to general-purpose artificial intelligence," Hassabis says, referring to a hopeful future but largely undefined moment when machines will be able to do anything and everything the human mind can do. "It's not AGI or anything like that, but it's the beginning of something."

Let's get in touch!

Please feel free to send us a message through the contact form.

Drop us a line at mailrequest@nosota.com / Give us a call over skypenosota.skype