Google is so back with Gemini 2.0

This is a new ChatGPT moment in the history of AI

Aki Ranin

Dec 13, 2024

Transcript

While OpenAI’s team has been busy with their 12 days of Christmas releases, Google just dropped the hammer. While o1 Pro Mode is fascinating, and a little concerning, the biggest news is the $200/month pricing. Sora is cool, but we’ve seen it before and other video models are plentiful.

While we’re still waiting for the big one in GPT-4.5, the DeepMind team just casually dropped their biggest release ever on a Thursday. Here’s what’s inside.

Multimodal Live API in Google AI Studio

In the video, you can see me demonstrate randomly picked tasks using Gemini 2.0. In a single session, by watching everything on my screen, and listening to my voice, Gemini helps me with the following:

Fixing a code mistake
Check my slides for factual errors
Help me understand scientific data
Suggest how to improve a legal clause

For me, this is something totally new. It’s the most impressive demo moment for me since ChatGPT came out.

Not only is the task performance solid, but it’s the low latency and fluidity of it all. Let me remind you, this is a single model that is streaming in audio, video, and text in parallel, and then outputting voice with close to zero latency. It’s just an engineering feat all around.

Is it perfect? No. There are some obvious kinks in the user experience. It seems faily obvious in retrospect that they wanted to front run OpenAI’s updates for Advanced Voice Mode. Literally the next day after the Gemini release, OpenAI added a screensharing mode to the mobile version of ChatGPT. No camera, no desktop, though, and not yet available for most users. Google 1, OpenAI 0 for this round.

Built for the Agentic era

From the very title of the release blog, Google is making it clear that Gemini 2.0 is built for the next paradigm in agents. This isn’t your brother’s chatbot.

Gemini 2.0 Flash’s native user interface action-capabilities, along with other improvements like multimodal reasoning, long context understanding, complex instruction following and planning, compositional function-calling, native tool use and improved latency, all work in concert to enable a new class of agentic experiences.

NOTE: In the naming conventions of LLMs, “Flash” here means it’s the smaller, faster version. Similar to perhaps “mini” in OpenAI lingo, or “Sonnet” for Anthrophic. Meaning there’s at least a “Pro” version coming, if not “Ultra”.

Google is targeting several different use-cases for AI agents. Here’s a two-minute demo covering a lot of capabilities they are working on.

Here’s the breakdown:

The first and most obvious is a universal assistant (Project Astra), which seems similar in concept to what you can do with the Google AI Studio demo I showed. A key element here is going to be hardware.
Separately, Google is also working on a web-based agent that can help navigate and interact with websites (Project Mariner). You might expect this one to be big commercially as an API into various helper apps and SaaS tools.
Demis Hassabis, the founder and CEO of Deepmind, started as a game developer in his teens. A lot of the early work from Deepmind was focused on games as a playground for AI. Last week, they revealed an AI-generated game engine in Genie. Now they are collaborating with leading game studios on AI agents for games. While cool, they note that part of the motivation here is to then extend these virtual agents into physical agents: robots.
Finally, they are gunning head-to-head with software developer agents like Devin with “Jules”. New lows for naming, as always.

The breadth here is astonishing, and we’re not even done yet! You really get a sense for how big Deepmind’s team is (2,600 employees), even compared to the ever bloating OpenAI (1,700 employees).

Gemini 1.5 Pro with Deep Research

While the video and screensharing is the game-changer, I’ve already used this other feature even more. It’s not even with Gemini 2.0, it’s using an older model. But what Deep Research does I find incredibly useful.

Google has gotten unbelievable flack for fumbling search and AI integration. Compared to the poster boy of LLM based search in Perplexity, Google’s efforts have been pretty humble. That changes today.

I just clicked on one of the suggested prompts, and a few minutes later, this is what I got.

The idea here isn’t actually to directly take on Perplexity for now. Instead, think of this as NotebookLM for the internet. This isn’t for fact search, but for research questions. Now, instead of uploading PDFs like you do with NotebookLM, Deep Research will scour the internet for you.

In my example, it spent a few minutes reading 70 web pages related to new battery technologies. Once finished, it generates a report complete with tables and sources. You can then ask followup questions to your heart’s content.

Again, this isn’t meant to be used on the fly to check the score of last night’s game, or the name of a celebrity in a movie. Instead, you ask complex questions about complex topics, and go grab a coffee. By the time you’re back, Gemini will have performed the equivalent of one intern afternoon, in some cases much more.

The quality and depth of the output is the show here.

Takeaways from the Gemini 2.0 release

From my perspective, this wipes ChatGPT off the map, at least momentarily. I’m sure OpenAI will come back with similar features, and potentially GPT-4.5 is a similar agent architecture and even outperforms Gemini on some tasks. But we haven’t seen OpenAI do this much this quickly. It’s a major update on the competitive landscape.

This release also updates my views on AI in several important ways.

Google’s big week

As if that wasn’t enough, they also announced the world’s most powerful quantum chip (Willow) and Android XR, which is like Android for mixed reality devices. Especially XR will be a key enabler for Project Astra, like in the original demo video (link) that showed a person walking around with a pair of glasses while interacting with Gemini. I can’t believe these are sidenotes in this essay.

Google is finally flexing its muscle

Did you really doubt Demis Hassabis and Google? They literally invented the Transformer and wrote all the major papers on Reinforcement Learning. They can undercut OpenAI on API pricing and subscriptions, because they make billions in profits while OpenAI makes billions in losses. Same goes for Anthropic. Microsoft and AWS are still behind on AI talent, and xAI is very new. Let’s see how many days, weeks, or months Google can stay at the top.

There is no wall

The scaling hypothesis is alive and well. Despite recent hiccups with pre-training, the AI labs have found enough alternative paths with synthetic data and post-training to keep the steamroller on full speed. Expect GPT-4.5 before Christmas, followed by Grok 3, Llama 4, and whatever Anthropic is cooking in Q1 of 2025.