Google plays another AI card in the form of Gemini 1.5 Pro

Google has played another card with Gemini 1.5 Pro, a model that builds on the achievements of its predecessor, Gemini 1.0.

With Google Bard dead and buried, the Gemini family seems to be multiplying faster than the AI community can keep track of.

Now comes Gemini 1.5 Pro, which is more efficient than Google’s previous flagship model, Gemini Ultra.

It offers a new Mixture-of-Experts (MoE) architecture and outperforms Gemini Pro (now called Gemini 1.0 Pro) in 87% of benchmarks.

Ultra is available through Google’s new paid AI platform named Google One AI Premium, usurping Gemini Pro despite Google only upgrading that a couple of weeks ago.

A headline feature of Gemini 1.5 Pro is its ability to understand long contexts. It comes with a standard 128,000 token context window, expandable up to 1 million tokens. This enables the processing and interpretation of colossal data sets, including entire books.

However, Google emphasizes that Gemini 1.5 Pro is a ‘mid-size’ multimodal model designed to be scalable and versatile.

Is Gemini 1.5 a GPT-4 killer, then? Not in brute-force performance, but it should outflank it for specific tasks with very large quantities of information, as Google was keen to demonstrate.

Gemini’s applications and capabilities

Like its predecessors, Gemini 1.5 Pro’s capabilities extend across multiple modalities, from text to video and audio.

Its extended context window enables the model to process and reason about vast amounts of information, such as lengthy documents, extensive codebases, or hours of video content.

In a Google demo, Gemini 1.5 Pro can seemingly understand and identify details in the 402-page transcripts from Apollo 11’s mission to the moon.

Another challenge involved locating specific scenes in Buster Keaton’s “Sherlock Jr.” using descriptions and sketches. Despite the complexity of the task and sheer volume of data, Gemini 1.5 Pro successfully completed the set tasks, albeit with a processing time ranging from 20 seconds to a minute.

In another task, Gemini 1.5 Pro was challenged with translating English to the complex Guinean language, Kalamang, and vice versa.

This was especially daunting because Kalamang is not represented in the model’s training data.

Google provided the model with instructional materials in its input context, including approximately 500 pages of reference grammar, a bilingual wordlist (dictionary) with about 2,000 entries, and a set of around 400 parallel sentences.

These materials comprised around 250k tokens, fitting within the model’s extended context window.

With just the instructional materials provided, Gemini 1.5 Pro successfully translated sentences between English and Kalamang. This experiment showcased the model’s capability to absorb and apply new linguistic rules and vocabulary from the context, effectively learning a new language on the fly.

The quality of translations produced by Gemini 1.5 Pro was assessed by human experts who compared the model’s performance with that of a human language learner given the same set of materials.

Gemini 1.5 Pro’s context window extends to coding, too, with another demo gauging the model’s performance in analyzing and solving problems over 100,000 lines of code.

Insights from Gemini 1.5 Pro’s research paper

Google released an accompanying research paper on Gemini 1.5, titled “Gemini 1.5: Unlocking Multimodal Understanding Across Millions of Tokens of Context.”

It’s clear that Google intends to push Gemini 1.5 Pro’s extended context window, which currently dominates the competition.

Gemini 1.5 Pro achieves near-perfect recall on long-context retrieval tasks across different modalities, sets new standards in long-document QA, long-video QA, and long-context ASR, and matches or even surpasses the performance of its predecessor, Gemini 1.0 Ultra, across a broad set of benchmarks.

The paper details Gemini 1.5 Pro’s performance in various core capabilities, comparing it to the Gemini 1.0 models:

Win-rate improvements: Gemini 1.5 Pro shows an 87.1% win rate against Gemini 1.0 Pro and a 54.8% win rate against Gemini 1.0 Ultra across multiple benchmarks, demonstrating its improvements
Specific area performance: In text-related tasks, the model achieves a 100% win rate against Gemini 1.0 Pro and a 77% win rate against Gemini 1.0 Ultra. In vision-related tasks, the win rates are 77% and 46% against Gemini 1.0 Pro and Ultra, respectively. Audio tasks show a 60% win rate against Gemini 1.0 Pro and a 20% win rate against Gemini 1.0 Ultra.

Overall, Gemini 1.5 Pro is a good GPT-3.5-level model with a longer context window than competitors.

Is that enough to lure people away from ChatGPT? The truth is, unless you’ve got entire books to analyze, the benefits may be slim to non-existent.

How to use Gemini 1.5

Gemini 1.5 is currently available in a limited preview for developers and enterprise customers.

Questions about long-term pricing and accessibility have yet gone unanswered. Google has hinted at pricing tiers that will vary based on the context window size, from the standard 128,000 tokens to the full 1 million.

The exact cost remains under wraps, stirring speculation about the potential investment required to leverage this advanced context window.

Some have highlighted that by the time Gemini 1.5 Pro goes live for the masses, the competition will have moved on.

Google is differentiating itself on a product that only a tiny amount of people can access and experiment with. That seems a little risky and could prove a miscalculation.

The Gemini family: accessible or esoteric?

In the space of two to three months or so, Google raised and killed Bard, swapping it with Gemini Pro and releasing Ultra, Nano, and now Gemini 1.5 Pro.

This has involved renaming Gemini Pro (which was just Gemini?) to Gemini 1.0 Pro.

To further confuse matters, Gemini Advanced, which is Google’s paid version of Gemini running on Google One AI Premium, uses Gemini Ultra 1.0, which is better than Gemini 1.5 Pro.

DeepMind’s landing page for the Gemini family is quite frankly convoluted and crowded.

OpenAI, in many ways, pulled a slick marketing trick by keeping their models under the ‘ChatGPT’ umbrella from the start and keeping access more-or-less limited to just the free GPT-3.5 and paid GPT-4 for non-API users.

Gemini is Google going nuclear on generative AI, but they might get bogged down in their increasingly ambiguous product offerings.