Generate text from multimodal prompts using the Gemini API


When calling the Gemini API from your app using a Vertex AI for Firebase SDK, you can prompt the Gemini model to generate text based on a multimodal input. Multimodal prompts can include multiple modalities (or types of input), like text along with images, PDFs, video, and audio.

For testing and iterating on multimodal prompts, we recommend using Vertex AI Studio.

Before you begin

If you haven't already, work through the getting started guide for the Vertex AI for Firebase SDKs. Make sure that you've done all of the following:

  • Set up a new or existing Firebase project, including using the Blaze pricing plan and enabling the required APIs.

  • Connect your app to Firebase, including registering your app and adding your Firebase config to your app.

  • Add the SDK and initialize the Vertex AI service and the generative model in your app.

After you've connected your app to Firebase, added the SDK, and initialized the Vertex AI service and the generative model, you're ready to call the Gemini API.

Generate text from text and a single image

Make sure that you've completed the Before you begin section of this guide before trying this sample.

You can call the Gemini API with multimodal prompts that include both text and a single file (like an image, as shown in this example). For these calls, you need to use a model that supports multimodal prompts (like Gemini 1.5 Pro).

Supported files include images, PDFs, video, audio, and more. Make sure to review the requirements and recommendations for input files.

Choose whether you want to stream the response (generateContentStream) or wait for the response until the entire result is generated (generateContent).

Streaming

You can achieve faster interactions by not waiting for the entire result from the model generation, and instead use streaming to handle partial results.

This example shows how to use generateContentStream() to stream generated text from a multimodal prompt request that includes text and a single image:

Without streaming

Alternatively, you can wait for the entire result instead of streaming; the result is only returned after the model completes the entire generation process.

This example shows how to use generateContent() to generate text from a multimodal prompt request that includes text and a single image:

Learn how to choose a Gemini model and optionally a location appropriate for your use case and app.

Generate text from text and multiple images

Make sure that you've completed the Before you begin section of this guide before trying this sample.

You can call the Gemini API with multimodal prompts that include both text and multiple files (like images, as shown in this example). For these calls, you need to use a model that supports multimodal prompts (like Gemini 1.5 Pro).

Supported files include images, PDFs, video, audio, and more. Make sure to review the requirements and recommendations for input files.

Choose whether you want to stream the response (generateContentStream) or wait for the response until the entire result is generated (generateContent).

Streaming

You can achieve faster interactions by not waiting for the entire result from the model generation, and instead use streaming to handle partial results.

This example shows how to use generateContentStream() to stream generated text from a multimodal prompt request that includes text and multiple images:

Without streaming

Alternatively, you can alternatively wait for the entire result instead of streaming; the result is only returned after the model completes the entire generation process.

This example shows how to use generateContent() to generate text from a multimodal prompt request that includes text and multiple images:

Learn how to choose a Gemini model and optionally a location appropriate for your use case and app.

Generate text from text and a video

Make sure that you've completed the Before you begin section of this guide before trying this sample.

You can call the Gemini API with multimodal prompts that include both text and a single video (as shown in this example). For these calls, you need to use a model that supports multimodal prompts (like Gemini 1.5 Pro).

Make sure to review the requirements and recommendations for input files.

Choose whether you want to stream the response (generateContentStream) or wait for the response until the entire result is generated (generateContent).

Streaming

You can achieve faster interactions by not waiting for the entire result from the model generation, and instead use streaming to handle partial results.

This example shows how to use generateContentStream() to stream generated text from a multimodal prompt request that includes text and a single video:

Without streaming

Alternatively, you can wait for the entire result instead of streaming; the result is only returned after the model completes the entire generation process.

This example shows how to use generateContent() to generate text from a multimodal prompt request that includes text and a single video:

Learn how to choose a Gemini model and optionally a location appropriate for your use case and app.

Requirements and recommendations for input files

To learn about supported file types, how to specify MIME type, and how to make sure that your files and multimodal requests meet the requirements and follow best practices, see Supported input files and requirements for the Vertex AI Gemini API.

What else can you do?

  • Learn how to count tokens before sending long prompts to the model.
  • Set up Cloud Storage for Firebase so that you can include large files in your multimodal requests using Cloud Storage URLs. Files can include images, PDFs, video, and audio.
  • Start thinking about preparing for production, including setting up Firebase App Check to protect the Gemini API from abuse by unauthorized clients.

Try out other capabilities of the Gemini API

Learn how to control content generation

You can also experiment with prompts and model configurations using Vertex AI Studio.

Learn more about the Gemini models

Learn about the models available for various use cases and their quotas and pricing.


Give feedback about your experience with Vertex AI for Firebase