Firebase is back at Cloud Next on April 9 - 11. Register now.

Generate text from multimodal prompts using the Gemini API

When calling the Gemini API from your app using a Vertex AI in Firebase SDK, you can prompt the Gemini model to generate text based on a multimodal input. Multimodal prompts can include multiple modalities (or types of input), like text along with images, PDFs, plain-text files, video, and audio.

In each multimodal request, you must always provide the following:

The file's mimeType. Learn about each input file's supported MIME types.
The file. You can either provide the file as inline data (as shown on this page) or using its URL or URI.

For testing and iterating on multimodal prompts, we recommend using Vertex AI Studio.

Other options for working with the Gemini API

Optionally experiment with an alternative "Google AI" version of the Gemini API
Get free-of-charge access (within limits and where available) using Google AI Studio and Google AI client SDKs. These SDKs should be used for prototyping only in mobile and web apps.

After you're familiar with how a Gemini API works, migrate to our Vertex AI in Firebase SDKs (this documentation), which have many additional features important for mobile and web apps, like protecting the API from abuse using Firebase App Check and support for large media files in requests.

Optionally call the Vertex AI Gemini API server-side (like with Python, Node.js, or Go)
Use the server-side Vertex AI SDKs, Firebase Genkit, or Firebase Extensions for the Gemini API.

Before you begin

If you haven't already, complete the getting started guide for the Vertex AI in Firebase SDKs. Make sure that you've done all of the following:

Set up a new or existing Firebase project, including using the Blaze pricing plan and enabling the required APIs.
Connect your app to Firebase, including registering your app and adding your Firebase config to your app.
Add the SDK and initialize the Vertex AI service and the generative model in your app.

After you've connected your app to Firebase, added the SDK, and initialized the Vertex AI service and the generative model, you're ready to call the Gemini API.

Generate text from text and a single image Generate text from text and multiple images Generate text from text and a video

Sample media files

If you don't already have media files, then you can use the following publicly available files. Since these files are stored in buckets that aren't in your Firebase project, you need to use the https://storage.googleapis.com/BUCKET_NAME/PATH/TO/FILE format for the URL.

Image: https://storage.googleapis.com/cloud-samples-data/generative-ai/image/scones.jpg with a MIME type of image/jpeg. View or download this image.
PDF: https://storage.googleapis.com/cloud-samples-data/generative-ai/pdf/2403.05530.pdf with a MIME type of application/pdf. View or download this PDF.
Video: https://storage.googleapis.com/cloud-samples-data/video/animals.mp4 with a MIME type of video/mp4. View or download this video.
Audio: https://storage.googleapis.com/cloud-samples-data/generative-ai/audio/pixel.mp3 with a MIME type of audio/mp3. Listen to or download this audio.

Generate text from text and a single image

Make sure that you've completed the Before you begin section of this guide before trying this sample.

You can call the Gemini API with multimodal prompts that include both text and a single file (like an image, as shown in this example). For these calls, you need to use a model that supports media in prompts (like Gemini 2.0 Flash).

Make sure to review the requirements and recommendations for input files.

Choose whether you want to stream the response (generateContentStream) or wait for the response until the entire result is generated (generateContent).

Streaming

You can achieve faster interactions by not waiting for the entire result from the model generation, and instead use streaming to handle partial results.

This example shows how to use generateContentStream() to stream generated text from a multimodal prompt request that includes text and a single image:

Kotlin

^{For Kotlin, the methods in this SDK are suspend functions and need to be called
from a Coroutine scope.}

// Initialize the Vertex AI service and the generative model
// Specify a model that supports your use case
val generativeModel = Firebase.vertexAI.generativeModel("gemini-2.0-flash")

// Loads an image from the app/res/drawable/ directory
val bitmap: Bitmap = BitmapFactory.decodeResource(resources, R.drawable.sparky)

// Provide a prompt that includes the image specified above and text
val prompt = content {
  image(bitmap)
  text("What developer tool is this mascot from?")
}

// To stream generated text output, call generateContentStream with the prompt
var fullResponse = ""
generativeModel.generateContentStream(prompt).collect { chunk ->
  print(chunk.text)
  fullResponse += chunk.text
}

Note: The example above takes advantage of a simplified way to handle platform-native image types (Bitmap) in multi-modal prompts. These image types (regardless of their original format) are converted client-side to JPEG at 80% quality before being sent to the server. This means that when you provide images inline like in the example above, you don't need to specify the MIME type.

For more control over image formats and conversions, you may provide the images as an InlineDataPart and supply the specific MIME type. For example: content { inlineData(/* PNG as byte array */, "image/png") }.

Java

^{For Java, the streaming methods in this SDK return a
Publisher type from the Reactive Streams library.}

// Initialize the Vertex AI service and the generative model
// Specify a model that supports your use case
GenerativeModel gm = FirebaseVertexAI.getInstance()
        .generativeModel("gemini-2.0-flash");
GenerativeModelFutures model = GenerativeModelFutures.from(gm);

Bitmap bitmap = BitmapFactory.decodeResource(getResources(), R.drawable.sparky);

// Provide a prompt that includes the image specified above and text
Content prompt = new Content.Builder()
        .addImage(bitmap)
        .addText("What developer tool is this mascot from?")
        .build();

// To stream generated text output, call generateContentStream with the prompt
Publisher<GenerateContentResponse> streamingResponse = model.generateContentStream(prompt);

final String[] fullResponse = {""};

streamingResponse.subscribe(new Subscriber<GenerateContentResponse>() {
    @Override
    public void onNext(GenerateContentResponse generateContentResponse) {
        String chunk = generateContentResponse.getText();
        fullResponse[0] += chunk;
    }

    @Override
    public void onComplete() {
        System.out.println(fullResponse[0]);
    }

    @Override
    public void onError(Throwable t) {
        t.printStackTrace();
    }

    @Override
    public void onSubscribe(Subscription s) {
    }
});

Without streaming

Alternatively, you can wait for the entire result instead of streaming; the result is only returned after the model completes the entire generation process.

This example shows how to use generateContent() to generate text from a multimodal prompt request that includes text and a single image:

Kotlin

^{For Kotlin, the methods in this SDK are suspend functions and need to be called
from a Coroutine scope.}

// Initialize the Vertex AI service and the generative model
// Specify a model that supports your use case
val generativeModel = Firebase.vertexAI.generativeModel("gemini-2.0-flash")

// Loads an image from the app/res/drawable/ directory
val bitmap: Bitmap = BitmapFactory.decodeResource(resources, R.drawable.sparky)

// Provide a prompt that includes the image specified above and text
val prompt = content {
  image(bitmap)
  text("What developer tool is this mascot from?")
}

// To generate text output, call generateContent with the prompt
val response = generativeModel.generateContent(prompt)
print(response.text)

Java

^{For Java, the methods in this SDK return a
ListenableFuture.}

// Initialize the Vertex AI service and the generative model
// Specify a model that supports your use case
GenerativeModel gm = FirebaseVertexAI.getInstance()
        .generativeModel("gemini-2.0-flash");
GenerativeModelFutures model = GenerativeModelFutures.from(gm);

Bitmap bitmap = BitmapFactory.decodeResource(getResources(), R.drawable.sparky);

// Provide a prompt that includes the image specified above and text
Content content = new Content.Builder()
        .addImage(bitmap)
        .addText("What developer tool is this mascot from?")
        .build();

// To generate text output, call generateContent with the prompt
ListenableFuture<GenerateContentResponse> response = model.generateContent(content);
Futures.addCallback(response, new FutureCallback<GenerateContentResponse>() {
    @Override
    public void onSuccess(GenerateContentResponse result) {
        String resultText = result.getText();
        System.out.println(resultText);
    }

    @Override
    public void onFailure(Throwable t) {
        t.printStackTrace();
    }
}, executor);

Learn how to choose a Gemini model and optionally a location appropriate for your use case and app.

Generate text from text and multiple images

Make sure that you've completed the Before you begin section of this guide before trying this sample.

You can call the Gemini API with multimodal prompts that include both text and multiple files (like images, as shown in this example). For these calls, you need to use a model that supports media in prompts (like Gemini 2.0 Flash).

Make sure to review the requirements and recommendations for input files.

Choose whether you want to stream the response (generateContentStream) or wait for the response until the entire result is generated (generateContent).

Streaming

You can achieve faster interactions by not waiting for the entire result from the model generation, and instead use streaming to handle partial results.

This example shows how to use generateContentStream() to stream generated text from a multimodal prompt request that includes text and multiple images:

Kotlin

^{For Kotlin, the methods in this SDK are suspend functions and need to be called
from a Coroutine scope.}

// Initialize the Vertex AI service and the generative model
// Specify a model that supports your use case
val generativeModel = Firebase.vertexAI.generativeModel("gemini-2.0-flash")

// Loads an image from the app/res/drawable/ directory
val bitmap1: Bitmap = BitmapFactory.decodeResource(resources, R.drawable.sparky)
val bitmap2: Bitmap = BitmapFactory.decodeResource(resources, R.drawable.sparky_eats_pizza)

// Provide a prompt that includes the images specified above and text
val prompt = content {
    image(bitmap1)
    image(bitmap2)
    text("What's different between these pictures?")
}

// To stream generated text output, call generateContentStream with the prompt
var fullResponse = ""
generativeModel.generateContentStream(prompt).collect { chunk ->
  print(chunk.text)
  fullResponse += chunk.text
}

Java

^{For Java, the streaming methods in this SDK return a
Publisher type from the Reactive Streams library.}

// Initialize the Vertex AI service and the generative model
// Specify a model that supports your use case
GenerativeModel gm = FirebaseVertexAI.getInstance()
        .generativeModel("gemini-2.0-flash");
GenerativeModelFutures model = GenerativeModelFutures.from(gm);

Bitmap bitmap1 = BitmapFactory.decodeResource(getResources(), R.drawable.sparky);
Bitmap bitmap2 = BitmapFactory.decodeResource(getResources(), R.drawable.sparky_eats_pizza);

// Provide a prompt that includes the images specified above and text
Content prompt = new Content.Builder()
    .addImage(bitmap1)
    .addImage(bitmap2)
    .addText("What's different between these pictures?")
    .build();

// To stream generated text output, call generateContentStream with the prompt
Publisher<GenerateContentResponse> streamingResponse = model.generateContentStream(prompt);

final String[] fullResponse = {""};

streamingResponse.subscribe(new Subscriber<GenerateContentResponse>() {
    @Override
    public void onNext(GenerateContentResponse generateContentResponse) {
        String chunk = generateContentResponse.getText();
        fullResponse[0] += chunk;
    }

    @Override
    public void onComplete() {
        System.out.println(fullResponse[0]);
    }

    @Override
    public void onError(Throwable t) {
        t.printStackTrace();
    }

    @Override
    public void onSubscribe(Subscription s) {
    }
});

Without streaming

Alternatively, you can alternatively wait for the entire result instead of streaming; the result is only returned after the model completes the entire generation process.

This example shows how to use generateContent() to generate text from a multimodal prompt request that includes text and multiple images:

Kotlin

^{For Kotlin, the methods in this SDK are suspend functions and need to be called
from a Coroutine scope.}

// Initialize the Vertex AI service and the generative model
// Specify a model that supports your use case
val generativeModel = Firebase.vertexAI.generativeModel("gemini-2.0-flash")

// Loads an image from the app/res/drawable/ directory
val bitmap1: Bitmap = BitmapFactory.decodeResource(resources, R.drawable.sparky)
val bitmap2: Bitmap = BitmapFactory.decodeResource(resources, R.drawable.sparky_eats_pizza)

// Provide a prompt that includes the images specified above and text
val prompt = content {
  image(bitmap1)
  image(bitmap2)
  text("What is different between these pictures?")
}

// To generate text output, call generateContent with the prompt
val response = generativeModel.generateContent(prompt)
print(response.text)

Java

^{For Java, the methods in this SDK return a
ListenableFuture.}

// Initialize the Vertex AI service and the generative model
// Specify a model that supports your use case
GenerativeModel gm = FirebaseVertexAI.getInstance()
        .generativeModel("gemini-2.0-flash");
GenerativeModelFutures model = GenerativeModelFutures.from(gm);

Bitmap bitmap1 = BitmapFactory.decodeResource(getResources(), R.drawable.sparky);
Bitmap bitmap2 = BitmapFactory.decodeResource(getResources(), R.drawable.sparky_eats_pizza);

// Provide a prompt that includes the images specified above and text
Content prompt = new Content.Builder()
    .addImage(bitmap1)
    .addImage(bitmap2)
    .addText("What's different between these pictures?")
    .build();

// To generate text output, call generateContent with the prompt
ListenableFuture<GenerateContentResponse> response = model.generateContent(prompt);
Futures.addCallback(response, new FutureCallback<GenerateContentResponse>() {
    @Override
    public void onSuccess(GenerateContentResponse result) {
        String resultText = result.getText();
        System.out.println(resultText);
    }

    @Override
    public void onFailure(Throwable t) {
        t.printStackTrace();
    }
}, executor);

Learn how to choose a Gemini model and optionally a location appropriate for your use case and app.

Generate text from text and a video

Make sure that you've completed the Before you begin section of this guide before trying this sample.

You can call the Gemini API with multimodal prompts that include both text and video file(s) (as shown in this example). For these calls, you need to use a model that supports media in prompts (like Gemini 2.0 Flash).

Make sure to review the requirements and recommendations for input files.

Choose whether you want to stream the response (generateContentStream) or wait for the response until the entire result is generated (generateContent).

Streaming

You can achieve faster interactions by not waiting for the entire result from the model generation, and instead use streaming to handle partial results.

This example shows how to use generateContentStream() to stream generated text from a multimodal prompt request that includes text and a single video:

Kotlin

^{For Kotlin, the methods in this SDK are suspend functions and need to be called
from a Coroutine scope.}

// Initialize the Vertex AI service and the generative model
// Specify a model that supports your use case
val generativeModel = Firebase.vertexAI.generativeModel("gemini-2.0-flash")

val contentResolver = applicationContext.contentResolver
contentResolver.openInputStream(videoUri).use { stream ->
  stream?.let {
    val bytes = stream.readBytes()

    // Provide a prompt that includes the video specified above and text
    val prompt = content {
        inlineData(bytes, "video/mp4")
        text("What is in the video?")
    }

    // To stream generated text output, call generateContentStream with the prompt
    var fullResponse = ""
    generativeModel.generateContentStream(prompt).collect { chunk ->
        Log.d(TAG, chunk.text ?: "")
        fullResponse += chunk.text
    }
  }
}

Java

^{For Java, the streaming methods in this SDK return a
Publisher type from the Reactive Streams library.}

// Initialize the Vertex AI service and the generative model
// Specify a model that supports your use case
GenerativeModel gm = FirebaseVertexAI.getInstance()
        .generativeModel("gemini-2.0-flash");
GenerativeModelFutures model = GenerativeModelFutures.from(gm);

ContentResolver resolver = getApplicationContext().getContentResolver();
try (InputStream stream = resolver.openInputStream(videoUri)) {
    File videoFile = new File(new URI(videoUri.toString()));
    int videoSize = (int) videoFile.length();
    byte[] videoBytes = new byte[videoSize];
    if (stream != null) {
        stream.read(videoBytes, 0, videoBytes.length);
        stream.close();

        // Provide a prompt that includes the video specified above and text
        Content prompt = new Content.Builder()
                .addInlineData(videoBytes, "video/mp4")
                .addText("What is in the video?")
                .build();

        // To stream generated text output, call generateContentStream with the prompt
        Publisher<GenerateContentResponse> streamingResponse =
                model.generateContentStream(prompt);

        final String[] fullResponse = {""};

        streamingResponse.subscribe(new Subscriber<GenerateContentResponse>() {
            @Override
            public void onNext(GenerateContentResponse generateContentResponse) {
                String chunk = generateContentResponse.getText();
                fullResponse[0] += chunk;
            }

            @Override
            public void onComplete() {
                System.out.println(fullResponse[0]);
            }

            @Override
            public void onError(Throwable t) {
                t.printStackTrace();
            }

            @Override
            public void onSubscribe(Subscription s) {
            }
         });
    }
} catch (IOException e) {
    e.printStackTrace();
} catch (URISyntaxException e) {
    e.printStackTrace();
}

Without streaming

Alternatively, you can wait for the entire result instead of streaming; the result is only returned after the model completes the entire generation process.

This example shows how to use generateContent() to generate text from a multimodal prompt request that includes text and a single video:

Kotlin

^{For Kotlin, the methods in this SDK are suspend functions and need to be called
from a Coroutine scope.}

// Initialize the Vertex AI service and the generative model
// Specify a model that supports your use case
val generativeModel = Firebase.vertexAI.generativeModel("gemini-2.0-flash")

val contentResolver = applicationContext.contentResolver
contentResolver.openInputStream(videoUri).use { stream ->
  stream?.let {
    val bytes = stream.readBytes()

    // Provide a prompt that includes the video specified above and text
    val prompt = content {
        inlineData(bytes, "video/mp4")
        text("What is in the video?")
    }

    // To generate text output, call generateContent with the prompt
    val response = generativeModel.generateContent(prompt)
    Log.d(TAG, response.text ?: "")
  }
}

Java

^{For Java, the methods in this SDK return a
ListenableFuture.}

// Initialize the Vertex AI service and the generative model
// Specify a model that supports your use case
GenerativeModel gm = FirebaseVertexAI.getInstance()
        .generativeModel("gemini-2.0-flash");
GenerativeModelFutures model = GenerativeModelFutures.from(gm);

ContentResolver resolver = getApplicationContext().getContentResolver();
try (InputStream stream = resolver.openInputStream(videoUri)) {
    File videoFile = new File(new URI(videoUri.toString()));
    int videoSize = (int) videoFile.length();
    byte[] videoBytes = new byte[videoSize];
    if (stream != null) {
        stream.read(videoBytes, 0, videoBytes.length);
        stream.close();

        // Provide a prompt that includes the video specified above and text
        Content prompt = new Content.Builder()
                .addInlineData(videoBytes, "video/mp4")
                .addText("What is in the video?")
                .build();

        // To generate text output, call generateContent with the prompt
        ListenableFuture<GenerateContentResponse> response = model.generateContent(prompt);
        Futures.addCallback(response, new FutureCallback<GenerateContentResponse>() {
            @Override
            public void onSuccess(GenerateContentResponse result) {
                String resultText = result.getText();
                System.out.println(resultText);
            }

            @Override
            public void onFailure(Throwable t) {
                t.printStackTrace();
            }
        }, executor);
    }
} catch (IOException e) {
    e.printStackTrace();
} catch (URISyntaxException e) {
    e.printStackTrace();
}

Learn how to choose a Gemini model and optionally a location appropriate for your use case and app.

Requirements and recommendations for input files

See Supported input files and requirements for the Vertex AI Gemini API to learn about the following:

Different options for providing a file in a request
Supported file types
Supported MIME types and how to specify them
Requirements and best practices for files and multimodal requests

What else can you do?

Learn how to count tokens before sending long prompts to the model.
Set up Cloud Storage for Firebase so that you can include large files in your multimodal requests and have a more managed solution for providing files in prompts. Files can include images, PDFs, video, and audio.
Start thinking about preparing for production, including setting up Firebase App Check to protect the Gemini API from abuse by unauthorized clients.

Try out other capabilities of the Gemini API

Build multi-turn conversations (chat).
Generate text from text-only prompts.
Generate structured output (like JSON) from both text and multimodal prompts.
Use function calling to connect generative models to external systems and information.

Learn how to control content generation

Understand prompt design, including best practices, strategies, and example prompts.
Configure model parameters like temperature and maximum output tokens.
Use safety settings to adjust the likelihood of getting responses that may be considered harmful.

You can also experiment with prompts and model configurations using Vertex AI Studio.

Learn more about the Gemini models

Learn about the models available for various use cases and their quotas and pricing.

Give feedback about your experience with Vertex AI in Firebase