Generate text from multimodal prompts using the Gemini API


When calling the Gemini API from your app using a Vertex AI in Firebase SDK, you can prompt the Gemini model to generate text based on a multimodal input. Multimodal prompts can include multiple modalities (or types of input), like text along with images, PDFs, plain-text files, video, and audio.

In each multimodal request, you must always provide the following:

  • The file's mimeType. Learn about each input file's supported MIME types.

  • The file. You can either provide the file as inline data (as shown on this page) or using its URL or URI.

For testing and iterating on multimodal prompts, we recommend using Vertex AI Studio.

Before you begin

If you haven't already, complete the getting started guide for the Vertex AI in Firebase SDKs. Make sure that you've done all of the following:

  1. Set up a new or existing Firebase project, including using the Blaze pricing plan and enabling the required APIs.

  2. Connect your app to Firebase, including registering your app and adding your Firebase config to your app.

  3. Add the SDK and initialize the Vertex AI service and the generative model in your app.

After you've connected your app to Firebase, added the SDK, and initialized the Vertex AI service and the generative model, you're ready to call the Gemini API.

Generate text from text and a single image Generate text from text and multiple images Generate text from text and a video

Sample media files

If you don't already have media files, then you can use the following publicly available files:

Generate text from text and a single image

Make sure that you've completed the Before you begin section of this guide before trying this sample.

You can call the Gemini API with multimodal prompts that include both text and a single file (like an image, as shown in this example). For these calls, you need to use a model that supports media in prompts (like Gemini 1.5 Flash).

Make sure to review the requirements and recommendations for input files.

Choose whether you want to stream the response (generateContentStream) or wait for the response until the entire result is generated (generateContent).

Streaming

You can achieve faster interactions by not waiting for the entire result from the model generation, and instead use streaming to handle partial results.

This example shows how to use generateContentStream() to stream generated text from a multimodal prompt request that includes text and a single image:

Kotlin

For Kotlin, the methods in this SDK are suspend functions and need to be called from a Coroutine scope.
// Initialize the Vertex AI service and the generative model
// Specify a model that supports your use case
// Gemini 1.5 models are versatile and can be used with all API capabilities
val generativeModel = Firebase.vertexAI.generativeModel("gemini-1.5-flash")

// Loads an image from the app/res/drawable/ directory
val bitmap: Bitmap = BitmapFactory.decodeResource(resources, R.drawable.sparky)

// Provide a prompt that includes the image specified above and text
val prompt = content {
  image(bitmap)
  text("What developer tool is this mascot from?")
}

// To stream generated text output, call generateContentStream with the prompt
var fullResponse = ""
generativeModel.generateContentStream(prompt).collect { chunk ->
  print(chunk.text)
  fullResponse += chunk.text
}

Java

For Java, the streaming methods in this SDK return a Publisher type from the Reactive Streams library.
// Initialize the Vertex AI service and the generative model
// Specify a model that supports your use case
// Gemini 1.5 models are versatile and can be used with all API capabilities
GenerativeModel gm = FirebaseVertexAI.getInstance()
        .generativeModel("gemini-1.5-flash");
GenerativeModelFutures model = GenerativeModelFutures.from(gm);

Bitmap bitmap = BitmapFactory.decodeResource(getResources(), R.drawable.sparky);

// Provide a prompt that includes the image specified above and text
Content prompt = new Content.Builder()
        .addImage(bitmap)
        .addText("What developer tool is this mascot from?")
        .build();

// To stream generated text output, call generateContentStream with the prompt
Publisher<GenerateContentResponse> streamingResponse = model.generateContentStream(prompt);

final String[] fullResponse = {""};

streamingResponse.subscribe(new Subscriber<GenerateContentResponse>() {
    @Override
    public void onNext(GenerateContentResponse generateContentResponse) {
        String chunk = generateContentResponse.getText();
        fullResponse[0] += chunk;
    }

    @Override
    public void onComplete() {
        System.out.println(fullResponse[0]);
    }

    @Override
    public void onError(Throwable t) {
        t.printStackTrace();
    }

    @Override
    public void onSubscribe(Subscription s) {
    }
});

Without streaming

Alternatively, you can wait for the entire result instead of streaming; the result is only returned after the model completes the entire generation process.

This example shows how to use generateContent() to generate text from a multimodal prompt request that includes text and a single image:

Kotlin

For Kotlin, the methods in this SDK are suspend functions and need to be called from a Coroutine scope.
// Initialize the Vertex AI service and the generative model
// Specify a model that supports your use case
// Gemini 1.5 models are versatile and can be used with all API capabilities
val generativeModel = Firebase.vertexAI.generativeModel("gemini-1.5-flash")

// Loads an image from the app/res/drawable/ directory
val bitmap: Bitmap = BitmapFactory.decodeResource(resources, R.drawable.sparky)

// Provide a prompt that includes the image specified above and text
val prompt = content {
  image(bitmap)
  text("What developer tool is this mascot from?")
}

// To generate text output, call generateContent with the prompt
val response = generativeModel.generateContent(prompt)
print(response.text)

Java

For Java, the methods in this SDK return a ListenableFuture.
// Initialize the Vertex AI service and the generative model
// Specify a model that supports your use case
// Gemini 1.5 models are versatile and can be used with all API capabilities
GenerativeModel gm = FirebaseVertexAI.getInstance()
        .generativeModel("gemini-1.5-flash");
GenerativeModelFutures model = GenerativeModelFutures.from(gm);

Bitmap bitmap = BitmapFactory.decodeResource(getResources(), R.drawable.sparky);

// Provide a prompt that includes the image specified above and text
Content content = new Content.Builder()
        .addImage(bitmap)
        .addText("What developer tool is this mascot from?")
        .build();

// To generate text output, call generateContent with the prompt
ListenableFuture<GenerateContentResponse> response = model.generateContent(content);
Futures.addCallback(response, new FutureCallback<GenerateContentResponse>() {
    @Override
    public void onSuccess(GenerateContentResponse result) {
        String resultText = result.getText();
        System.out.println(resultText);
    }

    @Override
    public void onFailure(Throwable t) {
        t.printStackTrace();
    }
}, executor);

Learn how to choose a Gemini model and optionally a location appropriate for your use case and app.

Generate text from text and multiple images

Make sure that you've completed the Before you begin section of this guide before trying this sample.

You can call the Gemini API with multimodal prompts that include both text and multiple files (like images, as shown in this example). For these calls, you need to use a model that supports media in prompts (like Gemini 1.5 Flash).

Make sure to review the requirements and recommendations for input files.

Choose whether you want to stream the response (generateContentStream) or wait for the response until the entire result is generated (generateContent).

Streaming

You can achieve faster interactions by not waiting for the entire result from the model generation, and instead use streaming to handle partial results.

This example shows how to use generateContentStream() to stream generated text from a multimodal prompt request that includes text and multiple images:

Kotlin

For Kotlin, the methods in this SDK are suspend functions and need to be called from a Coroutine scope.
// Initialize the Vertex AI service and the generative model
// Specify a model that supports your use case
// Gemini 1.5 models are versatile and can be used with all API capabilities
val generativeModel = Firebase.vertexAI.generativeModel("gemini-1.5-flash")

// Loads an image from the app/res/drawable/ directory
val bitmap1: Bitmap = BitmapFactory.decodeResource(resources, R.drawable.sparky)
val bitmap2: Bitmap = BitmapFactory.decodeResource(resources, R.drawable.sparky_eats_pizza)

// Provide a prompt that includes the images specified above and text
val prompt = content {
    image(bitmap1)
    image(bitmap2)
    text("What's different between these pictures?")
}

// To stream generated text output, call generateContentStream with the prompt
var fullResponse = ""
generativeModel.generateContentStream(prompt).collect { chunk ->
  print(chunk.text)
  fullResponse += chunk.text
}

Java

For Java, the streaming methods in this SDK return a Publisher type from the Reactive Streams library.
// Initialize the Vertex AI service and the generative model
// Specify a model that supports your use case
// Gemini 1.5 models are versatile and can be used with all API capabilities
GenerativeModel gm = FirebaseVertexAI.getInstance()
        .generativeModel("gemini-1.5-flash");
GenerativeModelFutures model = GenerativeModelFutures.from(gm);

Bitmap bitmap1 = BitmapFactory.decodeResource(getResources(), R.drawable.sparky);
Bitmap bitmap2 = BitmapFactory.decodeResource(getResources(), R.drawable.sparky_eats_pizza);

// Provide a prompt that includes the images specified above and text
Content prompt = new Content.Builder()
    .addImage(bitmap1)
    .addImage(bitmap2)
    .addText("What's different between these pictures?")
    .build();

// To stream generated text output, call generateContentStream with the prompt
Publisher<GenerateContentResponse> streamingResponse = model.generateContentStream(prompt);

final String[] fullResponse = {""};

streamingResponse.subscribe(new Subscriber<GenerateContentResponse>() {
    @Override
    public void onNext(GenerateContentResponse generateContentResponse) {
        String chunk = generateContentResponse.getText();
        fullResponse[0] += chunk;
    }

    @Override
    public void onComplete() {
        System.out.println(fullResponse[0]);
    }

    @Override
    public void onError(Throwable t) {
        t.printStackTrace();
    }

    @Override
    public void onSubscribe(Subscription s) {
    }
});

Without streaming

Alternatively, you can alternatively wait for the entire result instead of streaming; the result is only returned after the model completes the entire generation process.

This example shows how to use generateContent() to generate text from a multimodal prompt request that includes text and multiple images:

Kotlin

For Kotlin, the methods in this SDK are suspend functions and need to be called from a Coroutine scope.
// Initialize the Vertex AI service and the generative model
// Specify a model that supports your use case
// Gemini 1.5 models are versatile and can be used with all API capabilities
val generativeModel = Firebase.vertexAI.generativeModel("gemini-1.5-flash")

// Loads an image from the app/res/drawable/ directory
val bitmap1: Bitmap = BitmapFactory.decodeResource(resources, R.drawable.sparky)
val bitmap2: Bitmap = BitmapFactory.decodeResource(resources, R.drawable.sparky_eats_pizza)

// Provide a prompt that includes the images specified above and text
val prompt = content {
  image(bitmap1)
  image(bitmap2)
  text("What is different between these pictures?")
}

// To generate text output, call generateContent with the prompt
val response = generativeModel.generateContent(prompt)
print(response.text)

Java

For Java, the methods in this SDK return a ListenableFuture.
// Initialize the Vertex AI service and the generative model
// Specify a model that supports your use case
// Gemini 1.5 models are versatile and can be used with all API capabilities
GenerativeModel gm = FirebaseVertexAI.getInstance()
        .generativeModel("gemini-1.5-flash");
GenerativeModelFutures model = GenerativeModelFutures.from(gm);

Bitmap bitmap1 = BitmapFactory.decodeResource(getResources(), R.drawable.sparky);
Bitmap bitmap2 = BitmapFactory.decodeResource(getResources(), R.drawable.sparky_eats_pizza);

// Provide a prompt that includes the images specified above and text
Content prompt = new Content.Builder()
    .addImage(bitmap1)
    .addImage(bitmap2)
    .addText("What's different between these pictures?")
    .build();

// To generate text output, call generateContent with the prompt
ListenableFuture<GenerateContentResponse> response = model.generateContent(prompt);
Futures.addCallback(response, new FutureCallback<GenerateContentResponse>() {
    @Override
    public void onSuccess(GenerateContentResponse result) {
        String resultText = result.getText();
        System.out.println(resultText);
    }

    @Override
    public void onFailure(Throwable t) {
        t.printStackTrace();
    }
}, executor);

Learn how to choose a Gemini model and optionally a location appropriate for your use case and app.

Generate text from text and a video

Make sure that you've completed the Before you begin section of this guide before trying this sample.

You can call the Gemini API with multimodal prompts that include both text and video file(s) (as shown in this example). For these calls, you need to use a model that supports media in prompts (like Gemini 1.5 Flash).

Make sure to review the requirements and recommendations for input files.

Choose whether you want to stream the response (generateContentStream) or wait for the response until the entire result is generated (generateContent).

Streaming

You can achieve faster interactions by not waiting for the entire result from the model generation, and instead use streaming to handle partial results.

This example shows how to use generateContentStream() to stream generated text from a multimodal prompt request that includes text and a single video:

Kotlin

For Kotlin, the methods in this SDK are suspend functions and need to be called from a Coroutine scope.
// Initialize the Vertex AI service and the generative model
// Specify a model that supports your use case
// Gemini 1.5 models are versatile and can be used with all API capabilities
val generativeModel = Firebase.vertexAI.generativeModel("gemini-1.5-flash")

val contentResolver = applicationContext.contentResolver
contentResolver.openInputStream(videoUri).use { stream ->
  stream?.let {
    val bytes = stream.readBytes()

    // Provide a prompt that includes the video specified above and text
    val prompt = content {
        inlineData(bytes, "video/mp4")
        text("What is in the video?")
    }

    // To stream generated text output, call generateContentStream with the prompt
    var fullResponse = ""
    generativeModel.generateContentStream(prompt).collect { chunk ->
        Log.d(TAG, chunk.text ?: "")
        fullResponse += chunk.text
    }
  }
}

Java

For Java, the streaming methods in this SDK return a Publisher type from the Reactive Streams library.
// Initialize the Vertex AI service and the generative model
// Specify a model that supports your use case
// Gemini 1.5 models are versatile and can be used with all API capabilities
GenerativeModel gm = FirebaseVertexAI.getInstance()
        .generativeModel("gemini-1.5-flash");
GenerativeModelFutures model = GenerativeModelFutures.from(gm);

ContentResolver resolver = getApplicationContext().getContentResolver();
try (InputStream stream = resolver.openInputStream(videoUri)) {
    File videoFile = new File(new URI(videoUri.toString()));
    int videoSize = (int) videoFile.length();
    byte[] videoBytes = new byte[videoSize];
    if (stream != null) {
        stream.read(videoBytes, 0, videoBytes.length);
        stream.close();

        // Provide a prompt that includes the video specified above and text
        Content prompt = new Content.Builder()
                .addInlineData(videoBytes, "video/mp4")
                .addText("What is in the video?")
                .build();

        // To stream generated text output, call generateContentStream with the prompt
        Publisher<GenerateContentResponse> streamingResponse =
                model.generateContentStream(prompt);

        final String[] fullResponse = {""};

        streamingResponse.subscribe(new Subscriber<GenerateContentResponse>() {
            @Override
            public void onNext(GenerateContentResponse generateContentResponse) {
                String chunk = generateContentResponse.getText();
                fullResponse[0] += chunk;
            }

            @Override
            public void onComplete() {
                System.out.println(fullResponse[0]);
            }

            @Override
            public void onError(Throwable t) {
                t.printStackTrace();
            }

            @Override
            public void onSubscribe(Subscription s) {
            }
         });
    }
} catch (IOException e) {
    e.printStackTrace();
} catch (URISyntaxException e) {
    e.printStackTrace();
}

Without streaming

Alternatively, you can wait for the entire result instead of streaming; the result is only returned after the model completes the entire generation process.

This example shows how to use generateContent() to generate text from a multimodal prompt request that includes text and a single video:

Kotlin

For Kotlin, the methods in this SDK are suspend functions and need to be called from a Coroutine scope.
// Initialize the Vertex AI service and the generative model
// Specify a model that supports your use case
// Gemini 1.5 models are versatile and can be used with all API capabilities
val generativeModel = Firebase.vertexAI.generativeModel("gemini-1.5-flash")

val contentResolver = applicationContext.contentResolver
contentResolver.openInputStream(videoUri).use { stream ->
  stream?.let {
    val bytes = stream.readBytes()

    // Provide a prompt that includes the video specified above and text
    val prompt = content {
        inlineData(bytes, "video/mp4")
        text("What is in the video?")
    }

    // To generate text output, call generateContent with the prompt
    val response = generativeModel.generateContent(prompt)
    Log.d(TAG, response.text ?: "")
  }
}

Java

For Java, the methods in this SDK return a ListenableFuture.
// Initialize the Vertex AI service and the generative model
// Specify a model that supports your use case
// Gemini 1.5 models are versatile and can be used with all API capabilities
GenerativeModel gm = FirebaseVertexAI.getInstance()
        .generativeModel("gemini-1.5-flash");
GenerativeModelFutures model = GenerativeModelFutures.from(gm);

ContentResolver resolver = getApplicationContext().getContentResolver();
try (InputStream stream = resolver.openInputStream(videoUri)) {
    File videoFile = new File(new URI(videoUri.toString()));
    int videoSize = (int) videoFile.length();
    byte[] videoBytes = new byte[videoSize];
    if (stream != null) {
        stream.read(videoBytes, 0, videoBytes.length);
        stream.close();

        // Provide a prompt that includes the video specified above and text
        Content prompt = new Content.Builder()
                .addInlineData(videoBytes, "video/mp4")
                .addText("What is in the video?")
                .build();

        // To generate text output, call generateContent with the prompt
        ListenableFuture<GenerateContentResponse> response = model.generateContent(prompt);
        Futures.addCallback(response, new FutureCallback<GenerateContentResponse>() {
            @Override
            public void onSuccess(GenerateContentResponse result) {
                String resultText = result.getText();
                System.out.println(resultText);
            }

            @Override
            public void onFailure(Throwable t) {
                t.printStackTrace();
            }
        }, executor);
    }
} catch (IOException e) {
    e.printStackTrace();
} catch (URISyntaxException e) {
    e.printStackTrace();
}

Learn how to choose a Gemini model and optionally a location appropriate for your use case and app.

Requirements and recommendations for input files

See Supported input files and requirements for the Vertex AI Gemini API to learn about the following:

  • Different options for providing a file in a request
  • Supported file types
  • Supported MIME types and how to specify them
  • Requirements and best practices for files and multimodal requests

What else can you do?

  • Learn how to count tokens before sending long prompts to the model.
  • Set up Cloud Storage for Firebase so that you can include large files in your multimodal requests and have a more managed solution for providing files in prompts. Files can include images, PDFs, video, and audio.
  • Start thinking about preparing for production, including setting up Firebase App Check to protect the Gemini API from abuse by unauthorized clients.

Try out other capabilities of the Gemini API

Learn how to control content generation

You can also experiment with prompts and model configurations using Vertex AI Studio.

Learn more about the Gemini models

Learn about the models available for various use cases and their quotas and pricing.


Give feedback about your experience with Vertex AI in Firebase