When calling the Gemini API from your app using a Vertex AI in Firebase SDK, you can prompt the Gemini model to generate text based on a multimodal input. Multimodal prompts can include multiple modalities (or types of input), like text along with images, PDFs, plain-text files, video, and audio.
In each multimodal request, you must always provide the following:
The file's
mimeType
. Learn about each input file's supported MIME types.The file. You can either provide the file as inline data (as shown on this page) or using its URL or URI.
For testing and iterating on multimodal prompts, we recommend using Vertex AI Studio.
Optionally experiment with an alternative "Google AI" version of the Gemini API
Get free-of-charge access (within limits and where available) using Google AI Studio and Google AI client SDKs. These SDKs should be used for prototyping only in mobile and web apps.After you're familiar with how a Gemini API works, migrate to our Vertex AI in Firebase SDKs (this documentation), which have many additional features important for mobile and web apps, like protecting the API from abuse using Firebase App Check and support for large media files in requests.
Optionally call the Vertex AI Gemini API server-side (like with Python, Node.js, or Go)
Use the server-side Vertex AI SDKs, Firebase Genkit, or Firebase Extensions for the Gemini API.
Before you begin
If you haven't already, complete the getting started guide for the Vertex AI in Firebase SDKs. Make sure that you've done all of the following:
Set up a new or existing Firebase project, including using the Blaze pricing plan and enabling the required APIs.
Connect your app to Firebase, including registering your app and adding your Firebase config to your app.
Add the SDK and initialize the Vertex AI service and the generative model in your app.
After you've connected your app to Firebase, added the SDK, and initialized the Vertex AI service and the generative model, you're ready to call the Gemini API.
Generate text from text and a single image Generate text from text and multiple images Generate text from text and a video
Sample media files
If you don't already have media files, then you can use the following publicly available files:
Image:
gs://cloud-samples-data/generative-ai/image/scones.jpg
with a MIME type ofimage/jpeg
.
View or download this image.PDF:
gs://cloud-samples-data/generative-ai/pdf/2403.05530.pdf
with a MIME type ofapplication/pdf
.
View or download this PDF.Video:
gs://cloud-samples-data/video/animals.mp4
with a MIME type ofvideo/mp4
.
View or download this video.Audio:
gs://cloud-samples-data/generative-ai/audio/pixel.mp3
with a MIME type ofaudio/mp3
.
Listen to or download this audio.
Generate text from text and a single image
Make sure that you've completed the Before you begin section of this guide before trying this sample.
You can call the Gemini API with multimodal prompts that include both text and a single file (like an image, as shown in this example). For these calls, you need to use a model that supports media in prompts (like Gemini 1.5 Flash).
Make sure to review the requirements and recommendations for input files.
Choose whether you want to stream the response (generateContentStream
) or wait
for the response until the entire result is generated (generateContent
).
Streaming
You can achieve faster interactions by not waiting for the entire result from the model generation, and instead use streaming to handle partial results.
This example shows how to use
generateContentStream()
to stream generated text from a multimodal prompt request that includes text
and a single image:
Kotlin
For Kotlin, the methods in this SDK are suspend functions and need to be called from a Coroutine scope.// Initialize the Vertex AI service and the generative model
// Specify a model that supports your use case
// Gemini 1.5 models are versatile and can be used with all API capabilities
val generativeModel = Firebase.vertexAI.generativeModel("gemini-1.5-flash")
// Loads an image from the app/res/drawable/ directory
val bitmap: Bitmap = BitmapFactory.decodeResource(resources, R.drawable.sparky)
// Provide a prompt that includes the image specified above and text
val prompt = content {
image(bitmap)
text("What developer tool is this mascot from?")
}
// To stream generated text output, call generateContentStream with the prompt
var fullResponse = ""
generativeModel.generateContentStream(prompt).collect { chunk ->
print(chunk.text)
fullResponse += chunk.text
}
Java
For Java, the streaming methods in this SDK return aPublisher
type from the Reactive Streams library.
// Initialize the Vertex AI service and the generative model
// Specify a model that supports your use case
// Gemini 1.5 models are versatile and can be used with all API capabilities
GenerativeModel gm = FirebaseVertexAI.getInstance()
.generativeModel("gemini-1.5-flash");
GenerativeModelFutures model = GenerativeModelFutures.from(gm);
Bitmap bitmap = BitmapFactory.decodeResource(getResources(), R.drawable.sparky);
// Provide a prompt that includes the image specified above and text
Content prompt = new Content.Builder()
.addImage(bitmap)
.addText("What developer tool is this mascot from?")
.build();
// To stream generated text output, call generateContentStream with the prompt
Publisher<GenerateContentResponse> streamingResponse = model.generateContentStream(prompt);
final String[] fullResponse = {""};
streamingResponse.subscribe(new Subscriber<GenerateContentResponse>() {
@Override
public void onNext(GenerateContentResponse generateContentResponse) {
String chunk = generateContentResponse.getText();
fullResponse[0] += chunk;
}
@Override
public void onComplete() {
System.out.println(fullResponse[0]);
}
@Override
public void onError(Throwable t) {
t.printStackTrace();
}
@Override
public void onSubscribe(Subscription s) {
}
});
Without streaming
Alternatively, you can wait for the entire result instead of streaming; the result is only returned after the model completes the entire generation process.
This example shows how to use
generateContent()
to generate text from a multimodal prompt request that includes text and a
single image:
Kotlin
For Kotlin, the methods in this SDK are suspend functions and need to be called from a Coroutine scope.// Initialize the Vertex AI service and the generative model
// Specify a model that supports your use case
// Gemini 1.5 models are versatile and can be used with all API capabilities
val generativeModel = Firebase.vertexAI.generativeModel("gemini-1.5-flash")
// Loads an image from the app/res/drawable/ directory
val bitmap: Bitmap = BitmapFactory.decodeResource(resources, R.drawable.sparky)
// Provide a prompt that includes the image specified above and text
val prompt = content {
image(bitmap)
text("What developer tool is this mascot from?")
}
// To generate text output, call generateContent with the prompt
val response = generativeModel.generateContent(prompt)
print(response.text)
Java
For Java, the methods in this SDK return aListenableFuture
.
// Initialize the Vertex AI service and the generative model
// Specify a model that supports your use case
// Gemini 1.5 models are versatile and can be used with all API capabilities
GenerativeModel gm = FirebaseVertexAI.getInstance()
.generativeModel("gemini-1.5-flash");
GenerativeModelFutures model = GenerativeModelFutures.from(gm);
Bitmap bitmap = BitmapFactory.decodeResource(getResources(), R.drawable.sparky);
// Provide a prompt that includes the image specified above and text
Content content = new Content.Builder()
.addImage(bitmap)
.addText("What developer tool is this mascot from?")
.build();
// To generate text output, call generateContent with the prompt
ListenableFuture<GenerateContentResponse> response = model.generateContent(content);
Futures.addCallback(response, new FutureCallback<GenerateContentResponse>() {
@Override
public void onSuccess(GenerateContentResponse result) {
String resultText = result.getText();
System.out.println(resultText);
}
@Override
public void onFailure(Throwable t) {
t.printStackTrace();
}
}, executor);
Learn how to choose a Gemini model and optionally a location appropriate for your use case and app.
Generate text from text and multiple images
Make sure that you've completed the Before you begin section of this guide before trying this sample.
You can call the Gemini API with multimodal prompts that include both text and multiple files (like images, as shown in this example). For these calls, you need to use a model that supports media in prompts (like Gemini 1.5 Flash).
Make sure to review the requirements and recommendations for input files.
Choose whether you want to stream the response (generateContentStream
) or wait
for the response until the entire result is generated (generateContent
).
Streaming
You can achieve faster interactions by not waiting for the entire result from the model generation, and instead use streaming to handle partial results.
This example shows how to use
generateContentStream()
to stream generated text from a multimodal prompt request that includes text
and multiple images:
Kotlin
For Kotlin, the methods in this SDK are suspend functions and need to be called from a Coroutine scope.// Initialize the Vertex AI service and the generative model
// Specify a model that supports your use case
// Gemini 1.5 models are versatile and can be used with all API capabilities
val generativeModel = Firebase.vertexAI.generativeModel("gemini-1.5-flash")
// Loads an image from the app/res/drawable/ directory
val bitmap1: Bitmap = BitmapFactory.decodeResource(resources, R.drawable.sparky)
val bitmap2: Bitmap = BitmapFactory.decodeResource(resources, R.drawable.sparky_eats_pizza)
// Provide a prompt that includes the images specified above and text
val prompt = content {
image(bitmap1)
image(bitmap2)
text("What's different between these pictures?")
}
// To stream generated text output, call generateContentStream with the prompt
var fullResponse = ""
generativeModel.generateContentStream(prompt).collect { chunk ->
print(chunk.text)
fullResponse += chunk.text
}
Java
For Java, the streaming methods in this SDK return aPublisher
type from the Reactive Streams library.
// Initialize the Vertex AI service and the generative model
// Specify a model that supports your use case
// Gemini 1.5 models are versatile and can be used with all API capabilities
GenerativeModel gm = FirebaseVertexAI.getInstance()
.generativeModel("gemini-1.5-flash");
GenerativeModelFutures model = GenerativeModelFutures.from(gm);
Bitmap bitmap1 = BitmapFactory.decodeResource(getResources(), R.drawable.sparky);
Bitmap bitmap2 = BitmapFactory.decodeResource(getResources(), R.drawable.sparky_eats_pizza);
// Provide a prompt that includes the images specified above and text
Content prompt = new Content.Builder()
.addImage(bitmap1)
.addImage(bitmap2)
.addText("What's different between these pictures?")
.build();
// To stream generated text output, call generateContentStream with the prompt
Publisher<GenerateContentResponse> streamingResponse = model.generateContentStream(prompt);
final String[] fullResponse = {""};
streamingResponse.subscribe(new Subscriber<GenerateContentResponse>() {
@Override
public void onNext(GenerateContentResponse generateContentResponse) {
String chunk = generateContentResponse.getText();
fullResponse[0] += chunk;
}
@Override
public void onComplete() {
System.out.println(fullResponse[0]);
}
@Override
public void onError(Throwable t) {
t.printStackTrace();
}
@Override
public void onSubscribe(Subscription s) {
}
});
Without streaming
Alternatively, you can alternatively wait for the entire result instead of streaming; the result is only returned after the model completes the entire generation process.
This example shows how to use
generateContent()
to generate text from a multimodal prompt request that includes text and
multiple images:
Kotlin
For Kotlin, the methods in this SDK are suspend functions and need to be called from a Coroutine scope.// Initialize the Vertex AI service and the generative model
// Specify a model that supports your use case
// Gemini 1.5 models are versatile and can be used with all API capabilities
val generativeModel = Firebase.vertexAI.generativeModel("gemini-1.5-flash")
// Loads an image from the app/res/drawable/ directory
val bitmap1: Bitmap = BitmapFactory.decodeResource(resources, R.drawable.sparky)
val bitmap2: Bitmap = BitmapFactory.decodeResource(resources, R.drawable.sparky_eats_pizza)
// Provide a prompt that includes the images specified above and text
val prompt = content {
image(bitmap1)
image(bitmap2)
text("What is different between these pictures?")
}
// To generate text output, call generateContent with the prompt
val response = generativeModel.generateContent(prompt)
print(response.text)
Java
For Java, the methods in this SDK return aListenableFuture
.
// Initialize the Vertex AI service and the generative model
// Specify a model that supports your use case
// Gemini 1.5 models are versatile and can be used with all API capabilities
GenerativeModel gm = FirebaseVertexAI.getInstance()
.generativeModel("gemini-1.5-flash");
GenerativeModelFutures model = GenerativeModelFutures.from(gm);
Bitmap bitmap1 = BitmapFactory.decodeResource(getResources(), R.drawable.sparky);
Bitmap bitmap2 = BitmapFactory.decodeResource(getResources(), R.drawable.sparky_eats_pizza);
// Provide a prompt that includes the images specified above and text
Content prompt = new Content.Builder()
.addImage(bitmap1)
.addImage(bitmap2)
.addText("What's different between these pictures?")
.build();
// To generate text output, call generateContent with the prompt
ListenableFuture<GenerateContentResponse> response = model.generateContent(prompt);
Futures.addCallback(response, new FutureCallback<GenerateContentResponse>() {
@Override
public void onSuccess(GenerateContentResponse result) {
String resultText = result.getText();
System.out.println(resultText);
}
@Override
public void onFailure(Throwable t) {
t.printStackTrace();
}
}, executor);
Learn how to choose a Gemini model and optionally a location appropriate for your use case and app.
Generate text from text and a video
Make sure that you've completed the Before you begin section of this guide before trying this sample.
You can call the Gemini API with multimodal prompts that include both text and video file(s) (as shown in this example). For these calls, you need to use a model that supports media in prompts (like Gemini 1.5 Flash).
Make sure to review the requirements and recommendations for input files.
Choose whether you want to stream the response (generateContentStream
) or wait
for the response until the entire result is generated (generateContent
).
Streaming
You can achieve faster interactions by not waiting for the entire result from the model generation, and instead use streaming to handle partial results.
This example shows how to use
generateContentStream()
to stream generated text from a multimodal prompt request that includes text
and a single video:
Kotlin
For Kotlin, the methods in this SDK are suspend functions and need to be called from a Coroutine scope.// Initialize the Vertex AI service and the generative model
// Specify a model that supports your use case
// Gemini 1.5 models are versatile and can be used with all API capabilities
val generativeModel = Firebase.vertexAI.generativeModel("gemini-1.5-flash")
val contentResolver = applicationContext.contentResolver
contentResolver.openInputStream(videoUri).use { stream ->
stream?.let {
val bytes = stream.readBytes()
// Provide a prompt that includes the video specified above and text
val prompt = content {
inlineData(bytes, "video/mp4")
text("What is in the video?")
}
// To stream generated text output, call generateContentStream with the prompt
var fullResponse = ""
generativeModel.generateContentStream(prompt).collect { chunk ->
Log.d(TAG, chunk.text ?: "")
fullResponse += chunk.text
}
}
}
Java
For Java, the streaming methods in this SDK return aPublisher
type from the Reactive Streams library.
// Initialize the Vertex AI service and the generative model
// Specify a model that supports your use case
// Gemini 1.5 models are versatile and can be used with all API capabilities
GenerativeModel gm = FirebaseVertexAI.getInstance()
.generativeModel("gemini-1.5-flash");
GenerativeModelFutures model = GenerativeModelFutures.from(gm);
ContentResolver resolver = getApplicationContext().getContentResolver();
try (InputStream stream = resolver.openInputStream(videoUri)) {
File videoFile = new File(new URI(videoUri.toString()));
int videoSize = (int) videoFile.length();
byte[] videoBytes = new byte[videoSize];
if (stream != null) {
stream.read(videoBytes, 0, videoBytes.length);
stream.close();
// Provide a prompt that includes the video specified above and text
Content prompt = new Content.Builder()
.addInlineData(videoBytes, "video/mp4")
.addText("What is in the video?")
.build();
// To stream generated text output, call generateContentStream with the prompt
Publisher<GenerateContentResponse> streamingResponse =
model.generateContentStream(prompt);
final String[] fullResponse = {""};
streamingResponse.subscribe(new Subscriber<GenerateContentResponse>() {
@Override
public void onNext(GenerateContentResponse generateContentResponse) {
String chunk = generateContentResponse.getText();
fullResponse[0] += chunk;
}
@Override
public void onComplete() {
System.out.println(fullResponse[0]);
}
@Override
public void onError(Throwable t) {
t.printStackTrace();
}
@Override
public void onSubscribe(Subscription s) {
}
});
}
} catch (IOException e) {
e.printStackTrace();
} catch (URISyntaxException e) {
e.printStackTrace();
}
Without streaming
Alternatively, you can wait for the entire result instead of streaming; the result is only returned after the model completes the entire generation process.
This example shows how to use
generateContent()
to generate text from a multimodal prompt request that includes text and a
single video:
Kotlin
For Kotlin, the methods in this SDK are suspend functions and need to be called from a Coroutine scope.// Initialize the Vertex AI service and the generative model
// Specify a model that supports your use case
// Gemini 1.5 models are versatile and can be used with all API capabilities
val generativeModel = Firebase.vertexAI.generativeModel("gemini-1.5-flash")
val contentResolver = applicationContext.contentResolver
contentResolver.openInputStream(videoUri).use { stream ->
stream?.let {
val bytes = stream.readBytes()
// Provide a prompt that includes the video specified above and text
val prompt = content {
inlineData(bytes, "video/mp4")
text("What is in the video?")
}
// To generate text output, call generateContent with the prompt
val response = generativeModel.generateContent(prompt)
Log.d(TAG, response.text ?: "")
}
}
Java
For Java, the methods in this SDK return aListenableFuture
.
// Initialize the Vertex AI service and the generative model
// Specify a model that supports your use case
// Gemini 1.5 models are versatile and can be used with all API capabilities
GenerativeModel gm = FirebaseVertexAI.getInstance()
.generativeModel("gemini-1.5-flash");
GenerativeModelFutures model = GenerativeModelFutures.from(gm);
ContentResolver resolver = getApplicationContext().getContentResolver();
try (InputStream stream = resolver.openInputStream(videoUri)) {
File videoFile = new File(new URI(videoUri.toString()));
int videoSize = (int) videoFile.length();
byte[] videoBytes = new byte[videoSize];
if (stream != null) {
stream.read(videoBytes, 0, videoBytes.length);
stream.close();
// Provide a prompt that includes the video specified above and text
Content prompt = new Content.Builder()
.addInlineData(videoBytes, "video/mp4")
.addText("What is in the video?")
.build();
// To generate text output, call generateContent with the prompt
ListenableFuture<GenerateContentResponse> response = model.generateContent(prompt);
Futures.addCallback(response, new FutureCallback<GenerateContentResponse>() {
@Override
public void onSuccess(GenerateContentResponse result) {
String resultText = result.getText();
System.out.println(resultText);
}
@Override
public void onFailure(Throwable t) {
t.printStackTrace();
}
}, executor);
}
} catch (IOException e) {
e.printStackTrace();
} catch (URISyntaxException e) {
e.printStackTrace();
}
Learn how to choose a Gemini model and optionally a location appropriate for your use case and app.
Requirements and recommendations for input files
See Supported input files and requirements for the Vertex AI Gemini API to learn about the following:
- Different options for providing a file in a request
- Supported file types
- Supported MIME types and how to specify them
- Requirements and best practices for files and multimodal requests
What else can you do?
- Learn how to count tokens before sending long prompts to the model.
- Set up Cloud Storage for Firebase so that you can include large files in your multimodal requests and have a more managed solution for providing files in prompts. Files can include images, PDFs, video, and audio.
- Start thinking about preparing for production, including setting up Firebase App Check to protect the Gemini API from abuse by unauthorized clients.
Try out other capabilities of the Gemini API
- Build multi-turn conversations (chat).
- Generate text from text-only prompts.
- Generate structured output (like JSON) from both text and multimodal prompts.
- Use function calling to connect generative models to external systems and information.
Learn how to control content generation
- Understand prompt design, including best practices, strategies, and example prompts.
- Configure model parameters like temperature and maximum output tokens.
- Use safety settings to adjust the likelihood of getting responses that may be considered harmful.
Learn more about the Gemini models
Learn about the models available for various use cases and their quotas and pricing.Give feedback about your experience with Vertex AI in Firebase