When calling the Vertex AI Gemini API from your app using a Vertex AI in Firebase SDK, you can prompt the Gemini model to generate text based on a multimodal input. Multimodal prompts can include multiple modalities (or types of input), like text along with images, PDFs, video, and audio.
For the non-text parts of the input (like media files), you need to use supported file types, specify a supported MIME type, and make sure that your files and multimodal requests meet the requirements and follow best practices.
This page describes the supported MIME types, best practices, and limitations for the following:
Requirements specific to the Vertex AI in Firebase SDKs
For Vertex AI in Firebase SDKs, the maximum total request size is 20 MB. You get an HTTP 413 error if a request is too large.
If a file's size will make the total request size exceed 20 MB, then use a Cloud Storage for Firebase URL to include the file in your multimodal request.
If a file is small, you can often pass it directly as inline data. Note though, that a file provided as inline data is encoded to base64 in transit, which increases the size of the request. For examples showing how to include files as inline data, see Generate text from multimodal prompts using the Gemini API.
Images: Requirements, best practices, and limitations
Images: Requirements
In this section, learn about the supported MIME types and limits per request for images.
Supported MIME types
Gemini multimodal models support the following image MIME types:
Image MIME type | Gemini 1.5 Flash | Gemini 1.5 Pro | Gemini 1.0 Pro Vision |
---|---|---|---|
PNG - image/png |
|||
JPEG - image/jpeg |
Limits per request
There isn't a specific limit to the number of pixels in an image. However, larger images are scaled down and padded to fit a maximum resolution of 3072 x 3072 while preserving their original aspect ratio.
Here's the maximum number of image files allowed in a prompt request:
- Gemini 1.0 Pro Vision: 16 images
- Gemini 1.5 Flash and Gemini 1.5 Pro: 3000 images
Images: Tokenization
Here's how tokens are calculated for images:
- Gemini 1.0 Pro Vision: Each image accounts for 258 tokens.
- Gemini 1.5 Flash and
Gemini 1.5 Pro:
- If both dimensions of an image are less than or equal to 384 pixels, then 258 tokens are used.
- If one dimension of an image is greater than 384 pixels, then the image is cropped into tiles. Each tile size defaults to the smallest dimension (width or height) divided by 1.5. If necessary, each tile is adjusted so that it's not smaller than 256 pixels and not greater than 768 pixels. Each tile is then resized to 768x768 and uses 258 tokens.
Images: Best practices
When using images, use the following best practices and information for the best results:
- If you want to detect text in an image, use prompts with a single image to produce better results than prompts with multiple images.
- If your prompt contains a single image, place the image before the text prompt in your request.
- If your prompt contains multiple images, and you want to refer to them
later in your prompt or have the model refer to them in the model response,
it can help to give each image an index before the image. Use
ora
b
c
for your index. The following is an example of using indexed images in a prompt:image 1
image 2
image 3
image 1
image 2 image 3 Write a blogpost about my day using image 1 and image 2. Then, give me ideas for tomorrow based on image 3. - Use images with higher resolution; they yield better results.
- Include a few examples in the prompt.
- Rotate images to their proper orientation before adding them to the prompt.
- Avoid blurry images.
Images: Limitations
While Gemini multimodal models are powerful in many multimodal use cases, it's important to understand the limitations of the models:
- Content moderation: The models refuse to provide answers on images that violate our safety policies.
- Spatial reasoning: The models aren't precise at locating text or objects in images. They might only return the approximated counts of objects.
- Medical uses: The models aren't suitable for interpreting medical images (for example, x-rays and CT scans) or providing medical advice.
- People recognition: The models aren't meant to be used to identify people who aren't celebrities in images.
- Accuracy: The models might hallucinate or make mistakes when interpreting low-quality, rotated, or extremely low-resolution images. The models might also hallucinate when interpreting handwritten text in images documents.
Video: Requirements, best practices, and limitations
Video: Requirements
In this section, learn about the supported MIME types and limits per request for video.
Supported MIME types
Gemini multimodal models support the following video MIME types:
Video MIME type | Gemini 1.5 Flash | Gemini 1.5 Pro | Gemini 1.0 Pro Vision |
---|---|---|---|
FLV - video/x-flv |
|||
MOV - video/mov |
|||
MPEG - video/mpeg |
|||
MPEGPS - video/mpegps |
|||
MPG - video/mpg |
|||
MP4 - video/mp4 |
|||
WEBM - video/webm |
|||
WMV - video/wmv |
|||
3GPP - video/3gpp |
Limits per request
Here's the maximum number of video files allowed in a prompt request:
- Gemini 1.0 Pro Vision: 1 video file
- Gemini 1.5 Flash and Gemini 1.5 Pro: 10 video files
Video: Tokenization
Here's how tokens are calculated for video:
- All Gemini multimodal models: Videos are sampled at
1 frame per second (fps) . Each video frame accounts for 258 tokens. - Gemini 1.5 Flash and
Gemini 1.5 Pro: The audio track is encoded
with video frames. The audio track is also broken down into
1-second trunks that each accounts for 32 tokens. The video frame and audio tokens are interleaved together with their timestamps. The timestamps are represented as 7 tokens.
Video: Best practices
When using video, use the following best practices and information for the best results:
- If your prompt contains a single video, place the video before the text prompt.
- If you need timestamp localization in a video with audio, ask the model
to generate timestamps in the
MM:SS
format where the first two digits represent minutes and the last two digits represent seconds. Use the same format for questions that ask about a timestamp. Note the following if you're using Gemini 1.0 Pro Vision:
- Use no more than one video per prompt.
- The model only processes the information in the first two minutes of the video.
- The model processes videos as non-contiguous image frames from the video. Audio isn't included. If you notice the model missing some content from the video, try making the video shorter so that the model captures a greater portion of the video content.
- The model does not process any audio information or timestamp metadata. Because of this, the model might not perform well in use cases that require audio input, such as captioning audio, or time-related information, such as speed or rhythm.
Video: Limitations
While Gemini multimodal models are powerful in many multimodal use cases, it's important to understand the limitations of the models:
- Content moderation: The models refuse to provide answers on videos that violate our safety policies.
- Non-speech sound recognition: The models that support audio might make mistakes recognizing sound that's not speech.
- High-speed motion: The models might make mistakes
understanding high-speed motion in video due to the fixed
1 frame per second (fps) sampling rate. - Transcription punctuation: (if using Gemini 1.5 Flash) The models might return transcriptions that don't include punctuation.
Audio: Requirements and limitations
Audio: Requirements
In this section, learn about the supported MIME types and limits per request for audio.
Supported MIME types
Gemini multimodal models support the following audio MIME types:
Audio MIME type | Gemini 1.5 Flash | Gemini 1.5 Pro |
---|---|---|
AAC - audio/aac |
||
FLAC - audio/flac |
||
MP3 - audio/mp3 |
||
MPA - audio/m4a |
||
MPEG - audio/mpeg |
||
MPGA - audio/mpga |
||
MP4 - audio/mp4 |
||
OPUS - audio/opus |
||
PCM - audio/pcm |
||
WAV - audio/wav |
||
WEBM - audio/webm |
Limits per request
You can include a maximum of
Audio: Limitations
While Gemini multimodal models are powerful in many multimodal use cases, it's important to understand the limitations of the models:
- Non-speech sound recognition: The models that support audio might make mistakes recognizing sound that's not speech.
- Audio-only timestamps: The models that support audio can't accurately generate timestamps for requests with audio files. This includes segmentation and temporal localization timestamps. Timestamps can be generated accurately for input that includes a video that contains audio.
- Transcription punctuation: (if using Gemini 1.5 Flash) The models might return transcriptions that don't include punctuation.
Documents (like PDFs): Requirements, best practices, and limitations
Documents: Requirements
In this section, learn about the supported MIME types and limits per request for documents (like PDFs).
Supported MIME types
Gemini multimodal models support the following document MIME types:
Document MIME type | Gemini 1.5 Flash | Gemini 1.5 Pro | Gemini 1.0 Pro Vision |
---|---|---|---|
PDF - application/pdf |
Limits per request
PDFs are treated as images, so a single page of a PDF is treated as one image. The number of pages allowed in a prompt is limited to the number of images the model can support:
- Gemini 1.0 Pro Vision: 16 pages
- Gemini 1.5 Pro and Gemini 1.5 Flash: 1000 pages
Documents: Tokenization
PDFs are treated as images, so each page of a PDF is tokenized in the same way as an image.
Also, the cost for PDFs follows Gemini image pricing. For example, if you include a two-page PDF in a Gemini API call, you incur an input fee of processing two images.
Documents: Best practices
When using PDFs, use the following best practices and information for the best results:
- If your prompt contains a single PDF, place the PDF before the text prompt in your request.
- If you have a long document, consider splitting it into multiple PDFs to process it.
- Use PDFs created with text rendered as text instead of using text in scanned images. This format ensures text is machine-readable so that it's easier for the model to edit, search, and manipulate compared to scanned image PDFs. This practice provides optimal results when working with text-heavy documents like contracts.
Documents: Limitations
While Gemini multimodal models are powerful in many multimodal use cases, it's important to understand the limitations of the models:
- Spatial reasoning: The models aren't precise at locating text or objects in PDFs. They might only return the approximated counts of objects.
- Accuracy: The models might hallucinate when interpreting handwritten text in PDF documents.