Recognize Text in Images with ML Kit on iOS

You can use ML Kit to recognize text in images, using either an on-device model or a cloud model. See the overview to learn about the benefits of each approach.

See the ML Kit quickstart sample on GitHub for an example of this API in use, or try the codelab.

Before you begin

  1. If you have not already added Firebase to your app, do so by following the steps in the getting started guide.
  2. Include the ML Kit libraries in your Podfile:
    pod 'Firebase/Core'
    pod 'Firebase/MLVision'

    # If using the on-device API: pod 'Firebase/MLVisionTextModel'

    After you install or update your project's Pods, be sure to open your Xcode project using its .xcworkspace.
  3. In your app, import Firebase:

    Swift

    import Firebase

    Objective-C

    @import Firebase;
  4. If you want to use the cloud-based model, and you have not upgraded your project to a Blaze plan, do so in the Firebase console. Only Blaze-level projects can use the Cloud Vision APIs.
  5. If you want to use the cloud-based model, also enable the Cloud Vision API:
    1. Open the Cloud Vision API in the Cloud Console API library.
    2. Ensure that your Firebase project is selected in the menu at the top of the page.
    3. If the API is not already enabled, click Enable.
    If you want to use only the on-device model, you can skip this step.

Now you are ready to start recognizing text in an image, using either the on-device model or the Cloud-based model.


On-device text recognition

To use the on-device text recognition model, run the text detector as described below.

1. Run the text detector

To recognize text in an image, pass the image as a UIImage or a CMSampleBufferRef to the VisionTextDetector's detect(in:) method:

  1. Get an instance of VisionTextDetector:

    Swift

    lazy var vision = Vision.vision()
    let textDetector = vision.textDetector()
    

    Objective-C

    FIRVision *vision = [FIRVision vision];
    FIRVisionTextDetector *textDetector = [vision textDetector];
    
  2. Create a VisionImage object using a UIImage or a CMSampleBufferRef.

    To use a UIImage:

    1. If necessary, rotate the image so that its imageOrientation property is .up.
    2. Create a VisionImage object using the correctly-rotated UIImage. Do not specify any rotation metadata—the default value, .topLeft, must be used.

      Swift

      let image = VisionImage(image: uiImage)

      Objective-C

      FIRVisionImage *image = [[FIRVisionImage alloc] initWithImage:uiImage];
      

    To use a CMSampleBufferRef:

    1. Create a VisionImageMetadata object that specifies the orientation of the image data contained in the CMSampleBufferRef buffer.

      For example, if the image data must be rotated clockwise by 90 degrees to be upright:

      Swift

      let metadata = VisionImageMetadata()
      metadata.orientation = .rightTop  // Row 0 is on the right and column 0 is on the top
      

      Objective-C

      // Row 0 is on the right and column 0 is on the top
      FIRVisionImageMetadata *metadata = [[FIRVisionImageMetadata alloc] init];
      metadata.orientation = FIRVisionDetectorImageOrientationRightTop;
      
    2. Create a VisionImage object using the CMSampleBufferRef object and the rotation metadata:

      Swift

      let image = VisionImage(buffer: bufferRef)
      image.metadata = metadata
      

      Objective-C

      FIRVisionImage *image = [[FIRVisionImage alloc] initWithBuffer:buffer];
      image.metadata = metadata;
      
  3. Then, pass the image to the detect(in:) method:

    Swift

    textDetector.detect(in: visionImage) { features, error in
      guard error == nil, let features = features, !features.isEmpty else {
        // ...
        return
      }
    
      // ...
    }
    

    Objective-C

    [textDetector detectInImage:image
                     completion:^(NSArray<FIRVisionText *> *features,
                                  NSError *error) {
      if (error != nil) {
        return;
      } else if (features != nil) {
        // Recognized text
      }
    }];
    

2. Extract text from blocks of recognized text

If the text recognition operation succeeds, it will return an array of VisionText objects. Each VisionText object represents a rectangular block of text, a line of text, or a single word-like element of text.

For each VisionText, you can get the bounding coordinates of the block and the text contained in the block:

Swift

for feature in features {
  let value = feature.text
  let corners = feature.cornerPoints
}

Objective-C

for (id <FIRVisionText> feature in features) {
  NSString *value = feature.text;
  NSArray<NSValue *> *corners = feature.cornerPoints;
}

In addition, if the VisionText, is a VisionTextBlock, you can get the lines of text that make up the block, and if it's a VisionTextLine, you can get the elements that make up each line of text:

Swift

// Blocks contain lines of text
if let block = feature as? VisionTextBlock {
  for line in block.lines {
    // ...
    for element in line.elements {
      // ...
    }
  }
}

// Lines contain text elements
else if let line = feature as? VisionTextLine {
  for element in line.elements {
    // ...
  }
}

// Text elements are typically words
else if let element = feature as? VisionTextElement {
  // ...
}

Objective-C

// Blocks contain lines of text
if ([feature isKindOfClass:[FIRVisionTextBlock class]]) {
 FIRVisionTextBlock *block = (FIRVisionTextBlock *)feature;
 for (FIRVisionTextLine *line in block.lines) {
   // ...
   for (FIRVisionTextElement *element in line.elements) {
     // ...
   }
 }
}

// Lines contain text elements
else if ([feature isKindOfClass:[FIRVisionTextLine class]]) {
 FIRVisionTextLine *line = (FIRVisionTextLine *)feature;
 for (FIRVisionTextElement *element in line.elements) {
   // ...
 }
}

// Text elements are typically words
else if ([feature isKindOfClass:[FIRVisionTextElement class]]) {
 // ...
}

Cloud text recognition

To use the Cloud-based text recognition model, configure and run the the text detector as described below.

1. Configure the text detector

By default, the Cloud detector uses the stable version of the model. If you want to use the latest version of the model instead, create a VisionCloudDetectorOptions object as in the following example:

Swift

let options = VisionCloudDetectorOptions()
options.modelType = .latest
// options.maxResults has no effect with this API

Objective-C

FIRVisionCloudDetectorOptions *options =
    [[FIRVisionCloudDetectorOptions alloc] init];
options.modelType = FIRVisionCloudModelTypeLatest;

In the next step, pass the VisionCloudDetectorOptions object when you create the Cloud detector object.

2. Run the text detector

To recognize text in an image, create a VisionCloudTextDetector object, or, if the image is a document, a VisionCloudDocumentTextDetector object. Then, pass the image as a UIImage or a CMSampleBufferRef to the detectText(in:) method:

  1. Get an instance of VisionCloudTextDetector or VisionCloudDocumentTextDetector:

    Swift

    lazy var vision = Vision.vision()
    let cloudDetector = vision.cloudTextDetector(options: options)
    // Or, to use the default settings:
    // let textDetector = vision?.cloudTextDetector()
    

    Objective-C

    FIRVision *vision = [FIRVision vision];
    FIRVisionCloudTextDetector *textDetector = [vision cloudTextDetector];
    // Or, to change the default settings:
    // FIRVisionCloudTextDetector *textDetector =
    //     [vision cloudTextDetectorWithOptions:options];
    
  2. Create a VisionImage object using a UIImage or a CMSampleBufferRef.

    To use a UIImage:

    1. If necessary, rotate the image so that its imageOrientation property is .up.
    2. Create a VisionImage object using the correctly-rotated UIImage. Do not specify any rotation metadata—the default value, .topLeft, must be used.

      Swift

      let image = VisionImage(image: uiImage)

      Objective-C

      FIRVisionImage *image = [[FIRVisionImage alloc] initWithImage:uiImage];
      

    To use a CMSampleBufferRef:

    1. Create a VisionImageMetadata object that specifies the orientation of the image data contained in the CMSampleBufferRef buffer.

      For example, if the image data must be rotated clockwise by 90 degrees to be upright:

      Swift

      let metadata = VisionImageMetadata()
      metadata.orientation = .rightTop  // Row 0 is on the right and column 0 is on the top
      

      Objective-C

      // Row 0 is on the right and column 0 is on the top
      FIRVisionImageMetadata *metadata = [[FIRVisionImageMetadata alloc] init];
      metadata.orientation = FIRVisionDetectorImageOrientationRightTop;
      
    2. Create a VisionImage object using the CMSampleBufferRef object and the rotation metadata:

      Swift

      let image = VisionImage(buffer: bufferRef)
      image.metadata = metadata
      

      Objective-C

      FIRVisionImage *image = [[FIRVisionImage alloc] initWithBuffer:buffer];
      image.metadata = metadata;
      
  3. Then, pass the image to the detect(in:) method:

    Swift

    cloudDetector.detect(in: visionImage) { text, error in
      guard error == nil, let text = text else {
        // ...
        return
      }
    
      // Recognized and extracted text
      // ...
    }

    Objective-C

    [textDetector detectInImage:image
                     completion:^(FIRVisionCloudText *cloudText,
                                  NSError *error) {
      if (error != nil) {
        return;
      } else if (cloudText != nil) {
        // Recognized text
      }
    }];
    

3. Extract text from blocks of recognized text

If the text recognition operation succeeds, a VisionCloudText object will be passed to the success listener. This object contains the text that was recognized in the image.

For example:

Swift

let recognizedText = cloudText.text

Objective-C

NSString *recognizedText = cloudText.text;

You can also get information about the structure of the text. The text is organized into pages, blocks, paragraphs, words, and symbols. For each unit of organization, you can get information such as its dimensions and the languages it contains.

For example:

Swift

for page in cloudText.pages {
  let width = page.width
  let height = page.height
  let langs = page.textProperty?.detectedLanguages
  if let blocks = page.blocks {
    for block in blocks {
      let blockFrame = block.frame
    }
  }
}

Objective-C

for (FIRVisionCloudPage *page in cloudText.pages) {
 int width = [page.width intValue];
 int height = [page.height intValue];
 NSArray<FIRVisionCloudDetectedLanguage *> *langs = page.textProperty.detectedLanguages;
 for (FIRVisionCloudBlock *block in page.blocks) {
   CGRect frame = block.frame;
   // etc.
 }
}

Send feedback about...

Need help? Visit our support page.