Extend the Oracle Digital Assistant with OCI AI Vision!

a woman wearing a virtual reality goggles
Photo by Michelangelo Buonarroti on Pexels.com

The Oracle Cloud Infrastructure (OCI) AI Vision service is a fantastic complement to the Oracle Digital Assistant (ODA).

OCI AI Vision is a serverless, multi-tenant service, accessible using the Console, REST APIs, SDK, or CLI.

You can upload images to detect and classify objects in them. Vision’s features are thematically split between Document AI for document-centric images, and Image Analysis for object and scene-based images. Pretrained models and custom models are supported.

This service offers two main modes.

The Document AI allows us to detect and recognize text in a document, identify the type of document (passport, invoice, payslip, etc), extract its contents, and more.

Image Analysis AI is able to detect objects and their location in an image, such as persons, faces, cars, trees, or a dog.

Use Cases

There are many good use cases that we could implement using extended AI capabilities with the Vision service. Probably the most obvious use case is the extraction of data from an invoice or receipt, but there are more.

1. Extraction of data from an invoice or receipt

2. Upload ID/Passport for authentication or for data extraction

3. Upload a picture of meter readings for automatic value recognition

4. Create a new lead by uploading a picture of a business card

Do you have other potential use cases? I would be happy to hear about it – leave a comment below 🙂

Let’s explore a couple of use cases where we can use both Document AI and Image Analysis AI.

Example use case #1: Document AI

Imagine an assistant that allows users to update their company profile, including the updating of documents like ID or Driver’s License. The assistant can ask the user to upload a document and is able to validate the type of uploaded document and extract data from the document

In order to do this we need the Document AI service. Check here for the service overview and here for the API details.

Vision can detect and recognize text in a document. Language classification identifies the language of a document, then Optical Character Recognition (OCR) draws bounding boxes around the printed or hand-written text it locates in an image, and digitizes the text.

How to implement this? First and foremost you should acquaint yourself with the Vision API and decide which one of the available approaches to use (REST or one of the SDKs).

In order to cover everything, in this implementation we choose REST to identify the document type and the JavaScript SDK to extract information from the document.

This assistant helps the user update his/her profile by asking to upload a driver’s License.

The dialog flow calls a Vision REST API to identify the document type. It will only process a driver’s license.

Then it uses a custom component where the Vision SDK is called to extract data from the uploaded document.

The AI Vision API

Let’s try to understand a bit better the Vision API.

As mentioned before, we have two main modes, Document and Image AI. In both, we do need to send the image/file for processing.

Where we store the file is defined in the request and the allowed values are:

  • INLINE: The data is included directly in the request payload.
  • OBJECT_STORAGE: The document is in OCI Object Storage.

In ODA, when the user uploads an image/file, it is stored internally in the attachment server. This is how we can access the uploaded document from a custom component – > check this post

You can copy that file to the Object Storage, and use that reference while calling the Vision service, OR you can encode the file as a base64 and pass it INLINE – this will be the approach followed in this example.

The implementation

From a dialog flow point of view, we want to

  • Read the uploaded document (it is stored in the ODA internal attachment server)
  • Convert the file to base64
  • Use the REST service as it does not require any complex post-processing (check this post on how to configure the Vision REST API in ODA)
Convert the picture to base 64 with a custom component

Identify the document type and switch based on the value

Below is the REST_Vision (state name from above dialog flow) request payload, which takes the base64 (stored in a variable called b64) value and the ocid of the ODA instance.

{
"features": [
{
"featureType": "DOCUMENT_CLASSIFICATION"
}
],
"document": {
"source": "INLINE",
"data": "${b64.value}"
},
"compartmentId": "ocid1.compartment.oc1..aaaaaaaavrseiwy3uhon3lkj5ysalqxhezfr3pkkwdgunwkqdumd2pxplpja"
}

Now that we validated the document type, we are ready to read its contents, and this time we will use a custom component that makes use of the Vision JavaScript SDK. This makes more sense as we will need a bit more post-processing. We need to match the driver’s license labels and values. These will be returned to the dialog flow as variables.

var analyzeDocumentDetails = {
     features: [
       {
         featureType: "DOCUMENT_CLASSIFICATION",
         maxResults: 491
       },
       {
         "featureType": "TEXT_DETECTION"
       }
     ],
     document: {
       source: "INLINE",
       data: base64 
     },
     language: "ENG",
     documentType: "DRIVER_LICENSE"
   };

Note that again, the base64 value is passed to the Vision request.

If the request is successful, then we show the formatted values to the user (state sucessMessage)

All the variables are set in the custom component.

Document Type: ${switch.value}
State: ${state.value}
ID: ${id.value}
Date of Birth: ${dob.value}
Document Expiration Date: ${exp.value}

Then we have a resolve composite bag to manage the Yes/No question back to the user. This is a demo, therefore we do not manage the answer accordingly, but will end the dialog.

Example use case #2: Image AI

Let’s imagine a use case where the user can update his/her profile including uploading a new picture. We can have the Vision Image Analysis AI inspect the picture to ensure conformity.

For a profile picture we want a clear face, centered, so if there are other objects, or multiple faces, we should be able to identify the situation and communicate back to the user what needs to be corrected.

The technical details are very similar to the previous approach, with the main difference of making use of the JavaScript SDK instead of the REST API. Doing so will allow us to have more power in post-processing the service response, which will be needed for this use case. We want to verify if there is a face on the picture, we will need to crop the image so that it fits with profile picture standards (in terms of dimensions). We need to make decisions if there are multiple faces in the picture, if the person has sunglasses, if the face is not visible, if there are too many faces on screen, etc. There is more logic to cover, and this is best done in the context of an ODA Custom Component where we have an entire set of packages and modules to help us along the way.

Beforehand it would be best if you always considered the level of logic that you need to bring to manage the API response. That will indicate which approach to use ( REST connector in dialog flow VS using a custom component)

Here is a video where the chatbot handles several:

The user does not provide a clear face in the picture.

There is a person, but no clear face.

Too many faces.

Multiple faces but the user can pick one.

One face is not exactly human.

One cartoon face.

This is a real-life example of my Dog Rocky!

Conclusion

Granted the above use case might not be the most obvious one, but it goes to show how powerful combining AI Vision with ODA can be. The ability to read text, and identify words, images, objects, their position, etc really goes a long way in adding a layer of documents/images into a conversation.

Note

It seems that there was a renaming meanwhile and the Vision service was split in two, where the Document AI part is now called: Document Understanding Service

Note 2

This post was simultaneously released in the Oracle Digital Assistant blog as well