OCI AI Speech Service - Learn how to transcribe voice

Another OCI AI Service is freshly out of the oven – This time OCI Speech, which allows us “to easily convert audio files containing human speech into highly accurate text transcription“

Pretty cool right? Unless you are a professional stenographer I guess!

What can the service do?

The documentation says it all

Accurate transcriptions—Produces an accurate and easy to use JSON file written directly to the Object Storage bucket you choose. You can take advantage of the transcription and integrate it directly with applications, and use it for subtitles or content search and analysis.
Time stamped JSON—The transcription provides a timestamp for each token (word). You can use the timestamp to search and find the text you are looking for within the audio file then quickly jump to that location.
Multilingual—Produces accurate transcriptions in English, Spanish, and Portuguese.
Asynchronous API—Straightforward asynchronous APIs with transcription task batching. The APIs allow you to cancel jobs that are not yet processed saving you time and money.
Text normalizations—Provides text normalizations for numbers, addresses, currencies, and so on. With text normalizations, you get a higher-quality transcription from artificial intelligence that is easier to read and understand.
Profanity filtering—Allows you to remove, mask, or tag words that are offensive from the transcription.
Confidence score per word and transcription—Produces word and transcription confidence scores on the generated JSON file. You can use the confidence scores to quickly identify words that require your attention.

Big Picture

Shows the speech engine process, audio to front-end, to back-end to results.

We need to upload an audio file (with some specific recording parameters) into OCI Object Storage. From there we can use the console or any of the other available methods (SDK/API/CLI) to trigger a transcription job.

Record Audio

I will use Audacity.

This is the technical specs we need to adhere: You can use single-channel, 16-bit PCM WAV audio files with a 16-kHz sample rate.

Important to set the proper Rate to 16kHZ and also it needs to be a single channel (mono).

Then we can export it as WAV Signed 16-bit PCM.

The next step is to upload that to OCI Object Storage.

The new Speech Service is under AI Services menu.

Create a transcription job

Choose the previously uploaded file.

And that is it. Once the job is submitted and finished you can the the transcription result.

We can also download the transcript, which will be a JSON with all the technical parameters.

{"status":"SUCCESS","timeCreated":"2022-03-23 10:07:48.066","modelDetails":{"domain":"GENERIC","languageCode":"en-US"},"audioFormatDetails":{"format":"WAV","numberOfChannels":1,"encoding":"PCM","sampleRateInHz":16000},"transcriptions":[{"transcription":"this is a blog post about the new speech service","confidence":"1","tokens":[{"token":"this","startTime":"0.768s","endTime":"1.152s","confidence":"0.5311","type":"WORD"},{"token":"is","startTime":"1.152s","endTime":"1.344s","confidence":"0.4851","type":"WORD"},{"token":"a","startTime":"1.344s","endTime":"1.440s","confidence":"0.4575","type":"WORD"},{"token":"blog","startTime":"1.440s","endTime":"1.920s","confidence":"0.5693","type":"WORD"},{"token":"post","startTime":"1.920s","endTime":"2.448s","confidence":"0.5647","type":"WORD"},{"token":"about","startTime":"2.448s","endTime":"2.880s","confidence":"0.7551","type":"WORD"},{"token":"the","startTime":"2.928s","endTime":"3.120s","confidence":"0.4557","type":"WORD"},{"token":"new","startTime":"3.120s","endTime":"3.840s","confidence":"0.6036","type":"WORD"},{"token":"speech","startTime":"3.936s","endTime":"4.752s","confidence":"0.69","type":"WORD"},{"token":"service","startTime":"4.800s","endTime":"5.808s","confidence":"0.7144","type":"WORD"}]}]}