UCONN Stamford Google Cloud Development Platform: Chapter 16: Cloud Speech

Chapter 16. Cloud Speech

An overview of speech recognition
How the Cloud Speech API works
How Cloud Speech pricing is calculated
An example of generating automated captions from audio content

Speech recognition we mean audio streams.

Language is a particularly tricky human construct.

What we hear is based on what we see.

There is a difference between hearing and listening.

Talk - sounds and turning them into words.

Listen - taking sounds and combining them with your context and understanding.

Treat the results from a given audio file as helpful suggestions that are usually

right but not guaranteed.

Wouldn’t want to use a machine-learning algorithm for court transcripts.

May help stenographers improve their efficiency by using the output as a baseline.

There is a difference between hearing and listening.

16.1. Simple speech recognition

Cloud Speech API has textual content as an output but requires a

more complex input—an audio stream.

Send the audio file (for example, a .wav file) to the Cloud Speech

API for processing.

Tell the Cloud Speech API the format of the audio.

API needs to know the sample rate of the file.

Tell the audio processor the clock time covered by each data point.

Know the language spoken in the audio.

Cloud Speech

Humana

Codelab Speech

Enable the API in the Cloud Console.

API is enabled, you’ll install the client library

You should see some interesting output:

This audio file says: "how old is the Brooklyn Bridge"

How long the recognition took.

The Cloud Speech API needs to “listen” to the entire audio file.

The recognition process is directly correlated to the length of the audio.

Extraordinarily long audio files (for example, more than a few seconds)

shouldn’t be processed like this.

No concept of confidence in this result.

Use the verbose flag.

The output that looks something like the following:

This audio file says: "how old is the Brooklyn Bridge" (with 98% confidence)

16.2. Continuous speech recognition

You can’t take an entire audio file and send it as one chunk to the API for recognition.

Large audio file, which is too big to treat as one big blob.

Break it up into smaller chunks.

Trying to recognize streams that are live.

Because these streams keep going until you decide to turn them off.

Speech API allows asynchronous recognition

Accept chunks of data, recognize them along the way, and return a final result

after the audio stream is completed.

You should see the exact same result as before, shown in the next listing.

This audio file says: "how old is the Brooklyn Bridge" (with 98% confidence).

16.3. Hinting with custom words and phrases

Recognize that new words will be invented all the time.

Sometimes the Cloud Speech API might not be “in the know” about all the cool

new words or slang phrases, and may end up guessing wrong.

We invent new, interesting names for companies (for example, Google was a

misspelling of “Googol”).

Able to pass along some suggestions of valid phrases that can be added to the

API’s ranking system for each request.

Speech API does indeed use the alternate spelling provided:

This audio file says: "how old is the brooklynne bridge" (with 90% confidence)

16.4. Understanding pricing

Suggest tags based on what’s being said in the video.

Cloud Natural Language API to recognize any entities being discussed.

Pull out the audio portion of the video, figure out what’s being said, and come

back with suggested tags.

1. First, the user records and uploads a video.

2. Separate the audio track from the video track.

3. Send the audio content to the Cloud Speech API for recognition.

4. The Speech API should return a transcript as a response.

5. You then send all of the text (caption and video transcript) to the Cloud

Natural Language API.

6. The Cloud NL API will recognize entities and detect sentiment from the text.

7. Finally, you send the suggested tags back to the user.

Writing a function that will take a video buffer as input and return a JavaScript

promise for the transcript of the video, shown in the next listing.

Grab the audio and recognize it as text.

That will take any given content and return a JavaScript promise for the sentiment

and entities of that content.

You have all the tools you need to put your code together.

Build the final handler function that accepts a video with properties.

Summary

Speech recognition takes a stream of audio and converts it into text,

which can be deceivingly complicated due to things like the McGurk effect.

Cloud Speech is a hosted API that can perform speech recognition on audio

files or streams.

UCONN Stamford Google Cloud Development Platform

UCONN

Chapter 16: Cloud Speech

No comments:

Post a Comment

Assignment #12 due 5/9/25

Report Abuse