Chapter 16. Cloud Speech
An overview of speech recognition
How the Cloud Speech API works
How Cloud Speech pricing is calculated
An example of generating automated captions from audio content
Speech recognition we mean audio streams.
Language is a particularly tricky human construct.
What we hear is based on what we see.
There is a difference between hearing and listening.
Talk - sounds and turning them into words.
Listen - taking sounds and combining them with your context and understanding.
Treat the results from a given audio file as helpful suggestions that are usually right but not guaranteed.
Wouldn’t want to use a machine-learning algorithm for court transcripts.
May help stenographers improve their efficiency by using the output as a baseline.
There is a difference between hearing and listening.
16.1. Simple speech recognition
Cloud Speech API has textual content as an output but requires a more complex input—an audio stream.
Send the audio file (for example, a .wav file) to the Cloud Speech API for processing.
Tell the Cloud Speech API the format of the audio.
API needs to know the sample rate of the file.
Tell the audio processor the clock time covered by each data point.
Know the language spoken in the audio.
Enable the API in the Cloud Console.
API is enabled, you’ll install the client library
You should see some interesting output:
This audio file says: "how old is the Brooklyn Bridge"
How long the recognition took.
The Cloud Speech API needs to “listen” to the entire audio file.
The recognition process is directly correlated to the length of the audio.
Extraordinarily long audio files (for example, more than a few seconds)
shouldn’t be processed like this.
No concept of confidence in this result.
Use the verbose flag.
The output that looks something like the following:
This audio file says: "how old is the Brooklyn Bridge" (with 98% confidence)
16.2. Continuous speech recognition
You can’t take an entire audio file and send it as one chunk to the API for recognition.
Large audio file, which is too big to treat as one big blob.
Break it up into smaller chunks.
Trying to recognize streams that are live.
Because these streams keep going until you decide to turn them off.
Speech API allows asynchronous recognition
Accept chunks of data, recognize them along the way, and return a final result after the audio stream is completed.
You should see the exact same result as before, shown in the next listing.
This audio file says: "how old is the Brooklyn Bridge" (with 98% confidence).
16.3. Hinting with custom words and phrases
Recognize that new words will be invented all the time.
Sometimes the Cloud Speech API might not be “in the know” about all the cool new words or slang phrases, and may end up guessing wrong.
We invent new, interesting names for companies (for example, Google was a misspelling of “Googol”).
Able to pass along some suggestions of valid phrases that can be added to the API’s ranking system for each request.
Speech API does indeed use the alternate spelling provided:
This audio file says: "how old is the brooklynne bridge" (with 90% confidence)
16.4. Understanding pricing
Suggest tags based on what’s being said in the video.
Cloud Natural Language API to recognize any entities being discussed.
Pull out the audio portion of the video, figure out what’s being said, and come back with suggested tags.
1. First, the user records and uploads a video.
2. Separate the audio track from the video track.
3. Send the audio content to the Cloud Speech API for recognition.
4. The Speech API should return a transcript as a response.
5. You then send all of the text (caption and video transcript) to the Cloud
Natural Language API.
6. The Cloud NL API will recognize entities and detect sentiment from the text.
7. Finally, you send the suggested tags back to the user.
Writing a function that will take a video buffer as input and return a JavaScript
promise for the transcript of the video, shown in the next listing.
Grab the audio and recognize it as text.
That will take any given content and return a JavaScript promise for the sentiment
and entities of that content.
You have all the tools you need to put your code together.
Build the final handler function that accepts a video with properties.
Summary
Speech recognition takes a stream of audio and converts it into text,
which can be deceivingly complicated due to things like the McGurk effect.
Cloud Speech is a hosted API that can perform speech recognition on audio files or streams.
No comments:
Post a Comment