Speech recognition technology is ready to be consumed by the masses

Ian Firth, VP Products, Speechmatics

What’s ready today?

Speech recognition is in great shape – accuracy levels are good and improving all the time. The accuracy is no longer focused on the easy scenarios, but is now being used for noisier, harder conversational use-cases, making the technology practical for real-world applications. This is supported by the ability to deploy the technology in scalable ways that meet business needs, offering on-premises models as well as a public cloud.

The way it is consumed is getting easier too. Speech recognition can support things like multi-accents and dialect models to avoid the challenges of managing deployments for the diverse world that we live and operate within. Speech technology is not just for English either – it also supports native speakers of a growing range of many different languages. The capabilities of speech technology are ever increasing, enabling businesses to operate globally with the same scale and support that they would have in the English-speaking world.

Current challenges the industry still faces

There is always greater possibility in any industry. Non-English support for speech recognition is not as good as it is for English in many cases, especially taking into account accents and dialects. With the support for multiple languages comes the challenge of understanding which language is being spoken. This means that the ability to detect and decipher language itself is still a growing need. Language identification and detection and code switching are now becoming increasingly important to the adoption of speech technology, but still remain a challenge for most speech technology providers. Personalisation to specific users and use-cases is still very much a challenge but the foundations have been laid with features such as custom dictionaries and are expected to get better in the short term.

It’s not just words that are used to convey meaning in conversation. Sentiment, the speaker, hesitation and non-speech sounds all provide additional context and meaning. There is still work to do here to enable the wider meaning of speech to be determined.

What’s the potential for speech technology over the next few years?

Ultimately, what we really want is to truly understand the spoken word, not just transcribe what is said. That is the journey that the technology is now very much on. Understanding means supporting continuous intelligence within businesses. Enabling that understanding in real-time enables actions to be undertaken in line rather than out of band. Understanding also means using all the available context. So, that means looking wider than just the words. It means listening to sounds and sentiment but it also means using images, video and textual forms of communication that might be available to provide the deeper meaning of the communication. As speech technology continues to develop, we expect to see a broader range of useable outputs from speech analysis such as call-steering, detailed sentiment and extending voice control capabilities.

All of this advancement needs more and more data to be processed. The long pole here is having enough labelled data to support the learning required. We are undertaking some research to enable this to be less human-intensive and provide much faster learning that is continuous. These developments will unlock the power in understanding that will form the next big step in speech recognition technology.

Find out more about benchmarking speech technology providers here.

What’s next?

What’s ready today?

Current challenges the industry still faces

What’s the potential for speech technology over the next few years?

Paid Slider