Voice recognition: How to understand and use it

Artificial intelligence, and in particular voice recognition, is increasingly becoming a part of our daily lives. Its use is vast, whether it be on our smart phones, tablets or connected speakers. The users’ voice is now at the centre of the online world offered by digital stakeholders. However, with the multitude of personal assistants, connected speakers and voice-activated objects, it can quickly become complicated to navigate the voice recognition market. So, what do we really know about voice recognition? What is its function in everyday life and why is it so prevalent today? Netamo reviews the current situation.

A brief history of voice recognition

In 1961, one of the world’s first voice recognition tools was launched by IBM: the IBM Shoebox. The massive technology company Apple did not launch Siri until April 2011, which is now known worldwide. This was followed by several launches of various voice assistants: first Google in July 2012 with Google Now, then Microsoft the following year launched Cortana and in 2014 Amazon revealed Alexa and its connected speaker, the Amazon Echo.

How does voice recognition work?

Define voice recognition

Voice recognition can be defined as a technology that allows a device to understand and analyse a human voice and then transcribe each of the dictated words into usable text. Specifically, the voice is captured via the device’s microphone in sound frequencies and then transcribed into written text. Voice recognition can be seen as an alternative to keyboard/handwritten entry and is often praised for being faster and saving time in everyday tasks. Voice recognition can also be defined as the broader concept of automatic speech recognition or ASR. Automatic speech recognition comprises two technologies: voice dictation and voice control. But what is the difference between voice dictation and voice control?

Voice dictation: a simple principle, where you must verbally transmit a text to the device, which then transcribes it via a processor.

Voice control: a term used when it comes to giving spoken commands

The distinction between the two terms is very subtle. To sum up, voice control can be described as giving real instructions to the machine. Whereas voice dictation simply conveys a certain amount of information by voice, without it being a command.

To further improve voice recognition software, the National Institute of Standard and Technology developed the Speaker Recognition Evaluation in 1996. Thus, many researchers use this tool to evaluate the progress made by voice recognition through the years.

It is also common to hear about the word error rate, which is simply an average to evaluate the performance of voice recognition software.

The different components of the voice recognition system

The wake-up word or hot word: this is the entry key, the first interaction between human and machine in the voice recognition process. This is a word that will trigger the voice recognition of the device. The best-known wake-up words are “Ok Google” or “Hey Siri”. Wake-up words are often short and concise, as the user must be able to pronounce them easily and quickly. Easy pronunciation is even more important because in each language there may be several accents and a multitude of voice tones, etc.

Speech To Text: this is a system that breaks down the words that the user says. It separates words into small groups (called samples) to associate them with phonemes. More simply, it allows audio or voice transcription to be converted into written text. The process is paired with algorithms therefore allowing the machine to recognise what the user said. Speech to text can be improved through artificial intelligence techniques such as machine learning or deep-learning. This consists of training and “teaching” the machine the correct answers using artificial neural networks.

NLP (Natural Language Processing): this technology is referred to as automatic natural language processing. It is a tool for processing human language using computer tools. It is subdivided into 2 processes: Natural Language Understanding and Natural Language Generation. The NLP process comes after speech to text, since it is through this tool that the text is interpreted by the machine.

Text To Speech: this technology, also known as speech synthesis, allows computer text to be transformed into audio script. For example, it allows a computerised voice to read a web page to a person with visual impairments. After processing the text, the software establishes the rhythm or intonation to apply to the text. It is performed at the end of the voice recognition process because the synthesised voice is created to respond to the user’s request through this tool.

Voice assistants on the market

Getty Images 1140252133 768x503.jpg

In recent years, many of the major stakeholders in the “digital revolution” have been introducing their own personal assistants to the speech recognition market. Although they use broadly the same voice and text transcription techniques, each assistant has its own distinctive features according to its manufacturer’s objectives. Integrating voice recognition into an ecosystem is important for brands because user data ensures greater accuracy for the voice assistant. It should also be noted that to operate all the smart objects in the home, it is necessary to use an audio system (connected speakers) frequently sold by the various brands.

Google Assistant

Launched in 2016, the Google Assistant has become one of the world’s leading voice assistants. Nevertheless, before its appearance, the brand had already positioned itself on the voice recognition market with its Google Now assistant. More specifically, Google Assistant started out as an extension of Google Now and is now used in its own capacity. The assistant was able to be used with the former Google Allo application to answer messages directly for the user. Today, Google Assistant allows users to make voice commands and perform a variety of tasks ranging from real-time translation, music control, and recommendations on the best route to take. The wake-up words for this assistant are “Hey Google” or “Ok Google”. Google’s voice assistant is designed to work with all connected products in the Nest range (Nest Hub, Chromecast etc.). Additionally, the American brand has been able to extend the scope of its voice assistant thanks to its compatibility with many other brands such as Netatmo for example. The brand’s connected speaker is the Google Nest. There is also a version with a screen, the Nest Hub.

Netatmo products compatible with Google Assistant:

Siri

The personal assistant Siri was launched in 2011 by Apple. Like other assistants, it processes voice commands or searches made by the user. Siri’s unique feature is its compatibility with only the Apple ecosystem (iPhone, iPad…). Apple’s connected speaker is the Homepod (also available as a Mini).

Netatmo products compatible with the Apple Homekit application (Apple application that Siri-enabled smart products can be integrated with):

Alexa

Amazon launched Alexa in November 2014 and its Amazon Echo speaker at the same time. Unlike Siri or Google Assistant its wake-up word is its direct name “Alexa” which seems slightly more intuitive than an “OK” or “Hey”.The strength of this voice assistant lies in its excellent knowledge of Amazon customers’ shopping habits. When making voice commands for a purchase on Amazon, Alexa can provide a relevant item listing. It is also one of the assistants that has the most compatibility with products from other brands.

Netatmo products compatible with Alexa:

Bixby Voice

Less known than Siri or Alexa, Bixby Voice is the personal assistant created by Samsung. Bixby Voice offers many of the same features as the other assistants but is only available on Samsung phones and tablets. To activate it you have to say, “Hi Bixby”. Samsung says that Bixby Voice understands voice commands in a subtle way, i.e., it can differentiate between remarkably similar requests. Samsung’s connected speaker is called the Galaxy Home (there is a Mini version).

Cortana

In 2013 Microsoft announced the release of its Cortana software. It is described by Microsoft as a “productivity assistant” that works with Windows. The Cortana software facilitates better task management on Microsoft (calendar, meetings, reminders…): all you have to do is press the microphone button to launch Cortana with the wake-up word “Hey Coratana”. The assistant’s distinctive feature is that it is linked to the Microsoft system, which is an office tool and can therefore be used on computers with Windows 10, as well as on Windows Phone with the Windows 8.1 version (and later).

Dragon NaturallySpeaking

Like the Windows Assistant, Dragon NaturalSpeaking is a software that allows you to use your computer by voice and is used for translation. It runs on other programs such as Word or Excel but also on web browsers. Users mainly use it for word dictation. Dragon NaturalSpeaking allows for the transcription of audio recordings, spoken text processing and its correction. The Dragon software is also known for its accuracy, as it is said to make fewer mistakes on average compared to a user typing on a keyboard.

Conclusion

The voice recognition system is now in a state of constant expansion. Each brand proposes its own personal assistant that operates within its ecosystem (Siri, Bixby) or extends to products made by other brands (Alexa, Google Assistant). Associated applications such as Apple Homekit or Google Home offer users the choice to fully connect their home through voice recognition (and more extensively through artificial intelligence). Ultimately, the various voice assistants have similar applications (voice command, text dictation, etc.) and it is up to the user to choose the digital ecosystem they are most comfortable with.