The Future of Touchless Technologies: Voice or Gesture

If you think that communication with computers is only possible by using a keyboard, mouse, or touchscreen, you are completely wrong. Touchless technologies are just around the corner and already penetrates our daily life: smart cars, online customer support, virtual assistants, smart homes, augmented and virtual reality games. But what stands behind it?

Human communication methods are multiple. We interact with voice, gesture, or direct touch, with the choice to switch between them occurring naturally and seamlessly. However, human-computer interaction is much more complicated. While in the past we used to control computers relying on direct touch only, that is by pressing the button on the keyboard or swiping the screen, today touchless systems allow us to communicate with a computer by using voice or gesture.

Today, voice recognition and gesture recognition are the hype in the computing world and both are evolving at a rapid pace. These technologies are based on Natural Language Processing (NLP) and Computer Vision (CV), two areas of artificial intelligence that make machines capable of interacting as humans do, be it with verbal or non-verbal language.

But what are Computer Vision and NLP exactly?

Computer Vision or Machine Vision is a field of computer science that enables a computer to understand images or visual data the same way human vision does and extracts useful information in order to produce the appropriate output. Among its main applications are video surveillance, machine object detection/identification/avoidance, medical image analysis, augmented reality (AR)/virtual reality (VR) development, localization and mapping, converting paperwork into digital data, human emotion analysis, ad insertions into images and videos, face recognition, real estate development optimization, etc.

Whereas Computer Vision gives a machine the ability “to see”, Natural Language Processing is attributed to the ability “to speak”. NLP is a field of artificial intelligence that enables automatic processing of human languages. In essence, NLP allows one to turn natural language into a machine language in order to interact with computers. The main goal is to bridge the gap between human communication and computer understanding. This technology can be found in a wide variety of applications, such as information extraction, summarization, spelling and grammar checking, machine-assisted translation, information retrieval, document clustering, question answering, text segmentation, natural language interfaces to databases, e-mail understanding, optical character recognition (OCR), sentiment analysis, etc.

Where are CV and NLP being used today?

Computer Vision

Today, computer vision is heavily driven by a great number of industries and a variety of applications. The rapid development of technologies, software algorithm improvement, advanced cameras, the increasing adoption of Industry 4.0, and declining prices are all major factors that contribute to the growth of this market. The global revenue from computer vision hardware, software, and services is predicted to rise from $1.1 billion in 2016 to $26.2 billion by 2025, according to a report from industry analyst Tractica. Besides, over the recent years Computer Vision platforms have had the largest amounts of acquisitions among all AI technologies. Venture Scanner’s report estimated its worth around $16 billion, which is 72% of all AI acquisition activity. For instance, Intel acquired Movidius in 2016 for $400 million. In 2016, there were 106 companies specializing in Computer Vision, among them 33 companies providing gesture control.

Because of a wide range of application possibilities, vision-based technologies are used in quite a number of different market segments, such as automotive, consumer and mobile, robotics and machine vision, healthcare, sports and entertainment, security and surveillance, and retail markets.

The major application market is automotive and it is related to technologies used in cars and other vehicles, such as Advanced Driver Assistance Systems (ADAS), autonomous vehicles, parking assist, obstacle detection, etc. Self-parking systems developed by Mercedes, Google’s self-driving cars, BMW’s iDrive Controller with touchscreen control and gesture recognition are just a couple of examples. This market is expected to grow in the coming years due to the rising potential of vision-based technology for autonomous and semi-autonomous vehicles.

The second largest market is the consumer and mobile market. Applications of this market are used in consumer and mobile devices such as smartphones and tablets, which are embedded with digital cameras. The key applications are gesture recognition, smartphone apps, VR and AR, OCR. Microsoft’s Kinect or the PlayStation Eye, which are extremely popular today in the gaming industry, can be good examples. Virtual reality is a “killer app” for today, and according to Tractica, it may lead to a brand-new market segment within the consumer and mobile market.

The medical sector, despite its relatively low current volume, seems to be most promising in terms of general human development. CV provides solutions for oncology detection, medical imaging and diagnosis, picture archiving and communications, surgical imaging, etc. A good example is Microsoft’s InnerEye initiative, which is focused on image diagnostics and has progressed in delineating cancerous tumors.

The main Computer Vision platforms are Amazon Rekognition API, Google Cloud Vision API, Microsoft Computer Vision API, IBM Watson Visual Recognition API, and Oracle (Apiary) CloudSight API.

Natural Language Processing

Compared with the CV market, the global market of NLP is evolving less rapidly. However, according to the forecast of Tractica, it is expected to grow from $136 million in 2016 to $5.4 billion by 2025.

While today NLP is used mostly as a user interface technology heavily driven by the consumer market, its enormous business potential lies in the processing of unstructured data such as text documents, audio and video files. This technology is maturing at a rapid pace while the global market is ready and waiting.

According to the available statistics, in 2016 there were about 170 companies specializing in NLP technologies, including speech recognition and virtual personal assistants.

The main application markets for NLP are automotive, consumer, and mobile, entertainment, healthcare, the Banking, Financial Services and Insurance (BFSI industry), education and research, the consumer and BFSI markets leading the field.

The driving factor for the predicted growth is the accelerating adoption of chatbots and virtual digital assistants.

Chatbot is a computer program that provides conversation through voice commands or/and text chats. The global chatbots market was estimated at over $190 million in 2016 and is expected to grow according to Forbes magazine. This technology is already quite popular. A large number of “branded” manufacturers have already acquired this technology and introduced it into their products, such as smart cars, smart homes, social networks, and so on. Large enterprises are also picking up the wave and actively use chatbot opportunities for business needs. A prominent example is customer care call assistance, which can provide a customer with personalized virtual assistance.

As for the BFSI industry, along with personal virtual assistance, NLP has found its use in market intelligence. NLP technologies are able to extract the necessary information for business from unstructured data and provide a company with insight into the state of the market, employment changes, and other relative information.

It is also worth mentioning that NLP has also proved useful in the healthcare market. The key applications here are nursing assistants, automated care, management workflows, administrative workflows, and telemedical network.

The biggest NLP Platforms are Wit.ai, Api.ai, Luis AI, Amazon Lex, and IBM Watson.

Voice VS Gesture

Gesture recognition is a perceptual user interface, which is based on CV technology that allows the computer to interpret human motions as commands. Likewise, voice recognition is also a perceptual user interface, based on NLP, which enables a machine or program to recognize spoken language and, as a result, understand and carry voice commands.

Both interfaces are used as an alternative to touch control, allowing users to communicate with a computer without the use of hands, thus making the mouse and keyboard superfluous.

The major challenge facing voice recognition is the complexity of human language, which is ambiguous, has an abundant lexicon, and multiple expression methods. In addition, its dynamic nature requires regular updating. At present, the voice recognition interface works well when exclusively for simple tasks, with difficulties presented by slang, regional accents, sarcasm and irony, mumbling, ambient noise, etc. still to be overcome. Despite all this, the voice recognition error rate is constantly improving, and according to Google, in 2017 it was 4.9% compared to 8.5 % in 2016.

By contrast, gesture recognition does not have difficulties with identification. It can differentiate between people, so an unauthorized person cannot use the system. This feature is essential for smart homes, protecting them from intruders. Gesture recognition’s main weakness is a light condition because gesture control is based on computer vision, which heavily relies on cameras. These cameras are used to interpret gestures in 2D and 3D, so the extracted information can vary depending on the source of light. The limitation of the system cannot work in a dark environment.

Speaking of cameras, this technology requires 2-3 camera sensors that detect thousands of points to interpret the gesture correctly. Such cameras can be expensive, but prices are predicted to decrease as time passes.

Unlike gesture recognition, speech recognition does not depend on the light condition, has a small size, and is not expensive. On the other hand, gesture control beats voice control because of its natural, spontaneous character. This aspect plays a key role in the implementation of speech recognition technologies into a number of products, making the process faster, easier, and safer.

So, what does the future look like? Voice or gesture?

There is no absolute answer to this question. Voice and gesture control are applicable mostly in different fields in response to specific tasks requiring their use. However, there are some markets where these technologies compete with one another. One such market is the automotive market which has also already adopted VR technology.

The major concern for the automotive market is safety. Today, 20-40% of car accidents are caused by driver distraction. Mostly the driver’s attention is affected by simple things such as switching radio stations, turning on the air conditioning system, or setting directions in a navigation program. At first glance, voice control seems to be a good alternative to haptic control. However, most car manufacturers, in particular, BMW, Volkswagen, Subaru, Hyundai, and Seat, lean toward gesture control. The main reason for the preference of gesture recognition technology is the fact that gestures are used unconsciously, without distracting the driver. By contrast, the use of voice can affect visual attention. After giving the voice command the driver needs a physical reference point to make sure whether or not the command was understood by the system. Besides, a car is too noisy of an environment for voice recognition. Conversations with fellow passengers, traffic noise, and other sounds can lead to system errors. Another factor is that the output of the voice system can become annoying with time as it interrupts other auditory processes like music. Besides, comparing the technologies, voice recognition is more difficult to implement in practice, as it requires development for different languages and regular updating of the system. Therefore, the gesture control system appears to be more suitable and more widely used for the automotive market.

Another sphere where these technologies compete is the consumer market. Voice as well as gesture recognition has found an application in a Smart home. A Smart home is a place that incorporates advanced automation systems to provide remote control of some electronic devices, heating, and lighting. Several voice automated devices, in particular, the Amazon Echo and Google Home, are already on the mass market and pretty much popular.

Voice UIs are definitely leading this market today, but as Tractica forecasts, gesture control will be gaining more traction in the upcoming years. Moreover, there will not be a leading Smart home UI platform provider.

It is no surprise that for the mobile market, voice control is more dominant among other mobile user interfaces today, and it is expected to remain so, mostly due to the global tendency of screens getting much smaller (e.g. Smart watches). Besides, a lot of commands are just easier and cheaper to control with voice. The key players here are Google, Apple, and Amazon.

In the healthcare market, both technologies have a large potential, but most of them are now at the initial stage of implementation. For today, there is no AI-interface that provides a fully-open API for different applications with a low price and large-scale adoption. Still, voice recognition technology is more in demand due to a variety of its application areas.

According to a report from Accenture, the top three AI-applications in the healthcare market are robot-assisted surgery ($40 billion value), virtual nursing assistants ($20 billion value), and administrative workflow assistance ($18 billion value), which are all NLP-based technologies. A good example is the virtual assistant Sensely, which raised $8 million for its virtual nurse app. On the other hand, gesture control technology suits the operating room better. In addition, gesture control also can be used in medical devices with remote control or navigation.

In summary, both voice and gesture recognition are developing technologies, and their application depends on the nature of specific tasks requiring the use of one or the other. In areas where both technologies can be used interchangeably, the choice in favor of voice or gesture recognition technology is often made on the ground of the level of technological development. In this respect, gesture recognition currently is more developed, but voice recognition is catching up at a fast pace.