The conversational revolution

The rise of voice interfaces

The use of computers and mobile devices has become so ingrained into our reflexes, that few give thought to the fact that modern graphic interfaces – regardless of how well they’re implemented – are in effect straightjackets. They starkly narrow the full possible spectrum of human expression and interactions into a single focused command, such as a mouse click, in order to execute a single command and perform a function on the device and application involved.

Be it a keyboard, a mouse, trackball or touchscreen: all interactions with these input aids are artificial in nature and functionally highly limited. Luckily, human beings possess input and output ‘devices’ which are much more powerful: their ears and voices. These allow for a nearly limitless range of expression and comprehension – and beg the question: why don’t we use them to interface with our digital ‘counterparts’? After all, we’ve been speaking for 1.75 million years!

Even though the first voice recognition algorithms were implemented nearly 70 years ago, the use of voice interfaces was until very recently at best a stultifying experience – and at worst, a wholly irritating one. Currently however, voice interfaces are enjoying an exponential rate of acceptance, with the number of applications and voice-enabled devices reliably doubling every year since 2018. For instance, the number of voice-capable devices in the world was roughly two billion towards the end of 2019; by the end of 2020, it was four-and-a-half billion. This pattern of adoption not only rivals, but rather exceeds the last digital revolution of mobile devices.

The reasons behind this astonishing rate of adoption are twofold: vast improvements in Natural Language Processing technologies and the development of Conversational Design as a body of knowledge have combined to make voice interfaces pleasant and reliable to use. Consumers are re-learning their modes of interaction and – increasingly often – using pre-existing voice interfaces at home and underway. This learning curve is vastly easier to master compared to learning to operate a computer – humans need no instruction on how to use their voices!

Above and beyond powerful speech recognition machine learning models, technologies such as sentiment analysis, semantic networks, ontologies and self-training conversational bots, all conspire to allow applications to better understand human speech. This is no trivial task. Human speech patterns exhibit nearly infinite variations in syntax, inflection, vocabulary, tonality and semantics. Additionally, after natural language processing applications derive base meaning from a spoken or written statement, the job is only half done. Further challenges to the successful interpretation of speech arise when overlying factors, such as irony, sarcasm, meta context and even cultural influences come into play. These can easily skew or even completely reverse the meaning of what has been said. Briefly consider the statement: “What a fascinating dress! I don’t think I’ve ever seen anything quite like it!” Taken in an ironic context, these exclamations might be anything but flattering.

Speech design languages, such as SSML (synthetic speech markup language), enable designers to create spoken outputs which effectively imitate the finest nuances of human speech. Emotions, interjections, accents, changes in pitch, volume and speed – even breathing noises – can all be incorporated. Gone are the days when listeners were confronted with voice interfaces which were easily identifiable as robotic. The output from the newest generation of voice assistants is indistinguishable from human speech, for the first time stoking discussions about the moral dilemmas posed by robots which arise not from theoretical hypotheses, but rather from real-life situations.

Powering the advent of the ‘conversational revolution’ are a new generation of user experience architects: conversational designers. Many initial attempts to implement voice applications were hampered by a lack of understanding about how productive human conversations work. The assumption that information can be successfully disseminated regardless of channel or medium is naturally erroneous, but it’s especially easy to overlook that humans hear and process information inherently differently than they read, for instance, leading to voice applications which present their users with long-winded spoken instructions, hierarchical menus and wholly forgettable long lists of options. Competent conversational architects know how to carry a conversation forward, how to collapse complicated navigational menus, and – most importantly – how to manage expectations for the user while providing enough personalization and contextual dialogue to maintain the natural flow of conversation. For instance, a voice application thatperforms a relatively simple function, such as taking an order for a pizza, can easily be made user-friendly by registering whether a customer has preferences or allergies. Similarly, if the user logs in weekly, any instructions which were helpful for the first visit quickly become tedious and can be skipped in the future. Consider that placing an order in an online shop is often a seven- or eight-step process, whereas in a restaurant, it is usually one step only!

In conclusion, voice interfaces will continue to rapidly gain acceptance as a medium for interacting with our digital environment, but one should keep in mind: not all applications can be easily operated with voice commands alone. Admittedly, at a certain level of interdependency and interactional density, such as configuring and purchasing a new automobile, conversational interactions would quickly become too confusing.

Reply specializes in the design and implementation of solutions based on new communication channels and digital media and offers a network of companies working across sectors and technologies including AI, big data, cloud computing, digital media, the Internet of Things, among many others.