Should AI agents’ voice interactions be more like our own? What effects should we anticipate?
An article at Wired.com considers the pros and cons of making the voice interactions of AI assistants more humanlike.
The assumption that more human-like speech from AIs is naturally better may prove as incorrect as the belief that the desktop metaphor was the best way to make humans more proficient in using computers. When designing the interfaces between humans and machines, should we minimize the demands placed on users to learn more about the system they’re interacting with? That seems to have been Alan Kay’s assumption when he designed the first desktop interface back in 1970.
Problems arise when the interaction metaphor diverges too far from the reality of how the underlying system is organized and works. In a personal example, someone dear to me grew up helping her mother–an office manager for several businesses. Dear one was thoroughly familiar with physical desktops, paper documents and forms, file folders, and filing cabinets. As I explained how to create, save, and retrieve information on a 1990 Mac, she quickly overcame her initial fear. “Oh, it’s just like in the real world!” (Chalk one for Alan Kay? Not so fast.) I knew better than to tell her the truth at that point. Dear one’s Mac honeymoon crashed a few days later when, to her horror and confusion, she discovered a file cabinet inside a folder. A few years later, there was another metaphor collapse when she clicked on a string of underlined text in a document and was forcibly and instantly transported to a strange destination.
Having come to terms with computers through the command-line interface, I found the desktop metaphor annoying and unnecessary. Hyperlinking, however–that’s another matter altogether–an innovation that multiplied the value I found in computing.
On the other end of the complexity spectrum would be machine-level code. There would be no general computing today if we all had to speak to computers in their own fundamental language of ones and zeros. That hasn’t stopped some hard-core computer geeks from advocating extreme positions on appropriate interaction modes, as reflected in this quote from a 1984 edition of InfoWorld:
“There isn’t any software! Only different internal states of hardware. It’s all hardware! It’s a shame programmers don’t grok that better.”
Interaction designers operate on the metaphor end of the spectrum by necessity. The human brain organizes concepts by semantic association. But sometimes a different metaphor makes all the difference. And sometimes, to be truly proficient when interacting with automation systems, we have to invest the effort to understand less simplistic metaphors.
The article referenced in the beginning of this post mentions that humans are manually coding “speech synthesis markup tags” to cause synthesized voices of AI systems to sound more natural. (Note that this creates an appearance that the AI understands the user’s intent and emotional state, though this more natural intelligence is illusory.) Intuitively, this sounds appropriate. The down side, as the article points out, is that colloquial AI speech limits human-machine interactions to the sort of vagueness inherent in informal speech. It also trains humans to be less articulate. The result may be interactions that fail to clearly communicate what either party actually means.
I suspect a colloquial mode could be more effective in certain kinds of interactions: when attempting to deceive a human into thinking she’s speaking with another human; virtual talk therapy; when translating from one language to another in situations where idioms, inflections, pauses, tonality, and other linguistic nuances affect meaning and emotion; etc.
In conclusion, operating systems, applications, and AIs are not humans. To improve our effectiveness in using more complex automation systems, we will have to meet them farther along the complexity continuum–still far from machine code, but at points of complexity that require much more of us as users.