Sound is a growing medium for our literary habits. Podcasts and audiobooks have accompanied us on car journeys, on long walks while pushing prams, and in the dying embers of each day as we will our little ones to succumb to sleep. From fiction to biographies to more technical tomes, we have relied on distinctly human voices to guide us through our next ‘read’.
We are therefore keeping a keen eye, and keener ears, on the latest developments in AI audio. A lot is happening. Microsoft’s VALL-E made waves recently by deploying a text-to-speech language model to replicate literally anyone’s voice. Based solely on a three-second recording, VALL-E attempts to preserve the speaker’s timbre and emotional tone. See (actually, hear) for yourself - the days of monotonous-sounding voice assistants have surely passed.
Audiobooks are a natural target for these technologies. Both Apple and Google are promising to drive down costs with automated narrations. For small publishers and little known authors, for whom audiobooks have been prohibitively expensive, new opportunities abound to have their works ‘narrated by digital voice based on a human narrator’.
What, if anything, is lost when we substitute a human voice with a simulacrum? Is the era of human voice actors over? Kawther examines these questions through the lens of fiction, before Junaid returns next time to explore the role of AI narration in non-fiction.
The Great Gatsby retold
ElevenLabs claims to be ‘the most realistic and versatile AI speech software ever’, with an arsenal of ‘compelling, rich and lifelike voices’ ready to be adopted by creators and publishers. This realism is based on an AI model which has a ‘zoomed-out perspective’ on words: sentences are not uttered individually but generated with preceding and succeeding text in mind. Here it is narrating the opening to The Great Gatsby.
On first listen, one can’t help but be impressed. Gone is any immediate sense of robotic monotony, one detached utterance at a time. There is instead a naturalness to ElevenLabs, particularly at the sentence or small paragraph level. The AI not only sounds real but also makes respectable attempts at dramatisation, adopting different voices and intonations, and even expressing basic emotions such as surprise and hostility that are embedded in textual signifiers.
So far so good. But when you tune in properly to the words, and attempt to comprehend the meaning of The Great Gatsby as a whole, the inadequacy of a purely ‘realistic’ voice becomes apparent. The AI, its tone unwavering, struggles to render comprehensible more complex sentences within the text. Subtle lacings of humour are nearly all missed. When the narrator Nick Carraway tells us that ‘Conduct may be founded on the hard rock and wet marshes but after a certain point I don’t care what it’s founded on’, the ‘I don’t care’ is uttered smoothly, in the same uniform, consistent tone as the rest of the sentence. The lack of expressiveness is a missed opportunity to bring the text to life, to get this world - its characters, locations and moral underpinnings - to start to ‘stick’ in our minds.
It is precisely this quality of stickiness in good human narration - those moments of subtle characterisation and interpretation that reel us back into the performance, even as we zone out briefly - that is missing from AI renditions. As realistic as the voice is, over pages of text I found that it failed to ‘hook’ me, or give me enough of an anchor, without hyperconscious attentiveness on my end. At times, I found myself having to revisit the text, rereading the words to understand what I was missing on first listen - not the user experience ElevenLabs’ creators have in mind.
Beyond the telling of stories
The struggles of AI to narrate fiction convincingly (even more evident in side-by-side comparisons, such as this one by audiobook narrator Travis Baldree) speaks to the very nature of voice acting, an artistic and creative skill in its own right. Reading aloud by humans is an act of interpretation. Unlike AI speech software, it draws not only on statistical regularities between words on the page but on the lived experience of their narrator; their past interactions, their current context, and their unique interpretation of the fictional world, which is then reflected in their vocal performance. Cognitive neuroscience has shown that human communication transcends the mere exchange of words, and takes into account mutual beliefs and understandings, as well as past events.
In Eleanor Oliphant is Completely Fine - supremely narrated by Cathleen McCarron - we sense the narrator has some empathy for what it’s like to be a woman repeatedly accosted by an over-friendly acquaintance (the awkward IT technician, Raymond). She nods at our shared understanding of social media clichés in the way that she reads ‘#blessed’. McCarron’s voice is not just disembodied sound overlaid onto words, but part of a subjective interpretation of the author’s words on the page. It is partly for this reason that we can be drawn to certain narrators and sometimes irrationally repelled by others. Who is reading and how they are reading is so much more than a matter of tweaking the gender, tone, pitch and speed settings found on the control panel of emerging AI audio tools.
ElevenLabs tell us that its AI voices are the ‘ultimate tools for storytelling’, but as Professor of Cognitive Robotics Murray Shanahan argues, this is the kind of loose anthropomorphism that we are increasingly vulnerable to, as we start to see ‘the systems in which [Large Language Models] are imbedded as more human-like than they actually are’. There is more to ‘storytelling’ than the literal reading of the words that constitute a story. As anthropologists have repeatedly argued, it is a fundamental part of what makes us human; we use stories to make sense of our world and to share that understanding with others. An AI voice is lacking what Murray terms ‘communicative intent’; it is not making sense of the world - or crafting a shared understanding with an audience in the mind - in the same way that a human narrator is.
Do these distinctions around human subjectivity versus statistical regularity - the words on the page versus real-world context - actually matter? When it comes to fiction audiobooks, the answer is a resounding yes. Narrating fiction (that people actually enjoy listening to) requires skill and depth when it comes to characterisation; an ability to convey subtleties of humour and emotion stemming from empathy, and an overarching perspective that ripples throughout the entire book. The current crop of AI speech tools, natural-sounding as they are, are not yet up to task. AI-generated voices will only continue to improve in leaps and bounds, but even as those strides are made, there is no substitute for authenticity - that is, for the knowledge that a real person has chosen to render their own unique, subjective interpretation of a fictional world, to bring it to life creatively, to offer us something of themselves in the process. To suggest this authenticity can be automated feels oxymoronic.
Perhaps non-fiction is another story, where these critiques no longer apply because the listener’s aim is to ingest information as efficiently as possible - to be informed rather than merely entertained. Or perhaps there is more to non-fiction than many advocates of AI speech realise. We know at least one non-fiction author who thinks so: Junaid will explore the role of AI narration for his own book - and the wider implications for the genre - next time.