How do you want non-fiction served to your ears?
Authenticity may be the only - and most important - feature lacking in AI audio
If AI is still grappling with the subtleties of fictional narration, then surely there is low-hanging fruit to be picked in the non-fiction genre. Actor and famed narrator Edoardo Ballerini is ‘not a fan of AI voices’ but asserts that ‘there is a reasonable argument that it can serve a purpose, with backlist titles and nonfiction that nobody was going to put into audio anyway’. This is a matter close to my heart as I consider audiobook options for Mathematical Intelligence. How does a rendition of my words differ from human to machine?
To answer that question, I took on the role of human narrator for the opening excerpt of the Imagination chapter of the book. I then fed the same text to Speechify, a ‘leading text-to-speech reader’. Speechify enables listeners to ‘power through’ content. It is a boon for visually impaired people and others who struggle with text content: web articles and PDFs are finally made accessible through the medium of sound.
But how does Speechify fare with non-fiction audiobooks? Does it match up to my own narration of Mathematical Intelligence? Let your ears be the judge - here are the two samples:
Can you tell which is which? Of course you can: Speechify’s narration is betrayed by its monolithic tone and pacing, its lack of enthusiasm, and its apparent refusal to make any attempt at humour. Even though Speechify licences celebrity voices, familiar and entertaining voices are not enough to rescue the listening experience - even Snoop Dogg tires after a while.
In its defence, Speechify’s AI-generated audio works reasonably well for reading aloud short-form content. I embedded Speechify as a browser extension and found some joy in its rendering of social media posts and sardonic YouTube and Reddit comments. But for longer-form writing, its flaws are readily apparent.
The false trichotomy of reading
Or perhaps they are not flaws at all - could the relative bluntness of Speechify be a design feature rather than a bug? When you create a Speechify account, it will ask you for your preferred reading speed, even before you specify your reading interests. The latter choice is itself restricted to ‘work’, ‘leisure’ and ‘school’, as if these categories are neatly defined and mutually exclusive. The choice harks back to Mortimer Adler’s blunt tripartite schemas in How to Read a Book that suggests reading serves either to entertain, inform or understand - but never more than one at a time. Yet some of my greatest lessons have been derived from the imagined worlds of fiction authors, and some of my most entertaining reads have been shaped by the real-world stories of non-fiction writers. My chosen chapter excerpt, rooted in an anecdote about Monopoly-induced arguments, is not just there to entertain, or to inform - most writers realise that these aims are mutually reinforcing.
Speechify appears to subscribe to the notion, made popular by productivity gurus, that the purpose of reading (and listening to) non-fiction is to ingest information - to be informed rather than entertained.
In this framing, more content, consumed at ever-faster speeds, is the reader or listener’s primary objective. This view is perhaps befitting of an AI company that relies on large language models. There is a common belief, espoused by what philosopher Raphaël Millière terms the scaling maximalists, that intelligence is purely a function of information processing, and that all of our human thinking qualities will emerge in these models once they are large enough, and have had enough data pumped into them. If this is true, then for our own human intelligence to flourish we just need to absorb as much content as our neurons can bear.
I do not identify as a scaling maximalist, glamorous though it sounds. As a reader, I like to take things slow: I need time to digest ideas, to highlight salient passages, to make notes, to thumb back through previous sections. I am always striving to connect with the author’s communicative intent. I also need time, post hoc, to let those ideas ruminate in my mind before my conclusions are formed, or are overwhelmed by a flood of new content. When listening to audiobooks, I have to go at the narrator’s pace - it is the only hope I have of staying in sync with their thinking. By the same token, as an author I want my reader or listener to absorb my ideas in real time, to have the opportunity to grapple with them, disagree, and to connect them with their wider thoughts.
At the mercy of randomness
It is easy to see why Speechify fails my litmus test for narration; it is not optimising for human-like speech. But what about tools like Elevenlabs that place more emphasis on sounding like the real thing? A curious feature of Elevenlabs is the voice stability toggle, which users can set to anywhere between 0 and 100 percent. The lower the value, the more varied and ‘random’ the tone and pacing. Here it is at 0 percent:
Now we’re talking - this clip may not sound fully human-like, but it is edging closer. Enthusiasm, intonation, personality…it’s all there. AI audio is climbing firmly out of the uncanny valley, and it will only resemble human voices more strongly as these models evolve.
Even then, I may have reasons to hold back from deploying an AI narrator for my own writing. While the stability feature adds more dramatisation to each passage, there are large swathes of the book - particularly the ones focused on exposition and explanation - where this style is not appropriate.
The Elevenlabs clip is a loose interpretation of my own voice. That’s all it ever could be. Even if it used a clone of my voice, Elevenlabs could not possibly know which segments to inject more enthusiasm into, or when to dial the sarcastic tone up or down (much of that excerpt was written with a cynical but tongue-in-cheek tone that the Elevenlabs sample has failed to register). It is at the mercy of randomness. Just as with fiction, context-less narration fails to capture the author’s true meaning.
Keeping it real
The specific danger with non-fiction is that the listener may take the AI narration as the author’s own - a risk that does not exist with human narrators, where it is understood that the audio reflects another person’s interpretation of the writer’s words.
Authenticity may be our prized possession as creators, and as consumers of this content. This is made dramatically apparent by emerging cases of AI-enabled speech abuse. Scammers are finding joy in conning their victims out of money by mimicking the voices of the victims’ relatives. And in an all too predictable turn of events, users of 4chan are exploiting the very same technology from Elevenlabs to generate synthetic hate speech, voiced by celebrities. Elevenlabs have taken mitigation steps to limit such abuses but the genie has leapt out of the bottle with large language models and the information ecosystem, already contaminated with misinformation, is about to be flooded with fake content - content that becomes all the more persuasive in spoken form.
Given the ease with which humans are duped by AI imitation, it is essential that we hold fiercely to our notions of objective reality. If my writing is to ever be subject to AI narration - based on my own voice, a celebrity’s or whatever else - I want the listener to be aware that what they are hearing is an approximation of the real thing. This ought to be a standard for all large language models; clear labels that remind us consumers, gullible as we are, that we are engaging with artificial representations of human voice.
A presumed value proposition of these tools is that AI narration is better than none at all - that is the choice many authors are now faced with. I can say, categorically, that with the current state of the art I would prefer the latter over a misrepresentation of my work. And if AI speech providers insist on blurring the lines between what is real and what isn’t, then as creators we can - even when we deploy these tools - play our part in preserving some sense of our authentic selves.