In a significant step to expand access to classic literature, Project Gutenberg partnered with the Massachusetts Institute of Technology (MIT) and Microsoft to create an extensive collection of audiobooks using artificial intelligence (AI). The project offers thousands of free audiobooks on major platforms such as Spotify, Apple and Google Podcasts.
The project leverages new advances in neural speech synthesis with human characteristics to bring thousands of beloved books to life in a new accessible audio format, and can even read books in the user’s voice with just 5 seconds of audio.
This initiative, led by Mark Hamilton (MIT) and Brendan Walsh (Microsoft), along with supervising professor William T. Freeman (MIT), seeks to democratize access to literature to include people with visual impairments, language learners, children, and those who simply prefer to listen to their books.
Harnessing AI to Scale Audiobook Production
Whether you’re learning to read, looking for inclusive reading technology, or about to embark on a long journey, audiobooks can be an invaluable resource. However, creating audiobooks is not as simple as pressing “play.” Recording professional human readers can be time-consuming and expensive, requiring hundreds of hours of reading time per book.
With the rate of book publishing steadily increasing, creators are looking for faster solutions. Automated audiobook production offers a promising alternative, but historically it has been plagued by robotic and unnatural narration. In addition, it is difficult for algorithms to understand which parts of a book to read. Humans know how to skip page numbers, tables of contents, and footnotes, but algorithms must be smart to avoid these pitfalls.
Project Gutenberg, the oldest online library with more than 60,000 works, is aware of these challenges. Project Gutenberg CEO Greg Newby says, “We had tried to create audiobooks in the past, but the quality simply wasn’t very good, so we abandoned the effort. With this new technology, our partners were able to create much better quality audiobooks much faster than before.”
The project uses new advances in neural speech synthesis to create realistic voices that sound similar to native human speakers. The approach uses a deep network that is trained to mimic the quality and tone of native speakers, can speak in multiple languages, and even identify and style emotional text reading.
Evaluating Books for Structure
With a high-quality speech synthesis model in hand, the team set out to transcribe as many of Project Gutenberg’s 60,000+ books as possible. Mark Hamilton, one of the project leaders, shares that this was the hardest part, “It’s hard to find even two books in Project Gutenberg that have exactly the same structure. While the books display nicely for online readers, they contain all kinds of text that you wouldn’t want to hear in your audiobook. It became more of an art than a science to find what users would want to hear in a given book.”
To address this, the team searched the collection for large groups of books with a similar appearance and file format. This allowed the creation of specific parsers that could be tailored to the peculiarities of each book. In the end, the team identified more than 5,000 books that could be parsed with reasonable accuracy.
Speaking Millions of Prayers
The next challenge the team faced was how to efficiently speak the millions of sentences extracted from the five thousand books. Normally, this would take too long even for a computer. To make sure these algorithms could scale, the team used the SynapseML distributed computing library to orchestrate millions of model inference calls on hundreds of machines. This allowed the researchers to quickly use modern text-to-speech services such as VALL-E and Microsoft AI to create more than 35,000 hours of audiobooks in just over two hours, at no cost to the nonprofit Project Gutenberg.
For interested audiobook lovers, the entire audiobook collection can be listened to for free on most major podcast platforms, including Spotify, Google Podcasts, Apple Podcasts and the Internet Archive.
Creating Audiobooks in Your Own Voice
After donating 5,000 books to the public domain, the team demonstrated an application that could create an entire audiobook in someone’s voice, using just 5 seconds of sample audio. This demonstration, called “Automated Large-Scale Audiobook Creation,” which was presented at the Interspeech 2023 conference, illustrated how the latest advances in generative speech synthesis could be quickly used to create customized audiobooks for anyone with a microphone. The team hopes to explore whether this technology can help create more inclusive audiobooks that foster a more personal connection between listeners and their favorite works.
Bringing Classical Literature to a Global Audience
Thanks to the collaboration, Project Gutenberg has expanded its audiobook collection by nearly 5,000 titles, which are now available on popular platforms such as Spotify and Apple Podcasts. Newby considers this a milestone in Project Gutenberg’s journey, expressing optimism that “our library is more accessible than ever.”