The ultimate voice from AI can read you a story with emotion


The Oxford Centre for Innovation, (OCFI) on New Road, Oxford has welcomed DeepZen, a company which has created an ultra-realistic voice using artificial intelligence that can convert text to audio, regardless of length.

Traditional speech systems work by generating every word separately and then they’re put together to form a sentence. Unlike a robotic voice, DeepZen’s technology synthesizes the human voice to replicate emotions and intonations. Check out their website, where you can hear ‘William’ reading The Reluctant Cannibals and Kafka’s Metamorphosis narrated by ‘Lauren’ – you simply won’t believe these are AI-produced voices. Amazing.

Not only is the technology incredible but it means that an audio book can be produced in days rather than weeks. DeepZen CTO and Founder, Kerem, said: “A typical audiobook will cost around $5000 to produce we are aiming to reduce that significantly. A ten-hour audio book can be produced by us in just in a few hours; the rest of the time is spent on editing to check for continuity, context and emotion. This will get quicker with time, as our algorithm improves.”

Currently the Deep Zen team has just five voices – which can have different accents and speak in different languages – but soon they’ll be able to simulate well-known voices too from a short recording which they can imitate to get the right tone, pauses, tempo and expression.

DeepZen’s main focus is on audiobooks – currently an $8bn market worldwide and set to grow by 25 per cent per year – of which DeepZen plans to take a large chunk in two ways.

Kerem said: “We are currently creating audio books for publishers and simply charging for production, but we are also co-publishing with the independent UK publishing companies Legend Press and Endeavour Media and are ‘in conversation’ the big six publishing houses etc. At the moment, two million books are published annually but only 3 per cent are converted into audio books, so there’s a big gap in the market that DeepZen wants to fill”.

The company is also working with an agency to do short voiceovers for advertisers, gaming companies and animation, and developing text-to-speech tools which help online training and education by adding voice features to literacy apps, e-learning platforms and digital learning tools.

DeepZen is also looking at producing audio content for exhibitions in museums and galleries.

DeepZen was started by Kerem and colleague Taylan Kamis (CEO) in 2017, but a team, including Omer Gunes (NLP lead) and Spyridoula Papandreou (TTS language engineer), quickly formed to develop the product. Their first office was close to Paddington Station in West London but they wanted to be in Oxford as well as that’s where the expertise is. The business is now 14-strong with language specialists, editors and software developers – all based in London and at OCFI.

So what is Deep Zen’s greatest challenge for 2020? “Scaling up” says Kerem. “We’ve done a first round of Seed Funding and are now doing our second round. With this new round we aim to expand our R&D capabilities, bring in more editorial staff and work on new languages.