OpenAI can clone voices with just 15 seconds of audio

OpenAI can clone voices with just 15 seconds of audio
by Adam Schrader
Washington DC (UPI) Mar 31, 2024

A new language model unveiled by ChatGPT creator OpenAI can clone a person's voice using just seconds worth of audio, the company revealed as it shared preliminary insights from studying the technology's capabilities.

The artificial intelligence model, named Voice Engine, needs just a single 15-second audio sample to generate speech mimicking that of the original speaker, OpenAI announced in a blog post Friday. The technology was first developed in late 2022 and has been used to power the preset voices available in the text-to-speech API as well as in its ChatGPT Voice and Read Aloud features.

The technology has been tested with OpenAI's corporate partners with groundbreaking results. For example, the company shared tearjerking audio of a young girl speaking thanks to doctors Fatima Mirza, Rohaid Ali and Konstantina Svokos with the Norman Prince Neurosciences Institute.

The girl lost her ability to speak normally because of a vascular brain tumor. While still able to form words and sentences, her voice does not sound the same way it once did. The doctors used a clip of audio she recorded for a school project to restore her normal voice back to her so it no longer sounds impaired when she talks.

"We are taking a cautious and informed approach to a broader release due to the potential for synthetic voice misuse," the company said. "We hope to start a dialogue on the responsible deployment of synthetic voices, and how society can adapt to these new capabilities."

OpenAI, which has not released the model as a standalone product or broader tool, said it started privately testing its abilities with a "small group of trusted partners" and has been "impressed by the applications" of it. However, the company said it continues to have conversations about whether and how to deploy the technology at scale.

Among its practical applications, OpenAI said that Voice Engine could be used to provide reading assistance to non-readers and children. The company has partnered with Age of Learning, an education technology company, that has been using the technology to generate scripted educational content.

OpenAI shared a 15-second sample of original audio recorded by the company in which a male narrator defines "force" in the context of physics. The model was then applied to other themes, allowing the AI to generate audio relating to biology, chemistry, reading and math.

HeyGen, another adopter of the technology, is an AI visual storytelling platform that works with other companies to create human-like avatars for product marketing and sales demonstrations. They use Voice Engine to translate the audio in their videos.

"When used for translation, Voice Engine preserves the native accent of the original speaker: for example generating English with an audio sample from a French speaker would produce speech with a French accent," OpenAI said.

The company shared audio of an American-sounding woman speaking in English as the source clip, which was then translated into Spanish, Mandarin, German, French and Japanese -- all in the voice of the original woman.

And, the tool has been used to support people who are non-verbal through Livox, a Brazilian company with an AI alternative communication app that allows non-verbal users to speak with voices powered by Voice Engine.

"So, for example, a non-verbal person can have a unique voice that is not robotic and sounds exactly the same in several languages," Livox said on social media. "We hope Livox users will be able to have access to these voices soon!"

The news comes after OpenAI unveiled its video-generating model, Sora, which can create realistic video from a text prompt. Critics have become increasingly concerned about the ramifications of artificial intelligence models, including the ability to create deepfaked audio and videos.

Related Links
All about the robots on Earth and beyond!