OpenAI has launched a new suite of advanced audio models in its API, aimed at empowering developers to build more capable, customizable, and intelligent voice agents. These models represent a significant step forward in the accuracy, reliability, and expressiveness of both speech-to-text and text-to-speech technologies. The release is motivated by the need for more intuitive human-agent interactions beyond text, leveraging natural spoken language.
How OpenAI’s Audio Models Are Changing the Game.
The new speech-to-text models, gpt-4o-transcribe and gpt-4o-mini-transcribe, set a new state-of-the-art benchmark in transcription accuracy. They outperform existing Whisper models, particularly in challenging conditions such as accents, noisy environments, and varying speech speeds. These improvements are attributed to targeted innovations in reinforcement learning and extensive pretraining with diverse, high-quality audio datasets. The benefits are more reliable transcriptions, making them suitable for use cases like customer call centers and meeting note transcription.
For the first time, developers can instruct the new text-to-speech model, gpt-4o-mini-tts, on how to speak, in addition to what to say. This steerability feature allows for a new level of customization, enabling the creation of more empathetic customer service voices or expressive narration for storytelling. The models are limited to artificial, preset voices which OpenAI monitors.
The models are built upon GPT-4o and GPT-4o-mini architectures and have been extensively pretrained on audio-centric datasets. OpenAI has also enhanced distillation techniques, transferring knowledge from larger audio models to smaller, more efficient ones. A reinforcement learning (RL)-heavy paradigm has been integrated into the speech-to-text models, dramatically improving precision and reducing hallucination.
These new OpenAI audio models are available to all developers through the speech-to-text and text-to-speech APIs. OpenAI is also releasing an integration with the Agents SDK to simplify the development process for conversational experiences.
Looking ahead, OpenAI plans to continue improving the intelligence and accuracy of its audio models, exploring ways to allow developers to bring their own custom voices while adhering to safety standards. They are also committed to engaging in conversations with policymakers, researchers, developers, and creatives about the opportunities and challenges presented by synthetic voices. Finally, OpenAI intends to invest in other modalities, including video, to enable developers to build multimodal agentic experiences.
Key Takeaways
- New Audio Models: OpenAI launched new state-of-the-art speech-to-text (gpt-4o-transcribe, gpt-4o-mini-transcribe) and text-to-speech (gpt-4o-mini-tts) models.
- Improved Accuracy: Speech-to-text models have improved word error rate and better language recognition, especially in challenging audio conditions.
- Text-to-Speech Steerability: Developers can now instruct the text-to-speech model on how to speak, enabling more customized and expressive voices.
- Technical Innovations: The models leverage pretraining on audio-centric datasets, advanced distillation methodologies, and a reinforcement learning paradigm.
- API Availability: The models are available to all developers through the speech-to-text and text-to-speech APIs, with an integration with the Agents SDK simplifying development.
- Future Plans: OpenAI plans to continue improving the intelligence and accuracy of its audio models, explore custom voices, and invest in other modalities like video.
- Ethical Considerations: OpenAI continues to engage in conversations around the challenges and opportunities presented by synthetic voices, maintaining a focus on safety standards.
- Target Use Cases: Customer service agents, note taking, content creation.
Here are some generated demo :
Links
Announcement: Introducing our next generation audio models
Live demo: Openai fm