Speech AI

Provide voice-based interfaces for your conversational AI applications.

What Is Speech AI?

Speech AI lets people converse with devices, machines, and computers to simplify and augment their lives. A subset of conversational AI, it includes automatic speech recognition (ASR) and text-to-speech (TTS) to convert voice into text and generate a human-like voice from written words—making powerful applications like virtual assistants, real-time transcriptions, and voice searches driven by large language models (LLMs) and retrieval-augmented generation (RAG) possible.

The Benefits of Using Speech AI

World-Class Accuracy

Upgrade your customers' experiences to exceptional with the best-in-class accuracy that’s achieved with speech AI model customization.

Multiple Language Support

Broaden your customer base by offering voice-based applications in the languages your customers speak.

Performance and Scalability

Serve more customers with low-latency, high-throughput applications that can instantly scale on any infrastructure: on premises, cloud, edge, or embedded.

Unique, Natural Voices

Give your customer service a boost by delivering fast and meaningful engagements with your brand's unique voice.

Free Ebook: Building Speech AI Applications

Learn how to build and deploy real-time speech AI pipelines for your conversational AI application.

Download Ebook

GTC 2024 Sessions

Speech AI Demystified

Learn how speech AI technologies such as automatic speech recognition and text-to-speech are automating millions of conversations today.

Watch on Demand

Speech and Generative AI Developer Day

Learn how to use speech and translation AI with LLMs and RAG applications to transform chatbots into powerful multilingual virtual assistants and avatars.

Watch on Demand

Transforming Multilingual Multimedia With Speech AI

Learn how to add subtitles and dubbing in a specific language using NVIDIA® Riva speech recognition, text-to-speech, and translation.

Watch on Demand

How Speech AI Is Being Used

Transcribe Multiple Speakers at Once

Modern speech-to-text algorithms transcribe meetings, lectures, and social conversations in different languages while identifying speakers and labeling their contributions. With NVIDIA speech and translation AI technologies and SDKs, you can create accurate transcriptions for call center conversations and video conferencing meetings or automate clinical note-taking during physician-patient interactions for many different languages.

NVIDIA Riva: Build Your Own Speech and Translation AI Application

Make Your Assistants Virtual and Super Intelligent

Multilingual virtual assistants communicate with users via a speech interface to assist with diverse tasks—from resolving customer issues in call centers, to turning on the TV as a smart home assistant, to navigating to the nearest gas station as an in-car intelligent assistant. Build super intelligent virtual assistants and chatbots based on LLMs and RAG, or leverage NVIDIA Avatar Cloud Engine (ACE) to integrate NVIDIA speech and translation AI into your avatar applications for engaging interactions in many languages.

Explore AI Chatbot With RAG Develop and Deploy Interactive Avatars With NVIDIA ACE

Brand Your Voice

With a recognizable brand voice, companies can create multilingual applications that build relationships with customers in their own language while supporting all customers, including those with speech and language deficits. With NVIDIA Custom Voice, part of NVIDIA speech and translation AI, you can easily create a unique, high-quality voice personality for your brand in the language of your choice in hours versus weeks and with as little as 30 minutes of recorded speech data.

Expert, Natural Q&A With NVIDIA Omniverse ACE for Project Tokkio

Develop Customizable Speech AI Interfaces

Shorten Training by Using Pretrained Models

Modern speech AI systems use deep neural network (DNN) models trained on massive datasets. Over time, the size of speech AI models has grown so much that training such models can take weeks of intensive compute time, even when using deep learning frameworks, such as PyTorch, TensorFlow, and MXNet, on high-performance GPUs.

NVIDIA speech and translation AI offers pretrained, production-quality models in the NVIDIA NGC™ catalog that are trained on several public and proprietary datasets for over hundreds of thousands of hours on NVIDIA DGX™ systems.

Learn More About NVIDIA Pretrained Models

Figure 1: Highly accurate multilingual pretrained models.

Figure 2: End-to-end NVIDIA NeMo workflow.

Customize Models for Higher Accuracy

Many enterprises have to customize speech and translation AI models to achieve the desired multilingual accuracy for their specific conversational applications. However, customizing speech AI models from scratch usually requires large training datasets and AI expertise.

To speed up development and highly customize speech models, you can use NVIDIA NeMo™ to build, customize, and deploy speech—automatic speech recognition (ASR) and text-to-speech (TTS)—and natural language processing (NLP) pipelines. With NeMo you can customize, extend, and compose existing prebuilt speech AI modules to create new models. Models optimized with NeMo can easily be exported and deployed in NVIDIA® Riva on premises or in the cloud as a speech service.

Download This Ebook to Get Started With Customizable Speech AI

Achieve Natural Interactions by Developing Real-Time Skills

For speech AI skills, companies have always had to choose between accuracy and real-time performance. For example, they can’t ask a question and then wait several seconds for a response. In addition, they don’t want their conversational AI applications to misinterpret or produce gibberish.

With NVIDIA Riva, companies can achieve world-class accuracy and run their speech and translation AI pipelines in real time—under a few milliseconds. Riva offers SOTA pretrained models on NGC that could be fine-tuned with NVIDIA NeMo to achieve world-class accuracy, and optimized skills for real-time performance.

Learn How Companies Deployed Riva in Production

Figure 3: NVIDIA Riva speech AI skills capabilities.

Explore the Latest Breakthroughs in Speech AI

Speech AI Is Going Multilingual

Speech AI applications and pipelines must understand multiple languages, dialects, and accents to be deployed around the world. For example, people in the United States and most other countries speak different languages. In use cases like call centers, there are times when a customer uses more than one language to describe what's going on. The next step is to have speech AI applications that can handle these situations.

Developers can use separate speech models for each language or a single model that can handle more than one language. Learn more on the Speech Recognition Collections page about ASR models in different languages.

Taking Speech AI From Cloud to Device

When companies first started using speech AI, everyone used cloud services because they’re easy to set up and use. Slowly, companies started switching to on-premises solutions to avoid privacy issues with their data. Now, on-device solutions are the latest breakthrough, not just for keeping data private but also for faster inference and cutting costs.

NVIDIA Riva allows applications to be deployed in embedded, data center, and cloud environments to develop customizable speech AI interfaces for your conversational AI application.

Get Started With Speech AI

Get Started with Speech AI Workflows

Accelerate development time with packaged AI workflows, which include NVIDIA AI frameworks and pretrained models, as well as resources such as Helm charts, Jupyter Notebooks, and documentation to help you jump-start building AI solutions.

Learn More About the Audio Transcription Workflow

Learn More About the AI Chatbot With RAG Workflow

Start Developing With Containers and Models

While large-scale deployments require a purchase of NVIDIA Riva, NVIDIA also offers a variety of containers, models, and customization tools free of charge.

Explore Containers, Pretrained Models, and Deployment With Riva

Build Your Own Models With NeMo

Access Educational Resources

Get an Introduction to Speech AI

Understand speech AI core concepts and how to build and deploy voice-technology application.

Read Speech AI E-books

Demystify Speech AI

Learn how Speech AI technologies such as automatic speech recognition (ASR) and text-to-speech (TTS) automate millions of conversations today.

Watch Speech AI Demystified GTC Session

Browse Speech AI Blogs

Learn what speech AI is, how it has changed over time, about its key components, challenges, and use cases, and about NVIDIA Speech AI SDKs.

Read Speech AI Blogs

Take a Closer Look at NVIDIA Riva

Understand the key features of NVIDIA Riva that can help you build speech AI services.

Read NVIDIA Riva Introductory Blog