Provide voice-based interfaces for your conversational AI applications.
Speech AI lets people converse with devices, machines, and computers to simplify and augment their lives. A subset of conversational AI, it includes automatic speech recognition (ASR) and text-to-speech (TTS) to convert voice into text and generate a human-like voice from written words—making powerful applications like virtual assistants, real-time transcriptions, and voice searches driven by large language models (LLMs) and retrieval-augmented generation (RAG) possible.
Upgrade your customers' experiences to exceptional with the best-in-class accuracy that’s achieved with speech AI model customization.
Broaden your customer base by offering voice-based applications in the languages your customers speak.
Serve more customers with low-latency, high-throughput applications that can instantly scale on any infrastructure: on premises, cloud, edge, or embedded.
Give your customer service a boost by delivering fast and meaningful engagements with your brand's unique voice.
Learn how to build and deploy real-time speech AI pipelines for your conversational AI application.
Modern speech AI systems use deep neural network (DNN) models trained on massive datasets. Over time, the size of speech AI models has grown so much that training such models can take weeks of intensive compute time, even when using deep learning frameworks, such as PyTorch, TensorFlow, and MXNet, on high-performance GPUs.
NVIDIA speech and translation AI offers pretrained, production-quality models in the NVIDIA NGC™ catalog that are trained on several public and proprietary datasets for over hundreds of thousands of hours on NVIDIA DGX™ systems.
Figure 1: Highly accurate multilingual pretrained models.
Figure 2: End-to-end NVIDIA NeMo workflow.
Many enterprises have to customize speech and translation AI models to achieve the desired multilingual accuracy for their specific conversational applications. However, customizing speech AI models from scratch usually requires large training datasets and AI expertise.
To speed up development and highly customize speech models, you can use NVIDIA NeMo™ to build, customize, and deploy speech—automatic speech recognition (ASR) and text-to-speech (TTS)—and natural language processing (NLP) pipelines. With NeMo you can customize, extend, and compose existing prebuilt speech AI modules to create new models. Models optimized with NeMo can easily be exported and deployed in NVIDIA® Riva on premises or in the cloud as a speech service.
For speech AI skills, companies have always had to choose between accuracy and real-time performance. For example, they can’t ask a question and then wait several seconds for a response. In addition, they don’t want their conversational AI applications to misinterpret or produce gibberish.
With NVIDIA Riva, companies can achieve world-class accuracy and run their speech and translation AI pipelines in real time—under a few milliseconds. Riva offers SOTA pretrained models on NGC that could be fine-tuned with NVIDIA NeMo to achieve world-class accuracy, and optimized skills for real-time performance.
Figure 3: NVIDIA Riva speech AI skills capabilities.
Accelerate development time with packaged AI workflows, which include NVIDIA AI frameworks and pretrained models, as well as resources such as Helm charts, Jupyter Notebooks, and documentation to help you jump-start building AI solutions.
While large-scale deployments require a purchase of NVIDIA Riva, NVIDIA also offers a variety of containers, models, and customization tools free of charge.
Sign up to receive the latest speech AI news from NVIDIA.