New NVIDIA NIM Agent Blueprints now available   Get Started

NVIDIA TensorRT

NVIDIA® TensorRT™ is an ecosystem of APIs for high-performance deep learning inference. TensorRT includes an inference runtime and model optimizations that deliver low latency and high throughput for production applications. The TensorRT ecosystem includes TensorRT, TensorRT-LLM, TensorRT Model Optimizer, and TensorRT Cloud.

Download NowGet Started


NVIDIA TensorRT Benefits

TensorRT speeds up inference by 36X

Speed Up Inference by 36X

NVIDIA TensorRT-based applications perform up to 36X faster than CPU-only platforms during inference. TensorRT optimizes neural network models trained on all major frameworks, calibrates them for lower precision with high accuracy, and deploys them to hyperscale data centers, workstations, laptops, and edge devices.

TensorRT helps to optimize inference performance

Optimize Inference Performance

TensorRT, built on the CUDA® parallel programming model, optimizes inference using techniques such as quantization, layer and tensor fusion, and kernel tuning on all types of NVIDIA GPUs, from edge devices to PCs to data centers.

TensorRT helps to accelerate every workload

Accelerate Every Workload

TensorRT provides post-training and quantization-aware training techniques for optimizing FP8, INT8, and INT4 for deep learning inference. Reduced-precision inference significantly minimizes latency, which is required for many real-time services, as well as autonomous and embedded applications.

TensorRT-optimized models can be deployed, run, and scaled with NVIDIA Triton

Deploy, Run, and Scale With Triton

TensorRT-optimized models are deployed, run, and scaled with NVIDIA Triton™ inference-serving software that includes TensorRT as a backend. The advantages of using Triton include high throughput with dynamic batching, concurrent model execution, model ensembling, and streaming audio and video inputs.


Explore the Features and Tools of NVIDIA TensorRT

Decorative

Large Language Model Inference

NVIDIA TensorRT-LLM is an open-source library that accelerates and optimizes inference performance of recent large language models (LLMs) on the NVIDIA AI platform. Developers experiment with new LLMs for high performance and quick customization with a simplified Python API.

Developers accelerate LLM performance on NVIDIA GPUs in the data center or on workstation GPUs—including NVIDIA RTX™ systems on native Windows—with the same seamless workflow.

Decorative

Optimized Inference Engines

NVIDIA TensorRT Cloud is a developer service for compiling and creating optimized inference engines for ONNX. Developers can use their own model and choose the target RTX GPU. Then TensorRT Cloud builds the optimized inference engine, which can be downloaded and integrated into an application. TensorRT Cloud also provides prebuilt, optimized engines for popular LLMs on RTX GPUs.

TensorRT Cloud is available in early access on NVIDIA GeForce RTX™ GPUs to select partners. Apply to be notified when it's publicly available.

Decorative

Optimize Neural Networks

NVIDIA TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques, including quantization, sparsity, and distillation. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM and TensorRT to efficiently optimize inference on NVIDIA GPUs.

Decorative

Major Framework Integrations

TensorRT integrates directly into PyTorch, Hugging Face, and TensorFlow to achieve 6X faster inference with a single line of code. TensorRT provides an ONNX parser to import ONNX models from popular frameworks into TensorRT. MATLAB is integrated with TensorRT through GPU Coder to automatically generate high-performance inference engines for NVIDIA Jetson™, NVIDIA DRIVE®, and data center platforms.


World-Leading Inference Performance

TensorRT was behind NVIDIA’s wins across all performance tests in the industry-standard benchmark for MLPerf Inference. TensorRT-LLM accelerates the latest large language models for generative AI, delivering up to 8X more performance, 5.3X better TCO, and nearly 6X lower energy consumption.

See All Benchmarks

8X Increase in GPT-J 6B Inference Performance

TensorRT-LLM on H100 has 8X increase in GPT-J 6B inference performance

4X Higher Llama2 Inference Performance

TensorRT-LLM on H100 has 4X Higher Llama2 Inference Performance

Total Cost of Ownership

Lower is better
TensorRT-LLM has lower total cost of ownership than GPT-J 6B and Llama 2 70B

Energy Use

Lower is better
TensorRT-LLM has lower energy use than GPT-J 6B and Llama 2 70B

Accelerate Every Inference Platform

TensorRT can optimize AI deep learning models for applications across the edge, laptops and desktops, and data centers. It powers key NVIDIA solutions, such as NVIDIA TAO, NVIDIA DRIVE, NVIDIA Clara™, and NVIDIA JetPack™.

TensorRT is also integrated with application-specific SDKs, such as NVIDIA NIM, NVIDIA DeepStream, NVIDIA Riva, NVIDIA Merlin™, NVIDIA Maxine™, NVIDIA Morpheus, and NVIDIA Broadcast Engine. TensorRT provides developers a unified path to deploy intelligent video analytics, speech AI, recommender systems, video conferencing, AI-based cybersecurity, and streaming apps in production.

From creator apps to games and productivity tools, TensorRT is embraced by millions of NVIDIA RTX, GeForce®, Quadro® GPU users. Whether integrated directly or via the ONNX-Runtime framework, TensorRT-optimized engines are weightless and compressed, empowering developers to incorporate AI-rich features without bloating app sizes.

TensorRT integrates with application-specific SDKs

Read Success Stories

Learn how NVIDIA TensorRT supports Amazon.

Amazon

Discover how Amazon improved customer satisfaction by accelerating its inference 5X faster.

Learn how NVIDIA TensorRT supports AMEX." title="Learn how NVIDIA TensorRT supports AMEX.

American Express

American Express improves fraud detection by analyzing tens of millions of daily transactions 50X faster. Find out how.

Learn how NVIDIA TensorRT supports Zoox.

Zoox

Explore how Zoox, a robotaxi startup, accelerated their perception stack by 19X using TensorRT for real-time inference on autonomous vehicles.


Widely Adopted Across Industries

NVIDIA TensorRT is widely adopted by top companies across industries

TensorRT Resources

Read the Introductory TensorRT Blog

Learn how to apply TensorRT optimizations and deploy a PyTorch model to GPUs.

Watch On-Demand TensorRT Sessions From GTC

Learn more about TensorRT and its features from a curated list of webinars at GTC.

Get the Introductory Developer Guide

See how to get started with TensorRT in this step-by-step developer and API reference guide.

Use the right inference tools to develop AI for any application on any platform.

Get Started