tensorrt-llm

Here are 28 public repositories matching this topic...

DefTruth / Awesome-LLM-Inference

📖A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, Flash-Attention, Paged-Attention, Parallelism, etc. 🎉🎉

mla sora vllm llm-inference awesome-llm flash-attention tensorrt-llm paged-attention deepseek flash-attention-3 deepseek-v3 minimax-01 deepseek-r1

Updated Feb 19, 2025

janhq / cortex.cpp

Star

Local AI API Platform

onnx onnxruntime llamacpp gguf tensorrt-llm

Updated Feb 23, 2025
C++

collabora / WhisperLive

Star

A nearly-live implementation of OpenAI's Whisper.

text-to-speech translation voice-recognition openai obs dictation whisper tensorrt tensorrt-llm whisper-tensorrt

Updated Feb 7, 2025
Python

shashikg / WhisperS2T

Star

An Optimized Speech-to-Text Pipeline for the Whisper Model Supporting Multiple Inference Engine

deep-learning speech-recognition vad speech-to-text whisper asr tensorrt voice-activity-detection tensorrt-llm

Updated Aug 27, 2024
Jupyter Notebook

huggingface / optimum-benchmark

Star

🏋️ A unified multi-backend utility for benchmarking Transformers, Timm, PEFT, Diffusers and Sentence-Transformers with full support of Optimum's hardware optimizations & quantization schemes.

benchmark pytorch openvino onnxruntime text-generation-inference neural-compressor tensorrt-llm

Updated Jan 31, 2025
Python

coderonion / awesome-cuda-triton-hpc

Star

🔥🔥🔥 A collection of some awesome public CUDA, cuBLAS, cuDNN, CUTLASS, TensorRT, TensorRT-LLM, Triton, TVM, MLIR and High Performance Computing (HPC) projects.

Updated Feb 19, 2025

npuichigo / openai_trtllm

Star

OpenAI compatible API for TensorRT LLM triton backend

triton-inference-server openai-api llm langchain tensorrt-llm

Updated Aug 1, 2024
Rust

Deep Learning Deployment Framework: Supports tf/torch/trt/trtllm/vllm and other NN frameworks. Support dynamic batching, and streaming modes. It is dual-language compatible with Python and C++, offering scalability, extensibility, and high performance. It helps users quickly deploy models and provide services through HTTP/RPC interfaces.

tensorflow torch tensorrt serving triton-inference-server dynamic-batching vllm tensorrt-llm

Updated Jan 10, 2025
C++

NetEase-Media / grps_trtllm

Star

High-Performance OpenAI LLM Service: A pure C++ high-performance OpenAI LLM service implemented with GPRS+TensorRT-LLM+Tokenizers.cpp, supporting chat and function call, AI agents, distributed multi-GPU inference, multimodal capabilities, and a Gradio chat interface.

openai multi-modal phi function-call qwq ai-agent llm llama-index chatglm tensorrt-llm qwen2 llama3 internvl2 qwen2-vl deepseek-r1 janus-pro

Updated Feb 23, 2025
C++

openhackathons-org / End-to-End-LLM

Star

This repository is an AI Bootcamp material that consist of a workflow for LLM

natural-language-processing deep-learning question-answering prompt-tuning p-tuning llm genai nemo-guardrails tensorrt-llm nemo-megatron

Updated Aug 20, 2024
Jupyter Notebook

vossr / Chat-With-RTX-python-api

Star

Chat With RTX Python API

tensorrt llm llm-inference tensorrt-llm mistral-7b llama2-13b chat-with-rtx nvidia-chat-with-rtx

Updated Dec 1, 2024
Python

janhq / cortex.tensorrt-llm

Star

Cortex.Tensorrt-LLM is a C++ inference library that can be loaded by any server at runtime. It submodules NVIDIA’s TensorRT-LLM for GPU accelerated inference on NVIDIA's GPUs.

nvidia jan tensorrt llm tensorrt-llm