FlashInfer: Kernel Library for LLM Serving
-
Updated
Apr 20, 2026 - Python
FlashInfer: Kernel Library for LLM Serving
A GPU cluster manager that configures and orchestrates inference engines like vLLM and SGLang for high-performance AI model deployment.
一個基於 llama.cpp 的分佈式 LLM 推理程式,讓您能夠利用區域網路內的多台電腦協同進行大型語言模型的分佈式推理,使用 Electron 的製作跨平台桌面應用程式操作 UI。
Code for paper "JMDC: A Joint Model and Data Compression System for Deep Neural Networks Collaborative Computing in Edge-Cloud Networks"
Analyze and generate unstructured data using LLMs, from quick experiments to billion token jobs.
Mixed-vendor GPU inference cluster manager with speculative decoding
Source code of the paper "Private Collaborative Edge Inference via Over-the-Air Computation".
Accelerate reproducible inference experiments for large language models with LLM-D! This lab automates the setup of a complete evaluation environment on OpenShift/OKD: GPU worker pools, core operators, observability, traffic control, and ready-to-run example workloads.
Super Ollama Load Balancer - Performance-aware routing for distributed Ollama deployments with Ray, Dask, and adaptive metrics
The Internet is the computer. Distributed LLM inference across browsers via WebGPU.
Web UI for orchestrating distributed llama.cpp RPC GPU clusters with auto node discovery, telemetry, and one-click deployment.
Official impl. of ACM MM paper "Identity-Aware Attribute Recognition via Real-Time Distributed Inference in Mobile Edge Clouds". A distributed inference model for pedestrian attribute recognition with re-ID in an MEC-enabled camera monitoring system. Jointly training of pedestrian attribute recognition and Re-ID.
Turn any Kubernetes Cluster into a private LLM endpoint. One Helm command deploys distributed inference across commodity hardware. Raspberry Pi's, old servers, mixed architectures. OpenAI-Compatible API Powered by llama.cpp RPC
Tirami — distributed LLM inference where compute is currency. 1 TRM = 10^9 FLOPs. 21B supply cap, yield halving, staking, collusion resistance. 100% Rust, OpenAI-compatible, no token, no ICO. "tira mi su" = pull me up.
Referral service for your LLM
Encrypted Decentralized Inference and Learning (E.D.I.L.)
Mycellm iOS / iPadOS app for iPhone and iPad
A comprehensive framework for multi-node, multi-GPU scalable LLM inference on HPC systems using vLLM and Ollama. Includes distributed deployment templates, benchmarking workflows, and chatbot/RAG pipelines for high-throughput, production-grade AI services
Distributed LLM inference across multiple machines. A central server routes OpenAI-compatible requests to llama.cpp client nodes, with automatic model distribution and mutual TLS security.
Add a description, image, and links to the distributed-inference topic page so that developers can more easily learn about it.
To associate your repository with the distributed-inference topic, visit your repo's landing page and select "manage topics."