Want to connect with Inferact?
Join organizations building the agentic web. Get introductions, share updates, and shape the future of .agent.
Is this your company?
Claim this profile to update your info, add products, and connect with the community.
Inferact is central to the AI agent ecosystem because agents are computationally expensive. Unlike a single chat interaction, an agentic loop might require dozens of LLM calls to plan, execute tools, and reflect on results. For agents to be economically viable at scale, inference costs must drop significantly and throughput must rise. Inferact’s commercialization of vLLM provides the high-concurrency infrastructure necessary to support hundreds of parallel agentic threads without saturating GPU memory.
Furthermore, vLLM’s OpenAI-compatible API interface means it is a drop-in replacement for the infrastructure used by popular agent frameworks like LangChain, CrewAI, and AutoGPT. By allowing developers to run models locally or on private clouds with the same performance characteristics as top-tier proprietary providers, Inferact enables a level of agent privacy and cost control that is otherwise difficult to achieve.
Inferact is the commercial entity behind vLLM, an open-source inference engine that has become the standard for high-throughput language model serving. Founded by the creators of the vLLM project, Inferact recently emerged from stealth with a $150 million seed round co-led by Andreessen Horowitz and Lightspeed Venture Partners. The funding, which valued the young company at $800 million, reflects a broader market rotation. As the initial craze for training foundation models matures, the industry's economic center of gravity is moving toward inference—the act of running these models in production environments.
At the heart of the company's technical advantage is PagedAttention. This technology, developed by researchers at UC Berkeley's Sky Computing Lab, solves a fundamental inefficiency in LLM serving: Key-Value (KV) cache memory fragmentation. In traditional setups, memory for these caches is pre-allocated in large, contiguous blocks, much of which goes unused. PagedAttention borrows concepts from operating system virtual memory, partitioning the cache into smaller pages. This allows the system to utilize almost all available GPU memory, resulting in throughput gains that are often multiple times higher than standard implementations.
Inferact operates in a crowded field, yet it benefits from the massive adoption of the vLLM project. Its primary competitors include NVIDIA’s TensorRT-LLM, which is highly optimized but proprietary to NVIDIA hardware, and SGLang, another academic-born project that focuses on structured generation. Hugging Face’s Text Generation Inference (TGI) is the other major incumbent, though vLLM is generally regarded as having superior throughput for multi-tenant or high-concurrency workloads.
Inferact's business model follows a well-worn path in the infrastructure world: maintain a dominant open-source project while offering a managed, high-performance commercial version for enterprises that cannot afford the engineering overhead of self-hosting. This strategy mirrors companies like Databricks or Confluent. For Inferact, the value proposition is simple: reduce the number of GPUs required to serve a given number of users, thereby lowering the single largest line item in an AI company's budget.
The company's roots in the UC Berkeley ecosystem are critical to its identity. The founders emerged from the same environment that produced Apache Spark and Ray, emphasizing a philosophy of building the "operating system" layer for new computing paradigms. Based in the Bay Area, Inferact is led by the original architects of the PagedAttention paper. Their deep integration with the research community ensures that the software stays at the edge of hardware optimization, supporting everything from NVIDIA H100s to emergent silicon from AMD and specialized AI accelerators. As organizations move beyond simple API wrappers and begin deploying their own fine-tuned models on private infrastructure, Inferact is positioned to be the default runtime for the next generation of AI applications.
A high-throughput and memory-efficient inference and serving engine for LLMs.
Inferact is hiring
You've explored Inferact.
Join organizations building the agentic web.