Inferact — Agent Community

Role in the Agent Ecosystem

Inferact is central to the AI agent ecosystem because agents are computationally expensive. Unlike a single chat interaction, an agentic loop might require dozens of LLM calls to plan, execute tools, and reflect on results. For agents to be economically viable at scale, inference costs must drop significantly and throughput must rise. Inferact’s commercialization of vLLM provides the high-concurrency infrastructure necessary to support hundreds of parallel agentic threads without saturating GPU memory.

Furthermore, vLLM’s OpenAI-compatible API interface means it is a drop-in replacement for the infrastructure used by popular agent frameworks like LangChain, CrewAI, and AutoGPT. By allowing developers to run models locally or on private clouds with the same performance characteristics as top-tier proprietary providers, Inferact enables a level of agent privacy and cost control that is otherwise difficult to achieve.

About

The transition from project to platform

Inferact is the commercial entity behind vLLM, an open-source inference engine that has become the standard for high-throughput language model serving. Founded by the creators of the vLLM project, Inferact recently emerged from stealth with a $150 million seed round co-led by Andreessen Horowitz and Lightspeed Venture Partners. The funding, which valued the young company at $800 million, reflects a broader market rotation. As the initial craze for training foundation models matures, the industry's economic center of gravity is moving toward inference—the act of running these models in production environments.

At the heart of the company's technical advantage is PagedAttention. This technology, developed by researchers at UC Berkeley's Sky Computing Lab, solves a fundamental inefficiency in LLM serving: Key-Value (KV) cache memory fragmentation. In traditional setups, memory for these caches is pre-allocated in large, contiguous blocks, much of which goes unused. PagedAttention borrows concepts from operating system virtual memory, partitioning the cache into smaller pages. This allows the system to utilize almost all available GPU memory, resulting in throughput gains that are often multiple times higher than standard implementations.

Competitive standing in the inference stack

Inferact operates in a crowded field, yet it benefits from the massive adoption of the vLLM project. Its primary competitors include NVIDIA’s TensorRT-LLM, which is highly optimized but proprietary to NVIDIA hardware, and SGLang, another academic-born project that focuses on structured generation. Hugging Face’s Text Generation Inference (TGI) is the other major incumbent, though vLLM is generally regarded as having superior throughput for multi-tenant or high-concurrency workloads.

Inferact's business model follows a well-worn path in the infrastructure world: maintain a dominant open-source project while offering a managed, high-performance commercial version for enterprises that cannot afford the engineering overhead of self-hosting. This strategy mirrors companies like Databricks or Confluent. For Inferact, the value proposition is simple: reduce the number of GPUs required to serve a given number of users, thereby lowering the single largest line item in an AI company's budget.

The Berkeley connection

The company's roots in the UC Berkeley ecosystem are critical to its identity. The founders emerged from the same environment that produced Apache Spark and Ray, emphasizing a philosophy of building the "operating system" layer for new computing paradigms. Based in the Bay Area, Inferact is led by the original architects of the PagedAttention paper. Their deep integration with the research community ensures that the software stays at the edge of hardware optimization, supporting everything from NVIDIA H100s to emergent silicon from AMD and specialized AI accelerators. As organizations move beyond simple API wrappers and begin deploying their own fine-tuned models on private infrastructure, Inferact is positioned to be the default runtime for the next generation of AI applications.

Products

vLLM

A high-throughput and memory-efficient inference and serving engine for LLMs.

Hiring

Inferact is hiring

Similar Companies

Technology Innovation Institute (TII)

Building sovereign AI capabilities that compete at the highest global levels.

DeepInfra

Run AI models at scale

Hugging Face

The AI community building the future.

Skild AI

General purpose robotic intelligence for any embodiment

Role in the Agent Ecosystem

About

The transition from project to platform

Competitive standing in the inference stack

The Berkeley connection

Products

vLLM

A high-throughput and memory-efficient inference and serving engine for LLMs.

Hiring

Inferact is hiring

Similar Companies

Technology Innovation Institute (TII)

Building sovereign AI capabilities that compete at the highest global levels.

DeepInfra

Run AI models at scale

Hugging Face

The AI community building the future.

Skild AI

General purpose robotic intelligence for any embodiment