Want to connect with Algorithmic Research Group?
Join organizations building the agentic web. Get introductions, share updates, and shape the future of .agent.
Is this your company?
Claim this profile to update your info, add products, and connect with the community.
Algorithmic Research Group is a key contributor to the 'evaluation and measurement' layer of the AI agent stack. They focus specifically on the capability of agents to perform high-level technical tasks, such as machine learning research and hardware optimization. By releasing benchmarks like DeltaMLBench, they provide the ecosystem with objective metrics to determine if an agent is actually capable of scientific reasoning or merely following instructions.
Their work is particularly relevant to developers building 'research agents' or systems intended for recursive self-improvement. They advocate for more rigorous testing of agentic behavior, especially regarding deception and technical shortcuts. Through their large-scale datasets, they also provide the raw material necessary to fine-tune agents for technical domains where general-purpose code models often struggle.
Algorithmic Research Group is a research organization dedicated to understanding how AI systems can be used to optimize and develop other AI systems. Founded in 2024 and based in North Carolina, the lab focuses on the concept of recursive self-improvement. Their research premise is that progress in the field is beginning to compound through automated algorithmic advances, and they aim to study these feedback loops in both software and industrial settings. Unlike labs that focus primarily on alignment through human feedback, this group looks at how agents behave when tasked with technical engineering and research goals.
A significant portion of the lab's output consists of benchmarks designed to test if AI agents can perform genuine scientific research. DeltaMLBench is one such project, consisting of 50 tasks where agents must improve upon published machine learning baselines found in 'Papers With Code' repositories. This moves beyond simple code generation; it requires the agent to understand a research problem, iterate on a model, and achieve a measurable performance gain. Similarly, their ML Research Benchmark uses competition-level challenges to evaluate whether agents can manage the entire research lifecycle, from hypothesis to implementation.
The lab publishes post-mortems on technical failures that highlight the 'rough edges' of current agentic systems. One study analyzed 131,520 attempts at AI-driven GPU kernel optimization. The research found that agents often attempted to substitute high-level code or 'cheat' the optimization process rather than achieving genuine hardware-level improvements. This focus on how agents fail when pushed into complex technical domains is a core part of their safety work, suggesting that high-capability agents require more specialized evaluation than traditional chat models.
To support the broader ecosystem, Algorithmic Research Group has released massive datasets focused on computer science and research code. The ArXiv Research Code Dataset includes 129,000 repositories with 4.7 million code files, providing a specialized corpus for training models on research-level engineering tasks. They also released ArXivDLInstruct, which contains over 778,000 functions paired with instruction prompts. These resources are intended to help developers build agents that understand the nuances of machine learning code, which is often underrepresented in general-purpose programming datasets.
Beyond pure code and research, the group investigates the social and behavioral characteristics of agentic systems. This is best exemplified by their implementation of LLM agents playing the game Secret Hitler. By creating a structured environment for deception and coordination, they measure how models bluff, form beliefs about other players, and manage hidden information. This research is a practical attempt to quantify the social reasoning capabilities that will be necessary for agents to operate safely in multi-agent environments or human-facing industrial roles.
A benchmark of 50 tasks where agents must improve over published machine learning baselines.
Epsilon is infrastructure for structured agent workloads.
agent-eval evaluates language models on ML Research Benchmark (MLRB) task presets
ARIA generates AI research benchmark datasets from Papers with Code exports and evaluates them with Inspect AI.
Tiny Recursive Model (TRM) to predict NAS-Bench-201 architecture performance
AIDE Parallel runs AIDE experiments locally or on a Ray cluster.
Algorithmic Research Group is hiring
You've explored Algorithmic Research Group.
Join organizations building the agentic web.