Announcing Rootly AI Labs: Accelerating Reliability Engineering Through Community-Driven Innovation

Reliability engineering is evolving quickly—and AI is the catalyst. That’s why we’re excited to unveil Rootly AI Labs, a community-focused program dedicated to reshaping reliability through open collaboration, innovative prototypes, and cutting-edge research.

Written by

Sylvain Kalache

Announcing Rootly AI Labs: Accelerating Reliability Engineering Through Community-Driven Innovation

Reliability engineering is evolving quickly—and AI is the catalyst. That’s why we’re excited to unveil Rootly AI Labs, a community-focused program dedicated to reshaping reliability through open collaboration, innovative prototypes, and cutting-edge research.

Launch Day at GitHub HQ

We kicked things off with an exclusive event at GitHub’s San Francisco headquarters. Guests explored interactive demos and heard insights from leaders at Google, Anthropic, Andreessen Horowitz, Browserbase, Sentry, GitHub, Postman, and Rootly.

A key theme was the latest progress in Model Communication Protocol (MCP) servers and Agent-to-Agent (A2A) protocols—technologies that help developers and SRE teams boost productivity and chase the elusive “six nines” of uptime.

Expert panel (moderated by Sylvain Kalache, Head of Rootly AI Labs)

Yoko Li – Partner, AI & Infrastructure, Andreessen Horowitz
Miku Jha – Director, AI/ML & Generative AI, Google Cloud
Pete Koomen – General Partner, Y Combinator

What is Rootly AI Labs?

Rootly AI Labs is a collaborative hub designed for reliability engineers and researchers to develop transformative AI-based open-source tools, innovative prototypes, and cutting-edge research papers. All outputs are freely accessible, aiming to rapidly improve industry-wide operational excellence.

Recent Projects Include:

IncidentDiagram (GitHub): An LLM-powered CLI tool that visually maps incident retrospectives and related codebases, helping SRE teams quickly understand and communicate issues.
EventOrOutage (GitHub): Utilizes LLM technology to distinguish between genuine outages and external events (like holidays or sports events) impacting traffic. Companies like LinkedIn and Netflix have experienced similar scenarios.
Model Benchmarks: Comprehensive assessments measuring AI model performance on reliability-centric tasks such as error triage and bug fixing. Key findings reveal the strengths and limitations of newer models, including DeepSeek’s performance on error logs and the weaknesses of Meta's Llama 4 on coding tasks.
Rootly MCP Server (GitHub): Enables incident management directly within LLM-powered IDEs such as Cursor, Claude, and Windsurf, streamlining the response workflow.

All projects are open-source (Apache 2.0) and free to use.

Join the Community

Backed by Anthropic and supported by top engineers from LinkedIn, Venmo, Twilio, and esteemed research universities such as Carnegie Mellon, Georgia Tech, and McGill, Rootly AI Labs is looking for new contributors and partners.

Visit our GitHub page to explore every project, watch creator interviews, and submit pull requests. Your feedback and ideas are always welcome.

Together, we’re redefining what’s possible in reliability engineering. Join us and be part of the future.

‍