rtrvr.ai's Breakthrough Performance on the Halluminate Web Bench: Redefining AI Agent Capabilities
We're thrilled to announce that rtrvr.ai has achieved the #1 position on the Halluminate Web Bench, the industry's most comprehensive benchmark for AI web agents.
The Rise of AI Web Agents and the Need for Robust Benchmarks
AI web agents are transforming digital interaction by automating complex online tasks. However, the web's dynamic nature poses significant challenges for evaluating their performance reliably. This has created a critical demand for robust, standardized benchmarks like Halluminate's Web Bench, ensuring verifiable capabilities and accelerating trust in this emerging technology.
Understanding Halluminate's Web Bench
Halluminate's Web Bench offers a rigorous, comprehensive standard for evaluating AI browser agents by distinguishing between "READ" and "WRITE" tasks. Explore its methodology and current results at halluminate.ai/blog/benchmark.
📈 Interactive Charts Above
Scroll up to explore our performance metrics in detail with interactive visualizations.
📊 Key Performance Metrics
Overall Leadership
- 81.39% overall success rate - highest among all tested agents
- Surpasses OpenAI Operator + Human (76.5%) and Anthropic Sonnet 3.7 CUA (66.0%)
- First agent to break the 80% threshold
Task Performance Breakdown
Read Tasks: 88.24% success rate
- Best-in-class data extraction and information retrieval
- Beats the human-supervised benchmark (79.0%) by over 9%
- 7.6% higher than Anthropic CUA (80.6%)
Write Tasks: 65.63% success rate
- Leading performance in complex interactive tasks
- 41% higher than the runner-up (Skyvern at 46.6%)
- Approaching human-supervised benchmark (70.7%)
⚡ Speed Revolution
0.9 minutes average task completion — the fastest in the industry:
| Agent | Avg Time | Speed vs rtrvr.ai |
|---|---|---|
| rtrvr.ai | 0.9 min | — |
| Browser Use Cloud | 6.35 min | 7x slower |
| OpenAI Operator | 10.1 min | 11x slower |
| Anthropic Sonnet 3.7 CUA | 11.81 min | 13x slower |
| Skyvern 2.0 | 12.49 min | 14x slower |
| Skyvern 2.0 on Browserbase | 20.84 min | 23x slower |
💰 Cost Efficiency
- $0.12 average cost per task
- Total evaluation cost: ~$40 for 4,000 credits (323 tasks)
- Comparison: Halluminate reported testing costs of ~$3,000 per agent with human annotators
- 25x more cost-effective than cloud-based alternatives
- Powered by Gemini Flash for optimal price/performance
🏆 Complete Leaderboard
Overall Performance
| Rank | Agent | Success Rate |
|---|---|---|
| 🥇 1 | rtrvr.ai | 81.39% |
| — | OpenAI Operator + Human | 76.5% |
| 2 | Anthropic Sonnet 3.7 CUA | 66.0% |
| 3 | Skyvern 2.0 | 64.4% |
| 4 | Skyvern 2.0 on Browserbase | 60.7% |
| 5 | OpenAI Operator | 59.8% |
| 6 | Browser Use Cloud | 43.9% |
| 7 | Convergence AI | 39.9% |
Read Tasks Performance
| Rank | Agent | Success Rate |
|---|---|---|
| 🥇 1 | rtrvr.ai | 88.24% |
| 2 | Anthropic Sonnet 3.7 CUA | 80.6% |
| 3 | Skyvern 2.0 on Browserbase | 75.6% |
| 4 | OpenAI Operator | 75.0% |
| 5 | Skyvern 2.0 | 74.2% |
| 6 | Browser Use Cloud | 63.2% |
| 7 | Convergence AI | 51.8% |
Reference: Operator with Human Supervisor achieves 79.0%
Write Tasks Performance
| Rank | Agent | Success Rate |
|---|---|---|
| 🥇 1 | rtrvr.ai | 65.63% |
| 2 | Skyvern 2.0 | 46.6% |
| 3 | Anthropic Sonnet 3.7 CUA | 39.4% |
| 4 | Skyvern 2.0 on Browserbase | 33.6% |
| 5 | OpenAI Operator | 32.3% |
| 6 | Convergence AI | 13.1% |
| 7 | Browser Use Cloud | 11.4% |
Reference: Operator with Human Supervisor achieves 70.7%
🔑 Why rtrvr.ai Dominates
Local-First Architecture
rtrvr.ai distinguishes itself through a fundamental architectural difference: its commitment to local operation. Unlike many leading agents that rely on remote cloud browsers, rtrvr.ai operates directly within the user's own browser as a Chrome Extension, vetted and tested by Google for a secure and sandboxed execution environment.
Key Benefits:
- No bot detection issues - runs from user's own browser and local IP
- Reuses authenticated sessions - works with your logged-in accounts
- No CAPTCHA blocking - bypasses challenges that plague cloud agents
- Maintains user privacy - no credential sharing with third parties
- Works with paywalled content - access subscriptions seamlessly
DOM-Based Intelligence
Rather than relying solely on visual cues or screenshots, rtrvr.ai leverages the underlying HTML structure of webpages, providing a deeper and more robust understanding of content and elements.
Key Advantages:
- Direct HTML structure interaction vs screenshot parsing
- Handles pop-ups and overlays that block vision-based agents
- Enables parallel multi-tab workflows
- Works natively in any language (no OCR errors)
- Highly accurate data scraping
Key Observation on Vision-Based Agents
During our evaluation, rtrvr.ai performed well even when common web elements like pop-ups and overlays appeared. rtrvr.ai was able to close these or simply perform its action despite them. This contrasts sharply with many CUA (Computer Vision-based UI Automation) or vision-based agents, which often struggle with such elements. For vision agents, a pop-up can completely obscure the underlying webpage, requiring the agent to first identify and close the pop-up before it can even "see" and interact with the intended content.
Collapsing Exponential Failure Rates
A particularly compelling benefit of the DOM-based approach is its ability to mitigate the "exponential failure rate" problem inherent in multi-step web automation. In complex workflows, the probability of overall success can decrease dramatically with each additional step if individual steps have independent failure rates. By parallelizing steps across multiple tabs, rtrvr.ai fundamentally re-architects this problem, making sophisticated, multi-step tasks significantly more robust and reliable.
AI Function Calling
rtrvr.ai empowers users through its "AI Function Calling" capability, allowing them to define and supply their own custom code or functions that the AI agent can autonomously invoke. This feature provides immense flexibility and extensibility, enabling users to tailor the agent's capabilities to virtually any external tool, API, or custom workflow.
🔍 Failure Mode Analysis
A critical component of our evaluation is the detailed breakdown of failure modes:
Agent vs. Infrastructure Errors
| Error Type | Percentage | Description |
|---|---|---|
| Agent Errors | 96.61% | Internal AI logic and execution issues - can be directly addressed through AI improvements |
| Infrastructure Errors | 3.39% | External blocking and access issues - remarkably low due to local operation |
This extremely low percentage of infrastructure errors is a direct testament to rtrvr.ai's local, browser-extension design. Unlike cloud-based agents that frequently encounter obstacles such as bot detection, CAPTCHAs, and login authentication issues, rtrvr.ai's operation within the user's own browser effectively bypasses these common external barriers.
Why this matters: Having nearly all failures attributable to agent errors (rather than infrastructure) means development can focus entirely on enhancing core AI intelligence, reasoning, and robustness—rather than managing external factors like proxy rotations or CAPTCHA solving services.
🧪 Our Evaluation Methodology
Evaluation Setup
- Security First: Credit cards were locked before evaluation to prevent unintended transactions
- Pre-registered Accounts: Tasks assumed the agent was already logged into necessary accounts
- Streamlined Task Management: rtrvr.ai's capability to ingest tasks and URLs directly from spreadsheet formats made benchmark setup remarkably easy
Key Learnings & Observations
Iterative Improvement ("Hill Climb"): We will continue "hill climbing" on the identified failure cases and expect dramatically better performance on future runs.
Agent's Tool Use (Googling): Despite task goals being confined to specific website navigation, the agent occasionally resorted to Googling, which we counted as valid due to rtrvr.ai's robust URL navigation capabilities.
Networking and Posting Limits: Certain websites exhibited aggressive limits, occasionally flagging IP addresses, pointing to requirements for distributed testing setups or rotating IPs.
Agent Interaction Quirks Identified
- Aggressive Scrolling: Sometimes exhibited aggressive scrolling behavior
- No Hover Action: Current limitation preventing interaction with hover-dependent UI elements
- Dropdown Bugs: Challenges with multi-step dropdown interactions
- Crawl Functionality Limits: Multi-tab processing was purposefully limited during benchmarking for consistency
📝 Notes on Web Bench Design
While Halluminate's Web Bench is a significant step forward, our evaluation highlighted several considerations:
- Language Limitations: The current benchmark lacks tasks involving foreign language sites
- Real-World Relevance: There's a disconnect between "top human visited websites" and actual websites where users would most prefer AI agents
- Task Design: Future benchmarks could be more complex and open-ended, explicitly encouraging agents to utilize their full suite of tools
- Infrastructure Management: Running on personal machines resulted in IP flagging due to high request volumes
📺 See It In Action
Watch our complete benchmark evaluation playlist to validate the results yourself.
🚀 What This Means
rtrvr.ai's performance represents a fundamental breakthrough in AI web automation:
- Enterprise-ready reliability with over 80% success rate
- Production-ready speed completing tasks in under a minute
- Accessible pricing making automation available to everyone
- Beats human-supervised agents on read tasks (88.24% vs 79.0%)
The combination of superior accuracy, blazing speed, and cost efficiency makes rtrvr.ai the clear choice for businesses and developers seeking reliable web automation.
Get Started
Ready to experience the industry's leading AI web agent?
Install rtrvr.ai Chrome Extension →
View Complete Benchmark Data →
Works Cited
- Web Bench: The Current State of Browser Agents - Halluminate
- Web Bench - A new way to compare AI Browser Agents - Skyvern
Benchmark evaluation conducted June 2025 on Halluminate Web Bench v1.0 using 323 real-world tasks across read and write categories.