rtrvr.ai's Breakthrough Performance on the Halluminate Web Bench: Redefining AI Agent Capabilities

We're thrilled to announce that rtrvr.ai has achieved the #1 position on the Halluminate Web Bench, the industry's most comprehensive benchmark for AI web agents.

The Rise of AI Web Agents and the Need for Robust Benchmarks

AI web agents are transforming digital interaction by automating complex online tasks. However, the web's dynamic nature poses significant challenges for evaluating their performance reliably. This has created a critical demand for robust, standardized benchmarks like Halluminate's Web Bench, ensuring verifiable capabilities and accelerating trust in this emerging technology.

Understanding Halluminate's Web Bench

Halluminate's Web Bench offers a rigorous, comprehensive standard for evaluating AI browser agents by distinguishing between "READ" and "WRITE" tasks. Explore its methodology and current results at halluminate.ai/blog/benchmark.

📈 Interactive Charts Above

Scroll up to explore our performance metrics in detail with interactive visualizations.

📊 Key Performance Metrics

Overall Leadership

81.39% overall success rate - highest among all tested agents
Surpasses OpenAI Operator + Human (76.5%) and Anthropic Sonnet 3.7 CUA (66.0%)
First agent to break the 80% threshold

Task Performance Breakdown

Read Tasks: 88.24% success rate

Best-in-class data extraction and information retrieval
Beats the human-supervised benchmark (79.0%) by over 9%
7.6% higher than Anthropic CUA (80.6%)

Write Tasks: 65.63% success rate

Leading performance in complex interactive tasks
41% higher than the runner-up (Skyvern at 46.6%)
Approaching human-supervised benchmark (70.7%)

⚡ Speed Revolution

0.9 minutes average task completion — the fastest in the industry:

Agent	Avg Time	Speed vs rtrvr.ai
rtrvr.ai	0.9 min	—
Browser Use Cloud	6.35 min	7x slower
OpenAI Operator	10.1 min	11x slower
Anthropic Sonnet 3.7 CUA	11.81 min	13x slower
Skyvern 2.0	12.49 min	14x slower
Skyvern 2.0 on Browserbase	20.84 min	23x slower

💰 Cost Efficiency

$0.12 average cost per task
Total evaluation cost: ~$40 for 4,000 credits (323 tasks)
Comparison: Halluminate reported testing costs of ~$3,000 per agent with human annotators
25x more cost-effective than cloud-based alternatives
Powered by Gemini Flash for optimal price/performance

🏆 Complete Leaderboard

Overall Performance

Rank	Agent	Success Rate
🥇 1	rtrvr.ai	81.39%
—	OpenAI Operator + Human	76.5%
2	Anthropic Sonnet 3.7 CUA	66.0%
3	Skyvern 2.0	64.4%
4	Skyvern 2.0 on Browserbase	60.7%
5	OpenAI Operator	59.8%
6	Browser Use Cloud	43.9%
7	Convergence AI	39.9%

Read Tasks Performance

Rank	Agent	Success Rate
🥇 1	rtrvr.ai	88.24%
2	Anthropic Sonnet 3.7 CUA	80.6%
3	Skyvern 2.0 on Browserbase	75.6%
4	OpenAI Operator	75.0%
5	Skyvern 2.0	74.2%
6	Browser Use Cloud	63.2%
7	Convergence AI	51.8%

Reference: Operator with Human Supervisor achieves 79.0%

Write Tasks Performance

Rank	Agent	Success Rate
🥇 1	rtrvr.ai	65.63%
2	Skyvern 2.0	46.6%
3	Anthropic Sonnet 3.7 CUA	39.4%
4	Skyvern 2.0 on Browserbase	33.6%
5	OpenAI Operator	32.3%
6	Convergence AI	13.1%
7	Browser Use Cloud	11.4%

Reference: Operator with Human Supervisor achieves 70.7%

🔑 Why rtrvr.ai Dominates

Local-First Architecture

rtrvr.ai distinguishes itself through a fundamental architectural difference: its commitment to local operation. Unlike many leading agents that rely on remote cloud browsers, rtrvr.ai operates directly within the user's own browser as a Chrome Extension, vetted and tested by Google for a secure and sandboxed execution environment.

Key Benefits:

No bot detection issues - runs from user's own browser and local IP
Reuses authenticated sessions - works with your logged-in accounts
No CAPTCHA blocking - bypasses challenges that plague cloud agents
Maintains user privacy - no credential sharing with third parties
Works with paywalled content - access subscriptions seamlessly

DOM-Based Intelligence

Rather than relying solely on visual cues or screenshots, rtrvr.ai leverages the underlying HTML structure of webpages, providing a deeper and more robust understanding of content and elements.

Key Advantages:

Direct HTML structure interaction vs screenshot parsing
Handles pop-ups and overlays that block vision-based agents
Enables parallel multi-tab workflows
Works natively in any language (no OCR errors)
Highly accurate data scraping

Key Observation on Vision-Based Agents

During our evaluation, rtrvr.ai performed well even when common web elements like pop-ups and overlays appeared. rtrvr.ai was able to close these or simply perform its action despite them. This contrasts sharply with many CUA (Computer Vision-based UI Automation) or vision-based agents, which often struggle with such elements. For vision agents, a pop-up can completely obscure the underlying webpage, requiring the agent to first identify and close the pop-up before it can even "see" and interact with the intended content.

Collapsing Exponential Failure Rates

A particularly compelling benefit of the DOM-based approach is its ability to mitigate the "exponential failure rate" problem inherent in multi-step web automation. In complex workflows, the probability of overall success can decrease dramatically with each additional step if individual steps have independent failure rates. By parallelizing steps across multiple tabs, rtrvr.ai fundamentally re-architects this problem, making sophisticated, multi-step tasks significantly more robust and reliable.

AI Function Calling

rtrvr.ai empowers users through its "AI Function Calling" capability, allowing them to define and supply their own custom code or functions that the AI agent can autonomously invoke. This feature provides immense flexibility and extensibility, enabling users to tailor the agent's capabilities to virtually any external tool, API, or custom workflow.

🔍 Failure Mode Analysis

A critical component of our evaluation is the detailed breakdown of failure modes:

Agent vs. Infrastructure Errors

Error Type	Percentage	Description
Agent Errors	96.61%	Internal AI logic and execution issues - can be directly addressed through AI improvements
Infrastructure Errors	3.39%	External blocking and access issues - remarkably low due to local operation

This extremely low percentage of infrastructure errors is a direct testament to rtrvr.ai's local, browser-extension design. Unlike cloud-based agents that frequently encounter obstacles such as bot detection, CAPTCHAs, and login authentication issues, rtrvr.ai's operation within the user's own browser effectively bypasses these common external barriers.

Why this matters: Having nearly all failures attributable to agent errors (rather than infrastructure) means development can focus entirely on enhancing core AI intelligence, reasoning, and robustness—rather than managing external factors like proxy rotations or CAPTCHA solving services.

🧪 Our Evaluation Methodology

Evaluation Setup

Security First: Credit cards were locked before evaluation to prevent unintended transactions
Pre-registered Accounts: Tasks assumed the agent was already logged into necessary accounts
Streamlined Task Management: rtrvr.ai's capability to ingest tasks and URLs directly from spreadsheet formats made benchmark setup remarkably easy

Key Learnings & Observations

Iterative Improvement ("Hill Climb"): We will continue "hill climbing" on the identified failure cases and expect dramatically better performance on future runs.

Agent's Tool Use (Googling): Despite task goals being confined to specific website navigation, the agent occasionally resorted to Googling, which we counted as valid due to rtrvr.ai's robust URL navigation capabilities.

Networking and Posting Limits: Certain websites exhibited aggressive limits, occasionally flagging IP addresses, pointing to requirements for distributed testing setups or rotating IPs.

Agent Interaction Quirks Identified

Aggressive Scrolling: Sometimes exhibited aggressive scrolling behavior
No Hover Action: Current limitation preventing interaction with hover-dependent UI elements
Dropdown Bugs: Challenges with multi-step dropdown interactions
Crawl Functionality Limits: Multi-tab processing was purposefully limited during benchmarking for consistency

📝 Notes on Web Bench Design

While Halluminate's Web Bench is a significant step forward, our evaluation highlighted several considerations:

Language Limitations: The current benchmark lacks tasks involving foreign language sites
Real-World Relevance: There's a disconnect between "top human visited websites" and actual websites where users would most prefer AI agents
Task Design: Future benchmarks could be more complex and open-ended, explicitly encouraging agents to utilize their full suite of tools
Infrastructure Management: Running on personal machines resulted in IP flagging due to high request volumes

📺 See It In Action

Watch our complete benchmark evaluation playlist to validate the results yourself.

🚀 What This Means

rtrvr.ai's performance represents a fundamental breakthrough in AI web automation:

Enterprise-ready reliability with over 80% success rate
Production-ready speed completing tasks in under a minute
Accessible pricing making automation available to everyone
Beats human-supervised agents on read tasks (88.24% vs 79.0%)

The combination of superior accuracy, blazing speed, and cost efficiency makes rtrvr.ai the clear choice for businesses and developers seeking reliable web automation.

Get Started

Ready to experience the industry's leading AI web agent?

Install rtrvr.ai Chrome Extension →

View Complete Benchmark Data →

Works Cited

Benchmark evaluation conducted June 2025 on Halluminate Web Bench v1.0 using 323 real-world tasks across read and write categories.

rtrvr.ai's Breakthrough Performance on the Halluminate Web Bench: Redefining AI Agent Capabilities

We're thrilled to announce that rtrvr.ai has achieved the #1 position on the Halluminate Web Bench, the industry's most comprehensive benchmark for AI web agents.

The Rise of AI Web Agents and the Need for Robust Benchmarks

Understanding Halluminate's Web Bench

📈 Interactive Charts Above

Scroll up to explore our performance metrics in detail with interactive visualizations.

📊 Key Performance Metrics

Overall Leadership

81.39% overall success rate - highest among all tested agents
Surpasses OpenAI Operator + Human (76.5%) and Anthropic Sonnet 3.7 CUA (66.0%)
First agent to break the 80% threshold

Task Performance Breakdown

Read Tasks: 88.24% success rate

Best-in-class data extraction and information retrieval
Beats the human-supervised benchmark (79.0%) by over 9%
7.6% higher than Anthropic CUA (80.6%)

Write Tasks: 65.63% success rate

Leading performance in complex interactive tasks
41% higher than the runner-up (Skyvern at 46.6%)
Approaching human-supervised benchmark (70.7%)

⚡ Speed Revolution

0.9 minutes average task completion — the fastest in the industry:

Agent	Avg Time	Speed vs rtrvr.ai
rtrvr.ai	0.9 min	—
Browser Use Cloud	6.35 min	7x slower
OpenAI Operator	10.1 min	11x slower
Anthropic Sonnet 3.7 CUA	11.81 min	13x slower
Skyvern 2.0	12.49 min	14x slower
Skyvern 2.0 on Browserbase	20.84 min	23x slower

💰 Cost Efficiency

$0.12 average cost per task
Total evaluation cost: ~$40 for 4,000 credits (323 tasks)
Comparison: Halluminate reported testing costs of ~$3,000 per agent with human annotators
25x more cost-effective than cloud-based alternatives
Powered by Gemini Flash for optimal price/performance

🏆 Complete Leaderboard

Overall Performance

Rank	Agent	Success Rate
🥇 1	rtrvr.ai	81.39%
—	OpenAI Operator + Human	76.5%
2	Anthropic Sonnet 3.7 CUA	66.0%
3	Skyvern 2.0	64.4%
4	Skyvern 2.0 on Browserbase	60.7%
5	OpenAI Operator	59.8%
6	Browser Use Cloud	43.9%
7	Convergence AI	39.9%

Read Tasks Performance

Rank	Agent	Success Rate
🥇 1	rtrvr.ai	88.24%
2	Anthropic Sonnet 3.7 CUA	80.6%
3	Skyvern 2.0 on Browserbase	75.6%
4	OpenAI Operator	75.0%
5	Skyvern 2.0	74.2%
6	Browser Use Cloud	63.2%
7	Convergence AI	51.8%

Reference: Operator with Human Supervisor achieves 79.0%

Write Tasks Performance

Rank	Agent	Success Rate
🥇 1	rtrvr.ai	65.63%
2	Skyvern 2.0	46.6%
3	Anthropic Sonnet 3.7 CUA	39.4%
4	Skyvern 2.0 on Browserbase	33.6%
5	OpenAI Operator	32.3%
6	Convergence AI	13.1%
7	Browser Use Cloud	11.4%

Reference: Operator with Human Supervisor achieves 70.7%

🔑 Why rtrvr.ai Dominates

Local-First Architecture

Key Benefits:

No bot detection issues - runs from user's own browser and local IP
Reuses authenticated sessions - works with your logged-in accounts
No CAPTCHA blocking - bypasses challenges that plague cloud agents
Maintains user privacy - no credential sharing with third parties
Works with paywalled content - access subscriptions seamlessly

DOM-Based Intelligence

Rather than relying solely on visual cues or screenshots, rtrvr.ai leverages the underlying HTML structure of webpages, providing a deeper and more robust understanding of content and elements.

Key Advantages:

Direct HTML structure interaction vs screenshot parsing
Handles pop-ups and overlays that block vision-based agents
Enables parallel multi-tab workflows
Works natively in any language (no OCR errors)
Highly accurate data scraping

Key Observation on Vision-Based Agents

Collapsing Exponential Failure Rates

AI Function Calling

🔍 Failure Mode Analysis

A critical component of our evaluation is the detailed breakdown of failure modes:

Agent vs. Infrastructure Errors

Error Type	Percentage	Description
Agent Errors	96.61%	Internal AI logic and execution issues - can be directly addressed through AI improvements
Infrastructure Errors	3.39%	External blocking and access issues - remarkably low due to local operation

🧪 Our Evaluation Methodology

Evaluation Setup

Security First: Credit cards were locked before evaluation to prevent unintended transactions
Pre-registered Accounts: Tasks assumed the agent was already logged into necessary accounts
Streamlined Task Management: rtrvr.ai's capability to ingest tasks and URLs directly from spreadsheet formats made benchmark setup remarkably easy

Key Learnings & Observations

Iterative Improvement ("Hill Climb"): We will continue "hill climbing" on the identified failure cases and expect dramatically better performance on future runs.

Networking and Posting Limits: Certain websites exhibited aggressive limits, occasionally flagging IP addresses, pointing to requirements for distributed testing setups or rotating IPs.

Agent Interaction Quirks Identified

Aggressive Scrolling: Sometimes exhibited aggressive scrolling behavior
No Hover Action: Current limitation preventing interaction with hover-dependent UI elements
Dropdown Bugs: Challenges with multi-step dropdown interactions
Crawl Functionality Limits: Multi-tab processing was purposefully limited during benchmarking for consistency

📝 Notes on Web Bench Design

While Halluminate's Web Bench is a significant step forward, our evaluation highlighted several considerations:

Language Limitations: The current benchmark lacks tasks involving foreign language sites
Real-World Relevance: There's a disconnect between "top human visited websites" and actual websites where users would most prefer AI agents
Task Design: Future benchmarks could be more complex and open-ended, explicitly encouraging agents to utilize their full suite of tools
Infrastructure Management: Running on personal machines resulted in IP flagging due to high request volumes

📺 See It In Action

Watch our complete benchmark evaluation playlist to validate the results yourself.

🚀 What This Means

rtrvr.ai's performance represents a fundamental breakthrough in AI web automation:

Enterprise-ready reliability with over 80% success rate
Production-ready speed completing tasks in under a minute
Accessible pricing making automation available to everyone
Beats human-supervised agents on read tasks (88.24% vs 79.0%)

The combination of superior accuracy, blazing speed, and cost efficiency makes rtrvr.ai the clear choice for businesses and developers seeking reliable web automation.

Get Started

Ready to experience the industry's leading AI web agent?

Install rtrvr.ai Chrome Extension →

View Complete Benchmark Data →

Works Cited

Benchmark evaluation conducted June 2025 on Halluminate Web Bench v1.0 using 323 real-world tasks across read and write categories.

Overall Success Rate

rtrvr.ai's Breakthrough Performance on the Halluminate Web Bench: Redefining AI Agent Capabilities

The Rise of AI Web Agents and the Need for Robust Benchmarks

Understanding Halluminate's Web Bench

📈 Interactive Charts Above

📊 Key Performance Metrics

Overall Leadership

Task Performance Breakdown

⚡ Speed Revolution

💰 Cost Efficiency

🏆 Complete Leaderboard

Overall Performance

Read Tasks Performance

Write Tasks Performance

🔑 Why rtrvr.ai Dominates

Local-First Architecture

DOM-Based Intelligence

Key Observation on Vision-Based Agents

Collapsing Exponential Failure Rates

AI Function Calling

🔍 Failure Mode Analysis

Agent vs. Infrastructure Errors

🧪 Our Evaluation Methodology

Evaluation Setup

Key Learnings & Observations

Agent Interaction Quirks Identified

📝 Notes on Web Bench Design

📺 See It In Action

🚀 What This Means

Get Started

Works Cited

Ready to Get Started?

Overall Success Rate

rtrvr.ai's Breakthrough Performance on the Halluminate Web Bench: Redefining AI Agent Capabilities

The Rise of AI Web Agents and the Need for Robust Benchmarks

Understanding Halluminate's Web Bench

📈 Interactive Charts Above

📊 Key Performance Metrics

Overall Leadership

Task Performance Breakdown

⚡ Speed Revolution

💰 Cost Efficiency

🏆 Complete Leaderboard

Overall Performance

Read Tasks Performance

Write Tasks Performance

🔑 Why rtrvr.ai Dominates

Local-First Architecture

DOM-Based Intelligence

Key Observation on Vision-Based Agents

Collapsing Exponential Failure Rates

AI Function Calling

🔍 Failure Mode Analysis

Agent vs. Infrastructure Errors

🧪 Our Evaluation Methodology

Evaluation Setup

Key Learnings & Observations

Agent Interaction Quirks Identified

📝 Notes on Web Bench Design

📺 See It In Action

🚀 What This Means

Get Started

Works Cited

Ready to Get Started?