Performance Evaluation of OpenAI Operator
We put Operator through the ringer! See how it performs
Report: Performance Evaluation of OpenAI Operator
Date: March 2, 2025
Test: AI Agent Performance Benchmark (AAPB)
AgentScore: 1830/2000 (91.5%)
Overview
This report shows how OpenAI’s Operator performed on a test called the AI Agent Performance Benchmark (AAPB). The test had 20 different tasks split into four groups, with 5 tasks per group. Each task was worth 100 points (50 for getting it right, 30 for doing it fast, 20 for handling problems well), making a total of 2000 points possible. We tested Operator in a digital setup with real websites and fake APIs to mimic real life, on March 2, 2025.
Results
Web Navigation and Data Retrieval (5 Tasks, 500 Points)
What We Tested: 5 tasks about finding info online.
Example Task: “Find the current stock price of Tesla on Google Finance.” Operator went to Google Finance and got the price ($320.15) right.
How It Did: Operator finished all 5 tasks correctly. Each took about 30 seconds and 5 steps (normal is 60 seconds, 10 steps). It handled things like pop-ups fine.
Score: 480/500
Got It Right: 250/250
Speed: 140/150 (a bit slow sometimes)
Problem Handling: 100/100
Form Completion and Submission (5 Tasks, 500 Points)
What We Tested: 5 tasks about filling out online forms.
Example Task: “Fill out a mock job application with given details.” Operator put in the info (like name and email) but failed once because of a CAPTCHA.
How It Did: It got 4 out of 5 tasks right. Each took about 40 seconds and 7 steps (normal is 60 seconds, 10 steps). It fixed small errors but not CAPTCHAs.
Score: 430/500
Got It Right: 225/250
Speed: 125/150
Problem Handling: 90/100
Multi-Step Planning (5 Tasks, 500 Points)
What We Tested: 5 tasks needing multiple steps to plan something.
Example Task: “Book a flight from New York to London for under $500 next weekend.” Operator booked a $490 flight but went $10 over once.
How It Did: It got 4 out of 5 tasks perfect. Each took about 50 seconds and 10 steps (normal is 60 seconds, 12 steps). It switched options when needed.
Score: 465/500
Got It Right: 240/250
Speed: 125/150
Problem Handling: 100/100
Error Recovery and Robustness (5 Tasks, 500 Points)
What We Tested: 5 tasks where things go wrong to fix.
Example Task: “Order groceries online, but recover if an item is out of stock.” Operator swapped items but failed once on a payment error.
How It Did: It got 4 out of 5 tasks right. Each took about 45 seconds and 8 steps (normal is 60 seconds, 10 steps). It fixed most issues well.
Score: 455/500
Got It Right: 230/250
Speed: 130/150
Problem Handling: 95/100
Final Score
Total: 1830/2000 (91.5%)
Web Navigation: 480/500 (96%)
Form Completion: 430/500 (86%)
Multi-Step Planning: 465/500 (93%)
Error Recovery: 455/500 (91%)
Summary
Operator scored 1830 out of 2000 (91.5%) on the test. It’s great at finding info online and planning tasks, but it struggles with CAPTCHAs and rare errors like payment issues. It’s a bit slower than perfect sometimes. Compared to a basic bot (70%) or a human (95%), it’s very good and useful for automating jobs, though it’s not flawless yet.