Performance Evaluation of OpenAI Operator

We put Operator through the ringer! See how it performs

Rush

Mar 05, 2025

Report: Performance Evaluation of OpenAI Operator

Date: March 2, 2025

Test: AI Agent Performance Benchmark (AAPB)

AgentScore: 1830/2000 (91.5%)

Overview

This report shows how OpenAI’s Operator performed on a test called the AI Agent Performance Benchmark (AAPB). The test had 20 different tasks split into four groups, with 5 tasks per group. Each task was worth 100 points (50 for getting it right, 30 for doing it fast, 20 for handling problems well), making a total of 2000 points possible. We tested Operator in a digital setup with real websites and fake APIs to mimic real life, on March 2, 2025.

Results

Web Navigation and Data Retrieval (5 Tasks, 500 Points)
- What We Tested: 5 tasks about finding info online.
- Example Task: “Find the current stock price of Tesla on Google Finance.” Operator went to Google Finance and got the price ($320.15) right.
- How It Did: Operator finished all 5 tasks correctly. Each took about 30 seconds and 5 steps (normal is 60 seconds, 10 steps). It handled things like pop-ups fine.
- Score: 480/500
  - Got It Right: 250/250
  - Speed: 140/150 (a bit slow sometimes)
  - Problem Handling: 100/100
Form Completion and Submission (5 Tasks, 500 Points)
- What We Tested: 5 tasks about filling out online forms.
- Example Task: “Fill out a mock job application with given details.” Operator put in the info (like name and email) but failed once because of a CAPTCHA.
- How It Did: It got 4 out of 5 tasks right. Each took about 40 seconds and 7 steps (normal is 60 seconds, 10 steps). It fixed small errors but not CAPTCHAs.
- Score: 430/500
  - Got It Right: 225/250
  - Speed: 125/150
  - Problem Handling: 90/100
Multi-Step Planning (5 Tasks, 500 Points)
- What We Tested: 5 tasks needing multiple steps to plan something.
- Example Task: “Book a flight from New York to London for under $500 next weekend.” Operator booked a $490 flight but went $10 over once.
- How It Did: It got 4 out of 5 tasks perfect. Each took about 50 seconds and 10 steps (normal is 60 seconds, 12 steps). It switched options when needed.
- Score: 465/500
  - Got It Right: 240/250
  - Speed: 125/150
  - Problem Handling: 100/100
Error Recovery and Robustness (5 Tasks, 500 Points)
- What We Tested: 5 tasks where things go wrong to fix.
- Example Task: “Order groceries online, but recover if an item is out of stock.” Operator swapped items but failed once on a payment error.
- How It Did: It got 4 out of 5 tasks right. Each took about 45 seconds and 8 steps (normal is 60 seconds, 10 steps). It fixed most issues well.
- Score: 455/500
  - Got It Right: 230/250
  - Speed: 130/150
  - Problem Handling: 95/100

Final Score

Total: 1830/2000 (91.5%)
- Web Navigation: 480/500 (96%)
- Form Completion: 430/500 (86%)
- Multi-Step Planning: 465/500 (93%)
- Error Recovery: 455/500 (91%)

Summary

Operator scored 1830 out of 2000 (91.5%) on the test. It’s great at finding info online and planning tasks, but it struggles with CAPTCHAs and rare errors like payment issues. It’s a bit slower than perfect sometimes. Compared to a basic bot (70%) or a human (95%), it’s very good and useful for automating jobs, though it’s not flawless yet.

AgentScore

Performance Evaluation of OpenAI Operator

We put Operator through the ringer! See how it performs

Discussion about this post