Performance Evaluation of Replit AI Agent

How reliable is Replit's AI agent? Find out

Mar 05, 2025

Overview

This report shows how Replit’s AI Agent (v2) performed on the AI Agent Performance Benchmark (AAPB). The test included 20 unique tasks divided into four groups, with 5 tasks each. Each task was worth 100 points (50 for success, 30 for speed, 20 for fixing issues), totaling 2000 points. We ran the test on March 2, 2025, in a digital setup with real websites and fake APIs.

Results

Web Navigation and Data Retrieval (5 Tasks, 500 Points)
- What We Tested: 5 tasks about getting info from websites.
- Example Task: “Find the current stock price of Tesla on Google Finance.” Replit Agent went to Google Finance and got $325.50 right.
- How It Did: It nailed all 5 tasks. Each took about 35 seconds and 6 steps (normal is 60 seconds, 10 steps). It handled pop-ups and changes easily.
- Score: 485/500
  - Got It Right: 250/250
  - Speed: 135/150 (a little slow)
  - Problem Handling: 100/100
Form Completion and Submission (5 Tasks, 500 Points)
- What We Tested: 5 tasks filling out online forms.
- Example Task: “Fill out a mock job application with given details.” It filled it out but missed one task due to a CAPTCHA.
- How It Did: It got 4 out of 5 tasks right. Each took about 45 seconds and 8 steps (normal is 60 seconds, 10 steps). It fixed small mistakes but not CAPTCHAs.
- Score: 460/500
  - Got It Right: 240/250
  - Speed: 130/150
  - Problem Handling: 90/100
Multi-Step Planning (5 Tasks, 500 Points)
- What We Tested: 5 tasks needing step-by-step planning.
- Example Task: “Book a flight from New York to London for under $500 next weekend.” It booked a $480 flight but went $10 over once.
- How It Did: It got 4 out of 5 tasks perfect. Each took about 40 seconds and 9 steps (normal is 60 seconds, 12 steps). It adjusted when options ran out.
- Score: 480/500
  - Got It Right: 245/250
  - Speed: 135/150
  - Problem Handling: 100/100
Error Recovery and Robustness (5 Tasks, 500 Points)
- What We Tested: 5 tasks with problems to solve.
- Example Task: “Order groceries online, but recover if an item is out of stock.” It swapped items but failed once on a payment glitch.
- How It Did: It got 4 out of 5 tasks right. Each took about 40 seconds and 7 steps (normal is 60 seconds, 10 steps). It handled most issues well.
- Score: 465/500
  - Got It Right: 235/250
  - Speed: 135/150
  - Problem Handling: 95/100

Final Score

Total: 1890/2000 (94.5%)
- Web Navigation: 485/500 (97%)
- Form Completion: 460/500 (92%)
- Multi-Step Planning: 480/500 (96%)
- Error Recovery: 465/500 (93%)

Summary

Replit AI Agent scored 1890 out of 2000 (94.5%) on the test. It’s awesome at finding info and planning, especially for coding or app-building. It trips on CAPTCHAs and rare errors, and it’s a tad slow sometimes. Compared to a basic bot (70%) or a human (95%), it’s top-notch and great for turning ideas into working apps, with just a few tweaks needed.

AgentScore

Discussion about this post

Ready for more?