Evaluate browser agents with Prime Intellect

Use BrowserEnv with the Prime CLI to evaluate browser agents on structured tasks. Each evaluation run spins up Browserbase sessions, feeds observations to your model, and collects reward signals, giving you reproducible benchmarks for browser-capable models.

Prime Intellect evaluation run showing browser agent rollouts with reward signals

Prerequisites

Browserbase account

API key from your Browserbase dashboard

Prime CLI

Install via uv add prime

verifiers

Install with browser extras: uv add verifiers[browser]

Install and configure

Set Browserbase credentials

Export your Browserbase credentials so BrowserEnv can create sessions:

export BROWSERBASE_API_KEY=your_browserbase_api_key

Install the Prime CLI

uv add prime
prime login

Install verifiers with browser support

uv add verifiers[browser]

Choose a BrowserEnv mode

BrowserEnv supports two observation/action modes. The mode is selected when you run an evaluation, either through the environment’s default or via -a args.

DOM mode (recommended)

The agent receives structured DOM content and issues natural language instructions via Stagehand tools (navigate, observe, act, extract). This is the default and works well for most browser tasks.

prime eval run browser-dom-example -m openai/gpt-4.1 -k PRIME_API_KEY

CUA mode

The agent receives screenshots and uses coordinate-based tool calls (click, type_text, scroll, screenshot). Use this for vision models trained on screenshot-grounded interaction.

prime eval run browser-cua-example -m anthropic/claude-opus-4.5 -k PRIME_API_KEY

CUA mode deploys a sandbox server by default to handle connection to Browserbase’s custom CDP driver, Understudy, which overcomes performance limitations of Playwright. You can also run against a local server with -a '{"use_sandbox": false}'. See Operational Notes below.

Run an evaluation

Install a hub environment

Install a published Browserbase environment from the Prime hub:

prime env install browser-dom-example

Run with default settings

prime eval run browser-dom-example -m openai/gpt-4.1 -k PRIME_API_KEY

CLI output from a browser-dom-example evaluation run showing tool calls, reward, and metrics

Override evaluation parameters

Control the number of examples, rollouts, and environment-specific args:

prime eval run browser-dom-example \
  -m openai/gpt-4.1 \
  -k PRIME_API_KEY \
  -n 10 \
  -r 2

Flag	Short	Description
`--model`	`-m`	Model to evaluate (e.g. `openai/gpt-4.1`, `anthropic/claude-opus-4.5`)
`--api-key-var`	`-k`	Environment variable name for the model API key
`--num-examples`	`-n`	Number of task examples to evaluate
`--rollouts-per-example`	`-r`	Rollouts per example
`--env-args`	`-a`	JSON args passed to the environment’s `load_environment()`
`--max-concurrent`	`-c`	Max concurrent requests
`--save-results`	`-s`	Save results to disk

Pass environment args

Use -a to pass JSON arguments to the environment. These are forwarded to the load_environment() function:

# DOM mode with custom Stagehand model and max turns
prime eval run browser-dom-example \
  -m openai/gpt-4.1 \
  -k PRIME_API_KEY \
  -a '{"max_turns": 20, "stagehand_model": "openai/gpt-4.1"}'

# CUA mode with proxies and Verified
prime eval run browser-cua-example \
  -m anthropic/claude-opus-4.5 \
  -k PRIME_API_KEY \
  -a '{"proxies": true, "verified": true}'

Run a published benchmark

Browserbase publishes browser benchmarks on the Prime hub:

# Mind2Web benchmark
prime eval run browserbase/mind2web \
  -m anthropic/claude-opus-4.5 \
  -r 1 -n 10 \
  -a '{"max_turns": 50, "proxies": true, "verified": true}'

# WebVoyager benchmark
prime eval run browserbase/webvoyager \
  -m anthropic/claude-opus-4.5 \
  -r 1 -n 4 \
  -a '{"max_turns": 5, "proxies": true, "verified": true}'

Run from a local environment

If your environment lives in a local directory:

prime eval run ./my_browser_env -m openai/gpt-4.1 -k PRIME_API_KEY

Operational notes

CUA Mode: Sandbox vs Local Server

By default, CUA mode deploys a sandbox server using a pre-built Docker image (deepdream19/cua-server:latest) that exposes Browserbase’s CDP framework, Understudy. This is the recommended setup.For local development, you can run the CUA server yourself and disable the sandbox:

prime eval run browser-cua-example \
  -m openai/gpt-4.1 \
  -k PRIME_API_KEY \
  -a '{"use_sandbox": false, "server_url": "http://localhost:3000"}'

Browserbase Proxies and Verified

Enable Proxies and Verified via environment args:

prime eval run browser-dom-example \
  -m openai/gpt-4.1 \
  -k PRIME_API_KEY \
  -a '{"proxies": true, "verified": true}'

These are passed through to Browserbase session creation.

Environment Variables

DOM mode requires:

BROWSERBASE_API_KEY: Browserbase API key
MODEL_API_KEY: API key for Stagehand’s underlying model

CUA mode requires:

BROWSERBASE_API_KEY: Browserbase API key
PRIME_API_KEY: Required when using sandbox mode (default). Set via prime login or as an env var.

CUA mode optional:

OPENAI_API_KEY: Forwarded into the sandbox container if set

Prime Intellect evaluation docs

Full documentation on Prime’s evaluation workflow

Prime verifiers Environments

Source code and docs for verifiers environments

Browserbase getting started

Core Browserbase documentation

RL Training Guide

Wire BrowserEnv into Prime RL training workflows

​Prerequisites

Browserbase account

Prime CLI

verifiers

​Install and configure

​Set Browserbase credentials

​Install the Prime CLI

​Install verifiers with browser support

​Choose a BrowserEnv mode

​DOM mode (recommended)

​CUA mode

​Run an evaluation

​Install a hub environment

​Run with default settings

​Override evaluation parameters

​Pass environment args

​Run a published benchmark

​Run from a local environment

​Operational notes

​Related resources

Prime Intellect evaluation docs

Prime verifiers Environments

Browserbase getting started

RL Training Guide

Prerequisites

Install and configure

Set Browserbase credentials

Install the Prime CLI

Install verifiers with browser support

Choose a BrowserEnv mode

DOM mode (recommended)

CUA mode

Run an evaluation

Install a hub environment

Run with default settings

Override evaluation parameters

Pass environment args

Run a published benchmark

Run from a local environment

Operational notes

Related resources