Browser Agent
The Browser Agent is an AI-powered automation system that observes screenshots, makes decisions, and takes actions in your browser.
Overview
The Browser Agent uses vision-capable LLMs (like GPT-4o, Gemini, or Claude) to understand what's on your screen and interact with web pages intelligently.
How It Works
┌─────────────────┐
│ User Goal │ "Find flights to Paris"
└────────┬────────┘
│
▼
┌─────────────────┐
│ Take Screenshot│ Capture current page state
└────────┬────────┘
│
▼
┌─────────────────┐
│ AI Analysis │ Vision LLM understands the page
└────────┬────────┘
│
▼
┌─────────────────┐
│ Decide Action │ Click, type, navigate, scroll
└────────┬────────┘
│
▼
┌─────────────────┐
│ Execute via CDP│ Chrome DevTools Protocol
└────────┬────────┘
│
▼
Loop until goal complete
Usage
Via CLI (Recommended)
# Install praisonai
pip install praisonai
# Set your API key
export OPENAI_API_KEY="your-key"
# Launch browser automation
praisonai browser launch "Go to google.com and search for AI"
praisonai browser launch "Find flights to Paris" --model gpt-4o
Via Side Panel
- Start the bridge server:
praisonai browser start - Click the extension icon to open Side Panel
- Enter a task goal
- Click "Start Agent"
Supported LLMs
| Provider | Model | Vision Support |
|---|---|---|
| OpenAI | gpt-4o | ✅ |
| OpenAI | gpt-4o-mini | ✅ |
| gemini-1.5-pro | ✅ | |
| gemini-1.5-flash | ✅ | |
| Anthropic | claude-3-opus | ✅ |
| Anthropic | claude-3-sonnet | ✅ |
Note: Gemini Nano (Chrome's built-in AI) is text-only and cannot process screenshots, so it's disabled for Agent mode.
Best Practices
- Be Specific: Clear goals lead to better results
- Use Vision Models: Only vision-capable LLMs work with the agent
- Start Simple: Begin with straightforward tasks before complex workflows