Skip to content

RaheesAhmed/browseragent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🌐 Browser Agent

AI-powered browser automation with Vision and LangGraph.

An intelligent agent that can see, understand, and interact with web pages like a human. Built on LangGraph for robust state management and Ollama for flexible local or cloud-based vision models.

Browser Agent


✨ Features

Feature Description
🧠 LangGraph Orchestration Robust state-machine logic for reliable task execution
🔍 Vision-First Interaction Analyzes real-time screenshots to understand page state
🖱️ Coordinate Precision Uses (x,y) coordinates for interactions, avoiding fragile selectors
💾 Dual Memory System Short-term context (session) + Long-term persisted memory
⌨️ Full Keyboard/Mouse Types, scrolls, and presses keys (Enter, Tab, etc.)
🔄 Stateful Loops Automatically captures screen after every action for verification
💬 Professional CLI Clean, interactive terminal interface with progress feedback

🚀 Quick Start

Prerequisites

  • Python 3.12+
  • uv package manager
  • Ollama installed and running

Installation

# Clone the repo
git clone https://github.com/RaheesAhmed/browseragent.git
cd browseragent

# Install dependencies
uv sync

# Install Playwright browsers
uv run playwright install

# Pull recommended vision model
ollama pull qwen2.5-vl  # Or llama3.2-vision

Run

uv run python main.py

💡 Usage Examples

"Go to google.com and search for 'LangGraph documentation'""Login to github.com and check my recent notifications""Visit Wikipedia and tell me about the history of Artificial Intelligence"

🧠 Architecture

The agent uses a StateGraph to manage the execution loop, ensuring it always "sees" the browser state before making its next decision.

graph TD
    START -->|Initialize| Capture[Capture Screen]
    Capture -->|Vision Input| Agent[LLM Agent]
    Agent -->|Tool Calls| Tools[Execution Node]
    Tools -->|Result| Capture
    Agent -->|Finish| END
Loading

⚙️ Configuration

Model Selection

Modify src/config.py to change the model used by Ollama. You can switch between local models (e.g., qwen2.5-vl) or cloud-based models.

MODEL = "minimax-m2.5:cloud"  # Edit this to change the model!

Advanced Settings

  • Browser State: Managed in src/browser_manager.py (Default: 1280x800 viewport).
  • Tools: Extendable list in src/tools.py.

🎮 CLI Controls

Command Action
exit / quit Close the agent
Ctrl+C Force terminate session

📁 Project Structure

browseragent/
├── main.py              # CLI entry point (Run agent)
├── src/
│   ├── agent.py         # LangGraph state machine & reasoning
│   ├── browser_manager.py # Playwright low-level control
│   ├── tools.py         # External tools (Navigation, Memory)
│   └── config.py        # Global settings (Model selection)
├── memory.json          # Persistent long-term memory store
└── .env                 # Environment variables

🛠️ Browser Actions

Action Description
navigate Go to a specific URL
click_at_location Click at (x, y) coordinates
type_text Type text and optionally press Enter
press_key Single key press (Tab, Escape, etc.)
scroll_page Scroll up or down
save/get_memory Interact with persistent memory

📝 License

MIT


Built with ❤️ using LangGraph, Playwright, and Ollama

About

An intelligent agent that can see, understand, and interact with web pages like a human.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages