AI-powered browser automation with Vision and LangGraph.
An intelligent agent that can see, understand, and interact with web pages like a human. Built on LangGraph for robust state management and Ollama for flexible local or cloud-based vision models.
| Feature | Description |
|---|---|
| 🧠 LangGraph Orchestration | Robust state-machine logic for reliable task execution |
| 🔍 Vision-First Interaction | Analyzes real-time screenshots to understand page state |
| 🖱️ Coordinate Precision | Uses (x,y) coordinates for interactions, avoiding fragile selectors |
| 💾 Dual Memory System | Short-term context (session) + Long-term persisted memory |
| ⌨️ Full Keyboard/Mouse | Types, scrolls, and presses keys (Enter, Tab, etc.) |
| 🔄 Stateful Loops | Automatically captures screen after every action for verification |
| 💬 Professional CLI | Clean, interactive terminal interface with progress feedback |
# Clone the repo
git clone https://github.com/RaheesAhmed/browseragent.git
cd browseragent
# Install dependencies
uv sync
# Install Playwright browsers
uv run playwright install
# Pull recommended vision model
ollama pull qwen2.5-vl # Or llama3.2-visionuv run python main.py❯ "Go to google.com and search for 'LangGraph documentation'"
❯ "Login to github.com and check my recent notifications"
❯ "Visit Wikipedia and tell me about the history of Artificial Intelligence"The agent uses a StateGraph to manage the execution loop, ensuring it always "sees" the browser state before making its next decision.
graph TD
START -->|Initialize| Capture[Capture Screen]
Capture -->|Vision Input| Agent[LLM Agent]
Agent -->|Tool Calls| Tools[Execution Node]
Tools -->|Result| Capture
Agent -->|Finish| END
Modify src/config.py to change the model used by Ollama. You can switch between local models (e.g., qwen2.5-vl) or cloud-based models.
MODEL = "minimax-m2.5:cloud" # Edit this to change the model!- Browser State: Managed in
src/browser_manager.py(Default: 1280x800 viewport). - Tools: Extendable list in
src/tools.py.
| Command | Action |
|---|---|
exit / quit |
Close the agent |
Ctrl+C |
Force terminate session |
browseragent/
├── main.py # CLI entry point (Run agent)
├── src/
│ ├── agent.py # LangGraph state machine & reasoning
│ ├── browser_manager.py # Playwright low-level control
│ ├── tools.py # External tools (Navigation, Memory)
│ └── config.py # Global settings (Model selection)
├── memory.json # Persistent long-term memory store
└── .env # Environment variables
| Action | Description |
|---|---|
navigate |
Go to a specific URL |
click_at_location |
Click at (x, y) coordinates |
type_text |
Type text and optionally press Enter |
press_key |
Single key press (Tab, Escape, etc.) |
scroll_page |
Scroll up or down |
save/get_memory |
Interact with persistent memory |
MIT
Built with ❤️ using LangGraph, Playwright, and Ollama
