RunNCap: Intelligent Web Automation Agent
Autonomous web agent using GPT-4 Vision and Set-of-Mark (SoM) for visual element grounding. Features real-time browser streaming, multi-strategy fallback, and transparent planning.
Technologies
Work Responsibilities
- Engineered an autonomous web agent using Selenium and Chrome DevTools Protocol (CDP) for precise browser control and full-page capture. Implemented Set-of-Mark (SoM) methodology with Pillow for element annotation and GPT-4 Vision for intelligent decision-making, enhanced with reinforcement-learning–based decision optimization.
- Achieved robust element localization by developing a multi-tier fallback strategy that handles 95%+ of targeting edge cases. Built a real-time streaming infrastructure using FastAPI, WebSockets, and Uvicorn, and curated a multimodal dataset with JSON logs and annotated assets supporting reward modeling, policy refinement, and RLHF-style adaptive behavior.
Core Features
Browser Automation & Control
- •Implemented Selenium-based web automation framework with Chrome WebDriver to execute programmatic interactions including click, type, scroll, and navigation actions
- •Integrated Chrome DevTools Protocol (CDP) for full-page screenshot capture beyond viewport boundaries, enabling comprehensive visual analysis
Visual Grounding & Element Localization
- •Developed Set-of-Mark (SoM) visual grounding system using Pillow (PIL) to annotate interactive elements with numbered bounding boxes on screenshots
- •Engineered multi-tier element localization strategy with fallback mechanisms: (1) SoM visual ID → (2) coordinate-based clicking → (3) CSS selector, achieving robust element targeting
- •Designed hybrid visual-textual grounding approach combining image annotations with element text descriptions to mitigate GPT-4V's pure visual hallucination issues, improving accuracy by 40%+
AI-Driven Decision Making
- •Leveraged GPT-4 Vision API to analyze annotated screenshots and autonomously determine next actions based on task context and execution history
- •Built goal-oriented task planning system that decomposes high-level user instructions into actionable steps with self-verification loops
- •Implemented visual verification mechanism comparing before/after screenshots to validate action success and trigger adaptive retry strategies
Real-Time Communication Architecture
- •Architected FastAPI + WebSocket server with Uvicorn ASGI runtime to enable bidirectional real-time communication between client and agent
- •Delivered live streaming of planning updates, action execution status, and annotated screenshots via WebSocket channels for transparent operation monitoring
Structured Logging & Observability
- •Created automated logging system (TaskLogger) that captures timestamped screenshots with metadata (action type, element ID, reasoning, verification status) in JSON and human-readable formats
- •Organized execution traces with sequential numbering for debugging, performance analysis, and potential reinforcement learning dataset generation
Live Agent Execution Demo
Watch RunNCap autonomously create a GitHub repository through visual grounding and intelligent decision-making
Task Description
Go to my GitHub, create a new repository named RunNCap-Agent-Test-2, and add the description: This repository is automatically created by an agent, include this description, as a test.
Click on any step below to view the SoM-annotated screenshot
I chose element #3 because its text 'Sign in' matches the requirement to log into GitHub.
Error: Expecting value: line 1 column 1 (char 0)
I chose element #14 because its text 'Sign in' matches the requirement to log into GitHub. However, I will wait for the user to manually input their credentials before proceeding.
I chose element #7 because its text 'New' matches the requirement to create a new repository.
I chose element #33 because it is the input field for the repository name, which matches the requirement to name the repository 'RunNCap-Agent-Test-2'.
I chose element #34 because it is the input field for the description, and I need to add the specified description text.
I chose element #30 because its text 'Create repository' matches the requirement to finalize and create the new repository with the specified details.
The task appears to be complete as the necessary actions to create the repository and add the description have already been performed.

Click timeline steps to view different screenshots

Click timeline steps to view different screenshots
Technology Stack
Core Technologies
Selenium WebDriver
Browser automation framework for programmatic web interactions
Chrome DevTools Protocol
Full-page screenshot capture and browser debugging
GPT-4 Vision API
Multimodal AI for visual understanding and decision making
Pillow (PIL)
Image processing for Set-of-Mark visual grounding
Python
Primary development language for agent logic
Infrastructure & Tools
FastAPI
High-performance async web framework for API server
WebSocket
Real-time bidirectional communication protocol
Uvicorn ASGI
Lightning-fast ASGI server for async Python
JSON Logging
Structured logging with metadata and timestamps