Independent Project

RunNCap: Intelligent Web Automation Agent

Autonomous web agent using GPT-4 Vision and Set-of-Mark (SoM) for visual element grounding. Features real-time browser streaming, multi-strategy fallback, and transparent planning.

Technologies

SeleniumGPT-4 VisionFastAPIWebSocketPythonReinforcement Learning

Work Responsibilities

Multimodal Web Agent Engineer

Engineered an autonomous web agent using Selenium and Chrome DevTools Protocol (CDP) for precise browser control and full-page capture. Implemented Set-of-Mark (SoM) methodology with Pillow for element annotation and GPT-4 Vision for intelligent decision-making, enhanced with reinforcement-learning–based decision optimization.
Achieved robust element localization by developing a multi-tier fallback strategy that handles 95%+ of targeting edge cases. Built a real-time streaming infrastructure using FastAPI, WebSockets, and Uvicorn, and curated a multimodal dataset with JSON logs and annotated assets supporting reward modeling, policy refinement, and RLHF-style adaptive behavior.

Core Features

Browser Automation & Control

•Implemented Selenium-based web automation framework with Chrome WebDriver to execute programmatic interactions including click, type, scroll, and navigation actions
•Integrated Chrome DevTools Protocol (CDP) for full-page screenshot capture beyond viewport boundaries, enabling comprehensive visual analysis

Visual Grounding & Element Localization

•Developed Set-of-Mark (SoM) visual grounding system using Pillow (PIL) to annotate interactive elements with numbered bounding boxes on screenshots
•Engineered multi-tier element localization strategy with fallback mechanisms: (1) SoM visual ID → (2) coordinate-based clicking → (3) CSS selector, achieving robust element targeting
•Designed hybrid visual-textual grounding approach combining image annotations with element text descriptions to mitigate GPT-4V's pure visual hallucination issues, improving accuracy by 40%+

AI-Driven Decision Making

•Leveraged GPT-4 Vision API to analyze annotated screenshots and autonomously determine next actions based on task context and execution history
•Built goal-oriented task planning system that decomposes high-level user instructions into actionable steps with self-verification loops
•Implemented visual verification mechanism comparing before/after screenshots to validate action success and trigger adaptive retry strategies

Real-Time Communication Architecture

•Architected FastAPI + WebSocket server with Uvicorn ASGI runtime to enable bidirectional real-time communication between client and agent
•Delivered live streaming of planning updates, action execution status, and annotated screenshots via WebSocket channels for transparent operation monitoring

Structured Logging & Observability

•Created automated logging system (TaskLogger) that captures timestamped screenshots with metadata (action type, element ID, reasoning, verification status) in JSON and human-readable formats
•Organized execution traces with sequential numbering for debugging, performance analysis, and potential reinforcement learning dataset generation

Live Agent Execution Demo

Watch RunNCap autonomously create a GitHub repository through visual grounding and intelligent decision-making

Task Description

Go to my GitHub, create a new repository named RunNCap-Agent-Test-2, and add the description: This repository is automatically created by an agent, include this description, as a test.

Steps: 8

Duration: 94.9s

Click on any step below to view the SoM-annotated screenshot

clickElement #3✓ Verified

I chose element #3 because its text 'Sign in' matches the requirement to log into GitHub.

error

Error: Expecting value: line 1 column 1 (char 0)

waitElement #14✓ Verified

I chose element #14 because its text 'Sign in' matches the requirement to log into GitHub. However, I will wait for the user to manually input their credentials before proceeding.

clickElement #7✓ Verified

I chose element #7 because its text 'New' matches the requirement to create a new repository.

typeElement #33✓ Verified

I chose element #33 because it is the input field for the repository name, which matches the requirement to name the repository 'RunNCap-Agent-Test-2'.

typeElement #34✓ Verified

I chose element #34 because it is the input field for the description, and I need to add the specified description text.

clickElement #30✓ Verified

I chose element #30 because its text 'Create repository' matches the requirement to finalize and create the new repository with the specified details.

done

The task appears to be complete as the necessary actions to create the repository and add the description have already been performed.

Step 1 of 8

click

Target Element: #3

Click timeline steps to view different screenshots

Step 1 of 8

click

Target Element: #3

Click timeline steps to view different screenshots

Technology Stack

Core Technologies

Selenium WebDriver

Browser automation framework for programmatic web interactions

Chrome DevTools Protocol

Full-page screenshot capture and browser debugging

GPT-4 Vision API

Multimodal AI for visual understanding and decision making

Pillow (PIL)

Image processing for Set-of-Mark visual grounding

Python

Primary development language for agent logic

Infrastructure & Tools

FastAPI

High-performance async web framework for API server

WebSocket

Real-time bidirectional communication protocol

Uvicorn ASGI

Lightning-fast ASGI server for async Python

JSON Logging

Structured logging with metadata and timestamps