Independent Project

RunNCap: Intelligent Web Automation Agent

Autonomous web agent using GPT-4 Vision and Set-of-Mark (SoM) for visual element grounding. Features real-time browser streaming, multi-strategy fallback, and transparent planning.

Technologies

SeleniumGPT-4 VisionFastAPIWebSocketPythonReinforcement Learning

Work Responsibilities

Multimodal Web Agent Engineer
  • Engineered an autonomous web agent using Selenium and Chrome DevTools Protocol (CDP) for precise browser control and full-page capture. Implemented Set-of-Mark (SoM) methodology with Pillow for element annotation and GPT-4 Vision for intelligent decision-making, enhanced with reinforcement-learning–based decision optimization.
  • Achieved robust element localization by developing a multi-tier fallback strategy that handles 95%+ of targeting edge cases. Built a real-time streaming infrastructure using FastAPI, WebSockets, and Uvicorn, and curated a multimodal dataset with JSON logs and annotated assets supporting reward modeling, policy refinement, and RLHF-style adaptive behavior.

Core Features

Browser Automation & Control

  • Implemented Selenium-based web automation framework with Chrome WebDriver to execute programmatic interactions including click, type, scroll, and navigation actions
  • Integrated Chrome DevTools Protocol (CDP) for full-page screenshot capture beyond viewport boundaries, enabling comprehensive visual analysis

Visual Grounding & Element Localization

  • Developed Set-of-Mark (SoM) visual grounding system using Pillow (PIL) to annotate interactive elements with numbered bounding boxes on screenshots
  • Engineered multi-tier element localization strategy with fallback mechanisms: (1) SoM visual ID → (2) coordinate-based clicking → (3) CSS selector, achieving robust element targeting
  • Designed hybrid visual-textual grounding approach combining image annotations with element text descriptions to mitigate GPT-4V's pure visual hallucination issues, improving accuracy by 40%+

AI-Driven Decision Making

  • Leveraged GPT-4 Vision API to analyze annotated screenshots and autonomously determine next actions based on task context and execution history
  • Built goal-oriented task planning system that decomposes high-level user instructions into actionable steps with self-verification loops
  • Implemented visual verification mechanism comparing before/after screenshots to validate action success and trigger adaptive retry strategies

Real-Time Communication Architecture

  • Architected FastAPI + WebSocket server with Uvicorn ASGI runtime to enable bidirectional real-time communication between client and agent
  • Delivered live streaming of planning updates, action execution status, and annotated screenshots via WebSocket channels for transparent operation monitoring

Structured Logging & Observability

  • Created automated logging system (TaskLogger) that captures timestamped screenshots with metadata (action type, element ID, reasoning, verification status) in JSON and human-readable formats
  • Organized execution traces with sequential numbering for debugging, performance analysis, and potential reinforcement learning dataset generation

Live Agent Execution Demo

Watch RunNCap autonomously create a GitHub repository through visual grounding and intelligent decision-making

Task Description

Go to my GitHub, create a new repository named RunNCap-Agent-Test-2, and add the description: This repository is automatically created by an agent, include this description, as a test.

Steps: 8
Duration: 94.9s

Click on any step below to view the SoM-annotated screenshot

1
clickElement #3✓ Verified

I chose element #3 because its text 'Sign in' matches the requirement to log into GitHub.

2
error

Error: Expecting value: line 1 column 1 (char 0)

3
waitElement #14✓ Verified

I chose element #14 because its text 'Sign in' matches the requirement to log into GitHub. However, I will wait for the user to manually input their credentials before proceeding.

4
clickElement #7✓ Verified

I chose element #7 because its text 'New' matches the requirement to create a new repository.

5
typeElement #33✓ Verified

I chose element #33 because it is the input field for the repository name, which matches the requirement to name the repository 'RunNCap-Agent-Test-2'.

6
typeElement #34✓ Verified

I chose element #34 because it is the input field for the description, and I need to add the specified description text.

7
clickElement #30✓ Verified

I chose element #30 because its text 'Create repository' matches the requirement to finalize and create the new repository with the specified details.

8
done

The task appears to be complete as the necessary actions to create the repository and add the description have already been performed.

Step 1: click
Step 1 of 8
click
Target Element: #3

Click timeline steps to view different screenshots

Technology Stack

Core Technologies

Selenium WebDriver

Browser automation framework for programmatic web interactions

Chrome DevTools Protocol

Full-page screenshot capture and browser debugging

GPT-4 Vision API

Multimodal AI for visual understanding and decision making

Pillow (PIL)

Image processing for Set-of-Mark visual grounding

Python

Primary development language for agent logic

Infrastructure & Tools

FastAPI

High-performance async web framework for API server

WebSocket

Real-time bidirectional communication protocol

Uvicorn ASGI

Lightning-fast ASGI server for async Python

JSON Logging

Structured logging with metadata and timestamps