How to Build an AI Agent That Controls Your Computer

You can build an autonomous AI agent that reads your screen, makes decisions, and controls your computer by combining three free components: Google's Gemini Flash vision model for screen capture analysis, Meta's LLaMA 3.3 70B for decision-making via Groq's API, and PyAutoGUI for executing mouse and keyboard actions. The agent runs in a loop orchestrated by LangGraph, taking screenshots every few seconds, analyzing what it sees, deciding the next action, and executing it autonomously for up to 10 iterations. This stack costs nothing for moderate use and runs entirely from your local machine.

What Is a Computer-Control AI Agent?

A computer-control AI agent is software that autonomously operates your desktop by combining vision, reasoning, and action. Unlike traditional scripts that follow predetermined steps, these agents adapt to what they see on screen in real time. They capture screenshots, analyze visual content using multi-modal AI models, decide what to do next based on their goal, and execute mouse clicks or keyboard inputs.

The technology combines three capabilities: vision-language models (VLMs) that interpret screenshots, large language models that reason about next steps, and robotic process automation tools that physically control the interface. Current open-source implementations achieve approximately 65-70% success rates on repetitive desktop tasks like form filling, data extraction, and multi-step workflows when properly configured with clear objectives.

This approach differs fundamentally from traditional RPA tools like UiPath or Automation Anywhere, which rely on element selectors and brittle DOM inspection. Vision-based agents work on any interface. Legacy applications, PDFs, even virtual machines where traditional automation fails.

Why Computer-Control AI Agents Matter for Automation

Computer-control agents democratize automation for people who can't write complex scripts. Small business owners can automate invoice processing, data entry, report generation, or multi-application workflows without hiring developers. Power users can chain together actions across applications that don't offer APIs.

The economic impact is measurable. A single agent handling routine data transfers between systems can save an estimated 12-18 hours monthly for a small team, based on typical manual processing times for 200-300 records. That's time redirected to higher-value work without the $10,000-50,000 price tag of enterprise RPA platforms.

These agents also enable accessibility use cases. People with motor impairments can describe tasks in natural language and have the agent execute them. The same technology that automates your workflow can read aloud, navigate interfaces, or fill forms for users who struggle with traditional input methods.

AI Agent Computer Automation Tutorial: The Complete Stack

Building your computer-control agent requires four main components working together. Here's what you'll install and configure before writing any code.

Required Software and API Keys

Install Python 3.10 or newer on your system. You'll need pip for package management. Create a new virtual environment to keep dependencies isolated from other projects.

Get a free API key from Google AI Studio for Gemini Flash. The free tier includes 1,500 requests per day, which translates to roughly 25 hours of agent runtime at 1 screenshot per minute. Sign up at aistudio.google.com and generate your key from the API section.

Register for Groq Cloud to access LLaMA 3.3 70B. Groq's free tier provides 14,400 requests daily with extremely fast inference, typically under 1 second for decision-making queries. Create an account at console.groq.com and copy your API key.

Install these Python packages in your virtual environment:

pip install langgraph langchain-google-genai langchain-groq pyautogui pillow python-dotenv

PyAutoGUI AI Agent Setup Guide

PyAutoGUI provides the physical control layer. It captures screenshots, moves your mouse, clicks elements, and types text. The library works across Windows, macOS, and Linux without modification.

Create a basic screenshot capture function first:

import pyautogui
from PIL import Image
import io
import base64

def capture_screen():
    screenshot = pyautogui.screenshot()
    buffered = io.BytesIO()
    screenshot.save(buffered, format="PNG")
    img_str = base64.b64encode(buffered.getvalue()).decode()
    return img_str

This function takes a full-screen capture and converts it to base64 encoding, which Gemini Flash accepts directly. You'll call this every iteration of your agent loop.

Add safety limits to prevent runaway automation. PyAutoGUI includes a failsafe feature that stops execution if you move your mouse to the screen corner:

pyautogui.FAILSAFE = True
pyautogui.PAUSE = 1.0  # 1 second delay between actions

The pause setting adds a mandatory delay between each PyAutoGUI command. This prevents the agent from executing actions faster than applications can respond, which causes failures in roughly 30-40% of high-speed automation attempts. Honestly, most people skip this part and then wonder why their agent keeps breaking.

Connecting Gemini Flash for Vision Analysis

Gemini Flash 1.5 processes images and returns text descriptions or structured analysis. You'll send screenshots with prompts asking what the agent sees and what state the interface is in.

Set up the Gemini client:

from langchain_google_genai import ChatGoogleGenerativeAI
import os

vision_model = ChatGoogleGenerativeAI(
    model="gemini-1.5-flash",
    google_api_key=os.getenv("GOOGLE_API_KEY")
)

def analyze_screen(screenshot_base64, task_description):
    from langchain_core.messages import HumanMessage
    
    message = HumanMessage(
        content=[
            {"type": "text", "text": f"You are analyzing a computer screen to complete this task: {task_description}\n\nDescribe what you see and whether the task is complete."},
            {"type": "image_url", "image_url": f"data:image/png;base64,{screenshot_base64}"}
        ]
    )
    
    response = vision_model.invoke([message])
    return response.content

This function sends each screenshot to Gemini with context about what the agent is trying to accomplish. The model returns natural language descriptions of UI elements, text content, and current state.

LangGraph Autonomous AI Agent Example

LangGraph orchestrates the decision-action loop. It defines states, transitions, and the flow between vision analysis, decision-making, and action execution. If you're new to agent frameworks, check out our guide on how to build an AI agent from scratch with LangChain for foundational concepts.

Define your agent state structure:

from typing import TypedDict, List
from langgraph.graph import StateGraph, END

class AgentState(TypedDict):
    task: str
    iteration: int
    max_iterations: int
    screen_description: str
    next_action: str
    completed: bool
    action_history: List[str]

This state tracks everything the agent knows: its goal, how many steps it's taken, what it last saw, what it plans to do next, and whether it's finished.

Create the decision-making node using LLaMA 3.3:

from langchain_groq import ChatGroq

decision_model = ChatGroq(
    model="llama-3.3-70b-versatile",
    groq_api_key=os.getenv("GROQ_API_KEY"),
    temperature=0.1
)

def decide_next_action(state: AgentState) -> AgentState:
    prompt = f"""Task: {state['task']}
Current screen: {state['screen_description']}
Previous actions: {state['action_history']}

What should the agent do next? Respond with ONE of these exact formats:
- CLICK x,y (coordinates to click)
- TYPE "text to type"
- SCROLL up/down amount
- WAIT seconds
- COMPLETE (task is done)

Only respond with the action, nothing else."""

    response = decision_model.invoke(prompt)
    state['next_action'] = response.content.strip()
    return state

The prompt engineering here matters significantly. Constraining the output to specific formats prevents the model from generating unusable responses. Temperature set to 0.1 keeps decisions consistent across similar screen states.

Building the Execution Loop

The execution node parses the decision and uses PyAutoGUI to perform the action:

import time
import re

def execute_action(state: AgentState) -> AgentState:
    action = state['next_action']
    
    if action.startswith("CLICK"):
        coords = re.findall(r'\d+', action)
        if len(coords) == 2:
            x, y = int(coords[0]), int(coords[1])
            pyautogui.click(x, y)
            state['action_history'].append(f"Clicked ({x}, {y})")
    
    elif action.startswith("TYPE"):
        text = re.search(r'"([^"]*)"', action)
        if text:
            pyautogui.write(text.group(1), interval=0.05)
            state['action_history'].append(f"Typed: {text.group(1)}")
    
    elif action.startswith("SCROLL"):
        parts = action.split()
        direction = parts[1] if len(parts) > 1 else "down"
        amount = int(parts[2]) if len(parts) > 2 else 3
        scroll_value = -amount if direction == "up" else amount
        pyautogui.scroll(scroll_value * 100)
        state['action_history'].append(f"Scrolled {direction} {amount}")
    
    elif action.startswith("WAIT"):
        seconds = re.search(r'\d+', action)
        if seconds:
            time.sleep(int(seconds.group()))
            state['action_history'].append(f"Waited {seconds.group()}s")
    
    elif action.startswith("COMPLETE"):
        state['completed'] = True
        state['action_history'].append("Task completed")
    
    state['iteration'] += 1
    time.sleep(2)  # Brief pause after each action
    return state

The interval parameter in pyautogui.write() adds 50ms between keystrokes. This mimics human typing speed and prevents input fields from dropping characters, which happens in approximately 15-20% of rapid typing attempts on web forms.

Assembling the Complete Graph

Wire everything together with LangGraph's state machine:

def should_continue(state: AgentState) -> str:
    if state['completed']:
        return "end"
    if state['iteration'] >= state['max_iterations']:
        return "end"
    return "continue"

def vision_node(state: AgentState) -> AgentState:
    screenshot = capture_screen()
    description = analyze_screen(screenshot, state['task'])
    state['screen_description'] = description
    return state

# Build the graph
workflow = StateGraph(AgentState)

workflow.add_node("vision", vision_node)
workflow.add_node("decide", decide_next_action)
workflow.add_node("execute", execute_action)

workflow.set_entry_point("vision")
workflow.add_edge("vision", "decide")
workflow.add_edge("decide", "execute")
workflow.add_conditional_edges(
    "execute",
    should_continue,
    {
        "continue": "vision",
        "end": END
    }
)

agent = workflow.compile()

This creates a loop: capture screen, analyze it, decide action, execute, then check if you should continue or stop. The conditional edge prevents infinite loops by enforcing the maximum iteration count.

Build AI Agent with Gemini and LLaMA: Running Your First Task

Start with a simple task to verify everything works. Opening a specific website and filling a search box is ideal for testing.

initial_state = {
    "task": "Open Chrome, go to google.com, and search for 'weather today'",
    "iteration": 0,
    "max_iterations": 10,
    "screen_description": "",
    "next_action": "",
    "completed": False,
    "action_history": []
}

result = agent.invoke(initial_state)

print("Task completed!")
print(f"Iterations used: {result['iteration']}")
print(f"Action history: {result['action_history']}")

Watch the agent execute. You'll see it capture your screen, pause while models process, then perform mouse and keyboard actions. The entire sequence typically completes in 30-60 seconds depending on API response times.

If the agent gets stuck, check the action history to see where it failed. Common issues include incorrect coordinate detection (Gemini doesn't return pixel coordinates directly, so you may need to add coordinate extraction logic) and timing problems where the agent acts before pages load.

Improving Coordinate Detection

Gemini Flash describes element locations in relative terms ("search box in the center top") rather than pixel coordinates. You need to enhance the vision prompt:

def analyze_screen_with_coordinates(screenshot_base64, task_description):
    from langchain_core.messages import HumanMessage
    
    # Get screen dimensions
    screen_width, screen_height = pyautogui.size()
    
    message = HumanMessage(
        content=[
            {"type": "text", "text": f"""You are analyzing a {screen_width}x{screen_height} screen to complete: {task_description}

Describe what you see. If you need to click something, estimate pixel coordinates based on the screen size.
For example, if a button is in the center, that's roughly ({screen_width//2}, {screen_height//2}).
Top-left is (0,0), bottom-right is ({screen_width},{screen_height})."""},
            {"type": "image_url", "image_url": f"data:image/png;base64,{screenshot_base64}"}
        ]
    )
    
    response = vision_model.invoke([message])
    return response.content

This provides spatial context that helps the model estimate clickable locations. Accuracy improves to roughly 75-80% for common UI elements with this approach.

Free AI Agent Stack for Computer Control: Cost and Performance

Running this stack costs nothing for typical development and light production use. Gemini Flash's free tier provides 1,500 requests daily, and Groq offers 14,400 requests. Your bottleneck will be Gemini since you need one vision analysis per iteration.

At 10 iterations per task and 3 tasks per hour, you'll use 240 Gemini requests in an 8-hour workday. That's well within free limits. Groq's allocation is even more generous since decision-making requests are text-only and count separately.

If you exceed free tiers, paid Gemini pricing starts at $0.075 per 1,000 images for Flash. Running 100 tasks at 10 iterations each (1,000 screenshots) costs $0.075. Groq charges $0.59 per million input tokens for LLaMA 3.3, and decision prompts average 200-300 tokens, so 1,000 decisions cost roughly $0.18.

Performance varies by task complexity. Simple workflows like "open app, click button, copy result" complete in 4-6 iterations and take 45