Module 7: Browser Automation & Web Scraping

Learning Objectives

By the end of this module, you will be able to:

Understand OpenClaw's Headless Chromium architecture
Configure and use the Puppeteer integration for browser automation
Write secure and efficient web scraping Agents
Build a complete "Price Monitoring Agent"
Handle data extraction from dynamically rendered pages (SPAs)
Understand the legal and ethical considerations of web scraping

Core Concepts

Browser Automation Architecture

OpenClaw uses a built-in Headless Chromium engine with a Puppeteer API, enabling the Agent to operate a browser just like a human:

Agent
  │
  ├─→ Puppeteer API
  │     ├─→ Headless Chromium Instance
  │     │     ├─→ Page navigation
  │     │     ├─→ DOM manipulation
  │     │     ├─→ Screenshots
  │     │     └─→ PDF generation
  │     └─→ Browser Context (isolated browsing environment)
  │
  └─→ Results returned to user or downstream Skill

Headless vs Headed Mode

Mode	Description	Use Cases
Headless	No GUI, runs in the background	Server environments, scheduled scraping, CI/CD
Headed	GUI displayed	Development debugging, flows requiring human intervention
Headless (New)	Chrome 113+ new Headless mode	When more complete browser behavior simulation is needed

Why Puppeteer?

OpenClaw chose Puppeteer over Playwright for these reasons:

Tighter native integration with Chromium
Historical technical choice by the OpenClaw core team
Most browser Skills in the ecosystem are built on Puppeteer
Lower memory footprint (only requires Chromium, not multiple browser engines)

Implementation Guide

Step 1: Installation & Configuration

Ensure OpenClaw's browser module is enabled:

# Check if Chromium is installed
openclaw browser check

# If not installed, manually install Chromium
openclaw browser install

# Confirm version
openclaw browser version

Configure browser parameters in settings.json:

{
  "browser": {
    "enabled": true,
    "engine": "chromium",
    "headless": true,
    "launch_options": {
      "args": [
        "--no-sandbox",
        "--disable-setuid-sandbox",
        "--disable-dev-shm-usage",
        "--disable-gpu"
      ],
      "timeout": 30000
    },
    "default_viewport": {
      "width": 1920,
      "height": 1080
    },
    "max_concurrent_pages": 5,
    "page_timeout_ms": 30000
  }
}

Security Warning

The --no-sandbox argument disables Chromium's sandbox protection. In production environments, it is strongly recommended to use a Podman container for outer-layer isolation rather than disabling the sandbox. See Module 9: Security.

Step 2: Basic Browser Operations

Control the browser through the Agent's natural language commands:

User:  Open https://example.com and take a screenshot
Agent: [Launching Headless Chromium]
       [Navigating to example.com]
       [Screenshot complete, saved to /tmp/screenshot-2026-03-20.png]

For programmatic Skills:

// skills/browser-tools/screenshot.js
module.exports = {
  name: "webpage-screenshot",
  description: "Capture a webpage screenshot",

  async execute(context) {
    const { params, browser } = context;
    const page = await browser.newPage();

    try {
      // Set User-Agent to avoid bot detection
      await page.setUserAgent(
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ' +
        'AppleWebKit/537.36 (KHTML, like Gecko) ' +
        'Chrome/121.0.0.0 Safari/537.36'
      );

      await page.goto(params.url, {
        waitUntil: 'networkidle2',
        timeout: 30000
      });

      // Wait for a specific element to load (if specified)
      if (params.wait_for_selector) {
        await page.waitForSelector(params.wait_for_selector, {
          timeout: 10000
        });
      }

      const screenshotPath = `/tmp/screenshot-${Date.now()}.png`;
      await page.screenshot({
        path: screenshotPath,
        fullPage: params.full_page || false
      });

      return {
        success: true,
        path: screenshotPath,
        title: await page.title()
      };
    } finally {
      await page.close();
    }
  }
};

Step 3: Build a Price Monitoring Agent

This is a complete real-world project -- monitoring product prices on e-commerce platforms and automatically notifying when a price drops below a threshold.

1. Create the Skill structure:

mkdir -p skills/price-monitor
touch skills/price-monitor/index.js
touch skills/price-monitor/parsers.js
touch skills/price-monitor/storage.js

2. Main logic:

// skills/price-monitor/index.js
const { parsePrice } = require('./parsers');
const { loadProducts, saveHistory, getLastPrice } = require('./storage');

module.exports = {
  name: "price-monitor",
  description: "Monitor product price changes and notify",

  async execute(context) {
    const { browser, channel, params } = context;
    const products = params.products || loadProducts();
    const results = [];

    for (const product of products) {
      const page = await browser.newPage();

      try {
        await page.goto(product.url, {
          waitUntil: 'networkidle2',
          timeout: 20000
        });

        // Use different selectors based on the e-commerce platform
        const priceText = await page.$eval(
          product.price_selector,
          el => el.textContent
        );

        const currentPrice = parsePrice(priceText);
        const lastPrice = getLastPrice(product.id);

        // Save price history
        saveHistory(product.id, {
          price: currentPrice,
          timestamp: new Date().toISOString()
        });

        const result = {
          name: product.name,
          url: product.url,
          currentPrice,
          lastPrice,
          change: lastPrice
            ? ((currentPrice - lastPrice) / lastPrice * 100).toFixed(2)
            : null
        };

        results.push(result);

        // Price below threshold -- send notification
        if (currentPrice <= product.alert_below) {
          await channel.send(
            `**Price Alert!**\n` +
            `Product: ${product.name}\n` +
            `Current price: $${currentPrice.toLocaleString()}\n` +
            `Target price: $${product.alert_below.toLocaleString()}\n` +
            `Link: ${product.url}`
          );
        }

        // Major price drop (more than 10%)
        if (lastPrice && currentPrice < lastPrice * 0.9) {
          await channel.send(
            `**Major Price Drop!**\n` +
            `Product: ${product.name}\n` +
            `Was: $${lastPrice.toLocaleString()} → ` +
            `Now: $${currentPrice.toLocaleString()}\n` +
            `Drop: ${result.change}%`
          );
        }
      } catch (error) {
        results.push({
          name: product.name,
          error: error.message
        });
      } finally {
        await page.close();
      }
    }

    return { results };
  }
};

3. Price parser:

// skills/price-monitor/parsers.js
function parsePrice(priceText) {
  // Remove currency symbols, commas, whitespace
  const cleaned = priceText
    .replace(/[$\s,]/g, '')
    .trim();

  const price = parseFloat(cleaned);

  if (isNaN(price)) {
    throw new Error(`Unable to parse price: "${priceText}"`);
  }

  return price;
}

module.exports = { parsePrice };

4. Combine with Cron Job for automatic execution:

{
  "cron": {
    "jobs": [
      {
        "name": "price-monitor",
        "schedule": "0 */4 * * *",
        "action": "run_skill",
        "skill": "price-monitor",
        "params": {
          "products": [
            {
              "id": "macbook-air-m4",
              "name": "MacBook Air M4 15-inch",
              "url": "https://shop.example.com/macbook-air-m4",
              "price_selector": ".product-price .current",
              "alert_below": 1199
            },
            {
              "id": "airpods-pro-3",
              "name": "AirPods Pro 3",
              "url": "https://shop.example.com/airpods-pro-3",
              "price_selector": ".price-value",
              "alert_below": 229
            }
          ]
        }
      }
    ]
  }
}

Combine with Module 6 Scheduling

The best practice for price monitoring is to combine it with the scheduling mechanism from Module 6: Cron Jobs, checking automatically every few hours. Avoid checking too frequently (every minute), as the e-commerce platform may block your IP.

Step 4: Handling Dynamically Rendered Pages

Many modern websites use SPAs (Single Page Applications) that require waiting for JavaScript to finish rendering:

// Wait for AJAX requests to complete
await page.goto(url, { waitUntil: 'networkidle0' });

// Wait for a specific element to appear
await page.waitForSelector('.dynamic-content', {
  visible: true,
  timeout: 15000
});

// Simulate scrolling to trigger lazy loading
await page.evaluate(async () => {
  await new Promise(resolve => {
    let totalHeight = 0;
    const distance = 300;
    const timer = setInterval(() => {
      window.scrollBy(0, distance);
      totalHeight += distance;
      if (totalHeight >= document.body.scrollHeight) {
        clearInterval(timer);
        resolve();
      }
    }, 100);
  });
});

// Wait for scroll-triggered content to load
await page.waitForTimeout(2000);

async function loginAndScrape(browser, credentials, targetUrl) {
  const page = await browser.newPage();

  // Load previously saved cookies (if available)
  const cookies = await loadCookies(credentials.site);
  if (cookies) {
    await page.setCookie(...cookies);
  }

  await page.goto(targetUrl);

  // Check if login is needed
  const needsLogin = await page.$('.login-form');
  if (needsLogin) {
    await page.type('#username', credentials.username);
    await page.type('#password', credentials.password);
    await page.click('#login-button');
    await page.waitForNavigation();

    // Save cookies for next time
    const newCookies = await page.cookies();
    await saveCookies(credentials.site, newCookies);
  }

  return page;
}

Credential Security

Never hardcode usernames and passwords in your Skill code. Use OpenClaw's environment variables or Secret Manager:

openclaw config set SHOP_USERNAME "your_username" --secret
openclaw config set SHOP_PASSWORD "your_password" --secret

Common Errors

Error Message	Cause	Solution
`Navigation timeout exceeded`	Page load timed out	Increase `timeout`, or use `waitUntil: 'domcontentloaded'`
`net::ERR_ABORTED`	Page redirect was blocked	Check if cookie consent or CAPTCHA handling is needed
`Protocol error: Target closed`	Page was closed during an operation	Ensure you're not operating on the same page object from multiple places
`Execution context was destroyed`	SPA route change invalidated the context	Re-acquire element references after route changes
`Browser is not connected`	Chromium process unexpectedly terminated	Check memory usage, add `--disable-dev-shm-usage`

Memory Leaks

Headless Chromium is very memory-intensive. Each tab uses approximately 50-150MB of RAM. Make sure to:

Close pages immediately with page.close() when done
Set a max_concurrent_pages limit
Set memory limits in Docker/Podman
Periodically restart the Browser instance (recommended every 100 operations)

// Best practice for preventing memory leaks
let operationCount = 0;

async function getPage(browser) {
  operationCount++;
  if (operationCount > 100) {
    await browser.close();
    browser = await puppeteer.launch(launchOptions);
    operationCount = 0;
  }
  return await browser.newPage();
}

Troubleshooting

Chromium Won't Start

# Check dependencies (Linux)
ldd $(which chromium) | grep "not found"

# Common missing packages in Docker environments
apt-get install -y \
  libnss3 libatk1.0-0 libatk-bridge2.0-0 \
  libcups2 libdrm2 libxkbcommon0 libxcomposite1 \
  libxdamage1 libxrandr2 libgbm1 libpango-1.0-0 \
  libasound2

# On macOS you may need
xattr -cr /path/to/chromium

Being Detected as a Bot

// Use puppeteer-extra-plugin-stealth
const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
puppeteer.use(StealthPlugin());

// Or manually configure
await page.evaluateOnNewDocument(() => {
  Object.defineProperty(navigator, 'webdriver', {
    get: () => false,
  });
});

Exercises

Exercise 1: Basic Screenshot

Build a Skill that accepts a list of URLs, captures a full-page screenshot for each, and compiles them into a PDF report.

Exercise 2: News Aggregation

Write a daily news scraping Agent that extracts today's headlines from 3 tech news websites and compiles them into a Markdown summary.

Exercise 3: Full Price Monitor

Extend this module's price monitoring example by adding:

Price history chart generation
Weekly price trend reports
A universal parser supporting multiple e-commerce platforms

Quiz

What does waitUntil: 'networkidle2' mean?
- A) Absolutely no network requests
- B) No more than 2 network requests within 500ms
- C) Wait for 2 seconds
- D) Retry up to 2 times
View Answer
B) networkidle2 means no more than 2 in-flight network connections within 500 milliseconds, suitable for most page loading scenarios.
Why does Headless Chromium need the --no-sandbox argument?
- A) Performance improvement
- B) In Docker containers or non-root environments, Linux's user namespace sandbox may be unavailable
- C) Enables more features
- D) Reduces memory usage
View Answer
B) In containerized environments, Linux's sandbox mechanism may conflict with the container's isolation layer. However, this reduces security -- using a Podman container alongside is recommended.
What is the optimal check frequency for a price monitoring Agent?
- A) Every minute
- B) Every 2-4 hours
- C) Once daily
- D) Real-time monitoring
View Answer
B) Every 2-4 hours strikes the best balance. Too frequent risks IP blocking; too infrequent may miss flash sales. Adjust based on product characteristics.
When dealing with dynamic SPA content, which method should you use?
- A) page.content() to directly get the HTML
- B) page.waitForSelector() to wait for the target element to appear
- C) Refresh the page
- D) Disable JavaScript
View Answer
B) SPA content is dynamically rendered by JavaScript, so you must wait for the target element to actually appear in the DOM before extracting it.

Next Steps

Module 6: Cron Jobs / Heartbeat -- Schedule your scrapers as automated tasks
Module 8: Multi-Agent Architecture -- Have multiple Agents divide and conquer different websites
Module 9: Security -- Learn security best practices for browser automation

Learning Objectives​

Core Concepts​

Browser Automation Architecture​

Headless vs Headed Mode​

Why Puppeteer?​

Implementation Guide​

Step 1: Installation & Configuration​

Step 2: Basic Browser Operations​

Step 3: Build a Price Monitoring Agent​

Step 4: Handling Dynamically Rendered Pages​

Step 5: Handling Sites That Require Login​

Common Errors​

Troubleshooting​

Chromium Won't Start​

Being Detected as a Bot​

Exercises​

Exercise 1: Basic Screenshot​

Exercise 2: News Aggregation​

Exercise 3: Full Price Monitor​

Quiz​

Next Steps​