Introduction

Web scraping has moved well beyond quick-and-dirty scripts. In 2026, modern browsers render JavaScript-heavy SPAs, trigger lazy-loaded content, and gate data behind authentication flows — all of which break the simple HTTP-request scrapers of five years ago. Playwright, Microsoft's end-to-end browser automation library, handles all of that natively. Paired with Node.js, it gives you a production-ready scraping setup that can navigate real browser environments, intercept network requests, and extract structured data at scale.

This tutorial walks you through building a practical web scraper from scratch — one that can handle dynamic content, respect rate limits, and output clean JSON data ready for your dashboards, CRMs, or reporting tools. Whether you're a developer at an Australian SaaS company pulling competitor pricing, a marketer in Singapore monitoring review sites, or a small business owner in Canada automating lead research, this guide gives you a working foundation.

What You'll Need

  • Node.js 20+ installed (LTS recommended)
  • Basic familiarity with JavaScript or TypeScript
  • A terminal and a code editor (VS Code recommended)
  • npm or pnpm as your package manager
  • A target website to scrape (use one you have permission to scrape, or a public sandbox like books.toscrape.com)

A note on ethics and legality: Always check a site's robots.txt and Terms of Service before scraping. Many sites permit scraping for personal or research use but prohibit commercial use. When in doubt, use official APIs or contact the site owner.

Step 1: Set Up Your Project

Create a fresh Node.js project and install Playwright.

mkdir playwright-scraper
cd playwright-scraper
npm init -y
npm install playwright
npx playwright install chromium

The last command downloads Chromium — the only browser you need for most scraping tasks. Firefox and WebKit are available too if you need to test cross-browser behaviour.

Next, create your main script file:

touch scraper.js

Pro tip: If you prefer TypeScript (recommended for larger projects), install tsx for zero-config TypeScript execution: npm install -D tsx typescript. Then rename your file to scraper.ts and run it with npx tsx scraper.ts.

Step 2: Launch a Browser and Open a Page

Open scraper.js and add your first Playwright code. This launches a headless Chromium instance and navigates to your target URL.

const { chromium } = require('playwright');

(async () => {
  const browser = await chromium.launch({ headless: true });
  const context = await browser.newContext({
    userAgent: 'Mozilla/5.0 (compatible; MyResearchBot/1.0)'
  });
  const page = await context.newPage();

  await page.goto('https://books.toscrape.com', {
    waitUntil: 'domcontentloaded'
  });

  console.log('Page title:', await page.title());

  await browser.close();
})();

Run it with node scraper.js. You should see the page title printed in your terminal.

Common pitfall: Avoid using waitUntil: 'networkidle' on pages with long-polling or persistent websocket connections — it will time out. Prefer domcontentloaded or load, then explicitly wait for specific elements instead.

Step 3: Inspect and Select the Elements You Need

Before writing selectors, open the target page in a real browser and use DevTools (F12) to inspect the HTML structure. Playwright supports CSS selectors, XPath, and its own text-based locators.

For books.toscrape.com, each book is inside an article.product_pod element. The title lives in an h3 > a tag and the price is in p.price_color.

Using Playwright's Locator API

Playwright's modern Locator API is preferred over older ElementHandle methods — it's more resilient to timing issues and retries automatically.

const books = await page.locator('article.product_pod').all();

const results = [];

for (const book of books) {
  const title = await book.locator('h3 > a').getAttribute('title');
  const price = await book.locator('p.price_color').innerText();
  results.push({ title, price });
}

console.log(results);

Run the script again. You should see an array of book objects with titles and prices.

Pro tip: Use page.locator('css=...') over page.$$() for new code. The Locator API is lazy by default and handles async timing more gracefully — critical when scraping dynamically rendered content.

Step 4: Handle Pagination

Most real-world scraping tasks involve multiple pages. Here's how to loop through paginated results automatically.

let hasNextPage = true;
const allResults = [];

while (hasNextPage) {
  const books = await page.locator('article.product_pod').all();

  for (const book of books) {
    const title = await book.locator('h3 > a').getAttribute('title');
    const price = await book.locator('p.price_color').innerText();
    allResults.push({ title, price });
  }

  const nextButton = page.locator('li.next > a');
  const nextExists = await nextButton.count() > 0;

  if (nextExists) {
    await nextButton.click();
    await page.waitForLoadState('domcontentloaded');
    // Polite delay to avoid hammering the server
    await page.waitForTimeout(1000 + Math.random() * 500);
  } else {
    hasNextPage = false;
  }
}

console.log(`Scraped ${allResults.length} books total.`);

Common pitfall: Never use a fixed setTimeout in production scrapers. Playwright's page.waitForTimeout() combined with random jitter (as shown above) is more polite and harder for anti-bot systems to fingerprint as automated traffic.

Step 5: Save the Data to JSON

Raw console output doesn't get you far. Write your results to a JSON file for downstream processing.

const fs = require('fs');

// After your scraping loop:
fs.writeFileSync(
  'books.json',
  JSON.stringify(allResults, null, 2),
  'utf-8'
);

console.log('Data saved to books.json');

From here, you can import books.json into a spreadsheet, push it to a database, or feed it into a dashboard. The team at Lenka Studio regularly uses this pattern to automate competitive research and market data collection for clients building data-driven products.

Step 6: Handle Authentication and Cookies

If your target site requires login, Playwright makes this straightforward. Rather than scraping the login form every run (which is slow and suspicious), save your session cookies after the first login and reuse them.

// First run: log in and save session
await page.goto('https://example.com/login');
await page.fill('#email', process.env.SCRAPER_EMAIL);
await page.fill('#password', process.env.SCRAPER_PASSWORD);
await page.click('button[type="submit"]');
await page.waitForURL('**/dashboard');

// Save session to disk
await context.storageState({ path: 'session.json' });
await browser.close();
// Subsequent runs: reuse the saved session
const context = await browser.newContext({
  storageState: 'session.json'
});

Store credentials in environment variables (use a .env file with the dotenv package) — never hardcode them in your script.

Pro tip: Add session.json to your .gitignore immediately. Committing auth tokens to a repository is a serious security risk.

Step 7: Run Your Scraper on a Schedule

A scraper you run manually once isn't much of an automation. There are several good options for scheduling in 2026:

Option A: GitHub Actions (Free, Zero Infrastructure)

Create .github/workflows/scrape.yml:

name: Daily Scraper
on:
  schedule:
    - cron: '0 9 * * *'  # 9am UTC daily
  workflow_dispatch:

jobs:
  scrape:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: '20'
      - run: npm ci
      - run: npx playwright install --with-deps chromium
      - run: node scraper.js
      - uses: actions/upload-artifact@v4
        with:
          name: scraped-data
          path: books.json

This runs your scraper daily and stores the output as a downloadable artifact — perfect for SMBs that don't want to manage server infrastructure.

Option B: Cloudflare Workers + Durable Objects

For higher-frequency scraping or when you need the data available via API, deploy your scraper logic to Cloudflare Workers. This pairs well with the edge function patterns covered in Lenka Studio's existing development guides.

Common Pitfalls to Avoid

  • Not handling network errors: Wrap your page navigation in try/catch blocks and implement retry logic for transient failures.
  • Ignoring viewport size: Some sites serve different content based on screen width. Set a realistic viewport: await context.newPage(); page.setViewportSize({ width: 1280, height: 800 });
  • Scraping too fast: Always add delays between requests. A rate of one request every 1–3 seconds is generally safe and respectful.
  • Brittle CSS selectors: Sites update their markup. Prefer selectors tied to semantic attributes (data-testid, aria-label, visible text) over deeply nested class chains that break on any redesign.
  • Storing raw HTML instead of structured data: Parse and normalise your data at scrape time. Raw HTML bloats storage and makes downstream processing painful.

Next Steps

You now have a working Playwright scraper that can handle dynamic pages, paginate through results, manage authentication, save structured data, and run on a schedule — all without a single paid service.

From here, consider these natural extensions:

  • Add a database: Push your JSON output directly to Supabase or PlanetScale for queryable, persistent storage.
  • Build a monitoring layer: Use Playwright's screenshot and video capture to detect when a target site changes its layout so you can update your selectors proactively.
  • Combine with AI parsing: Feed scraped text into an LLM API (OpenAI, Anthropic, or a local model via Ollama) to extract structured fields from unstructured descriptions — powerful for lead enrichment and competitive intelligence.
  • Scale with a scraping framework: When you outgrow a single script, look at Crawlee (built on Playwright) for built-in request queuing, concurrency management, and proxy rotation.

If you're building a data pipeline or automation system and want a team to architect it properly from day one, get in touch with Lenka Studio. We work with SMBs across Australia, Singapore, Canada, and the US to design and build scalable digital products — and we're happy to talk through your specific use case.