Web scraping is the automated extraction of data from websites. This guide covers tools, strategies, legal considerations, and production patterns.

Web Scraping for AI & LLMs

Web scraping powers a significant share of the modern data economy. The global market exceeded $1 billion in 2025. 70% of generative AI models are trained primarily on scraped web data.

Common Use Cases

Use CaseWhat You Extract
Price monitoringProduct prices, availability across e-commerce sites
LLM knowledge basesFull-text content as clean markdown for RAG
Lead generationContact info, company details from directories
Content archivalPaid courses, docs, wikis as organized offline files
Training dataText, images, metadata for ML models
SEO monitoringRankings, backlinks, keyword positions

The Tool Landscape

Python dominates (67% market share), followed by JavaScript/Node.js (23%).

ToolJS RenderingBest ForDifficulty
Crawl4AINative (Playwright)AI-optimized markdown extractionMedium
PlaywrightNativeModern SPAs, browser automationMedium
httpxNoneAsync API calls, direct HTTPLow
BeautifulSoupNoneHTML parsing, beginner projectsLow
ScrapyRequires addonLarge-scale crawling (1000s of pages)Steep
FirecrawlHandled (API)LLM-ready markdown/JSON at scaleLow

Scraping Strategies

Many websites use internal APIs returning structured JSON. If you find these endpoints, skip browser rendering entirely — 5-10x faster and more reliable.

How to find APIs:

  1. Open DevTools Network tab, filter XHR/Fetch
  2. Navigate — APIs appear as JSON requests
  3. Copy URL pattern (e.g., /api/courses/{id}/lessons)
  4. Note auth header format (Bearer <token> or cookie)
import httpx
headers = {"Authorization": f"Bearer {jwt_token}"}
async with httpx.AsyncClient(headers=headers) as client:
    resp = await client.get("https://example.com/api/lessons/42")
    data = resp.json()

When content is rendered by JavaScript:

from crawl4ai import AsyncWebCrawler, CrawlerRunConfig

config = CrawlerRunConfig(
    excluded_selector="nav, .sidebar, .footer",
    wait_for="css:.lesson-content",
)
async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(url=lesson_url, config=config)
    markdown = result.markdown

For server-rendered HTML, no browser needed:

import httpx
from bs4 import BeautifulSoup

resp = httpx.get("https://example.com/article")
soup = BeautifulSoup(resp.text, "lxml")
title = soup.select_one("h1").text
content = soup.select_one(".article-body").get_text()

Strategy Selection Flowchart

Does View Source show full content?
├── YES → Static HTML
│   └── Does it have API endpoints (XHR in Network tab)?
│       ├── YES → API fetch (fastest)
│       └── NO  → Direct HTTP + parse HTML
└── NO → JavaScript-rendered (SPA)
    └── Does it have __NEXT_DATA__?
        ├── YES → Extract JSON directly (no browser)
        └── NO  → Browser rendering (Crawl4AI / Playwright)

Authentication

Auth TypeScraping Approach
Session cookiesSave cookies from browser, pass to httpx
JWT tokensExtract from localStorage, use as Bearer header
OAuthLog in manually, save session profile
API keyStore in .env, pass as header

AI-Powered Scraping

ApproachDescription
AI as coding assistantGive Claude your target site’s structure, it generates the complete scraper
AI as extraction engineSend raw HTML to LLM, ask for structured output. Handles messy HTML automatically
AI-Native Scraping Tools
ToolApproachBest For
Crawl4AIPlaywright + local LLM inferenceMarkdown extraction without API costs
FirecrawlCloud API, returns markdown/JSONLLM-ready content at scale
ScrapeGraphAILLM pipelines for extractionComplex structured data

Anti-Bot Protection

MethodDifficulty to Bypass
IP rate limitingLow — use delays and proxies
User-Agent filteringLow — set realistic headers
TLS fingerprintingHigh — requires real browser
Browser fingerprintingHigh — real browser with stealth plugins
JS challengesMedium — headless browsers pass most
Behavior analysisVery high
PracticeGuidance
Respect robots.txtRequired in EU (GDPR link), recommended everywhere
Rate limitingAlways (3s+ delay between requests)
Avoid PII collectionGDPR fines up to 4% annual revenue
Respect authenticationBypassing login protections raises legal risk
Content you paid forGenerally lower risk to archive

Production Pipeline

1

01. Authenticate

Log in, save session/cookies.
2

02. Discover API

Find endpoints, extract tokens from DevTools Network tab.
3

03. Discover Content

Map full hierarchy (Course > Module > Lesson).
4

04. Scrape Content

Extract text as clean markdown.
5

05. Download Media

Images, PDFs, videos (yt-dlp for video).
6

06. Organize Output

Clean markdown, YAML frontmatter, generate indexes.
7

07. Validate

Quality audit, coverage report. Target >95% completion rate.

Essential Patterns

State Checkpointing

The most important production pattern. Without it, a crash at item 300 of 500 means starting over.

from pydantic import BaseModel
import json

class ScrapeState(BaseModel):
    lessons: dict[str, str]  # lesson_id -> status
    completed: int = 0
    failed: int = 0

def save_state(path, state):
    path.write_text(state.model_dump_json(indent=2))

Save after every 5-10 items. On restart, skip completed items.

Two-Layer Markdown Cleaning

Layer 1 — CSS Exclusion (at scrape time):

config = CrawlerRunConfig(
    excluded_selector="nav, .sidebar, .footer, [class*='navigation']"
)

Layer 2 — Regex Post-Processing (after extraction):

import re
def clean_markdown(text: str) -> str:
    text = re.sub(r"^\d+ (?:likes?|comments?).*$", "", text, flags=re.MULTILINE)
    text = re.sub(r"\n{3,}", "\n\n", text)
    return text.strip()
Rate Limiting
import asyncio, time

class RateLimiter:
    def __init__(self, min_delay: float = 3.0):
        self._min_delay = min_delay
        self._last_request = 0.0
        self._lock = asyncio.Lock()

    async def wait(self):
        async with self._lock:
            elapsed = time.monotonic() - self._last_request
            if elapsed < self._min_delay:
                await asyncio.sleep(self._min_delay - elapsed)
            self._last_request = time.monotonic()

Common Mistakes

MistakeFix
No rate limitingEnforce 3s+ delay between requests
No checkpointingSave state every 5-10 items
Hardcoded CSS selectorsPrefer API extraction; use fallback chains
networkidle wait on SPAsUse domcontentloaded + CSS selector waits
Saving error pages as contentCheck error markers and minimum content length

Essential Python Packages

pip install crawl4ai httpx beautifulsoup4 lxml pydantic python-dotenv yt-dlp pdfplumber
PackagePurpose
crawl4aiBrowser rendering + markdown extraction
httpxAsync HTTP client for API calls
beautifulsoup4 + lxmlHTML parsing (static sites)
pydanticData models, state serialization
yt-dlpVideo downloading
pdfplumberPDF text extraction