Web Scraping for AI & LLMs

Tools, strategies, authentication, AI-powered scraping, production pipeline

Web scraping is the automated extraction of data from websites. This guide covers tools, strategies, legal considerations, and production patterns.

Web Scraping for AI & LLMs

Web scraping powers a significant share of the modern data economy. The global market exceeded $1 billion in 2025. 70% of generative AI models are trained primarily on scraped web data.

Common Use Cases

Use Case	What You Extract
Price monitoring	Product prices, availability across e-commerce sites
LLM knowledge bases	Full-text content as clean markdown for RAG
Lead generation	Contact info, company details from directories
Content archival	Paid courses, docs, wikis as organized offline files
Training data	Text, images, metadata for ML models
SEO monitoring	Rankings, backlinks, keyword positions

The Tool Landscape

Python dominates (67% market share), followed by JavaScript/Node.js (23%).

Tool	JS Rendering	Best For	Difficulty
Crawl4AI	Native (Playwright)	AI-optimized markdown extraction	Medium
Playwright	Native	Modern SPAs, browser automation	Medium
httpx	None	Async API calls, direct HTTP	Low
BeautifulSoup	None	HTML parsing, beginner projects	Low
Scrapy	Requires addon	Large-scale crawling (1000s of pages)	Steep
Firecrawl	Handled (API)	LLM-ready markdown/JSON at scale	Low

Scraping Strategies

Many websites use internal APIs returning structured JSON. If you find these endpoints, skip browser rendering entirely — 5-10x faster and more reliable.

How to find APIs:

Open DevTools Network tab, filter XHR/Fetch
Navigate — APIs appear as JSON requests
Copy URL pattern (e.g., /api/courses/{id}/lessons)
Note auth header format (Bearer <token> or cookie)

import httpx
headers = {"Authorization": f"Bearer {jwt_token}"}
async with httpx.AsyncClient(headers=headers) as client:
    resp = await client.get("https://example.com/api/lessons/42")
    data = resp.json()

When content is rendered by JavaScript:

from crawl4ai import AsyncWebCrawler, CrawlerRunConfig

config = CrawlerRunConfig(
    excluded_selector="nav, .sidebar, .footer",
    wait_for="css:.lesson-content",
)
async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(url=lesson_url, config=config)
    markdown = result.markdown

For server-rendered HTML, no browser needed:

import httpx
from bs4 import BeautifulSoup

resp = httpx.get("https://example.com/article")
soup = BeautifulSoup(resp.text, "lxml")
title = soup.select_one("h1").text
content = soup.select_one(".article-body").get_text()

Strategy Selection Flowchart

Does View Source show full content?
├── YES → Static HTML
│   └── Does it have API endpoints (XHR in Network tab)?
│       ├── YES → API fetch (fastest)
│       └── NO  → Direct HTTP + parse HTML
└── NO → JavaScript-rendered (SPA)
    └── Does it have __NEXT_DATA__?
        ├── YES → Extract JSON directly (no browser)
        └── NO  → Browser rendering (Crawl4AI / Playwright)

Authentication

Auth Type	Scraping Approach
Session cookies	Save cookies from browser, pass to `httpx`
JWT tokens	Extract from localStorage, use as `Bearer` header
OAuth	Log in manually, save session profile
API key	Store in `.env`, pass as header

AI-Powered Scraping

Approach	Description
AI as coding assistant	Give Claude your target site’s structure, it generates the complete scraper
AI as extraction engine	Send raw HTML to LLM, ask for structured output. Handles messy HTML automatically

AI-Native Scraping Tools

Tool	Approach	Best For
Crawl4AI	Playwright + local LLM inference	Markdown extraction without API costs
Firecrawl	Cloud API, returns markdown/JSON	LLM-ready content at scale
ScrapeGraphAI	LLM pipelines for extraction	Complex structured data

Anti-Bot Protection

Method	Difficulty to Bypass
IP rate limiting	Low — use delays and proxies
User-Agent filtering	Low — set realistic headers
TLS fingerprinting	High — requires real browser
Browser fingerprinting	High — real browser with stealth plugins
JS challenges	Medium — headless browsers pass most
Behavior analysis	Very high

Legal and Ethical Rules

Practice	Guidance
Respect `robots.txt`	Required in EU (GDPR link), recommended everywhere
Rate limiting	Always (3s+ delay between requests)
Avoid PII collection	GDPR fines up to 4% annual revenue
Respect authentication	Bypassing login protections raises legal risk
Content you paid for	Generally lower risk to archive

Production Pipeline

01. Authenticate

02. Discover API

Find endpoints, extract tokens from DevTools Network tab.

03. Discover Content

Map full hierarchy (Course > Module > Lesson).

04. Scrape Content

Extract text as clean markdown.

05. Download Media

Images, PDFs, videos (yt-dlp for video).

06. Organize Output

Clean markdown, YAML frontmatter, generate indexes.

07. Validate

Quality audit, coverage report. Target >95% completion rate.

Essential Patterns

State Checkpointing

The most important production pattern. Without it, a crash at item 300 of 500 means starting over.

from pydantic import BaseModel
import json

class ScrapeState(BaseModel):
    lessons: dict[str, str]  # lesson_id -> status
    completed: int = 0
    failed: int = 0

def save_state(path, state):
    path.write_text(state.model_dump_json(indent=2))

Save after every 5-10 items. On restart, skip completed items.

Two-Layer Markdown Cleaning

Layer 1 — CSS Exclusion (at scrape time):

config = CrawlerRunConfig(
    excluded_selector="nav, .sidebar, .footer, [class*='navigation']"
)

Layer 2 — Regex Post-Processing (after extraction):

import re
def clean_markdown(text: str) -> str:
    text = re.sub(r"^\d+ (?:likes?|comments?).*$", "", text, flags=re.MULTILINE)
    text = re.sub(r"\n{3,}", "\n\n", text)
    return text.strip()

Rate Limiting

import asyncio, time

class RateLimiter:
    def __init__(self, min_delay: float = 3.0):
        self._min_delay = min_delay
        self._last_request = 0.0
        self._lock = asyncio.Lock()

    async def wait(self):
        async with self._lock:
            elapsed = time.monotonic() - self._last_request
            if elapsed < self._min_delay:
                await asyncio.sleep(self._min_delay - elapsed)
            self._last_request = time.monotonic()

Common Mistakes

Mistake	Fix
No rate limiting	Enforce 3s+ delay between requests
No checkpointing	Save state every 5-10 items
Hardcoded CSS selectors	Prefer API extraction; use fallback chains
`networkidle` wait on SPAs	Use `domcontentloaded` + CSS selector waits
Saving error pages as content	Check error markers and minimum content length

Essential Python Packages

pip install crawl4ai httpx beautifulsoup4 lxml pydantic python-dotenv yt-dlp pdfplumber

Package	Purpose
`crawl4ai`	Browser rendering + markdown extraction
`httpx`	Async HTTP client for API calls
`beautifulsoup4` + `lxml`	HTML parsing (static sites)
`pydantic`	Data models, state serialization
`yt-dlp`	Video downloading
`pdfplumber`	PDF text extraction

Obsidian — Knowledge Management for AI