Web Scraping for AI & LLMs
Web scraping powers a significant share of the modern data economy. The global market exceeded $1 billion in 2025. 70% of generative AI models are trained primarily on scraped web data.
Common Use Cases
| Use Case | What You Extract |
|---|---|
| Price monitoring | Product prices, availability across e-commerce sites |
| LLM knowledge bases | Full-text content as clean markdown for RAG |
| Lead generation | Contact info, company details from directories |
| Content archival | Paid courses, docs, wikis as organized offline files |
| Training data | Text, images, metadata for ML models |
| SEO monitoring | Rankings, backlinks, keyword positions |
The Tool Landscape
Python dominates (67% market share), followed by JavaScript/Node.js (23%).
| Tool | JS Rendering | Best For | Difficulty |
|---|---|---|---|
| Crawl4AI | Native (Playwright) | AI-optimized markdown extraction | Medium |
| Playwright | Native | Modern SPAs, browser automation | Medium |
| httpx | None | Async API calls, direct HTTP | Low |
| BeautifulSoup | None | HTML parsing, beginner projects | Low |
| Scrapy | Requires addon | Large-scale crawling (1000s of pages) | Steep |
| Firecrawl | Handled (API) | LLM-ready markdown/JSON at scale | Low |
Scraping Strategies
Many websites use internal APIs returning structured JSON. If you find these endpoints, skip browser rendering entirely — 5-10x faster and more reliable.
How to find APIs:
- Open DevTools Network tab, filter XHR/Fetch
- Navigate — APIs appear as JSON requests
- Copy URL pattern (e.g.,
/api/courses/{id}/lessons) - Note auth header format (
Bearer <token>or cookie)
import httpx
headers = {"Authorization": f"Bearer {jwt_token}"}
async with httpx.AsyncClient(headers=headers) as client:
resp = await client.get("https://example.com/api/lessons/42")
data = resp.json()
When content is rendered by JavaScript:
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
config = CrawlerRunConfig(
excluded_selector="nav, .sidebar, .footer",
wait_for="css:.lesson-content",
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url=lesson_url, config=config)
markdown = result.markdown
For server-rendered HTML, no browser needed:
import httpx
from bs4 import BeautifulSoup
resp = httpx.get("https://example.com/article")
soup = BeautifulSoup(resp.text, "lxml")
title = soup.select_one("h1").text
content = soup.select_one(".article-body").get_text()
Strategy Selection Flowchart
Does View Source show full content?
├── YES → Static HTML
│ └── Does it have API endpoints (XHR in Network tab)?
│ ├── YES → API fetch (fastest)
│ └── NO → Direct HTTP + parse HTML
└── NO → JavaScript-rendered (SPA)
└── Does it have __NEXT_DATA__?
├── YES → Extract JSON directly (no browser)
└── NO → Browser rendering (Crawl4AI / Playwright)
Authentication
| Auth Type | Scraping Approach |
|---|---|
| Session cookies | Save cookies from browser, pass to httpx |
| JWT tokens | Extract from localStorage, use as Bearer header |
| OAuth | Log in manually, save session profile |
| API key | Store in .env, pass as header |
AI-Powered Scraping
| Approach | Description |
|---|---|
| AI as coding assistant | Give Claude your target site’s structure, it generates the complete scraper |
| AI as extraction engine | Send raw HTML to LLM, ask for structured output. Handles messy HTML automatically |
AI-Native Scraping Tools
| Tool | Approach | Best For |
|---|---|---|
| Crawl4AI | Playwright + local LLM inference | Markdown extraction without API costs |
| Firecrawl | Cloud API, returns markdown/JSON | LLM-ready content at scale |
| ScrapeGraphAI | LLM pipelines for extraction | Complex structured data |
Anti-Bot Protection
| Method | Difficulty to Bypass |
|---|---|
| IP rate limiting | Low — use delays and proxies |
| User-Agent filtering | Low — set realistic headers |
| TLS fingerprinting | High — requires real browser |
| Browser fingerprinting | High — real browser with stealth plugins |
| JS challenges | Medium — headless browsers pass most |
| Behavior analysis | Very high |
Legal and Ethical Rules
| Practice | Guidance |
|---|---|
Respect robots.txt | Required in EU (GDPR link), recommended everywhere |
| Rate limiting | Always (3s+ delay between requests) |
| Avoid PII collection | GDPR fines up to 4% annual revenue |
| Respect authentication | Bypassing login protections raises legal risk |
| Content you paid for | Generally lower risk to archive |
Production Pipeline
01. Authenticate
02. Discover API
03. Discover Content
04. Scrape Content
05. Download Media
06. Organize Output
07. Validate
Essential Patterns
State Checkpointing
The most important production pattern. Without it, a crash at item 300 of 500 means starting over.
from pydantic import BaseModel
import json
class ScrapeState(BaseModel):
lessons: dict[str, str] # lesson_id -> status
completed: int = 0
failed: int = 0
def save_state(path, state):
path.write_text(state.model_dump_json(indent=2))
Save after every 5-10 items. On restart, skip completed items.
Two-Layer Markdown Cleaning
Layer 1 — CSS Exclusion (at scrape time):
config = CrawlerRunConfig(
excluded_selector="nav, .sidebar, .footer, [class*='navigation']"
)
Layer 2 — Regex Post-Processing (after extraction):
import re
def clean_markdown(text: str) -> str:
text = re.sub(r"^\d+ (?:likes?|comments?).*$", "", text, flags=re.MULTILINE)
text = re.sub(r"\n{3,}", "\n\n", text)
return text.strip()
Rate Limiting
import asyncio, time
class RateLimiter:
def __init__(self, min_delay: float = 3.0):
self._min_delay = min_delay
self._last_request = 0.0
self._lock = asyncio.Lock()
async def wait(self):
async with self._lock:
elapsed = time.monotonic() - self._last_request
if elapsed < self._min_delay:
await asyncio.sleep(self._min_delay - elapsed)
self._last_request = time.monotonic()
Common Mistakes
| Mistake | Fix |
|---|---|
| No rate limiting | Enforce 3s+ delay between requests |
| No checkpointing | Save state every 5-10 items |
| Hardcoded CSS selectors | Prefer API extraction; use fallback chains |
networkidle wait on SPAs | Use domcontentloaded + CSS selector waits |
| Saving error pages as content | Check error markers and minimum content length |
Essential Python Packages
pip install crawl4ai httpx beautifulsoup4 lxml pydantic python-dotenv yt-dlp pdfplumber
| Package | Purpose |
|---|---|
crawl4ai | Browser rendering + markdown extraction |
httpx | Async HTTP client for API calls |
beautifulsoup4 + lxml | HTML parsing (static sites) |
pydantic | Data models, state serialization |
yt-dlp | Video downloading |
pdfplumber | PDF text extraction |