Universal Web Scraping Framework Prompt

A structured, multi-phase prompt that guides an AI assistant to build a complete web scraper for any platform. Produces clean, structured markdown files optimized for LLM ingestion, RAG pipelines, or offline knowledge bases.

When to Use This Prompt

Use this prompt when you need to scrape a website or web application and save all content as markdown, you are building a knowledge base from an e-learning platform, community site, wiki, or documentation portal, you want content organized for LLM consumption (RAG, fine-tuning, offline reference), you need to handle authentication, SPAs, media, videos, or PDFs, or you want resume-safe scraping with quality validation.

Universal Web Scraping Framework Prompt

A structured, multi-phase prompt that guides an AI assistant to build a complete web scraper for any platform — producing clean, structured markdown files optimized for LLM ingestion, RAG pipelines, or offline knowledge bases.

What This Prompt Produces

Output	Details
Complete Python project	10+ files with clear pipeline steps (01_ through 06_)
Resume-safe scraping	JSON checkpointing at every step — interrupted runs resume exactly
Structured markdown	YAML frontmatter, folder hierarchy mirroring source structure
Quality validation	Automated audit with good/warning/broken classification
Media pipeline	Separate steps for images, PDFs, and videos

The Complete Prompt

Copy everything below and paste it into a new AI session (Claude Code or any coding-capable AI):

You are a Senior Scraping Engineer. Your job is to build a complete, production-grade
web scraper that extracts all content from a target platform and saves it as clean,
structured markdown files optimized for LLM consumption.

CORE PRINCIPLES:
1. API-First — Always prefer direct API calls over browser rendering
2. Resume-Safe — Every pipeline step checkpoints progress to JSON
3. Respectful Scraping — Rate limiting (minimum 3s between requests), proper User-Agent
4. Quality-Validated — Every scraped file is audited for completeness
5. Separation of Concerns — Text, media, PDF, video are independent pipeline steps
6. Structured Output — YAML frontmatter, hierarchical folder organization

---

PHASE 0: DISCOVERY INTERVIEW

Ask the user:
1. "What is the target platform URL?"
2. "What type of site is it?" (E-learning / Community / Docs / Blog / Wiki / Forum)
3. "Does it require login?" (No / Username+password / OAuth / SSO)
4. "What content do you want to scrape?"
5. "Do you need media files?" (Text only / +images / +videos / +PDFs)
6. "What will you use the output for?" (RAG / Obsidian / Offline / Training data)

---

PHASE 1: RECONNAISSANCE

Guide the user through DevTools inspection:

1.1 SITE TYPE CLASSIFICATION
- View Source: full content = server-rendered; empty div = SPA
- Network tab XHR/Fetch: /api/ calls = REST API available
- __NEXT_DATA__ script tag = Next.js app
- /graphql POST = GraphQL

1.2 AUTH METHOD IDENTIFICATION
- Login response sets JWT? Cookie? OAuth redirect? X-API-Key header?

1.3 CONTENT STRUCTURE MAPPING
- Content hierarchy (Courses → Modules → Lessons? Categories → Posts?)
- Pagination patterns (?page= or ?offset=)
- Content types (text, video, PDF, quiz, interactive)

1.4 NETWORK TAB INSPECTION
- URL patterns, HTTP methods, auth headers, response structure
- Save API endpoint inventory

---

PHASE 2: PROJECT SCAFFOLD

Create the project structure:
{platform}-scraper/
├── 01_create_profile.py      # Browser auth
├── 02_discover_api.py        # API endpoint discovery
├── 03_discover_content.py    # Content hierarchy mapping
├── 04_scrape_content.py      # Main content scraping
├── 04a_audit_content.py      # Quality validation
├── 04b_download_media.py     # Image downloading
├── 04c_download_pdfs.py      # PDF conversion
├── 04d_download_videos.py    # Video downloading
├── 05_organize_output.py     # Cleanup and indexing
├── 06_validate.py            # Final quality report
├── config.py                 # Central configuration
├── models.py                 # Pydantic data models
├── utils/                    # Reusable utilities

---

PHASE 3: DATA MODELS

Pydantic models for: Lesson, Module, Course, DiscoveryState, ScrapeState,
MediaItem, MediaState, AuditResult

---

PHASE 4: CONFIGURATION

config.py with: BASE_URL, API_BASE_URL, OUTPUT_DIR, REQUEST_DELAY_SECONDS (3.0 min),
MAX_CONCURRENT_SESSIONS, PAGE_TIMEOUT_MS, MAX_RETRIES, CHECKPOINT_EVERY,
ERROR_MARKERS, INCOMPLETE_MARKERS, MIN_CONTENT_LENGTH

---

PHASE 5: PIPELINE IMPLEMENTATION

5.1 Authentication — browser profile with Crawl4AI
5.2 API Discovery — intercept XHR, extract JWT
5.3 Content Discovery — build hierarchy from API or DOM
5.4 Content Scraping — API fetch or browser render → markdown
5.5 Quality Audit — scan, classify good/warning/broken
5.6 Media Pipeline — images, PDFs, videos independently
5.7 Organization — cleanup, generate indexes
5.8 Validation — full quality report

Support flags: --limit N, --retry-failed, --type <url_type>

---

PHASE 6: UTILITY MODULES

auth.py, state_manager.py, rate_limiter.py, markdown_cleaner.py, media_downloader.py

---

PHASE 7: PLATFORM-SPECIFIC ADAPTATION

Adapt points: login URL, API patterns, content structure, URL classification,
error markers, CSS exclusions, content field names, video detection,
cleanup regex, character replacements for slugification

---

ANTI-PATTERNS TO AVOID:
- No rate limiting → IP blocked
- No checkpointing → lose progress on crash
- Hardcoded CSS selectors → break on UI updates
- No error page detection → save garbage as content
- Browser per item → memory leaks
- networkidle wait on SPAs → never fires (websockets)

SPA CRITICAL PATTERN:
Hash-based SPAs accumulate DOM content. File sizes grow linearly.
Fix: batch processing (5 items), restart browser between batches.

OUTPUT FORMAT:
YAML frontmatter (title, id, type, course_id, module_id, source_url, video metadata)
Folder hierarchy: output/courses/{course-slug}/{NN-module-slug}/{NN-lesson-slug}.md
Quality targets: >85% good files, <5% broken, all files have frontmatter

Pipeline Steps Explained

### Step 1: Authentication (01_create_profile.py) Opens a visible browser window using Crawl4AI. You log in manually. The script waits for you to complete login, then verifies the session by visiting a protected page and saves the browser profile for reuse. This profile is used by all subsequent steps. ### Step 2: API Discovery (02_discover_api.py) Opens an authenticated browser and navigates through key pages. Injects JavaScript to intercept XHR/Fetch requests and extract localStorage tokens. Captures all API URL patterns and saves them to a JSON state file. This step tells the scraper whether to use API calls or browser rendering. ### Step 3: Content Discovery (03_discover_content.py) Builds the complete content hierarchy. Strategy A (preferred): calls the content structure API and parses courses, modules, and lessons. Strategy B (fallback): renders pages and parses the DOM. Each lesson URL is classified into a handling strategy: api_fetch, browser_render, download_file, video_player, or spa_content. ### Step 4: Content Scraping (04_scrape_content.py) The main scraping loop. For each lesson: classify the URL, fetch via API or browser, validate for error pages and minimum length, add YAML frontmatter, clean the markdown (remove platform UI noise), and save. Checkpoints every N lessons. Supports `--limit 5` for testing and `--retry-failed` for recovery. ### Step 5: Quality Audit (04a_audit_content.py) Scans all output files. Checks for: valid frontmatter, error markers, minimum body length, HTML remnants, and broken images. Classifies each file as good, warning, or broken. Supports `--fix` to delete broken files and reset their state for retry. ### Step 6: Media Pipeline (04b, 04c, 04d) Three independent scripts. 04b downloads images and rewrites markdown paths to local references. 04c downloads PDFs and converts them to markdown using pdfplumber or PyMuPDF. 04d downloads videos using yt-dlp (YouTube, Vimeo, Loom) or ffmpeg (HLS streams). Each runs independently and can be retried. ### Step 7: Organization (05_organize_output.py) Cleans all markdown content with regex patterns (removes platform-specific UI artifacts). Generates _index.md files at each hierarchy level. Creates a master index with summary statistics: total files, total size, coverage percentage. ### Step 8: Validation (06_validate.py) Final quality scan of all output files. Reports good/warning/broken counts by content type, coverage versus discovered content, and generates a quality_report.json. Target: more than 85% good files, less than 5% broken, all files have frontmatter.

Usage Tips

Best Results

Use Claude Code or a coding-capable AI session — the prompt generates complete Python files
Provide the target URL upfront to skip back-and-forth
Share screenshots of the Network tab and page structure for faster adaptation
Test each pipeline step individually before running the full pipeline
Always start with python 04_scrape_content.py --limit 5 before a full run

Proven Framework This prompt is derived from two production scrapers that successfully extracted 1,237 quality markdown files from an e-learning LMS (789 files, 96% quality) and a community platform (448 lessons, 271 media files).

Ethical Use

Always check robots.txt and terms of service before scraping
Use rate limiting (3s+ between requests) — never overload a server
Scrape only content you have legitimate access to
Respect copyright — scraped content is for personal or internal use only

Windows Compatibility Set PYTHONUTF8=1 and PYTHONIOENCODING=utf-8 before running. Always use python -u for unbuffered output. Use pathlib.Path for cross-platform path handling.

Universal Web Scraping Framework Prompt

Universal Web Scraping Framework Prompt

What This Prompt Produces

The Complete Prompt

Pipeline Steps Explained

Usage Tips

Prompt Engineering Guide

Technical Documentation Prompt

Market Research Prompt

Universal Web Scraping Framework Prompt

Universal Web Scraping Framework Prompt

What This Prompt Produces

The Complete Prompt

Pipeline Steps Explained

Usage Tips

Related Pages

Prompt Engineering Guide

Technical Documentation Prompt

Market Research Prompt