When to Use This Prompt

Use this prompt when you need to scrape a website or web application and save all content as markdown, you are building a knowledge base from an e-learning platform, community site, wiki, or documentation portal, you want content organized for LLM consumption (RAG, fine-tuning, offline reference), you need to handle authentication, SPAs, media, videos, or PDFs, or you want resume-safe scraping with quality validation.

Universal Web Scraping Framework Prompt

A structured, multi-phase prompt that guides an AI assistant to build a complete web scraper for any platform — producing clean, structured markdown files optimized for LLM ingestion, RAG pipelines, or offline knowledge bases.


What This Prompt Produces

OutputDetails
Complete Python project10+ files with clear pipeline steps (01_ through 06_)
Resume-safe scrapingJSON checkpointing at every step — interrupted runs resume exactly
Structured markdownYAML frontmatter, folder hierarchy mirroring source structure
Quality validationAutomated audit with good/warning/broken classification
Media pipelineSeparate steps for images, PDFs, and videos

The Complete Prompt

Copy everything below and paste it into a new AI session (Claude Code or any coding-capable AI):

You are a Senior Scraping Engineer. Your job is to build a complete, production-grade
web scraper that extracts all content from a target platform and saves it as clean,
structured markdown files optimized for LLM consumption.

CORE PRINCIPLES:
1. API-First  Always prefer direct API calls over browser rendering
2. Resume-Safe  Every pipeline step checkpoints progress to JSON
3. Respectful Scraping  Rate limiting (minimum 3s between requests), proper User-Agent
4. Quality-Validated  Every scraped file is audited for completeness
5. Separation of Concerns  Text, media, PDF, video are independent pipeline steps
6. Structured Output  YAML frontmatter, hierarchical folder organization

---

PHASE 0: DISCOVERY INTERVIEW

Ask the user:
1. "What is the target platform URL?"
2. "What type of site is it?" (E-learning / Community / Docs / Blog / Wiki / Forum)
3. "Does it require login?" (No / Username+password / OAuth / SSO)
4. "What content do you want to scrape?"
5. "Do you need media files?" (Text only / +images / +videos / +PDFs)
6. "What will you use the output for?" (RAG / Obsidian / Offline / Training data)

---

PHASE 1: RECONNAISSANCE

Guide the user through DevTools inspection:

1.1 SITE TYPE CLASSIFICATION
- View Source: full content = server-rendered; empty div = SPA
- Network tab XHR/Fetch: /api/ calls = REST API available
- __NEXT_DATA__ script tag = Next.js app
- /graphql POST = GraphQL

1.2 AUTH METHOD IDENTIFICATION
- Login response sets JWT? Cookie? OAuth redirect? X-API-Key header?

1.3 CONTENT STRUCTURE MAPPING
- Content hierarchy (Courses  Modules  Lessons? Categories  Posts?)
- Pagination patterns (?page= or ?offset=)
- Content types (text, video, PDF, quiz, interactive)

1.4 NETWORK TAB INSPECTION
- URL patterns, HTTP methods, auth headers, response structure
- Save API endpoint inventory

---

PHASE 2: PROJECT SCAFFOLD

Create the project structure:
{platform}-scraper/
├── 01_create_profile.py      # Browser auth
├── 02_discover_api.py        # API endpoint discovery
├── 03_discover_content.py    # Content hierarchy mapping
├── 04_scrape_content.py      # Main content scraping
├── 04a_audit_content.py      # Quality validation
├── 04b_download_media.py     # Image downloading
├── 04c_download_pdfs.py      # PDF conversion
├── 04d_download_videos.py    # Video downloading
├── 05_organize_output.py     # Cleanup and indexing
├── 06_validate.py            # Final quality report
├── config.py                 # Central configuration
├── models.py                 # Pydantic data models
├── utils/                    # Reusable utilities

---

PHASE 3: DATA MODELS

Pydantic models for: Lesson, Module, Course, DiscoveryState, ScrapeState,
MediaItem, MediaState, AuditResult

---

PHASE 4: CONFIGURATION

config.py with: BASE_URL, API_BASE_URL, OUTPUT_DIR, REQUEST_DELAY_SECONDS (3.0 min),
MAX_CONCURRENT_SESSIONS, PAGE_TIMEOUT_MS, MAX_RETRIES, CHECKPOINT_EVERY,
ERROR_MARKERS, INCOMPLETE_MARKERS, MIN_CONTENT_LENGTH

---

PHASE 5: PIPELINE IMPLEMENTATION

5.1 Authentication  browser profile with Crawl4AI
5.2 API Discovery  intercept XHR, extract JWT
5.3 Content Discovery  build hierarchy from API or DOM
5.4 Content Scraping  API fetch or browser render  markdown
5.5 Quality Audit  scan, classify good/warning/broken
5.6 Media Pipeline  images, PDFs, videos independently
5.7 Organization  cleanup, generate indexes
5.8 Validation  full quality report

Support flags: --limit N, --retry-failed, --type <url_type>

---

PHASE 6: UTILITY MODULES

auth.py, state_manager.py, rate_limiter.py, markdown_cleaner.py, media_downloader.py

---

PHASE 7: PLATFORM-SPECIFIC ADAPTATION

Adapt points: login URL, API patterns, content structure, URL classification,
error markers, CSS exclusions, content field names, video detection,
cleanup regex, character replacements for slugification

---

ANTI-PATTERNS TO AVOID:
- No rate limiting  IP blocked
- No checkpointing  lose progress on crash
- Hardcoded CSS selectors  break on UI updates
- No error page detection  save garbage as content
- Browser per item  memory leaks
- networkidle wait on SPAs  never fires (websockets)

SPA CRITICAL PATTERN:
Hash-based SPAs accumulate DOM content. File sizes grow linearly.
Fix: batch processing (5 items), restart browser between batches.

OUTPUT FORMAT:
YAML frontmatter (title, id, type, course_id, module_id, source_url, video metadata)
Folder hierarchy: output/courses/{course-slug}/{NN-module-slug}/{NN-lesson-slug}.md
Quality targets: >85% good files, <5% broken, all files have frontmatter

Pipeline Steps Explained

### Step 1: Authentication (01_create_profile.py) Opens a visible browser window using Crawl4AI. You log in manually. The script waits for you to complete login, then verifies the session by visiting a protected page and saves the browser profile for reuse. This profile is used by all subsequent steps. ### Step 2: API Discovery (02_discover_api.py) Opens an authenticated browser and navigates through key pages. Injects JavaScript to intercept XHR/Fetch requests and extract localStorage tokens. Captures all API URL patterns and saves them to a JSON state file. This step tells the scraper whether to use API calls or browser rendering. ### Step 3: Content Discovery (03_discover_content.py) Builds the complete content hierarchy. Strategy A (preferred): calls the content structure API and parses courses, modules, and lessons. Strategy B (fallback): renders pages and parses the DOM. Each lesson URL is classified into a handling strategy: api_fetch, browser_render, download_file, video_player, or spa_content. ### Step 4: Content Scraping (04_scrape_content.py) The main scraping loop. For each lesson: classify the URL, fetch via API or browser, validate for error pages and minimum length, add YAML frontmatter, clean the markdown (remove platform UI noise), and save. Checkpoints every N lessons. Supports `--limit 5` for testing and `--retry-failed` for recovery. ### Step 5: Quality Audit (04a_audit_content.py) Scans all output files. Checks for: valid frontmatter, error markers, minimum body length, HTML remnants, and broken images. Classifies each file as good, warning, or broken. Supports `--fix` to delete broken files and reset their state for retry. ### Step 6: Media Pipeline (04b, 04c, 04d) Three independent scripts. 04b downloads images and rewrites markdown paths to local references. 04c downloads PDFs and converts them to markdown using pdfplumber or PyMuPDF. 04d downloads videos using yt-dlp (YouTube, Vimeo, Loom) or ffmpeg (HLS streams). Each runs independently and can be retried. ### Step 7: Organization (05_organize_output.py) Cleans all markdown content with regex patterns (removes platform-specific UI artifacts). Generates _index.md files at each hierarchy level. Creates a master index with summary statistics: total files, total size, coverage percentage. ### Step 8: Validation (06_validate.py) Final quality scan of all output files. Reports good/warning/broken counts by content type, coverage versus discovered content, and generates a quality_report.json. Target: more than 85% good files, less than 5% broken, all files have frontmatter.

Usage Tips

Best Results

  • Use Claude Code or a coding-capable AI session — the prompt generates complete Python files
  • Provide the target URL upfront to skip back-and-forth
  • Share screenshots of the Network tab and page structure for faster adaptation
  • Test each pipeline step individually before running the full pipeline
  • Always start with python 04_scrape_content.py --limit 5 before a full run
Proven Framework This prompt is derived from two production scrapers that successfully extracted 1,237 quality markdown files from an e-learning LMS (789 files, 96% quality) and a community platform (448 lessons, 271 media files).

Ethical Use

  • Always check robots.txt and terms of service before scraping
  • Use rate limiting (3s+ between requests) — never overload a server
  • Scrape only content you have legitimate access to
  • Respect copyright — scraped content is for personal or internal use only
Windows Compatibility Set PYTHONUTF8=1 and PYTHONIOENCODING=utf-8 before running. Always use python -u for unbuffered output. Use pathlib.Path for cross-platform path handling.