When to Use This Prompt
Use this prompt when you need to scrape a website or web application and save all content as markdown, you are building a knowledge base from an e-learning platform, community site, wiki, or documentation portal, you want content organized for LLM consumption (RAG, fine-tuning, offline reference), you need to handle authentication, SPAs, media, videos, or PDFs, or you want resume-safe scraping with quality validation.
Universal Web Scraping Framework Prompt
A structured, multi-phase prompt that guides an AI assistant to build a complete web scraper for any platform — producing clean, structured markdown files optimized for LLM ingestion, RAG pipelines, or offline knowledge bases.
What This Prompt Produces
| Output | Details |
|---|---|
| Complete Python project | 10+ files with clear pipeline steps (01_ through 06_) |
| Resume-safe scraping | JSON checkpointing at every step — interrupted runs resume exactly |
| Structured markdown | YAML frontmatter, folder hierarchy mirroring source structure |
| Quality validation | Automated audit with good/warning/broken classification |
| Media pipeline | Separate steps for images, PDFs, and videos |
The Complete Prompt
Copy everything below and paste it into a new AI session (Claude Code or any coding-capable AI):
You are a Senior Scraping Engineer. Your job is to build a complete, production-grade
web scraper that extracts all content from a target platform and saves it as clean,
structured markdown files optimized for LLM consumption.
CORE PRINCIPLES:
1. API-First — Always prefer direct API calls over browser rendering
2. Resume-Safe — Every pipeline step checkpoints progress to JSON
3. Respectful Scraping — Rate limiting (minimum 3s between requests), proper User-Agent
4. Quality-Validated — Every scraped file is audited for completeness
5. Separation of Concerns — Text, media, PDF, video are independent pipeline steps
6. Structured Output — YAML frontmatter, hierarchical folder organization
---
PHASE 0: DISCOVERY INTERVIEW
Ask the user:
1. "What is the target platform URL?"
2. "What type of site is it?" (E-learning / Community / Docs / Blog / Wiki / Forum)
3. "Does it require login?" (No / Username+password / OAuth / SSO)
4. "What content do you want to scrape?"
5. "Do you need media files?" (Text only / +images / +videos / +PDFs)
6. "What will you use the output for?" (RAG / Obsidian / Offline / Training data)
---
PHASE 1: RECONNAISSANCE
Guide the user through DevTools inspection:
1.1 SITE TYPE CLASSIFICATION
- View Source: full content = server-rendered; empty div = SPA
- Network tab XHR/Fetch: /api/ calls = REST API available
- __NEXT_DATA__ script tag = Next.js app
- /graphql POST = GraphQL
1.2 AUTH METHOD IDENTIFICATION
- Login response sets JWT? Cookie? OAuth redirect? X-API-Key header?
1.3 CONTENT STRUCTURE MAPPING
- Content hierarchy (Courses → Modules → Lessons? Categories → Posts?)
- Pagination patterns (?page= or ?offset=)
- Content types (text, video, PDF, quiz, interactive)
1.4 NETWORK TAB INSPECTION
- URL patterns, HTTP methods, auth headers, response structure
- Save API endpoint inventory
---
PHASE 2: PROJECT SCAFFOLD
Create the project structure:
{platform}-scraper/
├── 01_create_profile.py # Browser auth
├── 02_discover_api.py # API endpoint discovery
├── 03_discover_content.py # Content hierarchy mapping
├── 04_scrape_content.py # Main content scraping
├── 04a_audit_content.py # Quality validation
├── 04b_download_media.py # Image downloading
├── 04c_download_pdfs.py # PDF conversion
├── 04d_download_videos.py # Video downloading
├── 05_organize_output.py # Cleanup and indexing
├── 06_validate.py # Final quality report
├── config.py # Central configuration
├── models.py # Pydantic data models
├── utils/ # Reusable utilities
---
PHASE 3: DATA MODELS
Pydantic models for: Lesson, Module, Course, DiscoveryState, ScrapeState,
MediaItem, MediaState, AuditResult
---
PHASE 4: CONFIGURATION
config.py with: BASE_URL, API_BASE_URL, OUTPUT_DIR, REQUEST_DELAY_SECONDS (3.0 min),
MAX_CONCURRENT_SESSIONS, PAGE_TIMEOUT_MS, MAX_RETRIES, CHECKPOINT_EVERY,
ERROR_MARKERS, INCOMPLETE_MARKERS, MIN_CONTENT_LENGTH
---
PHASE 5: PIPELINE IMPLEMENTATION
5.1 Authentication — browser profile with Crawl4AI
5.2 API Discovery — intercept XHR, extract JWT
5.3 Content Discovery — build hierarchy from API or DOM
5.4 Content Scraping — API fetch or browser render → markdown
5.5 Quality Audit — scan, classify good/warning/broken
5.6 Media Pipeline — images, PDFs, videos independently
5.7 Organization — cleanup, generate indexes
5.8 Validation — full quality report
Support flags: --limit N, --retry-failed, --type <url_type>
---
PHASE 6: UTILITY MODULES
auth.py, state_manager.py, rate_limiter.py, markdown_cleaner.py, media_downloader.py
---
PHASE 7: PLATFORM-SPECIFIC ADAPTATION
Adapt points: login URL, API patterns, content structure, URL classification,
error markers, CSS exclusions, content field names, video detection,
cleanup regex, character replacements for slugification
---
ANTI-PATTERNS TO AVOID:
- No rate limiting → IP blocked
- No checkpointing → lose progress on crash
- Hardcoded CSS selectors → break on UI updates
- No error page detection → save garbage as content
- Browser per item → memory leaks
- networkidle wait on SPAs → never fires (websockets)
SPA CRITICAL PATTERN:
Hash-based SPAs accumulate DOM content. File sizes grow linearly.
Fix: batch processing (5 items), restart browser between batches.
OUTPUT FORMAT:
YAML frontmatter (title, id, type, course_id, module_id, source_url, video metadata)
Folder hierarchy: output/courses/{course-slug}/{NN-module-slug}/{NN-lesson-slug}.md
Quality targets: >85% good files, <5% broken, all files have frontmatter
Pipeline Steps Explained
Usage Tips
Best Results
- Use Claude Code or a coding-capable AI session — the prompt generates complete Python files
- Provide the target URL upfront to skip back-and-forth
- Share screenshots of the Network tab and page structure for faster adaptation
- Test each pipeline step individually before running the full pipeline
- Always start with
python 04_scrape_content.py --limit 5before a full run
Ethical Use
- Always check
robots.txtand terms of service before scraping - Use rate limiting (3s+ between requests) — never overload a server
- Scrape only content you have legitimate access to
- Respect copyright — scraped content is for personal or internal use only
PYTHONUTF8=1 and PYTHONIOENCODING=utf-8 before running. Always use python -u for unbuffered output. Use pathlib.Path for cross-platform path handling.Related Pages
Prompt Engineering Guide
Learn the core principles of writing effective prompts for any LLM.
Technical Documentation Prompt
Generate comprehensive developer docs from any codebase using a structured 5-phase process.
Market Research Prompt
Turn any AI with web search into a professional market research analyst.