line-today-scrape =================== Prototype respectful crawler for https://today.line.me/th/ (Thai locale). Overview -------- This project contains a conservative, policy-first crawler prototype written in Python. It demonstrates: - Robots.txt fetching and policy enforcement - Rate-limited async fetching - HTML extraction (meta tags + JSON-LD fallback) - Local storage of raw snapshots and parsed JSON Note: This is a prototype. Always review and run responsibly. Quickstart ---------- 1. Install dependencies (recommend using poetry or virtualenv) poetry install 2. Run the crawler in dry-run mode (fetch limited pages): python -m linetoday.cli --dry-run --limit 5 Files ----- - `linetoday/robots.py` - Robots & policy manager - `linetoday/fetcher.py` - Async HTTP fetcher with rate limiting - `linetoday/frontier.py` - URL frontier and canonicalization - `linetoday/extractor.py` - Article extraction logic - `linetoday/storage.py` - Local storage for snapshots and parsed JSON - `linetoday/cli.py` - CLI entrypoint License ------- Prototype for demonstration only.