1.1 KiB
1.1 KiB
line-today-scrape
Prototype respectful crawler for https://today.line.me/th/ (Thai locale).
Overview
This project contains a conservative, policy-first crawler prototype written in Python. It demonstrates:
- Robots.txt fetching and policy enforcement
- Rate-limited async fetching
- HTML extraction (meta tags + JSON-LD fallback)
- Local storage of raw snapshots and parsed JSON
Note: This is a prototype. Always review and run responsibly.
Quickstart
-
Install dependencies (recommend using poetry or virtualenv)
poetry install
-
Run the crawler in dry-run mode (fetch limited pages):
python -m linetoday.cli --dry-run --limit 5
Files
linetoday/robots.py- Robots & policy managerlinetoday/fetcher.py- Async HTTP fetcher with rate limitinglinetoday/frontier.py- URL frontier and canonicalizationlinetoday/extractor.py- Article extraction logiclinetoday/storage.py- Local storage for snapshots and parsed JSONlinetoday/cli.py- CLI entrypoint
License
Prototype for demonstration only.