line-today-scrape/README.md

line-today-scrape
===================

Prototype respectful crawler for https://today.line.me/th/ (Thai locale).

Overview
--------
This project contains a conservative, policy-first crawler prototype written in Python.
It demonstrates:
- Robots.txt fetching and policy enforcement
- Rate-limited async fetching
- HTML extraction (meta tags + JSON-LD fallback)
- Local storage of raw snapshots and parsed JSON

Note: This is a prototype. Always review and run responsibly.

Quickstart
----------
1. Install dependencies (recommend using poetry or virtualenv)

   poetry install

2. Run the crawler in dry-run mode (fetch limited pages):

   python -m linetoday.cli --dry-run --limit 5

Files
-----
- `linetoday/robots.py` - Robots & policy manager
- `linetoday/fetcher.py` - Async HTTP fetcher with rate limiting
- `linetoday/frontier.py` - URL frontier and canonicalization
- `linetoday/extractor.py` - Article extraction logic
- `linetoday/storage.py` - Local storage for snapshots and parsed JSON
- `linetoday/cli.py` - CLI entrypoint

License
-------
Prototype for demonstration only.