line-today-scrape/README.md
Sosokker 0b5b9d98c5
Some checks are pending
CI / test (push) Waiting to run
add main files
2025-10-29 16:12:55 +07:00

39 lines
1.1 KiB
Markdown

line-today-scrape
===================
Prototype respectful crawler for https://today.line.me/th/ (Thai locale).
Overview
--------
This project contains a conservative, policy-first crawler prototype written in Python.
It demonstrates:
- Robots.txt fetching and policy enforcement
- Rate-limited async fetching
- HTML extraction (meta tags + JSON-LD fallback)
- Local storage of raw snapshots and parsed JSON
Note: This is a prototype. Always review and run responsibly.
Quickstart
----------
1. Install dependencies (recommend using poetry or virtualenv)
poetry install
2. Run the crawler in dry-run mode (fetch limited pages):
python -m linetoday.cli --dry-run --limit 5
Files
-----
- `linetoday/robots.py` - Robots & policy manager
- `linetoday/fetcher.py` - Async HTTP fetcher with rate limiting
- `linetoday/frontier.py` - URL frontier and canonicalization
- `linetoday/extractor.py` - Article extraction logic
- `linetoday/storage.py` - Local storage for snapshots and parsed JSON
- `linetoday/cli.py` - CLI entrypoint
License
-------
Prototype for demonstration only.