39 lines
1.1 KiB
Markdown
39 lines
1.1 KiB
Markdown
line-today-scrape
|
|
===================
|
|
|
|
Prototype respectful crawler for https://today.line.me/th/ (Thai locale).
|
|
|
|
Overview
|
|
--------
|
|
This project contains a conservative, policy-first crawler prototype written in Python.
|
|
It demonstrates:
|
|
- Robots.txt fetching and policy enforcement
|
|
- Rate-limited async fetching
|
|
- HTML extraction (meta tags + JSON-LD fallback)
|
|
- Local storage of raw snapshots and parsed JSON
|
|
|
|
Note: This is a prototype. Always review and run responsibly.
|
|
|
|
Quickstart
|
|
----------
|
|
1. Install dependencies (recommend using poetry or virtualenv)
|
|
|
|
poetry install
|
|
|
|
2. Run the crawler in dry-run mode (fetch limited pages):
|
|
|
|
python -m linetoday.cli --dry-run --limit 5
|
|
|
|
Files
|
|
-----
|
|
- `linetoday/robots.py` - Robots & policy manager
|
|
- `linetoday/fetcher.py` - Async HTTP fetcher with rate limiting
|
|
- `linetoday/frontier.py` - URL frontier and canonicalization
|
|
- `linetoday/extractor.py` - Article extraction logic
|
|
- `linetoday/storage.py` - Local storage for snapshots and parsed JSON
|
|
- `linetoday/cli.py` - CLI entrypoint
|
|
|
|
License
|
|
-------
|
|
Prototype for demonstration only.
|