sosokker/line-today-scrape

Sosokker 0b5b9d98c5

CI / test (push) Waiting to run

Details

add main files

2025-10-29 16:12:55 +07:00

1.1 KiB

Raw Permalink Blame History

line-today-scrape

Prototype respectful crawler for https://today.line.me/th/ (Thai locale).

Overview

This project contains a conservative, policy-first crawler prototype written in Python. It demonstrates:

Robots.txt fetching and policy enforcement
Rate-limited async fetching
HTML extraction (meta tags + JSON-LD fallback)
Local storage of raw snapshots and parsed JSON

Note: This is a prototype. Always review and run responsibly.

Quickstart

Install dependencies (recommend using poetry or virtualenv)

poetry install
Run the crawler in dry-run mode (fetch limited pages):

python -m linetoday.cli --dry-run --limit 5

Files

linetoday/robots.py - Robots & policy manager
linetoday/fetcher.py - Async HTTP fetcher with rate limiting
linetoday/frontier.py - URL frontier and canonicalization
linetoday/extractor.py - Article extraction logic
linetoday/storage.py - Local storage for snapshots and parsed JSON
linetoday/cli.py - CLI entrypoint

License

Prototype for demonstration only.