Go to file
Sosokker 0b5b9d98c5
Some checks are pending
CI / test (push) Waiting to run
add main files
2025-10-29 16:12:55 +07:00
.github/workflows add main files 2025-10-29 16:12:55 +07:00
linetoday add main files 2025-10-29 16:12:55 +07:00
tests add main files 2025-10-29 16:12:55 +07:00
tools add main files 2025-10-29 16:12:55 +07:00
.gitignore add main files 2025-10-29 16:12:55 +07:00
.python-version add main files 2025-10-29 16:12:55 +07:00
pyproject.toml add main files 2025-10-29 16:12:55 +07:00
README.md add main files 2025-10-29 16:12:55 +07:00
requirements.txt add main files 2025-10-29 16:12:55 +07:00
uv.lock add main files 2025-10-29 16:12:55 +07:00

line-today-scrape

Prototype respectful crawler for https://today.line.me/th/ (Thai locale).

Overview

This project contains a conservative, policy-first crawler prototype written in Python. It demonstrates:

  • Robots.txt fetching and policy enforcement
  • Rate-limited async fetching
  • HTML extraction (meta tags + JSON-LD fallback)
  • Local storage of raw snapshots and parsed JSON

Note: This is a prototype. Always review and run responsibly.

Quickstart

  1. Install dependencies (recommend using poetry or virtualenv)

    poetry install

  2. Run the crawler in dry-run mode (fetch limited pages):

    python -m linetoday.cli --dry-run --limit 5

Files

  • linetoday/robots.py - Robots & policy manager
  • linetoday/fetcher.py - Async HTTP fetcher with rate limiting
  • linetoday/frontier.py - URL frontier and canonicalization
  • linetoday/extractor.py - Article extraction logic
  • linetoday/storage.py - Local storage for snapshots and parsed JSON
  • linetoday/cli.py - CLI entrypoint

License

Prototype for demonstration only.