From fc16ff8f33dea5a25df8b46d498c780f0276eba2 Mon Sep 17 00:00:00 2001 From: Sirin Puenggun Date: Sat, 12 Jul 2025 16:56:24 +0000 Subject: [PATCH] chore: add README --- README.md | 144 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 144 insertions(+) create mode 100644 README.md diff --git a/README.md b/README.md new file mode 100644 index 0000000..8c0b257 --- /dev/null +++ b/README.md @@ -0,0 +1,144 @@ +# Site to LLMs.txt - Generate llms.txt from any docs site + +[![Go](https://img.shields.io/badge/Go-1.24+-blue.svg)](https://golang.org/) + +> ⚠️ I vibe-coded this project. Quality isn't guaranteed but it is usable. + +A Go-based web crawler that scrapes websites and outputs documentation in [LLMs.txt format](https://llmstxt.org/) and markdown files. + +--- + +## Features + +* Generate `llms.txt` and `llms-full.txt` for specific websites +* Convert HTML pages to Markdown +* Filter out unwanted pages, files, and external domains +* Categorize main vs. secondary documentation +* Track crawl progress +* Save content to structured files + +--- + +## Installation + +```bash +git clone https://github.com/Sosokker/site-to-llmstxt.git +cd site-to-llmstxt +make build +``` + +--- + +## Quick Start + +```bash +./bin/site-to-llmstxt --url https://docs.example.com +``` + +```bash +# Custom settings +./bin/site-to-llmstxt --url https://example.com --output ./my-docs --workers 3 --verbose +``` + +--- + +## CLI Options + +* `--url, -u`: Start URL (required) +* `--output, -o`: Output directory (default: `./output`) +* `--workers, -w`: Number of concurrent workers (default: `1`) +* `--verbose`: Enable progress logging + +--- + +## Output + +``` +output/ +├── llms.txt # Overview +├── llms-full.txt # Full crawl +└── pages/ # Per-page Markdown +``` + +--- + +## Architecture + +``` +internal/ +├── config/ # CLI and env config +├── crawler/ # Fetch and parse logic +├── filters/ # URL + content filter +├── generator/ # Output file creator +├── models/ # Shared types +├── progress/ # Progress reporting +└── utils/ # Misc functions +``` + +--- + +## Filtering Rules + +**Excluded:** + +* Language folders (`/en/`, `/zh/`, etc.) +* File types: `.pdf`, `.docx`, `.zip`, `.jpg`, `.mp4`, `.exe`, etc. +* Paths: `/blog`, `/news`, `/about`, `/contact`, etc. + +**Categorization:** + +* **Main:** Guides, API, setup +* **Secondary:** Blog, news, legal + +--- + +## Development Setup + +```bash +make dev-setup +``` + +### Common Commands + +```bash +make build +make test +make lint +make demo +make clean +``` + +--- + +## Code Style + +* `gofmt`, `goimports` +* Idiomatic Go patterns +* `%w` for error wrapping +* Table-driven tests +* Package names are short and focused + +--- + +## Testing + +```bash +make test +make test-coverage +go test -v ./internal/filters/ +``` + +--- + +## Dependencies + +* `colly` – web crawling +* `html-to-markdown` – content conversion +* `progressbar` – terminal progress +* `urfave/cli` – CLI handling + +--- + +## License + +MIT – see [LICENSE](LICENSE) \ No newline at end of file