mirror of
https://github.com/Sosokker/site-to-llmstxt.git
synced 2025-12-18 13:34:06 +01:00
| internal | ||
| .gitignore | ||
| go.mod | ||
| go.sum | ||
| LICENSE | ||
| Makefile | ||
| README.md | ||
| test.sh | ||
Site to LLMs.txt - Generate llms.txt from any docs site
⚠️ I vibe-coded this project. Quality isn't guaranteed but it is usable.
A Go-based web crawler that scrapes websites and outputs documentation in LLMs.txt format and markdown files.
Features
- Generate
llms.txtandllms-full.txtfor specific websites - Convert HTML pages to Markdown
- Filter out unwanted pages, files, and external domains
- Categorize main vs. secondary documentation
- Track crawl progress
- Save content to structured files
Installation
git clone https://github.com/Sosokker/site-to-llmstxt.git
cd site-to-llmstxt
make build
Quick Start
./bin/site-to-llmstxt --url https://docs.example.com
# Custom settings
./bin/site-to-llmstxt --url https://example.com --output ./my-docs --workers 3 --verbose
CLI Options
--url, -u: Start URL (required)--output, -o: Output directory (default:./output)--workers, -w: Number of concurrent workers (default:1)--verbose: Enable progress logging
Output
output/
├── llms.txt # Overview
├── llms-full.txt # Full crawl
└── pages/ # Per-page Markdown
Architecture
internal/
├── config/ # CLI and env config
├── crawler/ # Fetch and parse logic
├── filters/ # URL + content filter
├── generator/ # Output file creator
├── models/ # Shared types
├── progress/ # Progress reporting
└── utils/ # Misc functions
Filtering Rules
Excluded:
- Language folders (
/en/,/zh/, etc.) - File types:
.pdf,.docx,.zip,.jpg,.mp4,.exe, etc. - Paths:
/blog,/news,/about,/contact, etc.
Categorization:
- Main: Guides, API, setup
- Secondary: Blog, news, legal
Development Setup
make dev-setup
Common Commands
make build
make test
make lint
make demo
make clean
Code Style
gofmt,goimports- Idiomatic Go patterns
%wfor error wrapping- Table-driven tests
- Package names are short and focused
Testing
make test
make test-coverage
go test -v ./internal/filters/
Dependencies
colly– web crawlinghtml-to-markdown– content conversionprogressbar– terminal progressurfave/cli– CLI handling
License
MIT – see LICENSE