mirror of
https://github.com/Sosokker/site-to-llmstxt.git
synced 2025-12-18 13:34:06 +01:00
153 lines
2.7 KiB
Markdown
153 lines
2.7 KiB
Markdown
# Site to LLMs.txt - Generate llms.txt from any docs site
|
||
|
||
[](https://golang.org/)
|
||
|
||
> ⚠️ I vibe-coded this project. Quality isn't guaranteed but it is usable.
|
||
|
||
A Go-based web crawler that scrapes websites and outputs documentation in [LLMs.txt format](https://llmstxt.org/) and markdown files.
|
||
|
||
---
|
||
|
||
## Features
|
||
|
||
* Generate `llms.txt` and `llms-full.txt` for specific websites
|
||
* Convert HTML pages to Markdown
|
||
* Filter out unwanted pages, files, and external domains
|
||
* Categorize main vs. secondary documentation
|
||
* Track crawl progress
|
||
* Save content to structured files
|
||
|
||
---
|
||
|
||
## Installation
|
||
|
||
```bash
|
||
git clone https://github.com/Sosokker/site-to-llmstxt.git
|
||
cd site-to-llmstxt
|
||
make build
|
||
```
|
||
|
||
|
||
Or with Makefile
|
||
|
||
```bash
|
||
make build
|
||
make run URL=https://example.com
|
||
make run URL=https://httpbin.org WORKERS=2 OUTPUT=./test-output
|
||
```
|
||
|
||
---
|
||
|
||
## Quick Start
|
||
|
||
```bash
|
||
./bin/site-to-llmstxt --url https://docs.example.com
|
||
```
|
||
|
||
```bash
|
||
# Custom settings
|
||
./bin/site-to-llmstxt --url https://example.com --output ./my-docs --workers 3 --verbose
|
||
```
|
||
|
||
---
|
||
|
||
## CLI Options
|
||
|
||
* `--url, -u`: Start URL (required)
|
||
* `--output, -o`: Output directory (default: `./output`)
|
||
* `--workers, -w`: Number of concurrent workers (default: `1`)
|
||
* `--verbose`: Enable progress logging
|
||
|
||
---
|
||
|
||
## Output
|
||
|
||
```
|
||
output/
|
||
├── llms.txt # Overview
|
||
├── llms-full.txt # Full crawl
|
||
└── pages/ # Per-page Markdown
|
||
```
|
||
|
||
---
|
||
|
||
## Architecture
|
||
|
||
```
|
||
internal/
|
||
├── config/ # CLI and env config
|
||
├── crawler/ # Fetch and parse logic
|
||
├── filters/ # URL + content filter
|
||
├── generator/ # Output file creator
|
||
├── models/ # Shared types
|
||
├── progress/ # Progress reporting
|
||
└── utils/ # Misc functions
|
||
```
|
||
|
||
---
|
||
|
||
## Filtering Rules
|
||
|
||
**Excluded:**
|
||
|
||
* Language folders (`/en/`, `/zh/`, etc.)
|
||
* File types: `.pdf`, `.docx`, `.zip`, `.jpg`, `.mp4`, `.exe`, etc.
|
||
* Paths: `/blog`, `/news`, `/about`, `/contact`, etc.
|
||
|
||
**Categorization:**
|
||
|
||
* **Main:** Guides, API, setup
|
||
* **Secondary:** Blog, news, legal
|
||
|
||
---
|
||
|
||
## Development Setup
|
||
|
||
```bash
|
||
make dev-setup
|
||
```
|
||
|
||
### Common Commands
|
||
|
||
```bash
|
||
make build
|
||
make test
|
||
make lint
|
||
make demo
|
||
make clean
|
||
```
|
||
|
||
---
|
||
|
||
## Code Style
|
||
|
||
* `gofmt`, `goimports`
|
||
* Idiomatic Go patterns
|
||
* `%w` for error wrapping
|
||
* Table-driven tests
|
||
* Package names are short and focused
|
||
|
||
---
|
||
|
||
## Testing
|
||
|
||
```bash
|
||
make test
|
||
make test-coverage
|
||
go test -v ./internal/filters/
|
||
```
|
||
|
||
---
|
||
|
||
## Dependencies
|
||
|
||
* `colly` – web crawling
|
||
* `html-to-markdown` – content conversion
|
||
* `progressbar` – terminal progress
|
||
* `urfave/cli` – CLI handling
|
||
|
||
---
|
||
|
||
## License
|
||
|
||
MIT – see [LICENSE](LICENSE) |