mirror of
https://github.com/Sosokker/site-to-llmstxt.git
synced 2025-12-19 14:04:06 +01:00
chore: add README
This commit is contained in:
parent
c54a27e458
commit
fc16ff8f33
144
README.md
Normal file
144
README.md
Normal file
@ -0,0 +1,144 @@
|
|||||||
|
# Site to LLMs.txt - Generate llms.txt from any docs site
|
||||||
|
|
||||||
|
[](https://golang.org/)
|
||||||
|
|
||||||
|
> ⚠️ I vibe-coded this project. Quality isn't guaranteed but it is usable.
|
||||||
|
|
||||||
|
A Go-based web crawler that scrapes websites and outputs documentation in [LLMs.txt format](https://llmstxt.org/) and markdown files.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Features
|
||||||
|
|
||||||
|
* Generate `llms.txt` and `llms-full.txt` for specific websites
|
||||||
|
* Convert HTML pages to Markdown
|
||||||
|
* Filter out unwanted pages, files, and external domains
|
||||||
|
* Categorize main vs. secondary documentation
|
||||||
|
* Track crawl progress
|
||||||
|
* Save content to structured files
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Installation
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git clone https://github.com/Sosokker/site-to-llmstxt.git
|
||||||
|
cd site-to-llmstxt
|
||||||
|
make build
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Quick Start
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./bin/site-to-llmstxt --url https://docs.example.com
|
||||||
|
```
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Custom settings
|
||||||
|
./bin/site-to-llmstxt --url https://example.com --output ./my-docs --workers 3 --verbose
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## CLI Options
|
||||||
|
|
||||||
|
* `--url, -u`: Start URL (required)
|
||||||
|
* `--output, -o`: Output directory (default: `./output`)
|
||||||
|
* `--workers, -w`: Number of concurrent workers (default: `1`)
|
||||||
|
* `--verbose`: Enable progress logging
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Output
|
||||||
|
|
||||||
|
```
|
||||||
|
output/
|
||||||
|
├── llms.txt # Overview
|
||||||
|
├── llms-full.txt # Full crawl
|
||||||
|
└── pages/ # Per-page Markdown
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
```
|
||||||
|
internal/
|
||||||
|
├── config/ # CLI and env config
|
||||||
|
├── crawler/ # Fetch and parse logic
|
||||||
|
├── filters/ # URL + content filter
|
||||||
|
├── generator/ # Output file creator
|
||||||
|
├── models/ # Shared types
|
||||||
|
├── progress/ # Progress reporting
|
||||||
|
└── utils/ # Misc functions
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Filtering Rules
|
||||||
|
|
||||||
|
**Excluded:**
|
||||||
|
|
||||||
|
* Language folders (`/en/`, `/zh/`, etc.)
|
||||||
|
* File types: `.pdf`, `.docx`, `.zip`, `.jpg`, `.mp4`, `.exe`, etc.
|
||||||
|
* Paths: `/blog`, `/news`, `/about`, `/contact`, etc.
|
||||||
|
|
||||||
|
**Categorization:**
|
||||||
|
|
||||||
|
* **Main:** Guides, API, setup
|
||||||
|
* **Secondary:** Blog, news, legal
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Development Setup
|
||||||
|
|
||||||
|
```bash
|
||||||
|
make dev-setup
|
||||||
|
```
|
||||||
|
|
||||||
|
### Common Commands
|
||||||
|
|
||||||
|
```bash
|
||||||
|
make build
|
||||||
|
make test
|
||||||
|
make lint
|
||||||
|
make demo
|
||||||
|
make clean
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Code Style
|
||||||
|
|
||||||
|
* `gofmt`, `goimports`
|
||||||
|
* Idiomatic Go patterns
|
||||||
|
* `%w` for error wrapping
|
||||||
|
* Table-driven tests
|
||||||
|
* Package names are short and focused
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Testing
|
||||||
|
|
||||||
|
```bash
|
||||||
|
make test
|
||||||
|
make test-coverage
|
||||||
|
go test -v ./internal/filters/
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Dependencies
|
||||||
|
|
||||||
|
* `colly` – web crawling
|
||||||
|
* `html-to-markdown` – content conversion
|
||||||
|
* `progressbar` – terminal progress
|
||||||
|
* `urfave/cli` – CLI handling
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## License
|
||||||
|
|
||||||
|
MIT – see [LICENSE](LICENSE)
|
||||||
Loading…
Reference in New Issue
Block a user