site-to-llmstxt/README.md
2025-07-12 16:59:59 +00:00

2.7 KiB
Raw Permalink Blame History

Site to LLMs.txt - Generate llms.txt from any docs site

Go

⚠️ I vibe-coded this project. Quality isn't guaranteed but it is usable.

A Go-based web crawler that scrapes websites and outputs documentation in LLMs.txt format and markdown files.


Features

  • Generate llms.txt and llms-full.txt for specific websites
  • Convert HTML pages to Markdown
  • Filter out unwanted pages, files, and external domains
  • Categorize main vs. secondary documentation
  • Track crawl progress
  • Save content to structured files

Installation

git clone https://github.com/Sosokker/site-to-llmstxt.git
cd site-to-llmstxt
make build

Or with Makefile

make build
make run URL=https://example.com
make run URL=https://httpbin.org WORKERS=2 OUTPUT=./test-output

Quick Start

./bin/site-to-llmstxt --url https://docs.example.com
# Custom settings
./bin/site-to-llmstxt --url https://example.com --output ./my-docs --workers 3 --verbose

CLI Options

  • --url, -u: Start URL (required)
  • --output, -o: Output directory (default: ./output)
  • --workers, -w: Number of concurrent workers (default: 1)
  • --verbose: Enable progress logging

Output

output/
├── llms.txt         # Overview
├── llms-full.txt    # Full crawl
└── pages/           # Per-page Markdown

Architecture

internal/
├── config/        # CLI and env config
├── crawler/       # Fetch and parse logic
├── filters/       # URL + content filter
├── generator/     # Output file creator
├── models/        # Shared types
├── progress/      # Progress reporting
└── utils/         # Misc functions

Filtering Rules

Excluded:

  • Language folders (/en/, /zh/, etc.)
  • File types: .pdf, .docx, .zip, .jpg, .mp4, .exe, etc.
  • Paths: /blog, /news, /about, /contact, etc.

Categorization:

  • Main: Guides, API, setup
  • Secondary: Blog, news, legal

Development Setup

make dev-setup

Common Commands

make build
make test
make lint
make demo
make clean

Code Style

  • gofmt, goimports
  • Idiomatic Go patterns
  • %w for error wrapping
  • Table-driven tests
  • Package names are short and focused

Testing

make test
make test-coverage
go test -v ./internal/filters/

Dependencies

  • colly web crawling
  • html-to-markdown content conversion
  • progressbar terminal progress
  • urfave/cli CLI handling

License

MIT see LICENSE