temp-deepwiki/README.md
2025-10-01 18:01:57 +07:00

3.9 KiB

DeepWiki Local

Turn your folders and repos into a browsable "wiki" with search, graphs, and Q&A.

Status: Steps 0-3 Complete

This implementation includes the foundation of the DeepWiki pipeline:

  • Step 0: Core data structures for files, documents, symbols, and chunks
  • Step 1: File discovery with ignore patterns and fingerprinting
  • Step 2: Symbol extraction using tree-sitter for Python, Rust, TypeScript
  • Step 3: Document chunking by semantic units (functions, sections)

Quick Start

# Build and run
cargo build
cargo run

# Run tests
cargo test

What It Does

1. Discovers files in your project (respects .gitignore)
   └─► 273 files found, 21 skipped

2. Parses files to extract symbols and imports
   └─► Functions, classes, imports identified

3. Chunks documents into searchable pieces
   └─► Per-function chunks for code, per-section for docs

Example Output

=== DeepWiki Local - Steps 0-3 ===

Step 1: Discovery
Scanning directory: .
Discovery complete: 273 files found, 21 skipped

Step 2: Parsing
Parsed: example/orders.py (4 symbols)
  - class OrderService
  - function create_order
  - function get_order
  - function list_orders

Step 3: Chunking
Created 4 chunks from example/orders.py
  Chunk 1: lines 5-24 (function create_order)
  Chunk 2: lines 26-28 (function get_order)

Features

Discovery

  • Gitignore-aware file walking
  • Smart ignore patterns (node_modules, target, .git, etc.)
  • BLAKE3 fingerprinting for change detection
  • Size filtering (max 2MB per file)

Parsing

  • Tree-sitter based symbol extraction
  • Python: functions, classes, imports
  • Rust: functions, structs, use declarations
  • TypeScript/JavaScript: functions, classes, ES6 imports
  • JSON: package.json scripts and dependencies
  • Secret redaction (API keys, tokens)

Chunking

  • Code: one chunk per symbol (function/class)
  • Markdown: one chunk per heading section
  • Line ranges and headings preserved

Architecture

src/
├── main.rs          # Pipeline orchestration
├── types.rs         # Data structures (FileRecord, Document, Symbol, Chunk)
├── discover.rs      # File discovery with ignore patterns
├── parser.rs        # Tree-sitter parsing and symbol extraction
└── chunker.rs       # Document chunking strategies

Documentation

Dependencies

blake3 = "1.8.2"              # Fast hashing
ignore = "0.4"                # Gitignore support
tree-sitter = "0.24"          # Language parsing
serde_json = "1.0"            # JSON parsing
anyhow = "1.0"                # Error handling

Testing

All tests passing (6/6):

  • Pattern matching for ignore rules
  • Secret redaction
  • Import parsing (Python, Rust)
  • Markdown and code chunking

Next Steps (Steps 4-7)

  • Step 4: BM25 keyword indexing with Tantivy
  • Step 5: Vector embeddings with ONNX
  • Step 6: Symbol graph building
  • Step 7: Wiki page synthesis

Design Philosophy

  1. Fast: BLAKE3 hashing, tree-sitter parsing, incremental updates
  2. Local-first: No cloud dependencies, runs offline
  3. Language-agnostic: Tree-sitter supports 40+ languages
  4. Precise: Citations to exact file:line-line ranges

Performance

  • Discovery: ~50ms for 273 files
  • Parsing: ~20ms for 5 files
  • Chunking: <1ms per document

Example Use Cases

Once complete, DeepWiki will answer:

  • "How do I run this project?" → README.md:19-28
  • "Where is create_order defined?" → api/orders.py:12-27
  • "What calls this function?" → Graph analysis
  • "Generate a flow diagram for checkout" → Synthesized from symbols

License

[Specify your license]

Contributing

This is an early-stage implementation. Contributions welcome!