3.9 KiB
3.9 KiB
DeepWiki Local
Turn your folders and repos into a browsable "wiki" with search, graphs, and Q&A.
Status: Steps 0-3 Complete ✅
This implementation includes the foundation of the DeepWiki pipeline:
- Step 0: Core data structures for files, documents, symbols, and chunks
- Step 1: File discovery with ignore patterns and fingerprinting
- Step 2: Symbol extraction using tree-sitter for Python, Rust, TypeScript
- Step 3: Document chunking by semantic units (functions, sections)
Quick Start
# Build and run
cargo build
cargo run
# Run tests
cargo test
What It Does
1. Discovers files in your project (respects .gitignore)
└─► 273 files found, 21 skipped
2. Parses files to extract symbols and imports
└─► Functions, classes, imports identified
3. Chunks documents into searchable pieces
└─► Per-function chunks for code, per-section for docs
Example Output
=== DeepWiki Local - Steps 0-3 ===
Step 1: Discovery
Scanning directory: .
Discovery complete: 273 files found, 21 skipped
Step 2: Parsing
Parsed: example/orders.py (4 symbols)
- class OrderService
- function create_order
- function get_order
- function list_orders
Step 3: Chunking
Created 4 chunks from example/orders.py
Chunk 1: lines 5-24 (function create_order)
Chunk 2: lines 26-28 (function get_order)
Features
Discovery
- ✅ Gitignore-aware file walking
- ✅ Smart ignore patterns (node_modules, target, .git, etc.)
- ✅ BLAKE3 fingerprinting for change detection
- ✅ Size filtering (max 2MB per file)
Parsing
- ✅ Tree-sitter based symbol extraction
- ✅ Python: functions, classes, imports
- ✅ Rust: functions, structs, use declarations
- ✅ TypeScript/JavaScript: functions, classes, ES6 imports
- ✅ JSON: package.json scripts and dependencies
- ✅ Secret redaction (API keys, tokens)
Chunking
- ✅ Code: one chunk per symbol (function/class)
- ✅ Markdown: one chunk per heading section
- ✅ Line ranges and headings preserved
Architecture
src/
├── main.rs # Pipeline orchestration
├── types.rs # Data structures (FileRecord, Document, Symbol, Chunk)
├── discover.rs # File discovery with ignore patterns
├── parser.rs # Tree-sitter parsing and symbol extraction
└── chunker.rs # Document chunking strategies
Documentation
- IMPLEMENTATION_SUMMARY.md - Quick overview of what's implemented
- README_STEPS_0_3.md - Detailed documentation with examples
Dependencies
blake3 = "1.8.2" # Fast hashing
ignore = "0.4" # Gitignore support
tree-sitter = "0.24" # Language parsing
serde_json = "1.0" # JSON parsing
anyhow = "1.0" # Error handling
Testing
All tests passing (6/6):
- Pattern matching for ignore rules
- Secret redaction
- Import parsing (Python, Rust)
- Markdown and code chunking
Next Steps (Steps 4-7)
- Step 4: BM25 keyword indexing with Tantivy
- Step 5: Vector embeddings with ONNX
- Step 6: Symbol graph building
- Step 7: Wiki page synthesis
Design Philosophy
- Fast: BLAKE3 hashing, tree-sitter parsing, incremental updates
- Local-first: No cloud dependencies, runs offline
- Language-agnostic: Tree-sitter supports 40+ languages
- Precise: Citations to exact file:line-line ranges
Performance
- Discovery: ~50ms for 273 files
- Parsing: ~20ms for 5 files
- Chunking: <1ms per document
Example Use Cases
Once complete, DeepWiki will answer:
- "How do I run this project?" → README.md:19-28
- "Where is create_order defined?" → api/orders.py:12-27
- "What calls this function?" → Graph analysis
- "Generate a flow diagram for checkout" → Synthesized from symbols
License
[Specify your license]
Contributing
This is an early-stage implementation. Contributions welcome!