temp-deepwiki/README.md
2025-10-01 18:01:57 +07:00

151 lines
3.9 KiB
Markdown

# DeepWiki Local
Turn your folders and repos into a browsable "wiki" with search, graphs, and Q&A.
## Status: Steps 0-3 Complete ✅
This implementation includes the foundation of the DeepWiki pipeline:
- **Step 0**: Core data structures for files, documents, symbols, and chunks
- **Step 1**: File discovery with ignore patterns and fingerprinting
- **Step 2**: Symbol extraction using tree-sitter for Python, Rust, TypeScript
- **Step 3**: Document chunking by semantic units (functions, sections)
## Quick Start
```bash
# Build and run
cargo build
cargo run
# Run tests
cargo test
```
## What It Does
```
1. Discovers files in your project (respects .gitignore)
└─► 273 files found, 21 skipped
2. Parses files to extract symbols and imports
└─► Functions, classes, imports identified
3. Chunks documents into searchable pieces
└─► Per-function chunks for code, per-section for docs
```
## Example Output
```
=== DeepWiki Local - Steps 0-3 ===
Step 1: Discovery
Scanning directory: .
Discovery complete: 273 files found, 21 skipped
Step 2: Parsing
Parsed: example/orders.py (4 symbols)
- class OrderService
- function create_order
- function get_order
- function list_orders
Step 3: Chunking
Created 4 chunks from example/orders.py
Chunk 1: lines 5-24 (function create_order)
Chunk 2: lines 26-28 (function get_order)
```
## Features
### Discovery
- ✅ Gitignore-aware file walking
- ✅ Smart ignore patterns (node_modules, target, .git, etc.)
- ✅ BLAKE3 fingerprinting for change detection
- ✅ Size filtering (max 2MB per file)
### Parsing
- ✅ Tree-sitter based symbol extraction
- ✅ Python: functions, classes, imports
- ✅ Rust: functions, structs, use declarations
- ✅ TypeScript/JavaScript: functions, classes, ES6 imports
- ✅ JSON: package.json scripts and dependencies
- ✅ Secret redaction (API keys, tokens)
### Chunking
- ✅ Code: one chunk per symbol (function/class)
- ✅ Markdown: one chunk per heading section
- ✅ Line ranges and headings preserved
## Architecture
```
src/
├── main.rs # Pipeline orchestration
├── types.rs # Data structures (FileRecord, Document, Symbol, Chunk)
├── discover.rs # File discovery with ignore patterns
├── parser.rs # Tree-sitter parsing and symbol extraction
└── chunker.rs # Document chunking strategies
```
## Documentation
- **[IMPLEMENTATION_SUMMARY.md](IMPLEMENTATION_SUMMARY.md)** - Quick overview of what's implemented
- **[README_STEPS_0_3.md](README_STEPS_0_3.md)** - Detailed documentation with examples
## Dependencies
```toml
blake3 = "1.8.2" # Fast hashing
ignore = "0.4" # Gitignore support
tree-sitter = "0.24" # Language parsing
serde_json = "1.0" # JSON parsing
anyhow = "1.0" # Error handling
```
## Testing
All tests passing (6/6):
- Pattern matching for ignore rules
- Secret redaction
- Import parsing (Python, Rust)
- Markdown and code chunking
## Next Steps (Steps 4-7)
- **Step 4**: BM25 keyword indexing with Tantivy
- **Step 5**: Vector embeddings with ONNX
- **Step 6**: Symbol graph building
- **Step 7**: Wiki page synthesis
## Design Philosophy
1. **Fast**: BLAKE3 hashing, tree-sitter parsing, incremental updates
2. **Local-first**: No cloud dependencies, runs offline
3. **Language-agnostic**: Tree-sitter supports 40+ languages
4. **Precise**: Citations to exact file:line-line ranges
## Performance
- Discovery: ~50ms for 273 files
- Parsing: ~20ms for 5 files
- Chunking: <1ms per document
## Example Use Cases
Once complete, DeepWiki will answer:
- "How do I run this project?" README.md:19-28
- "Where is create_order defined?" api/orders.py:12-27
- "What calls this function?" Graph analysis
- "Generate a flow diagram for checkout" Synthesized from symbols
## License
[Specify your license]
## Contributing
This is an early-stage implementation. Contributions welcome!