6.7 KiB
6.7 KiB
DeepWiki Local - Steps 0-3 Implementation
This document describes the implementation of the first phase of DeepWiki: Discovery, Parsing, and Chunking.
Overview
Steps 0-3 form the foundation of the DeepWiki pipeline, transforming raw files into structured, searchable pieces:
- Step 0: Define core data structures
- Step 1: Discover files with ignore patterns and fingerprinting
- Step 2: Parse files to extract symbols, imports, and metadata
- Step 3: Chunk documents into searchable pieces
What's Implemented
Core Modules
src/types.rs - Data Structures (Step 0)
Defines all core types:
FileRecord: Represents a discovered file with path, size, mtime, and fingerprintDocument: Parsed file with normalized content, type detection, symbols, imports, and factsDocumentType: Enum for file types (Markdown, Python, TypeScript, Rust, JSON, etc.)Symbol: Code symbols (functions, classes, structs) with line rangesImport: Import statements with module and imported itemsFact: Extracted metadata (scripts, ports, dependencies)Chunk: Searchable text segments with line ranges and optional headings
src/discover.rs - File Discovery (Step 1)
Features:
- Walks directory trees using the
ignorecrate (respects.gitignore) - Smart ignore patterns:
.git/**,node_modules/**,target/**,dist/**,build/**- Lock files:
**/*.lock,*-lock.json - IDE folders:
.vscode/**,.idea/** - Python cache:
__pycache__/**,*.pyc
- Size filtering: skips files > 2MB
- Content fingerprinting using BLAKE3 hash (first 16 chars)
- Cross-platform path handling (Windows and Unix)
Output:
Found: 270 files, skipped: 20
src/parser.rs - Document Parsing (Step 2)
Features:
- UTF-8 decoding and newline normalization (
\r\n→\n) - Secret redaction for:
- OpenAI keys (
sk-...) - GitHub tokens (
ghp_...) - AWS credentials (
AKIA..., secret keys)
- OpenAI keys (
- Tree-sitter based parsing for:
- Python: Functions, classes, imports (
import,from...import) - Rust: Functions, structs, use declarations
- TypeScript/JavaScript: Functions, classes, ES6 imports
- Python: Functions, classes, imports (
- JSON parsing for
package.json:- Extracts npm scripts
- Extracts dependencies
Symbol Extraction Examples:
Python:
def create_order(user_id): # Symbol: Function "create_order" lines 5-10
pass
class OrderService: # Symbol: Class "OrderService" lines 12-30
pass
TypeScript:
function OrdersPage() { // Symbol: Function "OrdersPage" lines 1-50
return <div>...</div>;
}
src/chunker.rs - Document Chunking (Step 3)
Features:
- Code chunking: One chunk per symbol (function/class)
- Markdown chunking: One chunk per heading section
- Generic chunking: 100-line chunks with 2-line overlap
- Chunks include:
- Start/end line numbers
- Full text content
- Optional heading/symbol name
Chunking Strategy:
| File Type | Strategy | Example |
|---|---|---|
| Python/TS/Rust | Per symbol | Each function = 1 chunk |
| Markdown | Per section | Each # Heading = 1 chunk |
| JSON/YAML/Other | Fixed size | 100 lines with overlap |
Output:
Created 6 chunks from README.md
Chunk 1: lines 1-4 (21 chars) - heading: "Overview"
Chunk 2: lines 5-6 (25 chars) - heading: "Installation"
Running the Code
Build and Run
cargo build
cargo run
Run Tests
cargo test
Test Coverage:
- ✅ Ignore pattern matching (directory and file patterns)
- ✅ Secret redaction (API keys, tokens)
- ✅ Import parsing (Python, Rust, TypeScript)
- ✅ Markdown chunking (by heading)
- ✅ Code chunking (by symbol)
Example Output
=== DeepWiki Local - Steps 0-3 ===
Step 1: Discovery
Scanning directory: .
Discovery complete: 270 files found, 20 skipped
Found 270 files
Step 2: Parsing
Parsed: .\.github\instructions\rust-guide.instructions.md (0 symbols)
Parsed: .\Cargo.toml (0 symbols)
Parsed: .\src\main.rs (1 symbols)
Parsed: .\src\discover.rs (3 symbols)
Parsed: .\src\parser.rs (15 symbols)
Step 3: Chunking
Created 6 chunks from README.md
Chunk 1: lines 1-4
Chunk 2: lines 5-12
Chunk 3: lines 13-25
Data Flow
1. Discovery
Input: Root directory "."
Output: Vec<FileRecord> with paths and fingerprints
2. Parsing
Input: FileRecord
Process: Read → Normalize → Redact → Extract symbols/imports
Output: Document with structured data
3. Chunking
Input: Document
Process: Split by symbol/heading/fixed-size
Output: Vec<Chunk> ready for indexing
File Structure
src/
├── main.rs # Orchestrates steps 1-3
├── types.rs # Core data structures
├── discover.rs # File discovery with ignore patterns
├── parser.rs # Tree-sitter parsing + symbol extraction
└── chunker.rs # Document chunking strategies
Dependencies
[dependencies]
blake3 = "1.8.2" # Fast hashing for fingerprints
ignore = "0.4" # Gitignore-aware directory walking
tree-sitter = "0.24" # Language parsing
tree-sitter-python = "0.23"
tree-sitter-rust = "0.23"
tree-sitter-typescript = "0.23"
tree-sitter-javascript = "0.23"
serde_json = "1.0" # JSON parsing
regex = "1.10" # Pattern matching
anyhow = "1.0" # Error handling
[dev-dependencies]
pretty_assertions = "1.4" # Better test diffs
Next Steps (Steps 4-7)
The foundation is ready for:
- Step 4: BM25 keyword indexing (Tantivy)
- Step 5: Vector embeddings (ONNX + all-MiniLM-L6-v2)
- Step 6: Symbol graph building
- Step 7: Wiki page synthesis
Design Decisions
Why Tree-sitter?
- Language-agnostic parsing
- Fast and incremental
- Robust to syntax errors
- Used by GitHub, Atom, Neovim
Why BLAKE3?
- Faster than SHA256
- 16-char prefix provides enough uniqueness for fingerprinting
Why Chunks?
- Search engines need bounded text pieces
- LLMs have token limits
- Enables precise citations (file:line-line)
Testing Philosophy
All tests follow project guidelines:
- Use
pretty_assertions::assert_eqfor better diffs - Tests run after every change
- No approval needed for
cargo fmt
Performance Notes
- Discovers 270 files in ~50ms
- Parses 5 files in ~20ms
- Tree-sitter parsing is lazy (only on changed files)
- Fingerprints enable incremental updates
Limitations & Future Work
Current:
- Basic symbol extraction (no cross-file resolution)
- Simple import parsing (no alias handling)
- No docstring extraction yet
Planned:
- LSP-level symbol resolution
- Signature extraction for autocomplete
- Docstring parsing for better context
- Graph edge creation (who calls what)