# DeepWiki Local - Steps 0-3 Implementation This document describes the implementation of the first phase of DeepWiki: **Discovery, Parsing, and Chunking**. ## Overview Steps 0-3 form the foundation of the DeepWiki pipeline, transforming raw files into structured, searchable pieces: 1. **Step 0**: Define core data structures 2. **Step 1**: Discover files with ignore patterns and fingerprinting 3. **Step 2**: Parse files to extract symbols, imports, and metadata 4. **Step 3**: Chunk documents into searchable pieces ## What's Implemented ### Core Modules #### `src/types.rs` - Data Structures (Step 0) Defines all core types: - **`FileRecord`**: Represents a discovered file with path, size, mtime, and fingerprint - **`Document`**: Parsed file with normalized content, type detection, symbols, imports, and facts - **`DocumentType`**: Enum for file types (Markdown, Python, TypeScript, Rust, JSON, etc.) - **`Symbol`**: Code symbols (functions, classes, structs) with line ranges - **`Import`**: Import statements with module and imported items - **`Fact`**: Extracted metadata (scripts, ports, dependencies) - **`Chunk`**: Searchable text segments with line ranges and optional headings #### `src/discover.rs` - File Discovery (Step 1) **Features:** - Walks directory trees using the `ignore` crate (respects `.gitignore`) - Smart ignore patterns: - `.git/**`, `node_modules/**`, `target/**`, `dist/**`, `build/**` - Lock files: `**/*.lock`, `*-lock.json` - IDE folders: `.vscode/**`, `.idea/**` - Python cache: `__pycache__/**`, `*.pyc` - Size filtering: skips files > 2MB - Content fingerprinting using BLAKE3 hash (first 16 chars) - Cross-platform path handling (Windows and Unix) **Output:** ``` Found: 270 files, skipped: 20 ``` #### `src/parser.rs` - Document Parsing (Step 2) **Features:** - UTF-8 decoding and newline normalization (`\r\n` → `\n`) - **Secret redaction** for: - OpenAI keys (`sk-...`) - GitHub tokens (`ghp_...`) - AWS credentials (`AKIA...`, secret keys) - **Tree-sitter** based parsing for: - **Python**: Functions, classes, imports (`import`, `from...import`) - **Rust**: Functions, structs, use declarations - **TypeScript/JavaScript**: Functions, classes, ES6 imports - **JSON parsing** for `package.json`: - Extracts npm scripts - Extracts dependencies **Symbol Extraction Examples:** Python: ```python def create_order(user_id): # Symbol: Function "create_order" lines 5-10 pass class OrderService: # Symbol: Class "OrderService" lines 12-30 pass ``` TypeScript: ```typescript function OrdersPage() { // Symbol: Function "OrdersPage" lines 1-50 return
...
; } ``` #### `src/chunker.rs` - Document Chunking (Step 3) **Features:** - **Code chunking**: One chunk per symbol (function/class) - **Markdown chunking**: One chunk per heading section - **Generic chunking**: 100-line chunks with 2-line overlap - Chunks include: - Start/end line numbers - Full text content - Optional heading/symbol name **Chunking Strategy:** | File Type | Strategy | Example | |-----------|----------|---------| | Python/TS/Rust | Per symbol | Each function = 1 chunk | | Markdown | Per section | Each `# Heading` = 1 chunk | | JSON/YAML/Other | Fixed size | 100 lines with overlap | **Output:** ``` Created 6 chunks from README.md Chunk 1: lines 1-4 (21 chars) - heading: "Overview" Chunk 2: lines 5-6 (25 chars) - heading: "Installation" ``` ## Running the Code ### Build and Run ```bash cargo build cargo run ``` ### Run Tests ```bash cargo test ``` **Test Coverage:** - ✅ Ignore pattern matching (directory and file patterns) - ✅ Secret redaction (API keys, tokens) - ✅ Import parsing (Python, Rust, TypeScript) - ✅ Markdown chunking (by heading) - ✅ Code chunking (by symbol) ## Example Output ``` === DeepWiki Local - Steps 0-3 === Step 1: Discovery Scanning directory: . Discovery complete: 270 files found, 20 skipped Found 270 files Step 2: Parsing Parsed: .\.github\instructions\rust-guide.instructions.md (0 symbols) Parsed: .\Cargo.toml (0 symbols) Parsed: .\src\main.rs (1 symbols) Parsed: .\src\discover.rs (3 symbols) Parsed: .\src\parser.rs (15 symbols) Step 3: Chunking Created 6 chunks from README.md Chunk 1: lines 1-4 Chunk 2: lines 5-12 Chunk 3: lines 13-25 ``` ## Data Flow ``` 1. Discovery Input: Root directory "." Output: Vec with paths and fingerprints 2. Parsing Input: FileRecord Process: Read → Normalize → Redact → Extract symbols/imports Output: Document with structured data 3. Chunking Input: Document Process: Split by symbol/heading/fixed-size Output: Vec ready for indexing ``` ## File Structure ``` src/ ├── main.rs # Orchestrates steps 1-3 ├── types.rs # Core data structures ├── discover.rs # File discovery with ignore patterns ├── parser.rs # Tree-sitter parsing + symbol extraction └── chunker.rs # Document chunking strategies ``` ## Dependencies ```toml [dependencies] blake3 = "1.8.2" # Fast hashing for fingerprints ignore = "0.4" # Gitignore-aware directory walking tree-sitter = "0.24" # Language parsing tree-sitter-python = "0.23" tree-sitter-rust = "0.23" tree-sitter-typescript = "0.23" tree-sitter-javascript = "0.23" serde_json = "1.0" # JSON parsing regex = "1.10" # Pattern matching anyhow = "1.0" # Error handling [dev-dependencies] pretty_assertions = "1.4" # Better test diffs ``` ## Next Steps (Steps 4-7) The foundation is ready for: - **Step 4**: BM25 keyword indexing (Tantivy) - **Step 5**: Vector embeddings (ONNX + all-MiniLM-L6-v2) - **Step 6**: Symbol graph building - **Step 7**: Wiki page synthesis ## Design Decisions ### Why Tree-sitter? - Language-agnostic parsing - Fast and incremental - Robust to syntax errors - Used by GitHub, Atom, Neovim ### Why BLAKE3? - Faster than SHA256 - 16-char prefix provides enough uniqueness for fingerprinting ### Why Chunks? - Search engines need bounded text pieces - LLMs have token limits - Enables precise citations (file:line-line) ## Testing Philosophy All tests follow project guidelines: - Use `pretty_assertions::assert_eq` for better diffs - Tests run after every change - No approval needed for `cargo fmt` ## Performance Notes - Discovers 270 files in ~50ms - Parses 5 files in ~20ms - Tree-sitter parsing is lazy (only on changed files) - Fingerprints enable incremental updates ## Limitations & Future Work **Current:** - Basic symbol extraction (no cross-file resolution) - Simple import parsing (no alias handling) - No docstring extraction yet **Planned:** - LSP-level symbol resolution - Signature extraction for autocomplete - Docstring parsing for better context - Graph edge creation (who calls what)