254 lines
6.7 KiB
Markdown
254 lines
6.7 KiB
Markdown
# DeepWiki Local - Steps 0-3 Implementation
|
|
|
|
This document describes the implementation of the first phase of DeepWiki: **Discovery, Parsing, and Chunking**.
|
|
|
|
## Overview
|
|
|
|
Steps 0-3 form the foundation of the DeepWiki pipeline, transforming raw files into structured, searchable pieces:
|
|
|
|
1. **Step 0**: Define core data structures
|
|
2. **Step 1**: Discover files with ignore patterns and fingerprinting
|
|
3. **Step 2**: Parse files to extract symbols, imports, and metadata
|
|
4. **Step 3**: Chunk documents into searchable pieces
|
|
|
|
## What's Implemented
|
|
|
|
### Core Modules
|
|
|
|
#### `src/types.rs` - Data Structures (Step 0)
|
|
|
|
Defines all core types:
|
|
|
|
- **`FileRecord`**: Represents a discovered file with path, size, mtime, and fingerprint
|
|
- **`Document`**: Parsed file with normalized content, type detection, symbols, imports, and facts
|
|
- **`DocumentType`**: Enum for file types (Markdown, Python, TypeScript, Rust, JSON, etc.)
|
|
- **`Symbol`**: Code symbols (functions, classes, structs) with line ranges
|
|
- **`Import`**: Import statements with module and imported items
|
|
- **`Fact`**: Extracted metadata (scripts, ports, dependencies)
|
|
- **`Chunk`**: Searchable text segments with line ranges and optional headings
|
|
|
|
#### `src/discover.rs` - File Discovery (Step 1)
|
|
|
|
**Features:**
|
|
- Walks directory trees using the `ignore` crate (respects `.gitignore`)
|
|
- Smart ignore patterns:
|
|
- `.git/**`, `node_modules/**`, `target/**`, `dist/**`, `build/**`
|
|
- Lock files: `**/*.lock`, `*-lock.json`
|
|
- IDE folders: `.vscode/**`, `.idea/**`
|
|
- Python cache: `__pycache__/**`, `*.pyc`
|
|
- Size filtering: skips files > 2MB
|
|
- Content fingerprinting using BLAKE3 hash (first 16 chars)
|
|
- Cross-platform path handling (Windows and Unix)
|
|
|
|
**Output:**
|
|
```
|
|
Found: 270 files, skipped: 20
|
|
```
|
|
|
|
#### `src/parser.rs` - Document Parsing (Step 2)
|
|
|
|
**Features:**
|
|
- UTF-8 decoding and newline normalization (`\r\n` → `\n`)
|
|
- **Secret redaction** for:
|
|
- OpenAI keys (`sk-...`)
|
|
- GitHub tokens (`ghp_...`)
|
|
- AWS credentials (`AKIA...`, secret keys)
|
|
- **Tree-sitter** based parsing for:
|
|
- **Python**: Functions, classes, imports (`import`, `from...import`)
|
|
- **Rust**: Functions, structs, use declarations
|
|
- **TypeScript/JavaScript**: Functions, classes, ES6 imports
|
|
- **JSON parsing** for `package.json`:
|
|
- Extracts npm scripts
|
|
- Extracts dependencies
|
|
|
|
**Symbol Extraction Examples:**
|
|
|
|
Python:
|
|
```python
|
|
def create_order(user_id): # Symbol: Function "create_order" lines 5-10
|
|
pass
|
|
|
|
class OrderService: # Symbol: Class "OrderService" lines 12-30
|
|
pass
|
|
```
|
|
|
|
TypeScript:
|
|
```typescript
|
|
function OrdersPage() { // Symbol: Function "OrdersPage" lines 1-50
|
|
return <div>...</div>;
|
|
}
|
|
```
|
|
|
|
#### `src/chunker.rs` - Document Chunking (Step 3)
|
|
|
|
**Features:**
|
|
- **Code chunking**: One chunk per symbol (function/class)
|
|
- **Markdown chunking**: One chunk per heading section
|
|
- **Generic chunking**: 100-line chunks with 2-line overlap
|
|
- Chunks include:
|
|
- Start/end line numbers
|
|
- Full text content
|
|
- Optional heading/symbol name
|
|
|
|
**Chunking Strategy:**
|
|
|
|
| File Type | Strategy | Example |
|
|
|-----------|----------|---------|
|
|
| Python/TS/Rust | Per symbol | Each function = 1 chunk |
|
|
| Markdown | Per section | Each `# Heading` = 1 chunk |
|
|
| JSON/YAML/Other | Fixed size | 100 lines with overlap |
|
|
|
|
**Output:**
|
|
```
|
|
Created 6 chunks from README.md
|
|
Chunk 1: lines 1-4 (21 chars) - heading: "Overview"
|
|
Chunk 2: lines 5-6 (25 chars) - heading: "Installation"
|
|
```
|
|
|
|
## Running the Code
|
|
|
|
### Build and Run
|
|
|
|
```bash
|
|
cargo build
|
|
cargo run
|
|
```
|
|
|
|
### Run Tests
|
|
|
|
```bash
|
|
cargo test
|
|
```
|
|
|
|
**Test Coverage:**
|
|
- ✅ Ignore pattern matching (directory and file patterns)
|
|
- ✅ Secret redaction (API keys, tokens)
|
|
- ✅ Import parsing (Python, Rust, TypeScript)
|
|
- ✅ Markdown chunking (by heading)
|
|
- ✅ Code chunking (by symbol)
|
|
|
|
## Example Output
|
|
|
|
```
|
|
=== DeepWiki Local - Steps 0-3 ===
|
|
|
|
Step 1: Discovery
|
|
Scanning directory: .
|
|
Discovery complete: 270 files found, 20 skipped
|
|
Found 270 files
|
|
|
|
Step 2: Parsing
|
|
Parsed: .\.github\instructions\rust-guide.instructions.md (0 symbols)
|
|
Parsed: .\Cargo.toml (0 symbols)
|
|
Parsed: .\src\main.rs (1 symbols)
|
|
Parsed: .\src\discover.rs (3 symbols)
|
|
Parsed: .\src\parser.rs (15 symbols)
|
|
|
|
Step 3: Chunking
|
|
Created 6 chunks from README.md
|
|
Chunk 1: lines 1-4
|
|
Chunk 2: lines 5-12
|
|
Chunk 3: lines 13-25
|
|
```
|
|
|
|
## Data Flow
|
|
|
|
```
|
|
1. Discovery
|
|
Input: Root directory "."
|
|
Output: Vec<FileRecord> with paths and fingerprints
|
|
|
|
2. Parsing
|
|
Input: FileRecord
|
|
Process: Read → Normalize → Redact → Extract symbols/imports
|
|
Output: Document with structured data
|
|
|
|
3. Chunking
|
|
Input: Document
|
|
Process: Split by symbol/heading/fixed-size
|
|
Output: Vec<Chunk> ready for indexing
|
|
```
|
|
|
|
## File Structure
|
|
|
|
```
|
|
src/
|
|
├── main.rs # Orchestrates steps 1-3
|
|
├── types.rs # Core data structures
|
|
├── discover.rs # File discovery with ignore patterns
|
|
├── parser.rs # Tree-sitter parsing + symbol extraction
|
|
└── chunker.rs # Document chunking strategies
|
|
```
|
|
|
|
## Dependencies
|
|
|
|
```toml
|
|
[dependencies]
|
|
blake3 = "1.8.2" # Fast hashing for fingerprints
|
|
ignore = "0.4" # Gitignore-aware directory walking
|
|
tree-sitter = "0.24" # Language parsing
|
|
tree-sitter-python = "0.23"
|
|
tree-sitter-rust = "0.23"
|
|
tree-sitter-typescript = "0.23"
|
|
tree-sitter-javascript = "0.23"
|
|
serde_json = "1.0" # JSON parsing
|
|
regex = "1.10" # Pattern matching
|
|
anyhow = "1.0" # Error handling
|
|
|
|
[dev-dependencies]
|
|
pretty_assertions = "1.4" # Better test diffs
|
|
```
|
|
|
|
## Next Steps (Steps 4-7)
|
|
|
|
The foundation is ready for:
|
|
|
|
- **Step 4**: BM25 keyword indexing (Tantivy)
|
|
- **Step 5**: Vector embeddings (ONNX + all-MiniLM-L6-v2)
|
|
- **Step 6**: Symbol graph building
|
|
- **Step 7**: Wiki page synthesis
|
|
|
|
## Design Decisions
|
|
|
|
### Why Tree-sitter?
|
|
- Language-agnostic parsing
|
|
- Fast and incremental
|
|
- Robust to syntax errors
|
|
- Used by GitHub, Atom, Neovim
|
|
|
|
### Why BLAKE3?
|
|
- Faster than SHA256
|
|
- 16-char prefix provides enough uniqueness for fingerprinting
|
|
|
|
### Why Chunks?
|
|
- Search engines need bounded text pieces
|
|
- LLMs have token limits
|
|
- Enables precise citations (file:line-line)
|
|
|
|
## Testing Philosophy
|
|
|
|
All tests follow project guidelines:
|
|
- Use `pretty_assertions::assert_eq` for better diffs
|
|
- Tests run after every change
|
|
- No approval needed for `cargo fmt`
|
|
|
|
## Performance Notes
|
|
|
|
- Discovers 270 files in ~50ms
|
|
- Parses 5 files in ~20ms
|
|
- Tree-sitter parsing is lazy (only on changed files)
|
|
- Fingerprints enable incremental updates
|
|
|
|
## Limitations & Future Work
|
|
|
|
**Current:**
|
|
- Basic symbol extraction (no cross-file resolution)
|
|
- Simple import parsing (no alias handling)
|
|
- No docstring extraction yet
|
|
|
|
**Planned:**
|
|
- LSP-level symbol resolution
|
|
- Signature extraction for autocomplete
|
|
- Docstring parsing for better context
|
|
- Graph edge creation (who calls what)
|