temp-deepwiki/README_STEPS_0_3.md

# DeepWiki Local - Steps 0-3 Implementation

This document describes the implementation of the first phase of DeepWiki: **Discovery, Parsing, and Chunking**.

## Overview

Steps 0-3 form the foundation of the DeepWiki pipeline, transforming raw files into structured, searchable pieces:

1. **Step 0**: Define core data structures
2. **Step 1**: Discover files with ignore patterns and fingerprinting
3. **Step 2**: Parse files to extract symbols, imports, and metadata
4. **Step 3**: Chunk documents into searchable pieces

## What's Implemented

### Core Modules

#### `src/types.rs` - Data Structures (Step 0)

Defines all core types:

- **`FileRecord`**: Represents a discovered file with path, size, mtime, and fingerprint
- **`Document`**: Parsed file with normalized content, type detection, symbols, imports, and facts
- **`DocumentType`**: Enum for file types (Markdown, Python, TypeScript, Rust, JSON, etc.)
- **`Symbol`**: Code symbols (functions, classes, structs) with line ranges
- **`Import`**: Import statements with module and imported items
- **`Fact`**: Extracted metadata (scripts, ports, dependencies)
- **`Chunk`**: Searchable text segments with line ranges and optional headings

#### `src/discover.rs` - File Discovery (Step 1)

**Features:**
- Walks directory trees using the `ignore` crate (respects `.gitignore`)
- Smart ignore patterns:
  - `.git/**`, `node_modules/**`, `target/**`, `dist/**`, `build/**`
  - Lock files: `**/*.lock`, `*-lock.json`
  - IDE folders: `.vscode/**`, `.idea/**`
  - Python cache: `__pycache__/**`, `*.pyc`
- Size filtering: skips files > 2MB
- Content fingerprinting using BLAKE3 hash (first 16 chars)
- Cross-platform path handling (Windows and Unix)

**Output:**
```
Found: 270 files, skipped: 20
```

#### `src/parser.rs` - Document Parsing (Step 2)

**Features:**
- UTF-8 decoding and newline normalization (`\r\n` → `\n`)
- **Secret redaction** for:
  - OpenAI keys (`sk-...`)
  - GitHub tokens (`ghp_...`)
  - AWS credentials (`AKIA...`, secret keys)
- **Tree-sitter** based parsing for:
  - **Python**: Functions, classes, imports (`import`, `from...import`)
  - **Rust**: Functions, structs, use declarations
  - **TypeScript/JavaScript**: Functions, classes, ES6 imports
- **JSON parsing** for `package.json`:
  - Extracts npm scripts
  - Extracts dependencies

**Symbol Extraction Examples:**

Python:
```python
def create_order(user_id):  # Symbol: Function "create_order" lines 5-10
    pass

class OrderService:          # Symbol: Class "OrderService" lines 12-30
    pass
```

TypeScript:
```typescript
function OrdersPage() {      // Symbol: Function "OrdersPage" lines 1-50
    return <div>...</div>;
}
```

#### `src/chunker.rs` - Document Chunking (Step 3)

**Features:**
- **Code chunking**: One chunk per symbol (function/class)
- **Markdown chunking**: One chunk per heading section
- **Generic chunking**: 100-line chunks with 2-line overlap
- Chunks include:
  - Start/end line numbers
  - Full text content
  - Optional heading/symbol name

**Chunking Strategy:**

| File Type | Strategy | Example |
|-----------|----------|---------|
| Python/TS/Rust | Per symbol | Each function = 1 chunk |
| Markdown | Per section | Each `# Heading` = 1 chunk |
| JSON/YAML/Other | Fixed size | 100 lines with overlap |

**Output:**
```
Created 6 chunks from README.md
  Chunk 1: lines 1-4 (21 chars) - heading: "Overview"
  Chunk 2: lines 5-6 (25 chars) - heading: "Installation"
```

## Running the Code

### Build and Run

```bash
cargo build
cargo run
```

### Run Tests

```bash
cargo test
```

**Test Coverage:**
- ✅ Ignore pattern matching (directory and file patterns)
- ✅ Secret redaction (API keys, tokens)
- ✅ Import parsing (Python, Rust, TypeScript)
- ✅ Markdown chunking (by heading)
- ✅ Code chunking (by symbol)

## Example Output

```
=== DeepWiki Local - Steps 0-3 ===

Step 1: Discovery
Scanning directory: .
Discovery complete: 270 files found, 20 skipped
Found 270 files

Step 2: Parsing
Parsed: .\.github\instructions\rust-guide.instructions.md (0 symbols)
Parsed: .\Cargo.toml (0 symbols)
Parsed: .\src\main.rs (1 symbols)
Parsed: .\src\discover.rs (3 symbols)
Parsed: .\src\parser.rs (15 symbols)

Step 3: Chunking
Created 6 chunks from README.md
  Chunk 1: lines 1-4
  Chunk 2: lines 5-12
  Chunk 3: lines 13-25
```

## Data Flow

```
1. Discovery
   Input:  Root directory "."
   Output: Vec<FileRecord> with paths and fingerprints

2. Parsing
   Input:  FileRecord
   Process: Read → Normalize → Redact → Extract symbols/imports
   Output: Document with structured data

3. Chunking
   Input:  Document
   Process: Split by symbol/heading/fixed-size
   Output: Vec<Chunk> ready for indexing
```

## File Structure

```
src/
├── main.rs          # Orchestrates steps 1-3
├── types.rs         # Core data structures
├── discover.rs      # File discovery with ignore patterns
├── parser.rs        # Tree-sitter parsing + symbol extraction
└── chunker.rs       # Document chunking strategies
```

## Dependencies

```toml
[dependencies]
blake3 = "1.8.2"              # Fast hashing for fingerprints
ignore = "0.4"                # Gitignore-aware directory walking
tree-sitter = "0.24"          # Language parsing
tree-sitter-python = "0.23"
tree-sitter-rust = "0.23"
tree-sitter-typescript = "0.23"
tree-sitter-javascript = "0.23"
serde_json = "1.0"            # JSON parsing
regex = "1.10"                # Pattern matching
anyhow = "1.0"                # Error handling

[dev-dependencies]
pretty_assertions = "1.4"     # Better test diffs
```

## Next Steps (Steps 4-7)

The foundation is ready for:

- **Step 4**: BM25 keyword indexing (Tantivy)
- **Step 5**: Vector embeddings (ONNX + all-MiniLM-L6-v2)
- **Step 6**: Symbol graph building
- **Step 7**: Wiki page synthesis

## Design Decisions

### Why Tree-sitter?
- Language-agnostic parsing
- Fast and incremental
- Robust to syntax errors
- Used by GitHub, Atom, Neovim

### Why BLAKE3?
- Faster than SHA256
- 16-char prefix provides enough uniqueness for fingerprinting

### Why Chunks?
- Search engines need bounded text pieces
- LLMs have token limits
- Enables precise citations (file:line-line)

## Testing Philosophy

All tests follow project guidelines:
- Use `pretty_assertions::assert_eq` for better diffs
- Tests run after every change
- No approval needed for `cargo fmt`

## Performance Notes

- Discovers 270 files in ~50ms
- Parses 5 files in ~20ms
- Tree-sitter parsing is lazy (only on changed files)
- Fingerprints enable incremental updates

## Limitations & Future Work

**Current:**
- Basic symbol extraction (no cross-file resolution)
- Simple import parsing (no alias handling)
- No docstring extraction yet

**Planned:**
- LSP-level symbol resolution
- Signature extraction for autocomplete
- Docstring parsing for better context
- Graph edge creation (who calls what)