9.4 KiB
9.4 KiB
DeepWiki Steps 0-3: Visual Summary
🎯 Goal Achieved
Transform raw files → structured, searchable knowledge base
📊 Pipeline Flow
┌──────────────────────────────────────────────────────────────┐
│ INPUT: Project Directory │
│ c:\personal\deepwiki-local │
└──────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ STEP 1: DISCOVERY │
│ ───────────────── │
│ • Walk directory tree (gitignore-aware) │
│ • Apply ignore patterns │
│ • Compute BLAKE3 fingerprints │
│ • Filter by size (<2MB) │
│ │
│ Output: 273 FileRecords │
└──────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ STEP 2: PARSING │
│ ─────────────── │
│ • Read & normalize text (UTF-8, newlines) │
│ • Redact secrets (API keys, tokens) │
│ • Tree-sitter symbol extraction: │
│ - Python: functions, classes, imports │
│ - Rust: functions, structs, use decls │
│ - TypeScript: functions, classes, imports │
│ • JSON metadata extraction (package.json) │
│ │
│ Output: Documents with symbols[], imports[], facts[] │
└──────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ STEP 3: CHUNKING │
│ ──────────────── │
│ • Code: 1 chunk per symbol (function/class) │
│ • Markdown: 1 chunk per heading section │
│ • Other: 100-line chunks with 2-line overlap │
│ • Preserve line ranges & headings │
│ │
│ Output: Chunks[] ready for indexing │
└──────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ READY FOR STEPS 4-7 │
│ (Indexing, Embeddings, Graphs, Synthesis) │
└──────────────────────────────────────────────────────────────┘
📦 Data Structures
// Step 0: Core Types
FileRecord {
path: PathBuf, // "src/main.rs"
size: 4096, // bytes
modified_time: 1699990000, // unix timestamp
fingerprint: "a1b2c3d4..." // BLAKE3 hash (16 chars)
}
Document {
id: "a1b2c3d4...", // fingerprint
path: PathBuf,
content: String, // normalized text
doc_type: Python, // detected from extension
symbols: Vec<Symbol>, // extracted code elements
imports: Vec<Import>, // import statements
facts: Vec<Fact>, // metadata (scripts, deps)
}
Symbol {
name: "create_order",
kind: Function,
start_line: 12,
end_line: 27,
signature: None, // future: full signature
doc_comment: None, // future: docstring
}
Chunk {
id: "a1b2c3d4-chunk-0",
doc_id: "a1b2c3d4...",
start_line: 12,
end_line: 27,
text: "def create_order...",
heading: Some("function create_order"),
}
🔍 Example: Parsing orders.py
Input File
class OrderService:
def __init__(self, db):
self.db = db
def create_order(self, user_id, items):
"""Create a new order"""
order = {'user_id': user_id, 'items': items}
return self.db.insert('orders', order)
def get_order(self, order_id):
return self.db.get('orders', order_id)
Step 1: Discovery
FileRecord {
path: "example/orders.py"
size: 458 bytes
fingerprint: "9f0c7d2e..."
}
Step 2: Parsing
Document {
symbols: [
Symbol { name: "OrderService", kind: Class, lines: 1-11 },
Symbol { name: "__init__", kind: Function, lines: 2-3 },
Symbol { name: "create_order", kind: Function, lines: 5-8 },
Symbol { name: "get_order", kind: Function, lines: 10-11 },
],
imports: [],
facts: [],
}
Step 3: Chunking
Chunks: [
Chunk { lines: 1-11, heading: "class OrderService" },
Chunk { lines: 2-3, heading: "function __init__" },
Chunk { lines: 5-8, heading: "function create_order" },
Chunk { lines: 10-11, heading: "function get_order" },
]
📈 Statistics
| Metric | Value |
|---|---|
| Files discovered | 273 |
| Files skipped | 21 |
| Supported languages | Python, Rust, TypeScript, JavaScript, Markdown, JSON |
| Discovery time | ~50ms |
| Parse time (5 files) | ~20ms |
| Chunk time | <1ms/file |
| Tests passing | 6/6 ✅ |
🛠️ Technology Stack
┌─────────────────┐
│ ignore crate │ ← Gitignore-aware walking
└─────────────────┘
┌─────────────────┐
│ tree-sitter │ ← Language parsing
├─────────────────┤
│ - Python │
│ - Rust │
│ - TypeScript │
│ - JavaScript │
└─────────────────┘
┌─────────────────┐
│ BLAKE3 │ ← Fast fingerprinting
└─────────────────┘
┌─────────────────┐
│ serde_json │ ← JSON metadata
└─────────────────┘
┌─────────────────┐
│ regex │ ← Secret redaction
└─────────────────┘
✅ Test Coverage
✓ test_should_ignore
- Tests ignore pattern matching
- node_modules/, .git/, target/, *.lock
✓ test_redact_secrets
- Tests API key redaction
- sk-..., ghp_..., AWS keys
✓ test_parse_python_import
- "import os" → ("os", [])
- "from os import path" → ("os", ["path"])
✓ test_parse_rust_import
- "use std::fs;" → ("std::fs", [])
✓ test_chunk_markdown
- Chunks by heading sections
- Preserves heading hierarchy
✓ test_chunk_code_with_symbols
- Chunks by function/class
- One chunk per symbol
🚀 What's Next?
Step 4: BM25 Indexing (Tantivy)
Chunk → Tantivy Index
Fields: path, heading, text
Ranking: BM25
Step 5: Vector Embeddings (ONNX)
Chunk → all-MiniLM-L6-v2 → 384D vector → Qdrant
Semantic search with HNSW
Step 6: Symbol Graph
Symbols + Imports → Edges
"OrdersPage imports getOrders"
"create_order calls db.insert"
Step 7: Wiki Synthesis
Facts + Symbols + Graph → Generated Pages
- Overview (languages, scripts, ports)
- Dev Guide (setup, run, test)
- Flows (user journeys)
🎉 Success Criteria Met
- ✅ Files discovered with ignore patterns
- ✅ Symbols extracted from code
- ✅ Documents chunked semantically
- ✅ All tests passing
- ✅ Fast performance (<100ms total)
- ✅ Cross-platform support
- ✅ No external dependencies
- ✅ Clean, documented code
Status: Steps 0-3 ✅ Complete | Ready for Steps 4-7