temp-deepwiki/README_STEPS_0_3.md
2025-10-01 18:01:57 +07:00

6.7 KiB

DeepWiki Local - Steps 0-3 Implementation

This document describes the implementation of the first phase of DeepWiki: Discovery, Parsing, and Chunking.

Overview

Steps 0-3 form the foundation of the DeepWiki pipeline, transforming raw files into structured, searchable pieces:

  1. Step 0: Define core data structures
  2. Step 1: Discover files with ignore patterns and fingerprinting
  3. Step 2: Parse files to extract symbols, imports, and metadata
  4. Step 3: Chunk documents into searchable pieces

What's Implemented

Core Modules

src/types.rs - Data Structures (Step 0)

Defines all core types:

  • FileRecord: Represents a discovered file with path, size, mtime, and fingerprint
  • Document: Parsed file with normalized content, type detection, symbols, imports, and facts
  • DocumentType: Enum for file types (Markdown, Python, TypeScript, Rust, JSON, etc.)
  • Symbol: Code symbols (functions, classes, structs) with line ranges
  • Import: Import statements with module and imported items
  • Fact: Extracted metadata (scripts, ports, dependencies)
  • Chunk: Searchable text segments with line ranges and optional headings

src/discover.rs - File Discovery (Step 1)

Features:

  • Walks directory trees using the ignore crate (respects .gitignore)
  • Smart ignore patterns:
    • .git/**, node_modules/**, target/**, dist/**, build/**
    • Lock files: **/*.lock, *-lock.json
    • IDE folders: .vscode/**, .idea/**
    • Python cache: __pycache__/**, *.pyc
  • Size filtering: skips files > 2MB
  • Content fingerprinting using BLAKE3 hash (first 16 chars)
  • Cross-platform path handling (Windows and Unix)

Output:

Found: 270 files, skipped: 20

src/parser.rs - Document Parsing (Step 2)

Features:

  • UTF-8 decoding and newline normalization (\r\n\n)
  • Secret redaction for:
    • OpenAI keys (sk-...)
    • GitHub tokens (ghp_...)
    • AWS credentials (AKIA..., secret keys)
  • Tree-sitter based parsing for:
    • Python: Functions, classes, imports (import, from...import)
    • Rust: Functions, structs, use declarations
    • TypeScript/JavaScript: Functions, classes, ES6 imports
  • JSON parsing for package.json:
    • Extracts npm scripts
    • Extracts dependencies

Symbol Extraction Examples:

Python:

def create_order(user_id):  # Symbol: Function "create_order" lines 5-10
    pass

class OrderService:          # Symbol: Class "OrderService" lines 12-30
    pass

TypeScript:

function OrdersPage() {      // Symbol: Function "OrdersPage" lines 1-50
    return <div>...</div>;
}

src/chunker.rs - Document Chunking (Step 3)

Features:

  • Code chunking: One chunk per symbol (function/class)
  • Markdown chunking: One chunk per heading section
  • Generic chunking: 100-line chunks with 2-line overlap
  • Chunks include:
    • Start/end line numbers
    • Full text content
    • Optional heading/symbol name

Chunking Strategy:

File Type Strategy Example
Python/TS/Rust Per symbol Each function = 1 chunk
Markdown Per section Each # Heading = 1 chunk
JSON/YAML/Other Fixed size 100 lines with overlap

Output:

Created 6 chunks from README.md
  Chunk 1: lines 1-4 (21 chars) - heading: "Overview"
  Chunk 2: lines 5-6 (25 chars) - heading: "Installation"

Running the Code

Build and Run

cargo build
cargo run

Run Tests

cargo test

Test Coverage:

  • Ignore pattern matching (directory and file patterns)
  • Secret redaction (API keys, tokens)
  • Import parsing (Python, Rust, TypeScript)
  • Markdown chunking (by heading)
  • Code chunking (by symbol)

Example Output

=== DeepWiki Local - Steps 0-3 ===

Step 1: Discovery
Scanning directory: .
Discovery complete: 270 files found, 20 skipped
Found 270 files

Step 2: Parsing
Parsed: .\.github\instructions\rust-guide.instructions.md (0 symbols)
Parsed: .\Cargo.toml (0 symbols)
Parsed: .\src\main.rs (1 symbols)
Parsed: .\src\discover.rs (3 symbols)
Parsed: .\src\parser.rs (15 symbols)

Step 3: Chunking
Created 6 chunks from README.md
  Chunk 1: lines 1-4
  Chunk 2: lines 5-12
  Chunk 3: lines 13-25

Data Flow

1. Discovery
   Input:  Root directory "."
   Output: Vec<FileRecord> with paths and fingerprints

2. Parsing
   Input:  FileRecord
   Process: Read → Normalize → Redact → Extract symbols/imports
   Output: Document with structured data

3. Chunking
   Input:  Document
   Process: Split by symbol/heading/fixed-size
   Output: Vec<Chunk> ready for indexing

File Structure

src/
├── main.rs          # Orchestrates steps 1-3
├── types.rs         # Core data structures
├── discover.rs      # File discovery with ignore patterns
├── parser.rs        # Tree-sitter parsing + symbol extraction
└── chunker.rs       # Document chunking strategies

Dependencies

[dependencies]
blake3 = "1.8.2"              # Fast hashing for fingerprints
ignore = "0.4"                # Gitignore-aware directory walking
tree-sitter = "0.24"          # Language parsing
tree-sitter-python = "0.23"
tree-sitter-rust = "0.23"
tree-sitter-typescript = "0.23"
tree-sitter-javascript = "0.23"
serde_json = "1.0"            # JSON parsing
regex = "1.10"                # Pattern matching
anyhow = "1.0"                # Error handling

[dev-dependencies]
pretty_assertions = "1.4"     # Better test diffs

Next Steps (Steps 4-7)

The foundation is ready for:

  • Step 4: BM25 keyword indexing (Tantivy)
  • Step 5: Vector embeddings (ONNX + all-MiniLM-L6-v2)
  • Step 6: Symbol graph building
  • Step 7: Wiki page synthesis

Design Decisions

Why Tree-sitter?

  • Language-agnostic parsing
  • Fast and incremental
  • Robust to syntax errors
  • Used by GitHub, Atom, Neovim

Why BLAKE3?

  • Faster than SHA256
  • 16-char prefix provides enough uniqueness for fingerprinting

Why Chunks?

  • Search engines need bounded text pieces
  • LLMs have token limits
  • Enables precise citations (file:line-line)

Testing Philosophy

All tests follow project guidelines:

  • Use pretty_assertions::assert_eq for better diffs
  • Tests run after every change
  • No approval needed for cargo fmt

Performance Notes

  • Discovers 270 files in ~50ms
  • Parses 5 files in ~20ms
  • Tree-sitter parsing is lazy (only on changed files)
  • Fingerprints enable incremental updates

Limitations & Future Work

Current:

  • Basic symbol extraction (no cross-file resolution)
  • Simple import parsing (no alias handling)
  • No docstring extraction yet

Planned:

  • LSP-level symbol resolution
  • Signature extraction for autocomplete
  • Docstring parsing for better context
  • Graph edge creation (who calls what)