sosokker/temp-deepwiki

Fork 0

sirin.ph 57bcc60d3c temp commit

2025-10-01 18:01:57 +07:00

6.7 KiB

Raw Blame History

DeepWiki Local - Steps 0-3 Implementation

This document describes the implementation of the first phase of DeepWiki: Discovery, Parsing, and Chunking.

Overview

Steps 0-3 form the foundation of the DeepWiki pipeline, transforming raw files into structured, searchable pieces:

Step 0: Define core data structures
Step 1: Discover files with ignore patterns and fingerprinting
Step 2: Parse files to extract symbols, imports, and metadata
Step 3: Chunk documents into searchable pieces

What's Implemented

Core Modules

`src/types.rs` - Data Structures (Step 0)

Defines all core types:

FileRecord: Represents a discovered file with path, size, mtime, and fingerprint
Document: Parsed file with normalized content, type detection, symbols, imports, and facts
DocumentType: Enum for file types (Markdown, Python, TypeScript, Rust, JSON, etc.)
Symbol: Code symbols (functions, classes, structs) with line ranges
Import: Import statements with module and imported items
Fact: Extracted metadata (scripts, ports, dependencies)
Chunk: Searchable text segments with line ranges and optional headings

`src/discover.rs` - File Discovery (Step 1)

Features:

Walks directory trees using the ignore crate (respects .gitignore)
Smart ignore patterns:
- .git/**, node_modules/**, target/**, dist/**, build/**
- Lock files: **/*.lock, *-lock.json
- IDE folders: .vscode/**, .idea/**
- Python cache: __pycache__/**, *.pyc
Size filtering: skips files > 2MB
Content fingerprinting using BLAKE3 hash (first 16 chars)
Cross-platform path handling (Windows and Unix)

Output:

Found: 270 files, skipped: 20

`src/parser.rs` - Document Parsing (Step 2)

Features:

UTF-8 decoding and newline normalization (\r\n → \n)
Secret redaction for:
- OpenAI keys (sk-...)
- GitHub tokens (ghp_...)
- AWS credentials (AKIA..., secret keys)
Tree-sitter based parsing for:
- Python: Functions, classes, imports (import, from...import)
- Rust: Functions, structs, use declarations
- TypeScript/JavaScript: Functions, classes, ES6 imports
JSON parsing for package.json:
- Extracts npm scripts
- Extracts dependencies

Symbol Extraction Examples:

Python:

def create_order(user_id):  # Symbol: Function "create_order" lines 5-10
    pass

class OrderService:          # Symbol: Class "OrderService" lines 12-30
    pass

TypeScript:

function OrdersPage() {      // Symbol: Function "OrdersPage" lines 1-50
    return <div>...</div>;
}

`src/chunker.rs` - Document Chunking (Step 3)

Features:

Code chunking: One chunk per symbol (function/class)
Markdown chunking: One chunk per heading section
Generic chunking: 100-line chunks with 2-line overlap
Chunks include:
- Start/end line numbers
- Full text content
- Optional heading/symbol name

Chunking Strategy:

File Type	Strategy	Example
Python/TS/Rust	Per symbol	Each function = 1 chunk
Markdown	Per section	Each `# Heading` = 1 chunk
JSON/YAML/Other	Fixed size	100 lines with overlap

Output:

Created 6 chunks from README.md
  Chunk 1: lines 1-4 (21 chars) - heading: "Overview"
  Chunk 2: lines 5-6 (25 chars) - heading: "Installation"

Running the Code

Build and Run

cargo build
cargo run

Run Tests

cargo test

Test Coverage:

✅ Ignore pattern matching (directory and file patterns)
✅ Secret redaction (API keys, tokens)
✅ Import parsing (Python, Rust, TypeScript)
✅ Markdown chunking (by heading)
✅ Code chunking (by symbol)

Example Output

=== DeepWiki Local - Steps 0-3 ===

Step 1: Discovery
Scanning directory: .
Discovery complete: 270 files found, 20 skipped
Found 270 files

Step 2: Parsing
Parsed: .\.github\instructions\rust-guide.instructions.md (0 symbols)
Parsed: .\Cargo.toml (0 symbols)
Parsed: .\src\main.rs (1 symbols)
Parsed: .\src\discover.rs (3 symbols)
Parsed: .\src\parser.rs (15 symbols)

Step 3: Chunking
Created 6 chunks from README.md
  Chunk 1: lines 1-4
  Chunk 2: lines 5-12
  Chunk 3: lines 13-25

Data Flow

1. Discovery
   Input:  Root directory "."
   Output: Vec<FileRecord> with paths and fingerprints

2. Parsing
   Input:  FileRecord
   Process: Read → Normalize → Redact → Extract symbols/imports
   Output: Document with structured data

3. Chunking
   Input:  Document
   Process: Split by symbol/heading/fixed-size
   Output: Vec<Chunk> ready for indexing

File Structure

src/
├── main.rs          # Orchestrates steps 1-3
├── types.rs         # Core data structures
├── discover.rs      # File discovery with ignore patterns
├── parser.rs        # Tree-sitter parsing + symbol extraction
└── chunker.rs       # Document chunking strategies

Dependencies

[dependencies]
blake3 = "1.8.2"              # Fast hashing for fingerprints
ignore = "0.4"                # Gitignore-aware directory walking
tree-sitter = "0.24"          # Language parsing
tree-sitter-python = "0.23"
tree-sitter-rust = "0.23"
tree-sitter-typescript = "0.23"
tree-sitter-javascript = "0.23"
serde_json = "1.0"            # JSON parsing
regex = "1.10"                # Pattern matching
anyhow = "1.0"                # Error handling

[dev-dependencies]
pretty_assertions = "1.4"     # Better test diffs

Next Steps (Steps 4-7)

The foundation is ready for:

Step 4: BM25 keyword indexing (Tantivy)
Step 5: Vector embeddings (ONNX + all-MiniLM-L6-v2)
Step 6: Symbol graph building
Step 7: Wiki page synthesis

Design Decisions

Why Tree-sitter?

Language-agnostic parsing
Fast and incremental
Robust to syntax errors
Used by GitHub, Atom, Neovim

Why BLAKE3?

Faster than SHA256
16-char prefix provides enough uniqueness for fingerprinting

Why Chunks?

Search engines need bounded text pieces
LLMs have token limits
Enables precise citations (file:line-line)

Testing Philosophy

All tests follow project guidelines:

Use pretty_assertions::assert_eq for better diffs
Tests run after every change
No approval needed for cargo fmt

Performance Notes

Discovers 270 files in ~50ms
Parses 5 files in ~20ms
Tree-sitter parsing is lazy (only on changed files)
Fingerprints enable incremental updates

Limitations & Future Work

Current:

Basic symbol extraction (no cross-file resolution)
Simple import parsing (no alias handling)
No docstring extraction yet

Planned:

LSP-level symbol resolution
Signature extraction for autocomplete
Docstring parsing for better context
Graph edge creation (who calls what)

6.7 KiB Raw Blame History

DeepWiki Local - Steps 0-3 Implementation

Overview

What's Implemented

Core Modules

src/types.rs - Data Structures (Step 0)

src/discover.rs - File Discovery (Step 1)

src/parser.rs - Document Parsing (Step 2)

src/chunker.rs - Document Chunking (Step 3)

Running the Code

Build and Run

Run Tests

Example Output

Data Flow

File Structure

Dependencies

Next Steps (Steps 4-7)

Design Decisions

Why Tree-sitter?

Why BLAKE3?

Why Chunks?

Testing Philosophy

Performance Notes

Limitations & Future Work

6.7 KiB

Raw Blame History

`src/types.rs` - Data Structures (Step 0)

`src/discover.rs` - File Discovery (Step 1)

`src/parser.rs` - Document Parsing (Step 2)

`src/chunker.rs` - Document Chunking (Step 3)