sosokker/temp-deepwiki

Fork 0

sirin.ph 57bcc60d3c temp commit

2025-10-01 18:01:57 +07:00

3.9 KiB

Raw Blame History

DeepWiki Local

Turn your folders and repos into a browsable "wiki" with search, graphs, and Q&A.

Status: Steps 0-3 Complete ✅

This implementation includes the foundation of the DeepWiki pipeline:

Step 0: Core data structures for files, documents, symbols, and chunks
Step 1: File discovery with ignore patterns and fingerprinting
Step 2: Symbol extraction using tree-sitter for Python, Rust, TypeScript
Step 3: Document chunking by semantic units (functions, sections)

Quick Start

# Build and run
cargo build
cargo run

# Run tests
cargo test

What It Does

1. Discovers files in your project (respects .gitignore)
   └─► 273 files found, 21 skipped

2. Parses files to extract symbols and imports
   └─► Functions, classes, imports identified

3. Chunks documents into searchable pieces
   └─► Per-function chunks for code, per-section for docs

Example Output

=== DeepWiki Local - Steps 0-3 ===

Step 1: Discovery
Scanning directory: .
Discovery complete: 273 files found, 21 skipped

Step 2: Parsing
Parsed: example/orders.py (4 symbols)
  - class OrderService
  - function create_order
  - function get_order
  - function list_orders

Step 3: Chunking
Created 4 chunks from example/orders.py
  Chunk 1: lines 5-24 (function create_order)
  Chunk 2: lines 26-28 (function get_order)

Features

Discovery

✅ Gitignore-aware file walking
✅ Smart ignore patterns (node_modules, target, .git, etc.)
✅ BLAKE3 fingerprinting for change detection
✅ Size filtering (max 2MB per file)

Parsing

✅ Tree-sitter based symbol extraction
✅ Python: functions, classes, imports
✅ Rust: functions, structs, use declarations
✅ TypeScript/JavaScript: functions, classes, ES6 imports
✅ JSON: package.json scripts and dependencies
✅ Secret redaction (API keys, tokens)

Chunking

✅ Code: one chunk per symbol (function/class)
✅ Markdown: one chunk per heading section
✅ Line ranges and headings preserved

Architecture

src/
├── main.rs          # Pipeline orchestration
├── types.rs         # Data structures (FileRecord, Document, Symbol, Chunk)
├── discover.rs      # File discovery with ignore patterns
├── parser.rs        # Tree-sitter parsing and symbol extraction
└── chunker.rs       # Document chunking strategies

Documentation

IMPLEMENTATION_SUMMARY.md - Quick overview of what's implemented
README_STEPS_0_3.md - Detailed documentation with examples

Dependencies

blake3 = "1.8.2"              # Fast hashing
ignore = "0.4"                # Gitignore support
tree-sitter = "0.24"          # Language parsing
serde_json = "1.0"            # JSON parsing
anyhow = "1.0"                # Error handling

Testing

All tests passing (6/6):

Pattern matching for ignore rules
Secret redaction
Import parsing (Python, Rust)
Markdown and code chunking

Next Steps (Steps 4-7)

Step 4: BM25 keyword indexing with Tantivy
Step 5: Vector embeddings with ONNX
Step 6: Symbol graph building
Step 7: Wiki page synthesis

Design Philosophy

Fast: BLAKE3 hashing, tree-sitter parsing, incremental updates
Local-first: No cloud dependencies, runs offline
Language-agnostic: Tree-sitter supports 40+ languages
Precise: Citations to exact file:line-line ranges

Performance

Discovery: ~50ms for 273 files
Parsing: ~20ms for 5 files
Chunking: <1ms per document

Example Use Cases

Once complete, DeepWiki will answer:

"How do I run this project?" → README.md:19-28
"Where is create_order defined?" → api/orders.py:12-27
"What calls this function?" → Graph analysis
"Generate a flow diagram for checkout" → Synthesized from symbols

License

[Specify your license]

Contributing

This is an early-stage implementation. Contributions welcome!

3.9 KiB Raw Blame History