temp-deepwiki/README.md

# DeepWiki Local

Turn your folders and repos into a browsable "wiki" with search, graphs, and Q&A.

## Status: Steps 0-3 Complete ✅

This implementation includes the foundation of the DeepWiki pipeline:

- **Step 0**: Core data structures for files, documents, symbols, and chunks
- **Step 1**: File discovery with ignore patterns and fingerprinting
- **Step 2**: Symbol extraction using tree-sitter for Python, Rust, TypeScript
- **Step 3**: Document chunking by semantic units (functions, sections)

## Quick Start

```bash
# Build and run
cargo build
cargo run

# Run tests
cargo test
```

## What It Does

```
1. Discovers files in your project (respects .gitignore)
   └─► 273 files found, 21 skipped

2. Parses files to extract symbols and imports
   └─► Functions, classes, imports identified

3. Chunks documents into searchable pieces
   └─► Per-function chunks for code, per-section for docs
```

## Example Output

```
=== DeepWiki Local - Steps 0-3 ===

Step 1: Discovery
Scanning directory: .
Discovery complete: 273 files found, 21 skipped

Step 2: Parsing
Parsed: example/orders.py (4 symbols)
  - class OrderService
  - function create_order
  - function get_order
  - function list_orders

Step 3: Chunking
Created 4 chunks from example/orders.py
  Chunk 1: lines 5-24 (function create_order)
  Chunk 2: lines 26-28 (function get_order)
```

## Features

### Discovery
- ✅ Gitignore-aware file walking
- ✅ Smart ignore patterns (node_modules, target, .git, etc.)
- ✅ BLAKE3 fingerprinting for change detection
- ✅ Size filtering (max 2MB per file)

### Parsing
- ✅ Tree-sitter based symbol extraction
- ✅ Python: functions, classes, imports
- ✅ Rust: functions, structs, use declarations
- ✅ TypeScript/JavaScript: functions, classes, ES6 imports
- ✅ JSON: package.json scripts and dependencies
- ✅ Secret redaction (API keys, tokens)

### Chunking
- ✅ Code: one chunk per symbol (function/class)
- ✅ Markdown: one chunk per heading section
- ✅ Line ranges and headings preserved

## Architecture

```
src/
├── main.rs          # Pipeline orchestration
├── types.rs         # Data structures (FileRecord, Document, Symbol, Chunk)
├── discover.rs      # File discovery with ignore patterns
├── parser.rs        # Tree-sitter parsing and symbol extraction
└── chunker.rs       # Document chunking strategies
```

## Documentation

- **[IMPLEMENTATION_SUMMARY.md](IMPLEMENTATION_SUMMARY.md)** - Quick overview of what's implemented
- **[README_STEPS_0_3.md](README_STEPS_0_3.md)** - Detailed documentation with examples

## Dependencies

```toml
blake3 = "1.8.2"              # Fast hashing
ignore = "0.4"                # Gitignore support
tree-sitter = "0.24"          # Language parsing
serde_json = "1.0"            # JSON parsing
anyhow = "1.0"                # Error handling
```

## Testing

All tests passing (6/6):
- Pattern matching for ignore rules
- Secret redaction
- Import parsing (Python, Rust)
- Markdown and code chunking

## Next Steps (Steps 4-7)

- **Step 4**: BM25 keyword indexing with Tantivy
- **Step 5**: Vector embeddings with ONNX
- **Step 6**: Symbol graph building
- **Step 7**: Wiki page synthesis

## Design Philosophy

1. **Fast**: BLAKE3 hashing, tree-sitter parsing, incremental updates
2. **Local-first**: No cloud dependencies, runs offline
3. **Language-agnostic**: Tree-sitter supports 40+ languages
4. **Precise**: Citations to exact file:line-line ranges

## Performance

- Discovery: ~50ms for 273 files
- Parsing: ~20ms for 5 files
- Chunking: <1ms per document

## Example Use Cases

Once complete, DeepWiki will answer:

- "How do I run this project?" → README.md:19-28
- "Where is create_order defined?" → api/orders.py:12-27
- "What calls this function?" → Graph analysis
- "Generate a flow diagram for checkout" → Synthesized from symbols

## License

[Specify your license]

## Contributing

This is an early-stage implementation. Contributions welcome!