Indexing
This page explains the indexing process in detail.
Overview
Section titled “Overview”Code Search uses Zoekt, a trigram-based search engine, to provide fast code search. The indexing process converts source code into a searchable index.
Trigram Indexing
Section titled “Trigram Indexing”What are Trigrams?
Section titled “What are Trigrams?”Trigrams are sequences of three consecutive characters. For example, the word “search” produces these trigrams:
seaeararcrchHow It Works
Section titled “How It Works”- Tokenization: Source code is split into trigrams
- Inverted Index: Each trigram maps to file locations
- Posting Lists: Each location stores line/byte offsets
- Compression: Index is compressed for efficiency
Search Process
Section titled “Search Process”When you search for handleRequest:
- Query split into trigrams:
han,and,ndl,dle,leR,eRe,Req,equ,que,ues,est - Each trigram’s posting list retrieved
- Intersection finds files containing all trigrams
- Verify actual match in content
- Return matching lines
Indexing Pipeline
Section titled “Indexing Pipeline”- Repository Discovery - The scheduler checks each connection for new or updated repositories based on poll interval.
- Git Clone/Fetch - New repositories are cloned; existing ones are fetched for updates.
- File Enumeration - All files in the repository are listed, respecting
.gitignore. - File Filtering - Binary files and large files are excluded from indexing.
- Language Detection - File language is detected from extension and content.
- Content Extraction - File content is read and prepared for indexing.
- Trigram Generation - Content is split into trigrams for the inverted index.
- Index Building - Zoekt builds the compressed index file.
- Index Deployment - New index replaces old, atomically.## File Filtering
Included Files
Section titled “Included Files”By default, text files are indexed:
- Source code (
.go,.py,.js, etc.) - Configuration (
.yaml,.json, etc.) - Documentation (
.md,.txt, etc.)
Excluded Files
Section titled “Excluded Files”These are excluded by default:
- Binary files (executables, images, etc.)
- Large files (>1MB by default)
- Vendor directories
- Build outputs
Custom Filters
Section titled “Custom Filters”Configure file patterns in config.yaml:
indexer: include_patterns: - "*.go" - "*.py" - "*.js" exclude_patterns: - "*_test.go" - "vendor/**" - "node_modules/**" max_file_size: 1048576 # 1MBBranch Handling
Section titled “Branch Handling”Default Branch
Section titled “Default Branch”By default, only the default branch is indexed.
Multiple Branches
Section titled “Multiple Branches”Configure additional branches per repository:
connections: - name: github type: github branches: - main - develop - "release/*"Branch Indexing
Section titled “Branch Indexing”Each branch has its own index:
index/├── myorg_api_main.zoekt├── myorg_api_develop.zoekt└── myorg_api_release_v1.zoektIncremental Updates
Section titled “Incremental Updates”Full Index vs Incremental
Section titled “Full Index vs Incremental”- Full: Complete rebuild of the index
- Incremental: Only changed files re-indexed
When Full Index Runs
Section titled “When Full Index Runs”- Initial repository indexing
- Major schema changes
- Manual force re-index
Incremental Detection
Section titled “Incremental Detection”- Git fetch to update local clone
- Compare current HEAD to indexed commit
- List changed files via
git diff - Re-index only changed files
- Merge updates into existing index
Index Storage
Section titled “Index Storage”Directory Structure
Section titled “Directory Structure”/var/lib/code-search/data/├── repos/ # Git repositories│ ├── github.com/│ │ └── myorg/│ │ └── api/│ │ └── .git/│ └── gitlab.com/│ └── ...└── index/ # Zoekt indexes ├── myorg_api_main_v1.zoekt └── ...Index Files
Section titled “Index Files”Each .zoekt file contains:
- Trigram inverted index
- File metadata
- Content for snippet display
- Compression metadata
Disk Usage
Section titled “Disk Usage”Typical index size: 10-30% of source code size
| Repo Size | Index Size |
|---|---|
| 10 MB | 1-3 MB |
| 100 MB | 10-30 MB |
| 1 GB | 100-300 MB |
Performance Optimization
Section titled “Performance Optimization”Concurrent Indexing
Section titled “Concurrent Indexing”Multiple repositories can be indexed in parallel:
indexer: concurrency: 4 # Number of workersMemory Management
Section titled “Memory Management”Index building is memory-intensive. Configure limits:
indexer: max_memory_mb: 4096CPU Usage
Section titled “CPU Usage”Indexing is CPU-bound. Consider:
- Running indexer on separate machine
- Scheduling during off-hours
- Rate limiting large repos
Monitoring
Section titled “Monitoring”Index Statistics
Section titled “Index Statistics”Via API:
GET /api/stats/index{ "total_repos": 45, "indexed_repos": 42, "total_files": 156789, "total_size_bytes": 4567890123, "index_size_bytes": 1234567890, "last_full_index": "2024-01-15T02:00:00Z"}Job Metrics
Section titled “Job Metrics”Monitor indexing jobs:
- Jobs queued
- Jobs running
- Jobs completed
- Average duration
- Error rate
Troubleshooting
Section titled “Troubleshooting”Index Not Updating
Section titled “Index Not Updating”- Check job status in Web UI
- Verify connection is healthy
- Check for errors in job logs
- Try force re-index
Slow Indexing
Section titled “Slow Indexing”- Check repository size
- Review file filters
- Increase worker concurrency
- Check disk I/O
Search Missing Results
Section titled “Search Missing Results”- Verify repository is indexed
- Check file was not excluded
- Try force re-index
- Check query syntax
Next Steps
Section titled “Next Steps”- Data Flow - Request flows
- Components - Component details