Skip to content

Indexing

This page explains the indexing process in detail.

Code Search uses Zoekt, a trigram-based search engine, to provide fast code search. The indexing process converts source code into a searchable index.

Trigrams are sequences of three consecutive characters. For example, the word “search” produces these trigrams:

sea
ear
arc
rch
  1. Tokenization: Source code is split into trigrams
  2. Inverted Index: Each trigram maps to file locations
  3. Posting Lists: Each location stores line/byte offsets
  4. Compression: Index is compressed for efficiency

When you search for handleRequest:

  1. Query split into trigrams: han, and, ndl, dle, leR, eRe, Req, equ, que, ues, est
  2. Each trigram’s posting list retrieved
  3. Intersection finds files containing all trigrams
  4. Verify actual match in content
  5. Return matching lines
  1. Repository Discovery - The scheduler checks each connection for new or updated repositories based on poll interval.
  2. Git Clone/Fetch - New repositories are cloned; existing ones are fetched for updates.
  3. File Enumeration - All files in the repository are listed, respecting .gitignore.
  4. File Filtering - Binary files and large files are excluded from indexing.
  5. Language Detection - File language is detected from extension and content.
  6. Content Extraction - File content is read and prepared for indexing.
  7. Trigram Generation - Content is split into trigrams for the inverted index.
  8. Index Building - Zoekt builds the compressed index file.
  9. Index Deployment - New index replaces old, atomically.## File Filtering

By default, text files are indexed:

  • Source code (.go, .py, .js, etc.)
  • Configuration (.yaml, .json, etc.)
  • Documentation (.md, .txt, etc.)

These are excluded by default:

  • Binary files (executables, images, etc.)
  • Large files (>1MB by default)
  • Vendor directories
  • Build outputs

Configure file patterns in config.yaml:

indexer:
include_patterns:
- "*.go"
- "*.py"
- "*.js"
exclude_patterns:
- "*_test.go"
- "vendor/**"
- "node_modules/**"
max_file_size: 1048576 # 1MB

By default, only the default branch is indexed.

Configure additional branches per repository:

connections:
- name: github
type: github
branches:
- main
- develop
- "release/*"

Each branch has its own index:

index/
├── myorg_api_main.zoekt
├── myorg_api_develop.zoekt
└── myorg_api_release_v1.zoekt
  • Full: Complete rebuild of the index
  • Incremental: Only changed files re-indexed
  • Initial repository indexing
  • Major schema changes
  • Manual force re-index
  1. Git fetch to update local clone
  2. Compare current HEAD to indexed commit
  3. List changed files via git diff
  4. Re-index only changed files
  5. Merge updates into existing index
/var/lib/code-search/data/
├── repos/ # Git repositories
│ ├── github.com/
│ │ └── myorg/
│ │ └── api/
│ │ └── .git/
│ └── gitlab.com/
│ └── ...
└── index/ # Zoekt indexes
├── myorg_api_main_v1.zoekt
└── ...

Each .zoekt file contains:

  • Trigram inverted index
  • File metadata
  • Content for snippet display
  • Compression metadata

Typical index size: 10-30% of source code size

Repo SizeIndex Size
10 MB1-3 MB
100 MB10-30 MB
1 GB100-300 MB

Multiple repositories can be indexed in parallel:

indexer:
concurrency: 4 # Number of workers

Index building is memory-intensive. Configure limits:

indexer:
max_memory_mb: 4096

Indexing is CPU-bound. Consider:

  • Running indexer on separate machine
  • Scheduling during off-hours
  • Rate limiting large repos

Via API:

GET /api/stats/index
{
"total_repos": 45,
"indexed_repos": 42,
"total_files": 156789,
"total_size_bytes": 4567890123,
"index_size_bytes": 1234567890,
"last_full_index": "2024-01-15T02:00:00Z"
}

Monitor indexing jobs:

  • Jobs queued
  • Jobs running
  • Jobs completed
  • Average duration
  • Error rate
  1. Check job status in Web UI
  2. Verify connection is healthy
  3. Check for errors in job logs
  4. Try force re-index
  1. Check repository size
  2. Review file filters
  3. Increase worker concurrency
  4. Check disk I/O
  1. Verify repository is indexed
  2. Check file was not excluded
  3. Try force re-index
  4. Check query syntax