Indexing

Name: Code Search
Author: Code Search

This page explains the indexing process in detail.

Overview

Code Search uses Zoekt, a trigram-based search engine, to provide fast code search. The indexing process converts source code into a searchable index.

Trigram Indexing

What are Trigrams?

Trigrams are sequences of three consecutive characters. For example, the word “search” produces these trigrams:

sea
ear
arc
rch

How It Works

Tokenization: Source code is split into trigrams
Inverted Index: Each trigram maps to file locations
Posting Lists: Each location stores line/byte offsets
Compression: Index is compressed for efficiency

Search Process

When you search for handleRequest:

Query split into trigrams: han, and, ndl, dle, leR, eRe, Req, equ, que, ues, est
Each trigram’s posting list retrieved
Intersection finds files containing all trigrams
Verify actual match in content
Return matching lines

Indexing Pipeline

Repository Discovery - The scheduler checks each connection for new or updated repositories based on poll interval.
Git Clone/Fetch - New repositories are cloned; existing ones are fetched for updates.
File Enumeration - All files in the repository are listed, respecting .gitignore.
File Filtering - Binary files and large files are excluded from indexing.
Language Detection - File language is detected from extension and content.
Content Extraction - File content is read and prepared for indexing.
Trigram Generation - Content is split into trigrams for the inverted index.
Index Building - Zoekt builds the compressed index file.
Index Deployment - New index replaces old, atomically.## File Filtering

Included Files

By default, text files are indexed:

Source code (.go, .py, .js, etc.)
Configuration (.yaml, .json, etc.)
Documentation (.md, .txt, etc.)

Excluded Files

These are excluded by default:

Binary files (executables, images, etc.)
Large files (>1MB by default)
Vendor directories
Build outputs

Custom Filters

Configure file patterns in config.yaml:

indexer:
  include_patterns:
    - "*.go"
    - "*.py"
    - "*.js"
  exclude_patterns:
    - "*_test.go"
    - "vendor/**"
    - "node_modules/**"
  max_file_size: 1048576 # 1MB

Branch Handling

Default Branch

By default, only the default branch is indexed.

Multiple Branches

Configure additional branches per repository:

connections:
  - name: github
    type: github
    branches:
      - main
      - develop
      - "release/*"

Branch Indexing

Each branch has its own index:

index/
├── myorg_api_main.zoekt
├── myorg_api_develop.zoekt
└── myorg_api_release_v1.zoekt

Incremental Updates

Full Index vs Incremental

Full: Complete rebuild of the index
Incremental: Only changed files re-indexed

When Full Index Runs

Initial repository indexing
Major schema changes
Manual force re-index

Incremental Detection

Git fetch to update local clone
Compare current HEAD to indexed commit
List changed files via git diff
Re-index only changed files
Merge updates into existing index

Index Storage

Directory Structure

/var/lib/code-search/data/
├── repos/                    # Git repositories
│   ├── github.com/
│   │   └── myorg/
│   │       └── api/
│   │           └── .git/
│   └── gitlab.com/
│       └── ...
└── index/                    # Zoekt indexes
    ├── myorg_api_main_v1.zoekt
    └── ...

Index Files

Each .zoekt file contains:

Trigram inverted index
File metadata
Content for snippet display
Compression metadata

Disk Usage

Typical index size: 10-30% of source code size

Repo Size	Index Size
10 MB	1-3 MB
100 MB	10-30 MB
1 GB	100-300 MB

Performance Optimization

Concurrent Indexing

Multiple repositories can be indexed in parallel:

indexer:
  concurrency: 4 # Number of workers

Memory Management

Index building is memory-intensive. Configure limits:

indexer:
  max_memory_mb: 4096

CPU Usage

Indexing is CPU-bound. Consider:

Running indexer on separate machine
Scheduling during off-hours
Rate limiting large repos

Monitoring

Index Statistics

Via API:

GET /api/stats/index

{
  "total_repos": 45,
  "indexed_repos": 42,
  "total_files": 156789,
  "total_size_bytes": 4567890123,
  "index_size_bytes": 1234567890,
  "last_full_index": "2024-01-15T02:00:00Z"
}

Job Metrics

Monitor indexing jobs:

Jobs queued
Jobs running
Jobs completed
Average duration
Error rate

Troubleshooting

Index Not Updating

Check job status in Web UI
Verify connection is healthy
Check for errors in job logs
Try force re-index

Slow Indexing

Check repository size
Review file filters
Increase worker concurrency
Check disk I/O

Search Missing Results

Verify repository is indexed
Check file was not excluded
Try force re-index
Check query syntax

Next Steps

Data Flow - Request flows
Components - Component details