Skip to content

Indexer Configuration

The indexer section configures the background indexer worker that clones and indexes repositories.

indexer:
concurrency: 2
clone_timeout: "10m"
index_timeout: "30m"

Number of concurrent indexing jobs.

PropertyValue
Typeinteger
Default2
EnvironmentCS_INDEXER_CONCURRENCY

Higher concurrency means faster indexing but more resource usage (CPU, memory, disk I/O).

Recommendations:

  • Small deployments (< 100 repos): 1-2
  • Medium deployments (100-1000 repos): 2-4
  • Large deployments (> 1000 repos): 4-8

Maximum time allowed for cloning a repository.

PropertyValue
Typeduration
Default"10m"
EnvironmentCS_INDEXER_CLONE_TIMEOUT

Increase this for large repositories or slow network connections.

Examples:

  • "10m" - 10 minutes (default)
  • "30m" - 30 minutes (for large repos)
  • "1h" - 1 hour (for very large monorepos)

Maximum time allowed for indexing a repository with Zoekt.

PropertyValue
Typeduration
Default"30m"
EnvironmentCS_INDEXER_INDEX_TIMEOUT

Increase this for very large repositories.

Terminal window
CS_INDEXER_CONCURRENCY="2"
CS_INDEXER_CLONE_TIMEOUT="10m"
CS_INDEXER_INDEX_TIMEOUT="30m"
  1. Poll Queue - Indexer polls Redis for pending jobs
  2. Clone Repository - Git clone/fetch the repository
  3. Run Zoekt Index - Create search index
  4. Update Database - Mark repository as indexed
  5. Notify - Signal Zoekt to reload indexes
TypeDescription
indexInitial indexing of a new repository
syncRe-sync an existing repository (fetch + re-index)
replaceExecute a search-and-replace operation

There are three approaches to scale indexing:

Run multiple indexer instances sharing the same storage:

Terminal window
# Docker Compose
docker compose up -d --scale indexer=4
# Kubernetes (requires ReadWriteMany PVC)
kubectl scale deployment code-search-indexer --replicas=4

Each indexer processes jobs independently from the Redis queue. All workers share the same index and repos directories.

Requirements: ReadWriteMany (RWX) storage class (NFS, CephFS, EFS, Azure Files)

For very large deployments or when RWX storage isn’t available, use hash-based sharding:

sharding:
enabled: true
total_shards: 3
federated_access: true

Each shard:

  • Has its own PersistentVolume (ReadWriteOnce)
  • Processes only repositories assigned to it via consistent hashing
  • Runs its own Zoekt instance

See Sharding Configuration for details.

For smaller deployments (< 1000 repos), a single indexer handles everything:

  • Simpler to operate
  • No shared storage requirements
  • Scale vertically with more CPU/memory

Indexing is CPU-intensive. Each concurrent job uses approximately 1 CPU core.

Memory usage depends on repository size:

  • Small repos (< 100 MB): ~512 MB per job
  • Medium repos (100 MB - 1 GB): ~1 GB per job
  • Large repos (> 1 GB): ~2-4 GB per job

Indexing is disk I/O intensive. Use SSDs for best performance.

Initial clones download the full repository. Subsequent syncs only fetch changes.

The indexer uses these Git settings:

Terminal window
# Shallow clone for initial index (faster)
git clone --depth 1 --single-branch
# Full history for sync operations
git fetch --all

Git authentication is handled via the connection’s access token. The token is used for HTTPS cloning:

https://oauth2:{token}@github.com/org/repo.git

By default, only the default branch (usually main or master) is indexed. Zoekt supports indexing multiple branches using the -branches flag.

When a repository is indexed, the indexer runs:

Terminal window
zoekt-git-index -index /data/index -branches main,develop /data/repos/myrepo

This creates searchable indexes for each specified branch.

Use the branch: filter to search specific branches:

FOO branch:develop

Omitting the branch: filter searches the default branch (HEAD).

  • Tags are not currently indexed (only branches)
  • Multi-branch indexing requires manual configuration per repository
  • The default behavior indexes only the default branch for efficiency

Planned improvements include:

  • Per-repository branch configuration
  • Tag indexing support
  • Automatic branch discovery based on patterns
job failed: clone timeout after 10m

Increase clone_timeout for large repositories:

indexer:
clone_timeout: "30m"
job failed: index timeout after 30m

Increase index_timeout for very large repositories:

indexer:
index_timeout: "1h"

If the indexer is killed by OOM:

  • Reduce concurrency
  • Increase container/pod memory limits
  • Exclude very large repositories

The indexer needs space for:

  • Git clones (repos_dir)
  • Zoekt indexes (index_dir)
  • Temporary files during indexing

Monitor disk usage and increase storage as needed.

If jobs are stuck:

  1. Check indexer logs for errors
  2. Restart the indexer: docker compose restart indexer
  3. Failed jobs will be retried