Indexer Configuration

The indexer section configures the background indexer worker that clones and indexes repositories.

Configuration

indexer:
  concurrency: 2
  clone_timeout: "10m"
  index_timeout: "30m"

Options

`concurrency`

Number of concurrent indexing jobs.

Property	Value
Type	`integer`
Default	`2`
Environment	`CS_INDEXER_CONCURRENCY`

Higher concurrency means faster indexing but more resource usage (CPU, memory, disk I/O).

Recommendations:

Small deployments (< 100 repos): 1-2
Medium deployments (100-1000 repos): 2-4
Large deployments (> 1000 repos): 4-8

`clone_timeout`

Maximum time allowed for cloning a repository.

Property	Value
Type	`duration`
Default	`"10m"`
Environment	`CS_INDEXER_CLONE_TIMEOUT`

Increase this for large repositories or slow network connections.

Examples:

"10m" - 10 minutes (default)
"30m" - 30 minutes (for large repos)
"1h" - 1 hour (for very large monorepos)

`index_timeout`

Maximum time allowed for indexing a repository with Zoekt.

Property	Value
Type	`duration`
Default	`"30m"`
Environment	`CS_INDEXER_INDEX_TIMEOUT`

Increase this for very large repositories.

Environment Variables

CS_INDEXER_CONCURRENCY="2"
CS_INDEXER_CLONE_TIMEOUT="10m"
CS_INDEXER_INDEX_TIMEOUT="30m"

How the Indexer Works

Poll Queue - Indexer polls Redis for pending jobs
Clone Repository - Git clone/fetch the repository
Run Zoekt Index - Create search index
Update Database - Mark repository as indexed
Notify - Signal Zoekt to reload indexes

Job Types

Type	Description
`index`	Initial indexing of a new repository
`sync`	Re-sync an existing repository (fetch + re-index)
`replace`	Execute a search-and-replace operation

Scaling Indexers

There are three approaches to scale indexing:

1. Multiple Workers with Shared Storage

Run multiple indexer instances sharing the same storage:

# Docker Compose
docker compose up -d --scale indexer=4

# Kubernetes (requires ReadWriteMany PVC)
kubectl scale deployment code-search-indexer --replicas=4

Each indexer processes jobs independently from the Redis queue. All workers share the same index and repos directories.

Requirements: ReadWriteMany (RWX) storage class (NFS, CephFS, EFS, Azure Files)

2. Hash-Based Sharding

For very large deployments or when RWX storage isn’t available, use hash-based sharding:

sharding:
  enabled: true
  total_shards: 3
  federated_access: true

Each shard:

Has its own PersistentVolume (ReadWriteOnce)
Processes only repositories assigned to it via consistent hashing
Runs its own Zoekt instance

See Sharding Configuration for details.

3. Single Indexer (Default)

For smaller deployments (< 1000 repos), a single indexer handles everything:

Simpler to operate
No shared storage requirements
Scale vertically with more CPU/memory

Resource Requirements

CPU

Indexing is CPU-intensive. Each concurrent job uses approximately 1 CPU core.

Memory

Memory usage depends on repository size:

Small repos (< 100 MB): ~512 MB per job
Medium repos (100 MB - 1 GB): ~1 GB per job
Large repos (> 1 GB): ~2-4 GB per job

Disk I/O

Indexing is disk I/O intensive. Use SSDs for best performance.

Network

Initial clones download the full repository. Subsequent syncs only fetch changes.

Git Configuration

The indexer uses these Git settings:

# Shallow clone for initial index (faster)
git clone --depth 1 --single-branch

# Full history for sync operations
git fetch --all

Authentication

Git authentication is handled via the connection’s access token. The token is used for HTTPS cloning:

https://oauth2:{token}@github.com/org/repo.git

Branch Support

By default, only the default branch (usually main or master) is indexed. Zoekt supports indexing multiple branches using the -branches flag.

How Branch Indexing Works

When a repository is indexed, the indexer runs:

zoekt-git-index -index /data/index -branches main,develop /data/repos/myrepo

This creates searchable indexes for each specified branch.

Searching Branches

Use the branch: filter to search specific branches:

FOO branch:develop

Omitting the branch: filter searches the default branch (HEAD).

Current Limitations

Tags are not currently indexed (only branches)
Multi-branch indexing requires manual configuration per repository
The default behavior indexes only the default branch for efficiency

Future Enhancements

Planned improvements include:

Per-repository branch configuration
Tag indexing support
Automatic branch discovery based on patterns

Troubleshooting

Clone timeout

job failed: clone timeout after 10m

Increase clone_timeout for large repositories:

indexer:
  clone_timeout: "30m"

Index timeout

job failed: index timeout after 30m

Increase index_timeout for very large repositories:

indexer:
  index_timeout: "1h"

Out of memory

If the indexer is killed by OOM:

Reduce concurrency
Increase container/pod memory limits
Exclude very large repositories

Disk full

The indexer needs space for:

Git clones (repos_dir)
Zoekt indexes (index_dir)
Temporary files during indexing

Monitor disk usage and increase storage as needed.

Jobs stuck in “running”

If jobs are stuck:

Check indexer logs for errors
Restart the indexer: docker compose restart indexer
Failed jobs will be retried