Skip to content

Indexer Configuration

The indexer section configures the background indexer worker that clones and indexes repositories.

indexer:
concurrency: 2
index_path: "./data/index"
repos_path: "./data/repos"
reindex_interval: 1h
zoekt_bin: "zoekt-git-index"
ctags_bin: "ctags"
require_ctags: true
# index_all_branches: false
# index_timeout: 0
# max_repo_size_mb: 0

Number of concurrent indexing jobs.

PropertyValue
Typeinteger
Default2
EnvironmentCS_INDEXER_CONCURRENCY

Higher concurrency means faster indexing but more resource usage (CPU, memory, disk I/O).

Recommendations:

  • Small deployments (< 100 repos): 1-2
  • Medium deployments (100-1000 repos): 2-4
  • Large deployments (> 1000 repos): 4-8

Path to the directory where Zoekt search index shards are stored.

PropertyValue
Typestring
Default"./data/index"
EnvironmentCS_INDEXER_INDEX_PATH

Path to the directory where cloned repositories are stored.

PropertyValue
Typestring
Default"./data/repos"
EnvironmentCS_INDEXER_REPOS_PATH

How often the indexer re-indexes repositories. Works in conjunction with the scheduler.

PropertyValue
Typeduration
Default1h
EnvironmentCS_INDEXER_REINDEX_INTERVAL

Path to the Zoekt indexer binary.

PropertyValue
Typestring
Default"zoekt-git-index"
EnvironmentCS_INDEXER_ZOEKT_BIN

Path to the universal-ctags binary for symbol indexing.

PropertyValue
Typestring
Default"ctags"
EnvironmentCS_INDEXER_CTAGS_BIN

If true, ctags failures will fail the indexing job. If false, indexing continues without symbol data.

PropertyValue
Typeboolean
Defaulttrue
EnvironmentCS_INDEXER_REQUIRE_CTAGS

When true, index all branches (not just the default branch). Increases storage and index time.

PropertyValue
Typeboolean
Defaultfalse
EnvironmentYAML only

Timeout for zoekt-git-index operations. Set to 0 for no timeout (infinite).

PropertyValue
Typeduration
Default0 (no timeout)
EnvironmentYAML only

For large repos (millions of lines), indexing can take hours. The default of 0 allows unlimited time.

Examples:

  • 0 - No timeout (default)
  • "30m" - 30 minutes
  • "2h" - 2 hours (for large monorepos)

Skip indexing repositories larger than this size in MB. Set to 0 for no limit.

PropertyValue
Typeinteger
Default0 (no limit)
EnvironmentYAML only

Useful to avoid indexing extremely large monorepos that would consume too much memory or time.

Example: 10000 (skip repos larger than 10 GB)

Terminal window
CS_INDEXER_CONCURRENCY="2"
CS_INDEXER_INDEX_PATH="./data/index"
CS_INDEXER_REPOS_PATH="./data/repos"
CS_INDEXER_REINDEX_INTERVAL="1h"
CS_INDEXER_ZOEKT_BIN="zoekt-git-index"
CS_INDEXER_CTAGS_BIN="ctags"
CS_INDEXER_REQUIRE_CTAGS="true"

Note: index_timeout, max_repo_size_mb, and index_all_branches are configurable via config.yaml only (no environment variable bindings).

  1. Poll Queue - Indexer polls Redis for pending jobs
  2. Clone Repository - Git clone/fetch the repository
  3. Run Zoekt Index - Create search index
  4. Update Database - Mark repository as indexed
  5. Notify - Signal Zoekt to reload indexes
TypeDescription
indexInitial indexing of a new repository
syncRe-sync an existing repository (fetch + re-index)
replaceExecute a search-and-replace operation

There are three approaches to scale indexing:

Run multiple indexer instances sharing the same storage:

Terminal window
# Docker Compose
docker compose up -d --scale indexer=4
# Kubernetes (requires ReadWriteMany PVC)
kubectl scale deployment code-search-indexer --replicas=4

Each indexer processes jobs independently from the Redis queue. All workers share the same index and repos directories.

Requirements: ReadWriteMany (RWX) storage class (NFS, CephFS, EFS, Azure Files)

For very large deployments or when RWX storage isn’t available, use hash-based sharding:

sharding:
enabled: true
total_shards: 3
federated_access: true

Each shard:

  • Has its own PersistentVolume (ReadWriteOnce)
  • Processes only repositories assigned to it via consistent hashing
  • Runs its own Zoekt instance

See Sharding Configuration for details.

For smaller deployments (< 1000 repos), a single indexer handles everything:

  • Simpler to operate
  • No shared storage requirements
  • Scale vertically with more CPU/memory

Indexing is CPU-intensive. Each concurrent job uses approximately 1 CPU core.

Memory usage depends on repository size:

  • Small repos (< 100 MB): ~512 MB per job
  • Medium repos (100 MB - 1 GB): ~1 GB per job
  • Large repos (> 1 GB): ~2-4 GB per job

Indexing is disk I/O intensive. Use SSDs for best performance.

Initial clones download the full repository. Subsequent syncs only fetch changes.

The indexer uses these Git settings:

Terminal window
# Shallow clone for initial index (faster)
git clone --depth 1 --single-branch
# Full history for sync operations
git fetch --all

Git authentication is handled via the connection’s access token. The token is used for HTTPS cloning:

https://oauth2:{token}@github.com/org/repo.git

By default, only the default branch (usually main or master) is indexed. Zoekt supports indexing multiple branches using the -branches flag.

When a repository is indexed, the indexer runs:

Terminal window
zoekt-git-index -index /data/index -branches main,develop /data/repos/myrepo

This creates searchable indexes for each specified branch.

Use the branch: filter to search specific branches:

FOO branch:develop

Omitting the branch: filter searches the default branch (HEAD).

  • Tags are not currently indexed (only branches)
  • Multi-branch indexing requires manual configuration per repository
  • The default behavior indexes only the default branch for efficiency

Planned improvements include:

  • Per-repository branch configuration
  • Tag indexing support
  • Automatic branch discovery based on patterns

The indexer can automatically run SCIP code intelligence indexing after Zoekt search indexing completes. This provides precise go-to-definition and find-references without manual API calls.

scip:
enabled: false # Master switch
auto_index: true # Auto-index after Zoekt indexing
timeout: 10m # Timeout per SCIP indexing operation
# work_dir: "" # Temp directory for checkouts (default: system temp)
# cache_dir: "" # SCIP database directory (default: <repos_path>/../scip)
OptionTypeDefaultEnvironmentDescription
enabledbooleanfalseCS_SCIP_ENABLEDEnable SCIP indexing
auto_indexbooleantrueCS_SCIP_AUTO_INDEXAuto-index after Zoekt indexing
timeoutduration10mCS_SCIP_TIMEOUTAggregate timeout for all SCIP indexing per repo (across all languages)
work_dirstring""CS_SCIP_WORK_DIRWorking directory for temporary checkouts
cache_dirstring""CS_SCIP_CACHE_DIRDirectory for SCIP SQLite databases

Languages are grouped into two tiers:

Standalone (auto-enabled when scip.enabled=true and binary is in PATH):

  • Go (scip-go)
  • TypeScript/JavaScript (scip-typescript via npx)
  • Python (scip-python)

Build-dependent (require explicit opt-in):

  • Java (scip-java)
  • Rust (rust-analyzer)
  • PHP (scip-php — per-project Composer dependency)

Override defaults for specific languages:

scip:
enabled: true
languages:
go:
enabled: true # Standalone: auto-enabled
java:
enabled: true # Build-dependent: explicit opt-in
binary_path: "/usr/local/bin/scip-java" # Optional: custom binary path
rust:
enabled: false # Disabled (default for build-dependent)

The default indexer image does not include SCIP binaries. Use the SCIP-enabled image:

docker-compose.yml
indexer:
build:
dockerfile: docker/indexer-scip.Dockerfile
environment:
CS_SCIP_ENABLED: "true"

The indexer-scip.Dockerfile includes scip-go, scip-typescript (via Node.js/npx), and scip-python.

  1. After Zoekt search indexing completes successfully, the indexer checks if SCIP is enabled
  2. It detects all languages present in the repository from marker files (go.mod, package.json, Cargo.toml, etc.) using git ls-tree — no full checkout needed at this stage. For monorepos, marker files in subdirectories (up to 3 levels deep) are also detected.
  3. For each detected language that is enabled and has an available indexer binary, it discovers project directories and runs SCIP indexing. Results from all languages are merged into a single index per repository.
  4. The SCIP index is stored in a per-repo SQLite database with file paths correctly mapped to the full repository path

SCIP indexing failure is non-fatal — it never fails the Zoekt index job. Errors are logged as warnings.

Repositories containing multiple languages (e.g., a Go backend with a TypeScript frontend) are fully supported. The indexer detects and indexes all languages, merging results into a single SCIP index.

Project discovery behavior depends on the language:

  • Independent projects (Go, PHP): Each marker file (e.g., go.mod) represents a separate project. All directories with markers are discovered and indexed independently, even if the root also has one. This is correct for Go modules, where nested go.mod files are distinct modules.
  • Workspace-aware (TypeScript, JavaScript, Java, Rust, Python): A root-level marker means the tool handles the entire tree (e.g., TypeScript workspaces, Cargo workspaces, Maven multi-module). Subdirectories are only searched when no root marker exists.

The scip.timeout config sets an aggregate timeout for all SCIP indexing within a single repository — across all detected languages and all project directories. Each individual indexer invocation also has its own per-execution timeout (scip.indexer_timeout internally), but the outer aggregate timeout caps the total time spent on SCIP indexing for one repo.

job failed: clone timeout after 10m

Clone timeouts during replace operations can be adjusted via the replace config:

replace:
clone_timeout: "30m"
job failed: index timeout

Increase index_timeout for very large repositories:

indexer:
index_timeout: "2h"

If the indexer is killed by OOM:

  • Reduce concurrency
  • Increase container/pod memory limits
  • Exclude very large repositories

The indexer needs space for:

  • Git clones (repos_dir)
  • Zoekt indexes (index_dir)
  • Temporary files during indexing

Monitor disk usage and increase storage as needed.

If jobs are stuck:

  1. Check indexer logs for errors
  2. Restart the indexer: docker compose restart indexer
  3. Failed jobs will be retried