Skip to content

Sharding Configuration

The sharding section configures horizontal scaling for large deployments with thousands of repositories.

Code Search supports three deployment modes for scaling:

The simplest deployment with one indexer and its Zoekt sidecar:

  • Storage: Single PersistentVolume (ReadWriteOnce)
  • Features: Full search, file browsing, replace jobs (with shared storage to API)
  • Scale: Handles thousands of repositories
  • Best for: Most deployments
# Default - no sharding config needed
sharding:
enabled: false

Multiple indexer workers sharing the same PersistentVolume:

  • Storage: ReadWriteMany (NFS, CephFS, EFS, Azure Files)
  • Features: All features work - search, file browsing, replace jobs
  • Scale: Parallel indexing with queue-based work distribution
  • Best for: Faster indexing without complex configuration
# In Kubernetes/Helm
indexer:
replicaCount: 4 # Multiple workers
persistence:
accessMode: ReadWriteMany # Shared storage
# No sharding config needed
sharding:
enabled: false

3. Hash-Based Sharding with Federated Access

Section titled “3. Hash-Based Sharding with Federated Access”

Each shard handles a subset of repositories using consistent FNV hashing:

  • Storage: Each shard has its own PersistentVolume (ReadWriteOnce)
  • Features: Search via parallel Zoekt queries, file browsing/replace via federated proxy
  • Scale: Extreme horizontal scaling without shared storage
  • Best for: Very large deployments, cloud environments without good RWX options
sharding:
enabled: true
total_shards: 3
indexer_api_port: 8081
indexer_service: "code-search-indexer-headless"
federated_access: true

Enable hash-based sharding mode.

PropertyValue
Typeboolean
Defaultfalse
EnvironmentCS_SHARDING_ENABLED

When enabled:

  • Repositories are distributed across shards using consistent FNV hashing
  • Each indexer only processes repositories assigned to its shard
  • Search queries all Zoekt instances in parallel

Total number of indexer shards. Must match the number of indexer replicas.

PropertyValue
Typeinteger
Default1
EnvironmentCS_SHARDING_TOTAL_SHARDS

The shard index is determined from the pod’s ordinal number (e.g., indexer-0 is shard 0).

HTTP API port exposed by each indexer for federated file/replace access.

PropertyValue
Typeinteger
Default8081
EnvironmentCS_SHARDING_INDEXER_API_PORT

This port is used internally for API-to-indexer communication. The endpoints are:

  • GET /files/{repo}/tree - List directory contents
  • GET /files/{repo}/blob - Get file contents
  • GET /files/{repo}/exists - Check if repo exists on shard
  • POST /replace/execute - Execute replace job on shard

Headless Kubernetes service name for pod discovery.

PropertyValue
Typestring
Default"code-search-indexer-headless"
EnvironmentCS_SHARDING_INDEXER_SERVICE

The API uses this service to construct pod DNS names for routing requests:

{indexer_service}.{namespace}.svc.cluster.local

For shard N:

code-search-indexer-N.{indexer_service}.{namespace}.svc.cluster.local

Enable federated file browsing and replace jobs via proxy.

PropertyValue
Typeboolean
Defaultfalse
EnvironmentCS_SHARDING_FEDERATED_ACCESS

When enabled:

  • File browser requests are proxied to the indexer that owns the repository
  • Replace jobs are fanned out to all shards and results are merged
  • Each indexer runs an HTTP API on indexer_api_port
Terminal window
CS_SHARDING_ENABLED="true"
CS_SHARDING_TOTAL_SHARDS="3"
CS_SHARDING_INDEXER_API_PORT="8081"
CS_SHARDING_INDEXER_SERVICE="code-search-indexer-headless"
CS_SHARDING_FEDERATED_ACCESS="true"

Repositories are assigned to shards using consistent FNV hashing:

shard = FNV32(repoName) % totalShards

This ensures:

  • Same repository always goes to the same shard
  • Even distribution across shards
  • No coordination needed between shards
  1. API receives search query
  2. Query is sent to all Zoekt instances in parallel
  3. Results are merged and returned to the client
  1. API receives file browse request for repository X
  2. API calculates: shard = hash(X) % totalShards
  3. API proxies request to indexer-{shard} via headless service
  4. Indexer returns file contents from its local storage
  1. API receives replace request with matches from multiple repos
  2. API groups matches by shard (based on repository hash)
  3. API sends execute request to each shard in parallel
  4. Each shard processes its assigned repositories
  5. API merges results and returns to client

For Kubernetes deployments, use the Helm chart’s sharding configuration:

values.yaml
sharding:
enabled: true
replicas: 3
federatedAccess:
enabled: true
indexerAPIPort: 8081

This automatically:

  • Creates a StatefulSet with the specified replicas
  • Configures the headless service for pod discovery
  • Sets up environment variables for sharding
  • Exposes the indexer API port
ScenarioModeConfiguration
< 1000 reposSingleDefault (no sharding)
1000-5000 repos, fast indexingShared StorageRWX + multiple replicas
> 5000 repos, no RWX availableShardingHash-based with federated access
Cloud with limited storage optionsShardingHash-based with federated access
  1. Change storage class to RWX
  2. Increase indexer.replicaCount
  3. No data migration needed
  1. Enable sharding with total_shards matching desired replicas
  2. Each shard will re-index its assigned repositories
  3. Consider running re-index during off-peak hours
  1. Disable sharding
  2. Scale to single replica
  3. Re-index all repositories to single storage