A-HA 60k+ nodes Tuning Recommendations

Good read for additional Operating system level tuning https://community.progress.com/s/article/Chef-Automate-Deployment-Planning-and-Performance-tuning-transcribed-from-Scaling-Chef-Automate-Beyond-100-000-nodes

Assumption is running with minimum servers specs for a combined cluster of:

7 FE Nodes:
- 8-16 cores cpu, 32GB ram
3 BE PGSQL Nodes:
- 8-16 cores cpu, 32-64GB ram, 1TB SSD hard drive space
5 BE OpenSearch Nodes:
- 16 cores cpu, 64GB ram, 15TB SSD hard drive space

You will also get more mileage by creating separate clusters for infra-server and Automate. This will allow for separate PGSQL and OpenSearch clusters for each application.

#1 Apply to all BE’s for PGSQL via `chef-automate config patch pgsql-be-patch.toml --pg`

# PGSQL connections
[postgresql.v1.sys.pg]
  max_connections = 1500

PGSQL servers haproxy service isn't configurable via `chef-automate config patch` Below are the steps to update the haproxy service

Get the current HaProxy config, and update with the new parameters

Note: run this on a db backend, normally a follower

source /hab/sup/default/SystemdEnvironmentFile.sh
automate-backend-ctl applied --svc=automate-ha-haproxy | tail -n +2 > haproxy_config.toml
# note haproxy_config.toml may be blank. This is only to capture any local customisations that might have occurred

# HaProxy config
# Global
maxconn = 2000
# Backend Servers
[server]
maxconn = 1500

Apply the change as below on a single db backend:-

hab config apply automate-ha-haproxy.default $(date '+%s') haproxy_config.toml

Note: this will propagate to all 3 backend db's and will restart the haproxy service on each Backend, causing an outage(will only last a few mins), but a complete db restart is required as follows:- (the only robust way is to restart all db backends, Do not skip the below steps)

Restart, follower01, follower02 ,then leader as below. Have to wait for sync.

On Followers

Systemctl stop hab-sup 
Systemctl start hab-sup 
journalctl -fu hab-sup

On leader

Systemctl stop hab-sup
# wait till leader is elected from other 2 old followers.  Only then do the start 
Systemctl start hab-sup

Check the synchronization

journalctl -fu hab-sup

Cat the following file on all x3 BE pgsql nodes. Just to be sure the settings have taken, after restart

(ie witness the "maxconn = 1500" setting is present )

hab/svc/automate-ha-haproxy/config/haproxy.conf

#2 Apply to all BE’s for OpenSearch via `chef-automate config patch opensearch-be-patch.toml --os`

Fix for knife search when nodes are over 10k. First run this on an FE node for embedded OpenSearch.

curl -XPUT "http://127.0.0.1:10144/chef/_settings" -d '{"index": {"max_result_window": 100000}}' -H "Content-Type: application/json"

Then run config patch with toml file below

# knife search fix for nodes over 10k
[erchef.v1.sys.index] # For Automate version 4.13.76 and newer
  track_total_hits = true
# Cluster Ingestion
[opensearch.v1.sys.cluster]
  max_shards_per_node = 6000
# JVM Heap
[opensearch.v1.sys.runtime]
  heapsize = "32g" # 50% of total memory up to 32GB

#3 Apply to all FE’s for Automate via `chef-automate config patch automate-fe-patch.toml --a2`

# Worker Processes
[load_balancer.v1.sys.ngx.main]
  worker_processes = 10 # Not to exceed 10 or max number of cores
[esgateway.v1.sys.ngx.main]
  worker_processes = 10 # Not to exceed 10 or max number of cores

#4 Apply to all FE’s for infra-server via `chef-automate config patch infr-fe-patch.toml -cs`

# Cookbook Version Cache
[erchef.v1.sys.api]
  cbv_cache_enabled = true

# Worker Processes
[load_balancer.v1.sys.ngx.main]
  worker_processes = 10 # Not to exceed 10 or max number of cores
[cs_nginx.v1.sys.ngx.main]
  worker_processes = 10 # Not to exceed 10 or max number of cores
[esgateway.v1.sys.ngx.main]
  worker_processes = 10 # Not to exceed 10 or max number of cores

# CB Depsolver
# Depsolver tuning parameters assume a chef workload of roles/envs/cookbooks
# If only using policyfiles instead of roles/envs depsolver tuning is not required 
[erchef.v1.sys.depsolver]
  timeout = 10000
  pool_init_size = 32
  pool_max_size = 32
  pool_queue_max = 512
  pool_queue_timeout = 10000

# Connection Pools
[erchef.v1.sys.data_collector]
  pool_init_size = 100
  pool_max_size = 100
[erchef.v1.sys.sql]
  timeout = 5000
  pool_init_size = 80
  pool_max_size = 80
  pool_queue_max = 512
  pool_queue_timeout = 10000
[bifrost.v1.sys.sql]
  timeout = 5000
  pool_init_size = 80
  pool_max_size = 80
  pool_queue_max = 512
  pool_queue_timeout = 10000
[erchef.v1.sys.authz]
  timeout = 10000
  pool_init_size = 100
  pool_max_size = 100
  pool_queue_max = 512
  pool_queue_timeout = 10000

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AutomateHaTuning.md

AutomateHaTuning.md

A-HA 60k+ nodes Tuning Recommendations

#1 Apply to all BE’s for PGSQL via `chef-automate config patch pgsql-be-patch.toml --pg`

PGSQL servers haproxy service isn't configurable via `chef-automate config patch` Below are the steps to update the haproxy service

Get the current HaProxy config, and update with the new parameters

Apply the change as below on a single db backend:-

Restart, follower01, follower02 ,then leader as below. Have to wait for sync.

On Followers

On leader

Check the synchronization

Cat the following file on all x3 BE pgsql nodes. Just to be sure the settings have taken, after restart

#2 Apply to all BE’s for OpenSearch via `chef-automate config patch opensearch-be-patch.toml --os`

#3 Apply to all FE’s for Automate via `chef-automate config patch automate-fe-patch.toml --a2`

#4 Apply to all FE’s for infra-server via `chef-automate config patch infr-fe-patch.toml -cs`

Files

AutomateHaTuning.md

Latest commit

History

AutomateHaTuning.md

File metadata and controls

A-HA 60k+ nodes Tuning Recommendations

#1 Apply to all BE’s for PGSQL via chef-automate config patch pgsql-be-patch.toml --pg

PGSQL servers haproxy service isn't configurable via chef-automate config patch Below are the steps to update the haproxy service

Get the current HaProxy config, and update with the new parameters

Apply the change as below on a single db backend:-

Restart, follower01, follower02 ,then leader as below. Have to wait for sync.

On Followers

On leader

Check the synchronization

Cat the following file on all x3 BE pgsql nodes. Just to be sure the settings have taken, after restart

#2 Apply to all BE’s for OpenSearch via chef-automate config patch opensearch-be-patch.toml --os

#3 Apply to all FE’s for Automate via chef-automate config patch automate-fe-patch.toml --a2

#4 Apply to all FE’s for infra-server via chef-automate config patch infr-fe-patch.toml -cs

#1 Apply to all BE’s for PGSQL via `chef-automate config patch pgsql-be-patch.toml --pg`

PGSQL servers haproxy service isn't configurable via `chef-automate config patch` Below are the steps to update the haproxy service

#2 Apply to all BE’s for OpenSearch via `chef-automate config patch opensearch-be-patch.toml --os`

#3 Apply to all FE’s for Automate via `chef-automate config patch automate-fe-patch.toml --a2`

#4 Apply to all FE’s for infra-server via `chef-automate config patch infr-fe-patch.toml -cs`