The primary data storage used by the refurbed platform is PostgreSQL database. Running the refurbed platform on Google Cloud, the obvious choice is to use the Cloud SQL for PostgreSQL - a database service fully-managed by Google. Becoming more popular and expanding to other markets makes our customer base grow significantly which goes hand-in-hand with increased traffic.
Our PostgreSQL cluster is configured for high-availability with primary and standby instances deployed in different zones. Instances in each zone have a single read replica. Given this setup we want to distribute the workload across all nodes. The nature of the refurbed platform shapes the traffic in such way, that most of the workload is read-only. Some pieces can be cached but others cannot, thus we need to have the ability to transparently send writes to the primary node and spread reads between all the nodes in the zone.
Add the ability to load-balance SQL queries while transparently splitting reads and writes between primary and replica nodes. Introduce this change without (significantly) touching the existing code.
There are three solutions out there that we considered in order to achieve our goal:
- Odyssey is new and promising high-performance solution developed by Yandex but in terms of load-balancing capabilities it is not flexible enough.
- PgBouncer is a battle-tested solution when it comes to connection pooling, however it lacks (by design) the load-balancing features.
- Pgpool-II is a well tested solution with built in support for distributing read and write queries to the primary and replicas respectively. This is the option that avoids having to build the read-write split into the refurbed platform itself.
Welcome to Pgpool-II
Pgpool has been around for quite some time (version 1.0 was released in 2003) and originally it was just a simple connection pooling proxy. In 2006, it transformed from a personal project to a project owned by the Pgpool Development Group and renamed to Pgpool-II. Currently the major supported features are:
- Connection pooling
- Query load balancing
- Automated failover
- Watchdog (HA of Pgpool-II)
- In memory query caching
To get a better understanding of how Pgpool-II works let’s go over a step-by-step guide. We start up a PostgreSQL cluster, configure Pgpool-II to load-balance queries, do a read-write split and validate it all works as expected. We are using Docker for this exercise. The following Docker Compose file does all the setup:
version: '3.8' services: pgprimary: image: bitnami/postgresql:13.1.0 ports: - 5432 volumes: - pgprimary_data:/bitnami/postgresql environment: - POSTGRESQL_REPLICATION_MODE=master - POSTGRESQL_REPLICATION_USER=postgres - POSTGRESQL_REPLICATION_PASSWORD=s3cret - POSTGRESQL_PASSWORD=s3cret - POSTGRESQL_DATABASE=test_db pgreplica: image: bitnami/postgresql:13.1.0 ports: - 5432 depends_on: - pgprimary environment: - POSTGRESQL_REPLICATION_MODE=slave - POSTGRESQL_REPLICATION_USER=postgres - POSTGRESQL_REPLICATION_PASSWORD=s3cret - POSTGRESQL_MASTER_HOST=pgprimary - POSTGRESQL_PASSWORD=s3cret pgpool: image: bitnami/pgpool:4.2.1 ports: - 5432:5432 depends_on: - pgprimary environment: - PGPOOL_BACKEND_NODES=0:pgprimary:5432:4:primary:ALWAYS_PRIMARY,1:pgreplica:5432:6:replica - PGPOOL_POSTGRES_USERNAME=postgres - PGPOOL_POSTGRES_PASSWORD=s3cret - PGPOOL_ADMIN_USERNAME=admin - PGPOOL_ADMIN_PASSWORD=s3cret - PGPOOL_ENABLE_LOAD_BALANCING=yes - PGPOOL_ENABLE_STATEMENT_LOAD_BALANCING=yes - PGPOOL_NUM_INIT_CHILDREN=10 - PGPOOL_MAX_POOL=1 - PGPOOL_SR_CHECK_USER=postgres - PGPOOL_SR_CHECK_PASSWORD=s3cret volumes: pgprimary_data:
The PostgreSQL cluster is configured with streaming replication enabled.
pgprimary service acts as the primary and the
pgreplica service as a replica node.
Both nodes listen on port
5432, but these are not exposed to the host machine as the
pgpool service should be used to connect to the cluster.
Pgpool-II is configured to recognise both the primary and replica nodes using
PGPOOL_BACKEND_NODES and its load-balancing feature is enabled with the
PGPOOL_ENABLE_LOAD_BALANCING environment variable.
By default, Pgpool-II uses session level load-balancing, where the target node is determined when a client connects and stays so until the session is closed.
While this is useful in many cases, we can also configure Pgpool-II to load-balance each statement regardless of the session life cycle.
This is called statement level load-balancing, and we enable this by using
Since Pgpool-II acts as a connection pool, we have a few config options to tweak its settings.
We instruct the Pgpool-II to use 10 child processes (workers) to handle the connections (
PGPOOL_NUM_INIT_CHILDREN), where each worker caches a single connection to a backend database/user pair (
By doing this we are constraining the size of the pool to hold maximum of 10 * 1 = 10 connections to a single backend node (
If no worker is available to process the query, it is queued and processed once the worker becomes free.
Use the following commands to start and connect to the cluster, seed some data, and execute few queries:
docker-compose up -d psql -U postgres -h localhost -p 5432 -d test_db test_db=# CREATE TABLE users (id SERIAL PRIMARY KEY, email TEXT NOT NULL); test_db=# INSERT INTO users (email) VALUES ('email@example.com'), ('firstname.lastname@example.org'), ('email@example.com'); test_db=# SELECT * FROM users; ... repeat couple of times test_db=# SELECT * FROM users;
Now check the Pgpool-II stats to see that load-balancing works as expected:
test_db=# SHOW pool_nodes; node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change ---------+-----------+------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+--------------------- 0 | pgprimary | 5432 | up | 0.400000 | primary | 22 | false | 0 | | | 2021-01-12 12:05:11 1 | pgreplica | 5432 | up | 0.600000 | standby | 43 | true | 0 | | | 2021-01-12 12:05:11 (2 rows)
SHOW pool_nodes command is a special SQL type command accepted by Pgpool-II.
Special SQL commands are not forwarded to backends, but processed by Pgpool-II only.
From the output we can see that Pgpool-II recognises 2 nodes (
pgprimary is a primary node, all writes are exclusively sent to it, on the other hand all the reads are split between both nodes using the 4:6 ratio (check
NOTE: In this example we use the Bitnami Docker images as we have had good experience with them. However, there is also an official Pgpool-II Docker image created and maintained by the Pgpool-II developers. Feel free to use it at your convenience, however bear in mind that the configuration options may slightly differ, and you may need adjust your setup accordingly.
Pgpool-II in Kubernetes
As already mentioned, we run on Google Cloud Platform, and the refurbed platform itself is deployed into Google Kubernetes Engine. Selected components were chosen to get SQL queries load-balanced across the PostgreSQL cluster. We configured these components to access the database via Pgpool-II. Generally speaking there are two ways to deploy Pgpool-II into a Kubernetes cluster:
- Separate service
- Sidecar container
To verify our Pgpool-II configuration worked as expected, we selected to run Pgpool-II as a sidecar container alongside our main backend API service containers. There has been a couple of issues with the sidecar approach, being able to cleanly deploy changes to the API container and ensure no downtime at deployment time has taken some effort to tweak the Pgpool-II settings. Furthermore, using Pgpool-II sidecar in more services and scaling them by adding replicas, we will end with too many connections to our PostgreSQL servers. Running Pgpool-II as a separate service is being evaluated at the moment.
NOTE: If you want to see how to configure the Pgpool-II as a service please refer to this excellent blog post authored by one of the Pgpool-II’s core development team member.
Using the sidecar approach, our current deployment manifest for the main API service looks as follows:
kind: Deployment apiVersion: apps/v1 metadata: name: api labels: service: api spec: replicas: 4 selector: matchLabels: service: api template: metadata: labels: service: api spec: containers: ... - name: api ... env: ... - name: API_POSTGRES_HOST value: localhost - name: API_POSTGRES_PORT value: "5432" ... lifecycle: preStop: exec: command: [ "sh", "-c", "sleep 12" ] ... - name: pgpool image: bitnami/pgpool:4.1.5 imagePullPolicy: IfNotPresent ports: - containerPort: 5432 protocol: TCP lifecycle: preStop: exec: command: [ "sh", "-c", "sleep 15" ] env: - name: PGPOOL_BACKEND_NODES value: "0:PRIMARY_IP_ADDR:5432:4:primary:ALWAYS_PRIMARY|DISALLOW_TO_FAILOVER,1:REPLICA_IP_ADDR:5432:6:repl01:DISALLOW_TO_FAILOVER" - name: PGPOOL_POSTGRES_USERNAME valueFrom: secretKeyRef: name: credentials key: postgres.username - name: PGPOOL_POSTGRES_PASSWORD valueFrom: secretKeyRef: name: credentials key: postgres.password - name: PGPOOL_SR_CHECK_PERIOD value: "0" - name: PGPOOL_HEALTH_CHECK_USER valueFrom: secretKeyRef: name: credentials key: postgres.username - name: PGPOOL_HEALTH_CHECK_PASSWORD valueFrom: secretKeyRef: name: credentials key: postgres.password - name: PGPOOL_NUM_INIT_CHILDREN value: "13" - name: PGPOOL_MAX_POOL value: "1" - name: PGPOOL_HEALTH_CHECK_PERIOD value: "5" - name: PGPOOL_HEALTH_CHECK_MAX_RETRIES value: "20" - name: PGPOOL_HEALTH_CHECK_RETRY_DELAY value: "1" - name: PGPOOL_ENABLE_LOAD_BALANCING value: "yes" - name: PGPOOL_ENABLE_STATEMENT_LOAD_BALANCING value: "yes" - name: PGPOOL_ADMIN_USERNAME value: "iamnotused" - name: PGPOOL_ADMIN_PASSWORD value: "s3cret" readinessProbe: tcpSocket: port: 5432 initialDelaySeconds: 5 periodSeconds: 2 livenessProbe: tcpSocket: port: 5432 initialDelaySeconds: 10 periodSeconds: 2 ...
Similarly, as with the Docker Compose example, we need to configure the Pgpool-II with the PostgreSQL backend nodes.
PGPOOL_BACKEND_NODES config option looks as follows:
It has two parts, each part is configuration for a single backend node.
The configurations for each node are separated using a comma and individual config options are separated by a colon.
Let’s break down each option:
0 index of the primary backend node PRIMARY_IP_ADDR IP address of the primary backend node 5432 port the primary backend node listens at 4 primary backend node weight primary primary backend node name ALWAYS_PRIMARY|DISALLOW_TO_FAILOVER primary backend node flags
ALWAYS_PRIMARY to let Pgpool-II know which node is primary (remember our setup in Cloud SQL uses streaming replication, and the primary node is fixed), so it does not inspect the backends.
We also use the
DISALLOW_TO_FAILOVER backend node flag (for
primary as well as
repl01 nodes) as the failover is managed by Cloud SQL.
1 index of the replica backend node REPLICA_IP_ADDR IP address of the replica backend node 5432 port the replica backend node node is listening at 6 replica backend node weight repl01 replica backend node name DISALLOW_TO_FAILOVER replica backend node flags
PGPOOL_HEALTH_* config options tweaks Pgpool-II’s health check mechanism used to periodically connect to configured backends and detect errors either on the servers or on the network.
PGPOOL_HEALTH_CHECK_PERIOD ensures the check is performed every 5 seconds (
0 means health checks are disabled),
PGPOOL_HEALTH_CHECK_RETRY_DELAY specifies to wait 1 second between failed health check retry attempts and the
PGPOOL_HEALTH_CHECK_MAX_RETRIES specifies to do 20 reties (at max) before giving up and marking the failed node as being down - such nodes will not be considered for load balancing.
Also, notice the container’s lifecycle
We put these in place to help to ensure the API container is shutdown before the Pgpool-II.
Pgpool-II has a graceful shutdown implemented (it is called
smart shutdown) where it will wait for all clients to disconnected.
However, in our Kubernetes environment when we trigger the shutdown, from time to time we experienced a situation where the API container (which also has a graceful shutdown implemented) blocks PgPool-II’s shutdown until we run out of time and the
terminationGracePeriodSeconds kicks in and the containers were forcefully killed.
We found out that cascading the shutdown procedure, using different
perStop hook values, helped us get rid of these situations (however we never really found the root cause).
Our journey was quite straightforward and smooth. However, we hit some bumps along the way. Our goal, to enable query load-balancing and read-write split without touching the code, was not met completely. Pgpool-II ensures that load-balancing does not happen for queries executed within a transaction, where all queries are sent to the primary node. This is a very useful feature in situations where we needed to ensure consistency in read-after-write sequences. Putting these sequences in a transaction was a small code change which gave us consistent reads in the particular cases. Unfortunately, that was not enough and after a short while, we realised there were still a few stale reads happening occasionally. Our investigation revealed that stale reads where happen in critical places such as some of our webhook callback handlers. These are logically coupled webhook handlers where a write in the first webhook triggers another webhook whose handler expects the just written data to be consistent with what was written by the first webhook. We needed again to ensure that these places of the codebase read only from the primary node. By doing so the issues were fixed.
We set out to transparently load-balance SQL queries between our PostgreSQL cluster nodes and to allow for read-write splitting of queries between the primary and replica nodes without touching the code. As far as we know, Pgpool-II seems to be the only way to achieve this goal (but if you know any open-source projects that are better we are open to ideas). Generally speaking it was a smooth ride, however we encountered few bumps. The main bumps were about the configuration, it is not trivial to configure Pgpool-II properly, the documentation helps, but could be improved with more targeted examples and full-text searching. It seems Pgpool-II is not widely adopted and most articles available online, are either obsolete or lack more comprehensive explanations of the config options and values. Fortunately, things are improving with recent blog posts from Bo Peng, a Pgpool-II development team member. These articles are well written and explain various Pgpool-II capabilities in a simple and clear way.
January 29, 2021
Michal is leading our architecture and infrastructure development.