Load Balancing SQL Queries … - Refurbed Engineering

The primary data storage used by the refurbed platform is PostgreSQL database. Running the refurbed platform on Google Cloud, the obvious choice is to use the Cloud SQL for PostgreSQL - a database service fully-managed by Google. Becoming more popular and expanding to other markets makes our customer base grow significantly which goes hand-in-hand with increased traffic.

Our PostgreSQL cluster is configured for high-availability with primary and standby instances deployed in different zones. Instances in each zone have a single read replica. Given this setup we want to distribute the workload across all nodes. The nature of the refurbed platform shapes the traffic in such way, that most of the workload is read-only. Some pieces can be cached but others cannot, thus we need to have the ability to transparently send writes to the primary node and spread reads between all the nodes in the zone.

The goal

Add the ability to load-balance SQL queries while transparently splitting reads and writes between primary and replica nodes. Introduce this change without (significantly) touching the existing code.

Possible solutions

There are three solutions out there that we considered in order to achieve our goal:

Odyssey is new and promising high-performance solution developed by Yandex but in terms of load-balancing capabilities it is not flexible enough.
PgBouncer is a battle-tested solution when it comes to connection pooling, however it lacks (by design) the load-balancing features.
Pgpool-II is a well tested solution with built in support for distributing read and write queries to the primary and replicas respectively. This is the option that avoids having to build the read-write split into the refurbed platform itself.

Welcome to Pgpool-II

Pgpool has been around for quite some time (version 1.0 was released in 2003) and originally it was just a simple connection pooling proxy. In 2006, it transformed from a personal project to a project owned by the Pgpool Development Group and renamed to Pgpool-II. Currently the major supported features are:

Connection pooling
Query load balancing
Automated failover
Watchdog (HA of Pgpool-II)
Replication
In memory query caching

Hands-on

To get a better understanding of how Pgpool-II works let’s go over a step-by-step guide. We start up a PostgreSQL cluster, configure Pgpool-II to load-balance queries, do a read-write split and validate it all works as expected. We are using Docker for this exercise. The following Docker Compose file does all the setup:

version: '3.8'
services:
  pgprimary:
    image: bitnami/postgresql:13.1.0
    ports:
      - 5432
    volumes:
      - pgprimary_data:/bitnami/postgresql
    environment:
      - POSTGRESQL_REPLICATION_MODE=master
      - POSTGRESQL_REPLICATION_USER=postgres
      - POSTGRESQL_REPLICATION_PASSWORD=s3cret
      - POSTGRESQL_PASSWORD=s3cret
      - POSTGRESQL_DATABASE=test_db
  pgreplica:
    image: bitnami/postgresql:13.1.0
    ports:
      - 5432
    depends_on:
      - pgprimary
    environment:
      - POSTGRESQL_REPLICATION_MODE=slave
      - POSTGRESQL_REPLICATION_USER=postgres
      - POSTGRESQL_REPLICATION_PASSWORD=s3cret
      - POSTGRESQL_MASTER_HOST=pgprimary
      - POSTGRESQL_PASSWORD=s3cret
  pgpool:
    image: bitnami/pgpool:4.2.1
    ports:
      - 5432:5432
    depends_on:
      - pgprimary
    environment:
      - PGPOOL_BACKEND_NODES=0:pgprimary:5432:4:primary:ALWAYS_PRIMARY,1:pgreplica:5432:6:replica
      - PGPOOL_POSTGRES_USERNAME=postgres
      - PGPOOL_POSTGRES_PASSWORD=s3cret
      - PGPOOL_ADMIN_USERNAME=admin
      - PGPOOL_ADMIN_PASSWORD=s3cret
      - PGPOOL_ENABLE_LOAD_BALANCING=yes
      - PGPOOL_ENABLE_STATEMENT_LOAD_BALANCING=yes
      - PGPOOL_NUM_INIT_CHILDREN=10
      - PGPOOL_MAX_POOL=1
      - PGPOOL_SR_CHECK_USER=postgres
      - PGPOOL_SR_CHECK_PASSWORD=s3cret
volumes:
  pgprimary_data:

The PostgreSQL cluster is configured with streaming replication enabled. The pgprimary service acts as the primary and the pgreplica service as a replica node. Both nodes listen on port 5432, but these are not exposed to the host machine as the pgpool service should be used to connect to the cluster. Pgpool-II is configured to recognise both the primary and replica nodes using PGPOOL_BACKEND_NODES and its load-balancing feature is enabled with the PGPOOL_ENABLE_LOAD_BALANCING environment variable. By default, Pgpool-II uses session level load-balancing, where the target node is determined when a client connects and stays so until the session is closed. While this is useful in many cases, we can also configure Pgpool-II to load-balance each statement regardless of the session life cycle. This is called statement level load-balancing, and we enable this by using PGPOOL_ENABLE_STATEMENT_LOAD_BALANCING.

Since Pgpool-II acts as a connection pool, we have a few config options to tweak its settings. We instruct the Pgpool-II to use 10 child processes (workers) to handle the connections (PGPOOL_NUM_INIT_CHILDREN), where each worker caches a single connection to a backend database/user pair (PGPOOL_MAX_POOL). By doing this we are constraining the size of the pool to hold maximum of 10 * 1 = 10 connections to a single backend node (pgprimary or pgreplica). If no worker is available to process the query, it is queued and processed once the worker becomes free.

Use the following commands to start and connect to the cluster, seed some data, and execute few queries:

docker-compose up -d
psql -U postgres -h localhost -p 5432 -d test_db
test_db=# CREATE TABLE users (id SERIAL PRIMARY KEY, email TEXT NOT NULL);
test_db=# INSERT INTO users (email) VALUES ('john@example.com'), ('michael@example.com'), ('robert@example.com');
test_db=# SELECT * FROM users;
... repeat couple of times
test_db=# SELECT * FROM users;

Now check the Pgpool-II stats to see that load-balancing works as expected:

test_db=# SHOW pool_nodes;
 node_id | hostname  | port | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change
---------+-----------+------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
 0       | pgprimary | 5432 | up     | 0.400000  | primary | 22         | false             | 0                 |                   |                        | 2021-01-12 12:05:11
 1       | pgreplica | 5432 | up     | 0.600000  | standby | 43         | true              | 0                 |                   |                        | 2021-01-12 12:05:11
(2 rows)

The SHOW pool_nodes command is a special SQL type command accepted by Pgpool-II. Special SQL commands are not forwarded to backends, but processed by Pgpool-II only. From the output we can see that Pgpool-II recognises 2 nodes (pgprimary and pgreplica). Because pgprimary is a primary node, all writes are exclusively sent to it, on the other hand all the reads are split between both nodes using the 4:6 ratio (check select_cnt column).

NOTE: In this example we use the Bitnami Docker images as we have had good experience with them. However, there is also an official Pgpool-II Docker image created and maintained by the Pgpool-II developers. Feel free to use it at your convenience, however bear in mind that the configuration options may slightly differ, and you may need adjust your setup accordingly.

Pgpool-II in Kubernetes

As already mentioned, we run on Google Cloud Platform, and the refurbed platform itself is deployed into Google Kubernetes Engine. Selected components were chosen to get SQL queries load-balanced across the PostgreSQL cluster. We configured these components to access the database via Pgpool-II. Generally speaking there are two ways to deploy Pgpool-II into a Kubernetes cluster:

Separate service
Sidecar container

To verify our Pgpool-II configuration worked as expected, we selected to run Pgpool-II as a sidecar container alongside our main backend API service containers. There has been a couple of issues with the sidecar approach, being able to cleanly deploy changes to the API container and ensure no downtime at deployment time has taken some effort to tweak the Pgpool-II settings. Furthermore, using Pgpool-II sidecar in more services and scaling them by adding replicas, we will end with too many connections to our PostgreSQL servers. Running Pgpool-II as a separate service is being evaluated at the moment.

NOTE: If you want to see how to configure the Pgpool-II as a service please refer to this excellent blog post authored by one of the Pgpool-II’s core development team member.

Using the sidecar approach, our current deployment manifest for the main API service looks as follows:

kind: Deployment
apiVersion: apps/v1
metadata:
  name: api
  labels:
    service: api
spec:
  replicas: 4
  selector:
    matchLabels:
      service: api
  template:
    metadata:
      labels:
        service: api
    spec:
      containers:
        ...

        - name: api
          ...
          env:
            ...
            - name: API_POSTGRES_HOST
              value: localhost
            - name: API_POSTGRES_PORT
              value: "5432"
            ...
          lifecycle:
            preStop:
              exec:
                command: [ "sh", "-c", "sleep 12" ]
          ...

        - name: pgpool
          image: bitnami/pgpool:4.1.5
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 5432
              protocol: TCP
          lifecycle:
            preStop:
              exec:
                command: [ "sh", "-c", "sleep 15" ]
          env:
            - name: PGPOOL_BACKEND_NODES
              value: "0:PRIMARY_IP_ADDR:5432:4:primary:ALWAYS_PRIMARY|DISALLOW_TO_FAILOVER,1:REPLICA_IP_ADDR:5432:6:repl01:DISALLOW_TO_FAILOVER"
            - name: PGPOOL_POSTGRES_USERNAME
              valueFrom:
                secretKeyRef:
                  name: credentials
                  key: postgres.username
            - name: PGPOOL_POSTGRES_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: credentials
                  key: postgres.password
            - name: PGPOOL_SR_CHECK_PERIOD
              value: "0"
            - name: PGPOOL_HEALTH_CHECK_USER
              valueFrom:
                secretKeyRef:
                  name: credentials
                  key: postgres.username
            - name: PGPOOL_HEALTH_CHECK_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: credentials
                  key: postgres.password
            - name: PGPOOL_NUM_INIT_CHILDREN
              value: "13"
            - name: PGPOOL_MAX_POOL
              value: "1"
            - name: PGPOOL_HEALTH_CHECK_PERIOD
              value: "5"
            - name: PGPOOL_HEALTH_CHECK_MAX_RETRIES
              value: "20"
            - name: PGPOOL_HEALTH_CHECK_RETRY_DELAY
              value: "1"
            - name: PGPOOL_ENABLE_LOAD_BALANCING
              value: "yes"
            - name: PGPOOL_ENABLE_STATEMENT_LOAD_BALANCING
              value: "yes"
            - name: PGPOOL_ADMIN_USERNAME
              value: "iamnotused"
            - name: PGPOOL_ADMIN_PASSWORD
              value: "s3cret"
          readinessProbe:
            tcpSocket:
              port: 5432
            initialDelaySeconds: 5
            periodSeconds: 2
          livenessProbe:
            tcpSocket:
              port: 5432
            initialDelaySeconds: 10
            periodSeconds: 2
          ...

Similarly, as with the Docker Compose example, we need to configure the Pgpool-II with the PostgreSQL backend nodes. PGPOOL_BACKEND_NODES config option looks as follows: 0:PRIMARY_IP_ADDRP:5432:4:primary:ALWAYS_PRIMARY|DISALLOW_TO_FAILOVER,1:REPLICA_IP_ADDR:5432:6:repl01:DISALLOW_TO_FAILOVER. It has two parts, each part is configuration for a single backend node. The configurations for each node are separated using a comma and individual config options are separated by a colon. Let’s break down each option:

0                                   index of the primary backend node
PRIMARY_IP_ADDR                     IP address of the primary backend node
5432                                port the primary backend node listens at
4                                   primary backend node weight
primary                             primary backend node name
ALWAYS_PRIMARY|DISALLOW_TO_FAILOVER primary backend node flags

We use ALWAYS_PRIMARY to let Pgpool-II know which node is primary (remember our setup in Cloud SQL uses streaming replication, and the primary node is fixed), so it does not inspect the backends. We also use the DISALLOW_TO_FAILOVER backend node flag (for primary as well as repl01 nodes) as the failover is managed by Cloud SQL.

1                    index of the replica backend node
REPLICA_IP_ADDR      IP address of the replica backend node
5432                 port the replica backend node node is listening at
6                    replica backend node weight
repl01               replica backend node name
DISALLOW_TO_FAILOVER replica backend node flags

The PGPOOL_HEALTH_* config options tweaks Pgpool-II’s health check mechanism used to periodically connect to configured backends and detect errors either on the servers or on the network. PGPOOL_HEALTH_CHECK_PERIOD ensures the check is performed every 5 seconds (0 means health checks are disabled), PGPOOL_HEALTH_CHECK_RETRY_DELAY specifies to wait 1 second between failed health check retry attempts and the PGPOOL_HEALTH_CHECK_MAX_RETRIES specifies to do 20 reties (at max) before giving up and marking the failed node as being down - such nodes will not be considered for load balancing.

Also, notice the container’s lifecycle preStop hooks. We put these in place to help to ensure the API container is shutdown before the Pgpool-II. Pgpool-II has a graceful shutdown implemented (it is called smart shutdown) where it will wait for all clients to disconnected. However, in our Kubernetes environment when we trigger the shutdown, from time to time we experienced a situation where the API container (which also has a graceful shutdown implemented) blocks PgPool-II’s shutdown until we run out of time and the terminationGracePeriodSeconds kicks in and the containers were forcefully killed. We found out that cascading the shutdown procedure, using different perStop hook values, helped us get rid of these situations (however we never really found the root cause).

Caveats

Our journey was quite straightforward and smooth. However, we hit some bumps along the way. Our goal, to enable query load-balancing and read-write split without touching the code, was not met completely. Pgpool-II ensures that load-balancing does not happen for queries executed within a transaction, where all queries are sent to the primary node. This is a very useful feature in situations where we needed to ensure consistency in read-after-write sequences. Putting these sequences in a transaction was a small code change which gave us consistent reads in the particular cases. Unfortunately, that was not enough and after a short while, we realised there were still a few stale reads happening occasionally. Our investigation revealed that stale reads where happen in critical places such as some of our webhook callback handlers. These are logically coupled webhook handlers where a write in the first webhook triggers another webhook whose handler expects the just written data to be consistent with what was written by the first webhook. We needed again to ensure that these places of the codebase read only from the primary node. By doing so the issues were fixed.

Conclusion

We set out to transparently load-balance SQL queries between our PostgreSQL cluster nodes and to allow for read-write splitting of queries between the primary and replica nodes without touching the code. As far as we know, Pgpool-II seems to be the only way to achieve this goal (but if you know any open-source projects that are better we are open to ideas). Generally speaking it was a smooth ride, however we encountered few bumps. The main bumps were about the configuration, it is not trivial to configure Pgpool-II properly, the documentation helps, but could be improved with more targeted examples and full-text searching. It seems Pgpool-II is not widely adopted and most articles available online, are either obsolete or lack more comprehensive explanations of the config options and values. Fortunately, things are improving with recent blog posts from Bo Peng, a Pgpool-II development team member. These articles are well written and explain various Pgpool-II capabilities in a simple and clear way.

Load Balancing SQL Queries Using Pgpool-II