CloudFlare 522 Error. Server does not respond.

This is probably the longest time I have spent on debugging something (4 days) so its worth the effort to write about it.

The problem: We built a wordpress site but got CloudFlare 522 error when trying to connect to it. Our logs were completely empty.

The Setup: Our setup consisted of a VM in a private network with no public IP. This VM ran 3 docker containers connected together in a user-defined bridge network. The first container was an nginx server which would forward requests for PHP resources to the wordpress container. The third container was running MySQL 5.7. Nginx server configuration was borrowed from here. If the VM has no public IP how does one access it? The VM was connected to a layer-4 load balancer in a DMZ with a public IP. Further the load balancer would only accept traffic from CloudFlare CDN. The purpose of CloudFlare was to filter out malicious traffic and protect the servers.

CloudFlare -> Load Balancer -> NGINX -> WordPress -> MySQL

How we debugged: We tried all the usual things. We tested that the host is forwarding http (port 80) and https (port 443) traffic to the nginx container.

$ docker ps
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS                                      NAMES
ec8600d9c4b4        nginx:1.17          "nginx -g 'daemon of…"   24 hours ago        Up 24 hours         0.0.0.0:80->80/tcp, 0.0.0.0:443->443/tcp   nginx
feb535607eb2        wordpress:php7.4-fpm-alpine       "docker-entrypoint.s…"   25 hours ago        Up 25 hours         9000/tcp                                   wordpress
48b2dea3706b        mysql:5.7           "docker-entrypoint.s…"   25 hours ago        Up 25 hours         0.0.0.0:3306->3306/tcp, 33060/tcp          mysql

We tested that wordpress container is listening on port 9000

$ docker exec wordpress netstat -tpln
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 127.0.0.11:40662        0.0.0.0:*               LISTEN      -
tcp        0      0 :::9000                 :::*                    LISTEN      1/php-fpm.conf)

We tested that nginx container can connect to wordpress container

$ docker exec -it nginx /bin/bash
root@ec8600d9c4b4:/# apt-get install netcat
root@ec8600d9c4b4:/# nc -zv wordpress 9000
DNS fwd/rev mismatch: wordpress != wordpress.wordpress_net
wordpress [172.18.0.3] 9000 (?) open

We edit our wp-config.php to enable debug logs in wordpress

// Enable WP_DEBUG mode
define( 'WP_DEBUG', true );

// Enable Debug logging to the /wp-content/debug.log file
define( 'WP_DEBUG_LOG', true );

// Disable display of errors and warnings
define( 'WP_DEBUG_DISPLAY', false );
@ini_set( 'display_errors', 0 );

// Use dev versions of core JS and CSS files (only needed if you are modifying these core files)
define( 'SCRIPT_DEBUG', true );

We then tailed the following logs:

$ docker logs -f nginx
$ docker logs -f wordpress
$ docker exec wordpress tail -f /var/www/html/wp-content/debug.log

Logs were empty.

Next thing we tried was to run a bare-bones nginx server without the wordpress setup. So just run

$ docker run -d -p 80:80 nginx:alpine

Now the error vanished! This led us to believe that CloudFlare was working properly. So the problem had to be with wordpress. But then why was there no error in the logs? Without anything in the logs to give a clue, we were stuck for a long time. Then it dawned on us to access the VM directly using its private IP from the VPN and lo and behold the server responded and wordpress loaded up! So it couldn’t be a problem with wordpress!

When 2+2 does not equal 4: If CloudFlare is working properly as well as WordPress, then we are led to a logical contradiction and the 522 cannot be explained.

We then contacted CloudFlare when we were out of moves and they told us that all requests to the site were timing out leading them to believe there is something blocking requests from CloudFlare’s IP

Source IP: Y.Y.Y.Y
nc: connect to X.X.X.X port 443 (tcp) timed out: Operation now in progress
[exit code 1]
Source IP: Y.Y.Y.Y
nc: connect to X.X.X.X port 443 (tcp) timed out: Operation now in progress
[exit code 1]
Source IP: Y.Y.Y.Y
nc: connect to X.X.X.X port 443 (tcp) timed out: Operation now in progress
[exit code 1]
Source IP: Y.Y.Y.Y
nc: connect to X.X.X.X port 443 (tcp) timed out: Operation now in progress
[exit code 1]
Source IP: Y.Y.Y.Y
nc: connect to X.X.X.X port 443 (tcp) timed out: Operation now in progress
[exit code 1]

But we knew there was nothing blocking CloudFlare as the requests did succeed when we ran a bare-bones NGINX server.

Finally, the light struck us on the 4th day. The load balancer in Azure uses health probes to know if a machine is healthy. It construes any 200 response as indication of an unhealthy server. This is all documented here

An HTTP / HTTPS probe fails when:

Probe endpoint returns an HTTP response code other than 200 (for example, 403, 404, or 500). This will mark down the health probe immediately.

but it was the first time I was using a load balancer up close and personal and I didn’t know this. The endpoint it was using to test was / to which nginx was responding with a redirect 302. As soon as I changed the health probe to a custom endpoint to which I configured NGINX to return 200 the error vanished!

    # this special section is for the load balancer.
    # The elastic load balancer in azure needs to know if a machine is healthy. 
    # We assume it does that by making a request to the /health-probe endpoint.
    # If the load balancer gets a non-200 response it will mark the machine as unhealthy
    # and not send requests to it. 
    location /health-probe {
        return 200 OK;
    }

It was typical example when you need out-of-the-box thinking to fix a problem. For the longest time I kept thinking maybe the problem is in the docker network as I have struggled with it in the past. The fact that CouldFlare worked as well as WordPress (when we hit the VM using its private IP) but then why we were getting was the thing that puzzled me the most. It left us with no clue.

This entry was posted in Software and tagged . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s