Tuning HTTP Keep-Alive in Node.js

We suffered some difficult to track down ECONNRESET errors in our Node.js proxy service until we learned a lot about HTTP persistent connections and Node.js HTTP defaults.

By default, HTTP creates a new TCP connection for every request.  HTTP keep-alive allows HTTP clients to re-use connections for multiple requests, and relies on timeout configurations on both the client and target server to decide when to close open TCP sockets.

In Node.js clients, you can use a module like agentkeepalive to tell your HTTP/HTTPS clients to use persistent HTTP connections.

In ConnectReport, we have a proxy service that acts as the gateway to all requests across our different services, with targets including our management API and our core server. We utilize agentkeepalive in our proxy service to reduce latency in connections from the proxy to the targets, but have periodically experienced ECONNRESET errors thrown by our proxy service. The ECONNRESET errors were fairly rare, occurring in about 1/1000 requests. Reproducing these elusive errors was nearly impossible, and we only really understood how they were being thrown once we did a deep dive into how HTTP persistent connections work.

HTTP persistent connections in a nutshell

Under the hood, HTTP persistent connections utilize OS level sockets. In general, clients will configure a limited number of sockets and a connection timeout. If nothing gets sent down the socket for the duration of the timeout, the socket for the connection is closed. On the other end of the client is the target server that the connection is calling, and the target server may have its own keep alive timeout. In Node.js, the keep alive timeout is 5 seconds by default. Node.js also has a headers timeout (which should be ~1s greater than the keep alive timeout), which contributes to the persistent connection timeout behavior.

502 Errors and their sources in your application

What happens in the real world (and for our product) is that the default timeout of the widely used agentkeepalive was 15 seconds – much longer than the Node.js timeout of 5 seconds. How does this look in practice?

Failed HTTP Keep-Alive connection

What's happens in the above:

  1. Client requests a resource from the proxy
  2. Proxy opens a reusable TCP socket to the target application with a 15 second timeout
  3. Request is sent down the socket to the target application, and a successful response comes back
  4. More than 5 seconds later, no requests have hit the target application, so the application closes the socket.
  5. The proxy receives another request and believes it has an open connection that it can use. It re-uses a reference to the same TCP socket as before,  but the socket on the end of the target application is actually closed.
  6. The proxy sends the requests down a dead socket and Node.js throws an ECONNRESET
  7. The proxy responds to the client with a 502 Bad Gateway error.  

Getting around ECONNRESET

The solution to the dead socket issue is for your client to have a shorter socket timeout than the target. If your target has a 5 second timeout, and your client has a 4 second timeout, the client will never try to send a request down a dead socket. If your client has a larger timeout than the target, at some point, the client will think it has an open socket that is actually dead.

In response to reports on this issue, agentkeepalive maintainers have smartly (and finally) decided to set the default freeSocketTimeout to 4 seconds in version 4.2.0 of agentkeepalive, released on December 30th 2020.

💡
Tip: Target server HTTP keep alive timeouts must always be greater than client timeouts to prevent sending requests to closed sockets.

If you're looking to use longer timeouts, or you're using a version of agentkeepalive earlier than 4.2.0, you can  edit your Node.js app default timeouts to be larger as follows:

const express = require("express");
const http = require("http");
const app = express();

const server = http.createServer({}, app).listen(3000);

// This is the important stuff
server.keepAliveTimeout = (60 * 1000) + 1000;
server.headersTimeout = (60 * 1000) + 2000;

Notes on AWS ELB 502's with Node.js servers

If you're using ELB to point traffic at your Node.js apps as we are, you should know that is has a default keep alive timeout of 60s, which is greater than the Node.js default of 5s, which will cause your ELB to have intermittent 502 errors.

You should edit your load balancer "Idle timeout" attribute to 4 seconds to be compatible with Node.js defaults.  

Edit your ELB attributes to set the idle timeout to something compatible with Node.js

Alternatively, you can edit the timeouts on your target Node.js app as indicated in the previous section.