Pushing nginx logs into PostgreSQL

It’s not that uncommon to push Web access logs into Elastic/OpenSearch or Splunk, but I personally prefer using SQL for log processing and analysis. There are, of course, other options, like Apache Hive or even a fully managed solution like AWS Athena (which uses Hive underneath) or Google BigQuery. But for me, in most cases it’s just simpler to push logs into an SQL database. And that’s a completely fine solution unless you have terabytes of logs.

This article describes my preferred way of doing this, which is a pretty opinionated setup. But I try to highlight other options when they are available. So, my preferred setup is to use nginx with JSON output format and log directly into syslog (via Unix datagram or UDP socket) and ingest data into PostgreSQL with syslog-ng. Syslog-ng is not the default syslog daemon in most Linux distros, but most of them have it in a distro repository and all required plugins as well. I usually prefer it over rsyslog, which is the default in most cases. And this, of course, can be done with rsyslog as well.

So, the first step is adding nginx JSON log formatter. That’s not a hard requirement, since syslog-ng supports parsing Apache access logs, and nginx by default escapes " and \ characters. So you can reliably process and ingest logs with syslog-ng without using JSON. And JSON log messages have an extra data processing overhead, but they have their own benefits as well.

I’m usually using a formatter like this:

log_format  json escape=json '{'
        '"host":"$host",'
        '"time":"$time_iso8601",'
        '"remote_addr":"$remote_addr",'
        '"request":"$request",'
        '"status": "$status",'
        '"body_bytes_sent":"$body_bytes_sent",'
        '"referer":"$http_referer",'
        '"user_agent":"$http_user_agent",'
        '"request_time":"$request_time",'
        '"upstream_response_time":"$upstream_response_time",'
        '"upstream_addr":"$upstream_addr",'
        '"upstream_status":"$upstream_status",'
        '"protocol":"$server_protocol",'
        '"scheme":"$scheme",'
        '"server_name":"$server_name",'
        '"uri":"$request_uri",'
        '"method":"$request_method"'
    '}';

Add this to your top-level nginx config (inside http block). For more info, check the log_format manual. You can also add some of your own fields. Use this page to check which variables are available. Afterward, check your configs with nginx -t command.

Now the next step is a syslog-ng listener and parser configuration. I prefer having a separate syslog listener for nginx. Here’s an example config:

# configure nginx listener on a UNIX datagram socket
source s_nginx {
    unix-dgram("/run/nginx-log.socket"
                   max-connections(16)
                   flags(no-hostname)
    );
};

parser p_nginx_json {
        json-parser(prefix("nginx."));
};

parser nginx_geoip_country_code {
    geoip2(
        "${nginx.remote_addr}",
        prefix("geoip.")
        database("/var/lib/GeoLite2-Country.mmdb")
    );
};

parser nginx_geoip_asn {
    geoip2(
        "${nginx.remote_addr}",
        prefix("geoip.")
        database("/var/lib/GeoLite2-ASN.mmdb")
    );
};

# configure debugging output destination
destination d_nginx_debug {
    file("/var/log/syslog-nginx.log"
        perm(0640)
        template("$(format-json nginx.* geoip.*)\n")
    );
};

# assemble pipeline
log {
        source(s_nginx);
        parser(p_nginx_json);
        parser(nginx_geoip_country_code);
        parser(nginx_geoip_asn);
        destination(d_nginx_debug);
};

For debugging purposes, I added a few sections that should help you debug the configuration. This config will write nginx logs in a JSON format to /var/log/syslog-nginx.log file. In addition, parsed log messages would include GeoIP information that could be taken from MaxMindDatabase (MMDB), although this is completely optional, and if you don’t want this, just remove all geoip2 parser and log statements. Now validate the config with syslog-ng -s command and once you ensure that the config is valid, reload it with syslog-ng-ctl reload. You can also check the manual for extra options for unix-dgram and json-parser.

Now add the following line to nginx configs to enable logging into the syslog socket:

access_log  syslog:server=unix:/run/nginx-log.socket,nohostname,tag=nginx  json;

You can add it to the global section (in that case it will be a default logger) or to each server sections of your config (in that case it will be used only by this virtual host). Verify the config with nginx -t and if everything is fine, reload with nginx -s reload. Now you should see logs in a JSON format in a /var/log/syslog-nginx.log file.

There is one thing you should be aware of: pushing logs over a UNIX datagram socket is not reliable. On an overloaded system, it can lose messages, which can be good, since a Web Server will not be blocked when writing into this socket. Although, as I mentioned previously, in cases when the system wouldn’t be able to keep up processing messages, it would lead to some messages being lost.

Now it’s time to configure PostgreSQL. Syslog-ng supports ingesting data into PostgreSQL, but it has some limitations. Therefore, I usually prefer to handle ingested logs with a PL/pgSQL trigger. This allows me to have more complex data processing on INSERT. But for the beginning, let’s start with a simple configuration. So, we will have a table like this:

CREATE TABLE nginx_syslog (
    id BIGSERIAL PRIMARY KEY,
    ts TIMESTAMPTZ NOT NULL,
    vhost VARCHAR(255) NULL,
    host VARCHAR(255) NULL,
    status VARCHAR(6) NULL,
    ssl VARCHAR(8) NULL,
    protocol VARCHAR(16) NULL,
    "method" VARCHAR(64) NULL,
    uri text NULL,
    user_agent text NULL,
    referer text NULL,
    remote_addr inet NULL,
    request_time VARCHAR(32) NULL,
    body_bytes_sent VARCHAR(32) NULL,
    upstream_status VARCHAR(6) NULL,
    upstream_response_time VARCHAR(32) NULL,
    upstream_addr VARCHAR(64) NULL,
    remote_country_code CHAR(2) NULL,
    asn VARCHAR(16) NULL
);

You might be thinking we are being wasteful, why the heck we use varchar(64) for method column which is indented for the HTTP method, or why we set status column as varchar(6) and not some kind of integer type. The main issue with a method column is that while most clients use well-established HTTP methods like GET and POST there are a bunch of HTTP extensions like WebDav and CardDav that have much longer method strings. Moreover, a remote client can send almost any string as an HTTP method. In fact, it can be even longer than 64 characters. And once we go to a solution with a PL/pgSQL trigger, we will solve this issue of this wastefulness. For now, just proceed with a simple configuration.

Another question that you might be asking is why do we need 2 fields for a host: host and vhost. host field is intended to store Host http header, while vhost supposed to store the virtual server name, also known as the virtual host. These columns might be different in the cases with a default server name or with a wildcard server name. So in a host column I’m going to store $host nginx variable, while in a vhost column I’ll store $server_name variable.

Ok, now let’s add this table as a logging destination to a syslog-ng configuration:

destination d_nginx_sql {
  sql(type(pgsql)
  host("localhost") username("syslog") password("yourpassword")
  database("syslog")
  table("nginx_syslog")
  columns(
     "ts timestamptz"
     "vhost VARCHAR(255)",
     "host VARCHAR(255)",
     "status varchar(6)",
     "ssl varchar(8)",
     "protocol varchar(16)",
     "method varchar(64)",
     "uri text",
     "user_agent text",
     "referer text",
     "remote_addr inet",
     "request_time varchar(32)",
     "body_bytes_sent varchar(32)",
     "upstream_status varchar(6)",
     "upstream_response_time varchar(32)",
     "upstream_addr varchar(64)",
     "remote_country_code char(2)",
     "asn varchar(16)"
  )
  values(
        "${nginx.time}",
        "${nginx.server_name}",
        "${nginx.host}",
        "${nginx.status}",
        '$(if ("${nginx.scheme}" == "https") "true" "false" )',
        "${nginx.protocol}",
        "${nginx.method}",
        "${nginx.uri}",
        "${nginx.user_agent}",
        "${nginx.referer}",
        "${nginx.remote_addr}",
        "${nginx.request_time}",
        "${nginx.body_bytes_sent}",
        "${nginx.upstream_status}",
        "${nginx.upstream_response_time}",
        "${nginx.upstream_addr}",
        "${geoip.country.iso_code}",
        "${geoip.autonomous_system_number}"
  ));
};

log {
        source(s_nginx);
        parser(p_nginx_json);
        parser(nginx_geoip_country_code);
        parser(nginx_geoip_asn);
        destination(d_nginx_sql);
        flags(final);
};

I’m also assuming that you can create a PostgreSQL user on your own and set appropriate permissions for that user, and don’t forget to update yourpassword with your password 😂. And of course, now you could remove our debug logging directive. Now, validate config with syslog-ng -s and assuming everything is fine, reload config with syslog-ng-ctl reload. In case you have any errors, these most likely would be logged into /var/log/error. Now, assuming everything is fine, you should see that logs started to be written into nginx_syslog table:

syslog=# select count(*) from nginx_syslog;
 count
-------
   160

And now it’s time to deal with data normalization. I prefer to use the PostgreSQL ENUM data type for protocol and separate tables for HTTP method and vhost columns.

-- create enum for http protocols
CREATE TYPE http_protocol AS ENUM
('HTTP/0.9', 'HTTP/1.0', 'HTTP/1.1', 'HTTP/2.0', 'HTTP/3.0');

-- table for http methods
CREATE TABLE http_methods (
    id SMALLSERIAL PRIMARY KEY,
    name VARCHAR(64) NOT NULL UNIQUE
);
-- PL/pgSQL function to get or create http method
CREATE OR REPLACE FUNCTION get_or_create_http_method(method VARCHAR(64))
RETURNS SMALLINT AS $$
DECLARE
    method_id smallint;
BEGIN
    method := lower(method);
    SELECT id into method_id FROM http_methods h where lower(h.name) = method;
    IF FOUND THEN
        RETURN method_id;
    ELSE
        INSERT INTO http_methods (name) VALUES (method) ON CONFLICT (name) DO NOTHING;
        SELECT id INTO method_id FROM http_methods h where lower(h.name) = method;
        RETURN method_id;
    END IF;
END;
$$ LANGUAGE plpgsql;

-- table for vhosts
CREATE TABLE http_server_names (
    id SERIAL PRIMARY KEY,
    vhost VARCHAR(255) NOT NULL UNIQUE
);

-- PL/pgSQL function to get or create vhost (same pattern as get_or_create_http_method)
CREATE OR REPLACE FUNCTION get_or_create_vhost(host VARCHAR(255))
RETURNS INT AS $$
DECLARE
    host_id int;
BEGIN
    host := lower(host);
    SELECT id into host_id FROM http_server_names h where lower(h.vhost) = host;
    IF FOUND THEN
        RETURN host_id;
    ELSE
        INSERT INTO http_server_names (vhost) VALUES (host) ON CONFLICT (vhost) DO NOTHING;
        SELECT id INTO host_id FROM http_server_names h where lower(h.vhost) = host;
        RETURN host_id;
    END IF;
END;
$$ LANGUAGE plpgsql;

-- create table for normalized logs
CREATE TABLE nginx (
    id BIGSERIAL PRIMARY KEY,
    ts TIMESTAMPTZ NOT NULL,
    vhost_id INT REFERENCES http_server_names (id),
    protocol HTTP_PROTOCOL,
    method_id SMALLINT REFERENCES http_methods (id),
    status SMALLINT,
    upstream_status SMALLINT,
    request_time REAL,
    upstream_response_time REAL,
    remote_country_code CHAR(2),
    ssl BOOLEAN NOT NULL,
    asn INTEGER,
    body_bytes_sent BIGINT,
    remote_addr INET NOT NULL,
    upstream_addr VARCHAR(64),
    host VARCHAR(255),
    uri TEXT NOT NULL,
    referer TEXT NOT NULL,
    user_agent TEXT NOT NULL
);

-- PL/pgSQL trigger function that called on insert
CREATE OR REPLACE FUNCTION nginx_syslog_insert_row()
RETURNS TRIGGER AS $$
DECLARE
    method_id smallint;
    vhost_id int;
    status smallint;
    upstream_status smallint;
    request_time REAL;
    upstream_response_time REAL;
    body_bytes_sent BIGINT;
    asn integer;
    protocol http_protocol;
    host text;
BEGIN
    status := NULLIF(NEW.status, '')::smallint;
    upstream_status := NULLIF(NEW.upstream_status, '')::smallint;
    request_time := NULLIF(NEW.request_time, '')::real;
    upstream_response_time := NULLIF(NEW.upstream_response_time, '')::real;
    body_bytes_sent := NULLIF(NEW.body_bytes_sent, '')::bigint;
    asn := NULLIF(NEW.asn, '')::integer;
    protocol := NULLIF(NEW.protocol, '')::http_protocol;
    host := NULLIF(NULLIF(NEW.host, ''), '_');

    -- method_id is null by default
    IF length(NEW.method) > 0 THEN
        method_id := get_or_create_http_method(NEW.method);
    END IF;
    -- vhost_id is NULL by default
    IF NEW.vhost NOT IN ('_', '') THEN
        vhost_id := get_or_create_vhost(NEW.vhost);
    END IF;

    INSERT INTO nginx (
        ts, vhost_id, host, status, ssl, protocol, method_id, uri, user_agent, referer,
        remote_addr, request_time, body_bytes_sent, upstream_status,
        upstream_response_time, upstream_addr, remote_country_code, asn
    ) VALUES (
        NEW.ts, vhost_id, host, status, NEW.ssl::boolean, protocol, method_id, NEW.uri,
        NEW.user_agent, NEW.referer, NEW.remote_addr, request_time, body_bytes_sent,
        upstream_status, upstream_response_time, NEW.upstream_addr,
        NEW.remote_country_code, asn
    );
    RETURN NULL;
END;
$$ LANGUAGE plpgsql;

-- register out trigger function to be called on insert
CREATE TRIGGER nginx_insert_normalize BEFORE INSERT ON nginx_syslog
FOR EACH ROW EXECUTE PROCEDURE nginx_syslog_insert_row();
-- I prefer to have consistent IDs for the main methods across many environments
INSERT INTO http_methods (id, name)
VALUES (1, 'get'), (2, 'head'), (3, 'post'), (4, 'put'), (5, 'delete');

So at first we create the enum type for the HTTP protocol field. This field is nullable, since in cases when nginx receives an invalid request, it returns an empty field, and this happens pretty often because of bots that try to “hack” a web server by sending different binary trash. I’m emphasizing this since there are some people who prefer having a non-nullable field and add an empty value to the enum type directly. Also, I’m creating 3 tables, http_methods and http_server_names and a main table nginx, which will contain records. And the main part is nginx_syslog_insert_row trigger, which will be called each time INSERT runs for nginx_syslog table (which was created previously). This trigger returns NULL cancelling INSERT operation on that table, but it does insert data into nginx table.

You might be thinking that this is a lot of SQL for such a simple task, and in fact this can be done easier, since syslog-ng has a way to set NULL values. But it lacks a straightforward way to handle related values. And honestly, I’d prefer dealing with data normalization in this case in a trigger, since I can update it live. This allows me to change the database schema without even changing syslog-ng configuration.

Now a few words about performance: if you are going to do some heavy analytics on this table, you would need to add table indexes. An even better idea might be using columnar storage, and PostgreSQL has a few plugins that provide this: TimescaleDB and Citus columnar. And yeah, you can use TimescaleDB not just for time series data. From my own experience, TimescaleDB provides better performance, and Citus columnar provides better compression. Although the difference in compression is pretty minor. It’s still much better than no compression at all, since you can achieve 7-9x data reduction.

At the same time, for a large amount of data, a dedicated domain-specific solution might be better, like ClickHouse, which was built originally for web analytics. It has columnar storage by default and a compression that works with a columnar datablocks, unlike TOAST compression that is supported in PostgreSQL. At the same time, both TimescaleDB and Citus columnar provide block-level compression per column.

And now a few examples of what we can do. If I wanted to check how many requests came to this site’s RSS link for the last week, I could use SQL like this:

SELECT count(*) FROM nginx WHERE
    vhost_id = (SELECT id FROM http_server_names WHERE vhost = 'slava.cc') and
    uri = '/atom.xml' and 
    ts > (now() - '1 week'::interval);

Or if I’ll want to check stats per protocol for this website:

SELECT protocol, (count(*) / sum(count(*)) OVER()) * 100 AS percent FROM nginx WHERE
    vhost_id = (SELECT id FROM http_server_names WHERE vhost = 'slava.cc')
    GROUP BY protocol
    ORDER BY percent DESC;

Protocol	Percent
HTTP/1.1	80,91 %
HTTP/3.0	12,20 %
NULL	3.06 %
HTTP/2.0	2,56 %
HTTP/1.0	1,27 %

Anyway, I think you got the idea.