Visualizing performance from the server logs

by Oleg Kikin

This is the visualization system I created for the company I work for. We actually use it in production, and it has proven itself very useful when dealing with performance issues.

The system is based on a custom Apache LogFormat. We have %D ("The time taken to serve the request, in microseconds") parameter enabled, so the performance is measured by the server itself.

The visualization system grabs server logs for various apps hourly, builds charts from them, and saves them in PNG files. Then we have a simple page to browse the charts.

The code is written in PHP with GD library.

There are three things displayed on these graphs:

Each black dot is a request. Horizontal axis is time. Vertical axis (logarithmic) is the number of milliseconds the request took.
Red is the average time the requests took, it is smoothed.
Cyan is the number of requests. There's no scale for them, they are linear, and were added for informational purposes to visualize traffic spikes.

Examples of charts:

Normal app behavior:

Visualizing performance

Traffic Spikes:

Here you can see two types of spikes. The one on the left hit something that was not cached well, and the app became slightly slower for a short period of time.

The spike on the right is closer to an SEM hit produced by our ad-ops team. It hit the pages that were cached, we got more cache hits than on average, and so the average request time (red line) went down for the duration of the spike.

Visualizing performance

Traffic Spike that the servers could not handle:

Here our app was hit with a traffic spike, quickly became slower and slower until the servers became unresponsive, and resulted in a 9 minute outage.

Visualizing performance

Drupal data export layer

Here's a graph of our Drupal export app (it exports drupal nodes in the XML format). We deployed Varnish to alleviate the server load. As you can see, the number of requests dropped significantly.

Visualizing performance

Fast ad-serving app (with a huge traffic spike)

This app is much faster, the average response time is around 50ms, and it handles traffic spikes pretty well.

Visualizing performance