Thursday, December 23, 2010

Quick and Dirty Reporting with Awk

A few years ago, I wrote about starting to use awk after a long time. Over the last couple of years, I've used awk on and off, although not very frequently, and never beyond the basic pattern described in that post.

Last week, I described an AspectJ aspect that wrote out timing information into the server logs. I initially thought of using OpenOffice Calc, but then stumbled upon Jadu Saikia's post on implementing group-by with awk, and I realized that awk could do the job just as easily, and would be easier to use (for multiple invocations).

Here is the script I came up with.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
#!/usr/bin/env awk -f
# Builds a report out of aggregate elapsed time data similar to the Spring 
# StopWatch prettyPrint report.
# Usage:
# my_profile_report.awk [-v qt=query_term] filename
#
BEGIN {
  FS="|";
}
(qt == "" || qt == $3) {
  counts[$4]++;
  times[$4] += $5;
  if (mins[$4] + 0 != 0) {
    if (mins[$4] > $5) {
      mins[$4] = $5;
    }
  } else {
    mins[$4] = $5;
  }
  if (maxs[$4] + 0 != 0) {
    if (maxs[$4] < $5) {
      maxs[$4] = $5;
    }
  } else {
    maxs[$4] = $5;
  }
  totals += $5;
}
END {
  totavg = 0;
  for (t in times) {
    totavg += times[t]/counts[t];
  }
  printf("---------------------------------------------------------------------\n");
  printf("%-32s %8s %8s %8s %8s\n", "Controller", "Avg(ms)", "%", "Min(ms)", "Max(ms)");
  printf("---------------------------------------------------------------------\n");
  for (t in times) {
    avg = times[t]/counts[t];
    perc = avg * 100 / totavg;
    printf("%-32s %8.2f %8.2f %8.2f %8.2f\n", t, avg, perc, mins[t], maxs[t]);
  }
  printf("---------------------------------------------------------------------\n");
  printf("%-32s %8.2f\n", "Average Elapsed (ms):", totavg);
  printf("---------------------------------------------------------------------\n");
}

As shown in my previous post, the input file looks something like this (without the header, I just grep the server log file with the aspect name).

1
2
3
4
5
6
# aspect_name|request_uuid|query|tile_name|elapsed_time_millis
MyAspect2|ed9ce777-263e-4fe5-a8af-4ea84d78add3|some query|TileAController|132
MyAspect2|ed9ce777-263e-4fe5-a8af-4ea84d78add3|some query|TileBController|49
MyAspect2|ed9ce777-263e-4fe5-a8af-4ea84d78add3|some query|TileCController|2
MyAspect2|ed9ce777-263e-4fe5-a8af-4ea84d78add3|some query|TileDController|3
...

The script can be invoked with or without a qt parameter. Without parameters, the script computes the average, minimum and maximum elapsed times across all the queries that were given to the application during the profiling run. The report looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
sujit@cyclone:aspects$ ./my_profile_report.awk input_file
---------------------------------------------------------------------
Controller                        Avg(ms)        %  Min(ms)  Max(ms)
---------------------------------------------------------------------
TileFController                      1.42     0.02     0.00    42.00
TileAController                    187.67     2.83    27.00   685.00
TileJController                    169.97     2.56     7.00  3140.00
TileEController                      9.14     0.14     2.00    44.00
TileNController                     45.91     0.69     4.00   234.00
TileIController                    444.91     6.71     0.00  3427.00
TileDController                   1506.30    22.72   792.00 12140.00
TileMController                    184.88     2.79     0.00  3078.00
TileHController                     13.67     0.21     7.00    50.00
TileCController                     34.06     0.51    14.00   108.00
TileLController                    759.73    11.46     3.00  9921.00
TileGController                   2473.24    37.31   579.00 10119.00
TileBController                     24.48     0.37     7.00   132.00
TileKController                    773.97    11.67     0.00  8554.00
---------------------------------------------------------------------
Average Elapsed (ms):             6629.35
---------------------------------------------------------------------

If we want only the times for a specific query term, then we can specify it on the command line as shown below. If the query has been invoked multiple times, then the report will show the average of the calls for the query. In this case, there is only a single invocation for the query, so the minimum and maximum times are redundant.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
sujit@cyclone:aspects$ ./my_profile_report.awk -v qt=some_query input_file
---------------------------------------------------------------------
Controller                        Avg(ms)        %  Min(ms)  Max(ms)
---------------------------------------------------------------------
TileFController                      0.00     0.00     0.00     0.00
TileAController                    106.00     4.57   106.00   106.00
TileJController                      7.00     0.30     7.00     7.00
TileEController                     10.00     0.43    10.00    10.00
TileNController                      8.00     0.34     8.00     8.00
TileIController                      0.00     0.00     0.00     0.00
TileDController                   1309.00    56.42  1309.00  1309.00
TileMController                      0.00     0.00     0.00     0.00
TileHController                     12.00     0.52    12.00    12.00
TileCController                     25.00     1.08    25.00    25.00
TileLController                     64.00     2.76    64.00    64.00
TileGController                    767.00    33.06   767.00   767.00
TileBController                     12.00     0.52    12.00    12.00
TileKController                      0.00     0.00     0.00     0.00
---------------------------------------------------------------------
Average Elapsed (ms):             2320.00
---------------------------------------------------------------------

The script uses awk's associative arrays and the post processing block. Before this script, I had read about awk's BEGIN/END blocks but didn't know why I would want to use them, and I didn't know about awk's associative arrays at all. Thanks to Daniel Robbin's three part series - Common threads: Awk by example, Part 1, 2 and 3, I now have a greater understanding and appreciation of awk.

2 comments (moderated to prevent spam):

Ashwin Jayaprakash said...

You can also use an in-mem database like H2 to import the CSV directly into a table and run all the aggregates and sorts you need - http://www.h2database.com/html/functions.html?highlight=CSVREAD&search=csv#csvread

Sujit Pal said...

Thanks Ashwin, did not know about H2, this is definitely a much nicer approach (more flexible) than the one I described. I read up a bit about H2, it looks very promising, thanks for the pointer.