Tuesday 2 September 2014

Big data analytics with grep and sed

This was written a couple of years ago, when a gigabyte of logs was reasonably "large scale". Nevertheless, the approach still works for many log monitoring tasks.

Part 1.

A few jobs ago I was on a team tasked with creating a system to monitor security events for a Swiss bank with 20k+ (all physical!) Unix hosts plus 60k Windows desktops and servers.

Hadoop did not exist yet outside of Yahoo research labs, so we had to do with MySQL, a bit of C and Perl. By the way, the resulting system kept chugging along for at least 5 years, if not 10, before finally succumbing to the likes of ArcSight.

We already had a centralised syslog setup, which produced tons of logs. To start with, we needed to come up with:
  • a list of things to ignore, 
  • a list of things to definitely alert on, 
  • and an anomaly detection algorithm. 
This turned out simpler than it sounded (hurray for 20-20 hindsight!), thanks to some ideas borrowed from Marcus Ranum's "Artificial Ignorance" and "Never Before Seen"

Part 2.

Back at my current job, I'd spent a couple of hours digging into auth.log from our (then) new and shiny SaaS infrastructure manager node (~ 1 Gb), and it summed up nicely into 38k of "interesting" events. Once you've created your own stoplist, you can run the filtered output in a terminal in background. It contains 10-20 messages an hour:
  • Manual, as opposed to scripted jobs, ssh logins,
  • Unscheduled scripts,
  • Adhoc fix work,
  • Auth failures,
  • Various unique ("never before seen") alarms.
I'm not including a sample of the filtered results - try this at home instead.

What I did was a dozen of iterations of

cat auth.* | grep -v -f stoplist.txt |sed -f sedcommands.txt |  sort | uniq -c | sort -r -n > stuff

starting with an empty stoplist and the following file sedcommands.txt:

s/^.\{15\}\ [a-z0-9.-]*\ //

This cuts off the initial part of the log line and summarises similar lines.  Looking at the sorted results, I add a few lines to the stoplist to get rid of the most uninteresting from security/anomaly point of view entries (usually top hits), rinse, repeat.

The final stoplist (slightly censored):

last message repeated .* times
Accepted publickey for .* from 59\.167\.XX\.XX
Accepted publickey for .* from 10\.65\..*
Accepted publickey for .* from 207\.223\.XX\.XX
session opened for user root by (uid=0)
session opened for user postgres by (uid=0)
session opened for user YYYYYYYY by (uid=0)
session opened for user www-data by (uid=0)
session closed for user
Received disconnect from .*: 11: disconnected by user
Accepted publickey for YYYYYYYY from 67\.221\.XX\.XX
USER=root ; COMMAND=/usr/lib/nagios/plugins/uc_.*
USER=root ; COMMAND=/usr/share/pyshared/pycm/bin/nagios/
USER=root ; COMMAND=/usr/lib/container-manager/.*
USER=root ; COMMAND=/opt/.*
USER=root ; COMMAND=/usr/bin/ZZZZZZ
PWD=/usr/share/pyshared/pycm/bin/vz ; USER=root ;
PWD=/usr/share/pyshared/pycm/bin/vz ; USER=root ;
PWD=/opt/unicorn-repl-tools/var/queue ; USER=root ; COMMAND=/bin/rm \./.*\.net
PWD=/opt/unicorn-repl-tools/var/queue ; USER=root ; COMMAND=/bin/rm \./.*\.com
PWD=/home/hal-uc-1 ; USER=root ; COMMAND=/usr/share/pyshared/pycm/
running '/sbin/halt' with root privileges on behalf of 'root'

This is only for illustration - do not copy the result, the point of this exercise is for you to come up with your own that is best fit for your environment. If you deal with extremely complex logfiles, you could throw in some AWK :)

In all, it took about 2 hours in total from getting the logs to producing the summary, including the remembering how this stuff was done.