Mon, 14 Sep 2009

Quick Log File Processing with Perl

A common thing to want to do as a sysadmin is match and print text from a file in a particular output format. There are lots of ways to do this using shell tools - grep, sed and awk are used frequently - but I’d like to show you a common Perl idiom for doing this type of task.

Perl was originally designed to be a replacement for the various shell tools, and while it has grown into much more over the years, it is still a great tool to have in your command line toolbox. Here’s an example. Let’s say you want to print the date, time, IP address and URL each time your website is crawled by a Googlebot. The Apache access log will look something like this:

... 10.249.66.234 - - [12/Sep/2009:19:22:51 -0400] "GET /robots.txt HTTP/1.1" 404 424 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 10.249.66.234 - - [12/Sep/2009:19:22:51 -0400] "GET / HTTP/1.1" 200 - "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" ...

A quick solution is this, all in one line:

serenity:~# perl -wnle 'print "Googlebot accessed \"$4\" from $1 on $2 at $3" if (/^ (\d+\.\d+\.\d+\.\d+) .+? \[ (.+?) : (.+?) \s .+? GET\s+(.+?)\s+HTTP .+ Googlebot/x)' /var/log/apache2/access.log Googlebot accessed "/robots.txt" from 10.249.66.234 on 12/Sep/2009 at 19:22:51 Googlebot accessed "/" from 10.249.66.234 on 12/Sep/2009 at 19:22:51 serenity:~#

There are four command line options used here:

See the perlrun manpage for details, there is much more to Perl’s command line processing.

I build the regular expression by picking a target line and going through it from left to right, adding expressions as I go. I make use of the /x modifier so that it is easier to read - this makes Perl ignore whitespace in the regexp. I also use Perl’s non-greedy quantifier quite a bit, this is the question mark in expressions like .+? \[. This little snippet matches one or more of any character, followed by a left-bracket. The question mark ensures that the first such left-bracket is matched. Normally Perl’s regexp engine would happily chomp away at characters and match the last left bracket it found in the line. Using the greedy form .+ \[ would work for us, since there is only one such left bracket in each line, but it turns out to be a performance improvement if we are parsing large text files (For more info, I encourage you to read Mastering Regular Expressions by Jeffrey Friedl, or start with the Regular Expression Tutorial).

This method has a few advantages. For one, it relies on just one tool, not a few disparate ones. Perl is portable to many operating systems, so you could use this to parse text files on Windows, for example. You also have the ability to load modules on the command line with the ‘-M’ switch. This gives you access to all of CPAN, potentially a huge time-saver.

posted at: 21:20 | path: / | permanent link to this entry | 0 comments | tags:

[Post to Yahoo Buzz]  [Post to Delicious]  [Post to Digg]  [Post to Reddit]  [Post to StumbleUpon] 

Sun, 23 Aug 2009

Move Over, Grep. Hello, Ack

As someone who has been using grep and its variants like egrep for years, I admit they have been insanely useful. But every once in a while something comes along that improves an idea so much, you can’t ignore it. Such a thing is Ack, the grep replacement.

I do a lot of software development in large codebases, and the ability to find snippets of text is paramount. Tags can be used and integrated with Emacs (or Vim, we’re not all perfect), which is great for function names, but not useful for general text searches. Using grep in a code repository is a pain, and usually means some sort of hack to ignore VC directories like .svn and RCS. Enter ack - similar to grep but with some more thought behind it. It ignores VC meta-data directories by default and is written in pure Perl - so it’s portable and supports the full Perl regexp syntax. Having a pure-Perl version available with no dependencies also means its easy to install in shared hosting environments, where you don’t have root access.

Install ack by just downloading the standalone version and put it in your command path, use CPAN (cpan App::Ack), or install a pre-packaged binary (On Debian/Ubuntu systems, the package name is ack-grep). Ack output is very readable, with highlighted matches by default as well as line numbers and file names. Here is an example:


dmaxwell@kaylee:~/tmp$ ack-grep -ai 'limit_as.+?\&rlimit' emacs-22.3 emacs-22.3/src/vm-limit.c 76: getrlimit (RLIMIT_AS, &rlimit);

Here is a screenshot so you can see the highlighting and colorization:

Ack usage and output

The -ai means ’search all, case insensitively’, and tells Ack to search all filetypes (but still not including common VCS directories or files), while ignoring case. Ack searches are recursive by default, so there is no need for a -r switch. You can see we used Perl’s non-greedy match quantifier in the search regexp, something egrep can’t do. This speeds the search up considerably.

There is much more to ack, read the docs and give it a try. I hope you’ll find it as useful as I have.

posted at: 19:20 | path: / | permanent link to this entry | 0 comments | tags:

[Post to Yahoo Buzz]  [Post to Delicious]  [Post to Digg]  [Post to Reddit]  [Post to StumbleUpon]