Mon, 14 Sep 2009
Quick Log File Processing with Perl
A common thing to want to do as a sysadmin is match and print text from a file in a particular output format. There are lots of ways to do this using shell tools - grep, sed and awk are used frequently - but I’d like to show you a common Perl idiom for doing this type of task.
Perl was originally designed to be a replacement for the various shell tools, and while it has grown into much more over the years, it is still a great tool to have in your command line toolbox. Here’s an example. Let’s say you want to print the date, time, IP address and URL each time your website is crawled by a Googlebot. The Apache access log will look something like this:
...
10.249.66.234 - - [12/Sep/2009:19:22:51 -0400] "GET /robots.txt HTTP/1.1" 404 424 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
10.249.66.234 - - [12/Sep/2009:19:22:51 -0400] "GET / HTTP/1.1" 200 - "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
...
A quick solution is this, all in one line:
serenity:~# perl -wnle 'print "Googlebot accessed \"$4\" from $1 on $2 at $3" if (/^ (\d+\.\d+\.\d+\.\d+) .+? \[ (.+?) : (.+?) \s .+? GET\s+(.+?)\s+HTTP .+ Googlebot/x)' /var/log/apache2/access.log
Googlebot accessed "/robots.txt" from 10.249.66.234 on 12/Sep/2009 at 19:22:51
Googlebot accessed "/" from 10.249.66.234 on 12/Sep/2009 at 19:22:51
serenity:~#
There are four command line options used here:
- w: Turn on warnings
- n: Loop through the supplied file one line at a time
- l: Print a newline after each line of output
- e: Execute the Perl code that follows
I build the regular expression by picking a target line and going
through it from left to right, adding expressions as I go. I make use
of the /x modifier so that it is easier to read - this makes Perl
ignore whitespace in the regexp. I also use Perl’s non-greedy
quantifier quite a bit, this is the question mark in expressions like
.+? \[. This little snippet matches one or more of any
character, followed by a left-bracket. The question mark ensures that
the first such left-bracket is
matched. Normally Perl’s regexp engine would happily chomp away at
characters and match the last
left bracket it found in the line. Using the greedy form .+ \[
would work for us, since there is only one such left bracket
in each line, but it turns out to be a performance improvement if we
are parsing large text files (For more info, I encourage you to read
Mastering
Regular Expressions by Jeffrey Friedl, or start with the Regular Expression
Tutorial).
This method has a few advantages. For one, it relies on just one tool, not a few disparate ones. Perl is portable to many operating systems, so you could use this to parse text files on Windows, for example. You also have the ability to load modules on the command line with the ‘-M’ switch. This gives you access to all of CPAN, potentially a huge time-saver.
posted at: 21:20 | path: / | permanent link to this entry | 0 comments | tags: Perl Sysadmin Tips Logs Regexps
Sun, 23 Aug 2009
Move Over, Grep. Hello, Ack
As someone who has been using grep and its variants like egrep for years, I admit they have been insanely useful. But every once in a while something comes along that improves an idea so much, you can’t ignore it. Such a thing is Ack, the grep replacement.
I do a lot of software development in large codebases, and the ability to find snippets of text is paramount. Tags can be used and integrated with Emacs (or Vim, we’re not all perfect), which is great for function names, but not useful for general text searches. Using grep in a code repository is a pain, and usually means some sort of hack to ignore VC directories like .svn and RCS. Enter ack - similar to grep but with some more thought behind it. It ignores VC meta-data directories by default and is written in pure Perl - so it’s portable and supports the full Perl regexp syntax. Having a pure-Perl version available with no dependencies also means its easy to install in shared hosting environments, where you don’t have root access.
Install ack by just downloading the standalone version and put it in your command path, use CPAN (cpan App::Ack), or install a pre-packaged binary (On Debian/Ubuntu systems, the package name is ack-grep). Ack output is very readable, with highlighted matches by default as well as line numbers and file names. Here is an example:
dmaxwell@kaylee:~/tmp$ ack-grep -ai 'limit_as.+?\&rlimit' emacs-22.3
emacs-22.3/src/vm-limit.c
76: getrlimit (RLIMIT_AS, &rlimit);
Here is a screenshot so you can see the highlighting and colorization:
The -ai means ’search all, case insensitively’, and tells Ack to search all filetypes (but still not including common VCS directories or files), while ignoring case. Ack searches are recursive by default, so there is no need for a -r switch. You can see we used Perl’s non-greedy match quantifier in the search regexp, something egrep can’t do. This speeds the search up considerably.
There is much more to ack, read the docs and give it a try. I hope you’ll find it as useful as I have.
posted at: 19:20 | path: / | permanent link to this entry | 0 comments | tags: Perl Grep Ack Tips