• Home
  • Code
  • Howtos
  • Opinion
  • Presentations

The Wandering Geek

Linux and Unix Lore, Sysadmin, Coding, and Hacks

« Comments on "Why I’m staying with Debian"
Don't Mess With Your Sysadmin »

Counting Words in Files With HTML Markup

Sep 13th, 2007 by Doug

I write blog posts with HTML markup, and I sometimes want to get a fairly accurate word count of my posts. By accurate I mean that HTML tags themselves as well as quoted values are not counted as words. There are a lots of utilities and scripts that do word counting, from the venerable Unix wc to an elisp subroutine in the FSF's An Introduction to Programming in Emacs Lisp. The ones I looked at all suffered from the same problem - they counted markup as 'words'. If there was some way to strip out or ignore markup, the various methods of word counting would work.

First I tried a few ready-made utilities. The Unix text-mode browser lynx has a 'dump' option that will output formatted text content from a given html file (lynx -dump -nolist foo.html), however, it outputs formatted text, and some of the formatting markup is itself counted as a word by the 'wc' utility. w3m is similar in its output, so has the same problems. I found a Debian package called unhtml that seemed to do what I wanted, but after experimenting a bit with it, I found that it could not handle multiple opening and closing tags on the same line (it counted them as one tag, meaning any real words in that line were skipped). Thinking I might have to write my own utility, I set out to not reinvent the wheel and did a CPAN search - and had success on the first hit. After a few tests I found that HTML::Strip did indeed handle multiple tags on a line as well as HTML comments and values properly.

The next step was to write a wrapper around HTML::Strip for command line use. After a bit of hacking, I came up with unhtml.pl. From the script header:

Script that strips HTML tags from text. It uses HTML::Strip to do the real work; this is a wrapper around that module that allows you to specify command line arguments - standard input/output is assumed if no args are given. If only one arg is given, it is assumed to be the input pathname. Requires HTML::Strip (perl -MCPAN -e 'install HTML::Strip' as root on any Unix-based OS will work). Examples (the following have equivalent results): unhtml.pl < foo.html > foo.txt unhtml.pl foo.html > foo.txt unhtml.pl foo.html foo.txt I also needed a way to integrate this into Emacs, here is an elisp snippet you can put in your .emacs (don't forget to modify the path to the script): (defun word-count nil "Count words in region" (interactive) (shell-command-on-region (point) (mark) "~/bin/unhtml.pl | wc -w")) (global-set-key "\C-c=" 'word-count) As a bonus, it also handles XML and SGML properly. To use it while editing, just type C-x= to get a word count of the current region (use C-xh to make the region the entire buffer first), minus HTML tags.

[Post to Yahoo Buzz]  [Post to Delicious]  [Post to Digg]  [Post to Reddit]  [Post to StumbleUpon] 

  • About

    Here you'll find plenty of Linux and Unix sysadmin tips, howtos, code snippets and geek commentary - I hope you find the site both interesting and useful.

  • [FSF Associate Member]
  • Links

    • Dilbert
    • Hacker News
    • Linux Gazette
    • Linux Questions
    • Linux Weekly News
    • Perl Monks
    • Ubuntu Forums
    • UnixLore.net
    • User Friendly
    • Xkcd
    • Join the FSF as an Associate Member!

The Wandering Geek © 2009 All Rights Reserved.