Counting Words in Files With HTML Markup
Sep 13th, 2007 by Doug
href=”http://blog.unixlore.net/2006/03/using-emacs-to-edit-blog-posts.html”>I
write blog posts with HTML markup, and I sometimes want to get a
fairly accurate word count of my posts. By accurate I mean that HTML
tags themselves as well as quoted values
are not counted as words. There
are a lots of utilities and scripts that do word counting, from the
venerable Unix ‘wc’ to an elisp subroutine in the
FSF’s
href=”http://www.gnu.org/software/emacs/emacs-lisp-intro/html_node/Whitespace-Bug.html”>An
Introduction to Programming in Emacs Lisp. The ones I looked at
all suffered from the same problem - they counted markup as
‘words’. If there was some way to strip out or ignore markup, the
various methods of word counting would work.
First I tried a few ready-made utilities. The Unix text-mode browser
lynx has a ‘dump’ option that will output formatted text content from
a given html file (lynx -dump -nolist foo.html), however, it outputs
formatted text, and some of the formatting markup is itself counted as
a word by the ‘wc’ utility. w3m is similar in its output, so has the
same problems. I found a Debian package
called
href=”http://packages.debian.org/stable/text/unhtml”>unhtml that
seemed to do what I wanted, but after experimenting a bit with it, I
found that it could not handle multiple opening and closing tags on
the same line (it counted them as one tag, meaning any real words in
that line were skipped). Thinking I might have to write my own
utility, I set out to not reinvent the wheel and did a CPAN search -
and had success on
the
href=”http://search.cpan.org/~kilinrax/HTML-Strip-1.06/Strip.pm”>first
hit. After a few tests I found
that
href=”http://search.cpan.org/~kilinrax/HTML-Strip-1.06/Strip.pm”>HTML::Strip
did indeed handle multiple tags on a line as well as HTML comments and
values properly.
The next step was to write a wrapper around HTML::Strip for command
line use. After a bit of hacking, I came up
with
href=”http://unixlore.net/downloads/unhtml.pl.txt”>unhtml.pl. From
the script header:
Script that strips HTML tags from text. It uses HTML::Strip to do the real work; this is a wrapper around that module that allows you to specify command line arguments - standard input/output is assumed if no args are given. If only one arg is given, it is assumed to be the input pathname. Requires HTML::Strip (perl -MCPAN -e ‘install HTML::Strip’ as root on any Unix-based OS will work). Examples (the following have equivalent results): unhtml.pl < foo.html > foo.txt unhtml.pl foo.html > foo.txt unhtml.pl foo.html foo.txtI also needed a way to integrate this into Emacs, here is an elisp snippet you can put in your .emacs (don’t forget to modify the path to the script):
(defun word-count nil "Count words in region"
(interactive)
(shell-command-on-region (point) (mark) "/home/dmaxwell/bin/unhtml.pl | wc -w"))
(global-set-key “\C-c=” ‘word-count)
As a bonus, it also handles XML and SGML properly. To use it while
editing, just type C-x= to get a word count of the current region (use
C-xh to make the region the entire buffer), minus HTML tags. ![[SDF Public Access Unix System] [SDF Public Access Unix System]](http://www.unixlore.net/images/sdf.jpg)