Imagine a programmer finds chunks of (almost) identical code duplicated across multiple files. Such a programmer should be itching to extract those chunks. It doesn't really matter what programming language we're talking about: every non-trivial language has a way of refering to external code.
It's easy to do a simple find/replace operation where the text to be replaced does not run over any line boundaries. Any half-decent text editor can do it, as can tools like sed or awk, with minimal tweaking.
When it comes to chunks of text that span multiple lines or paragraphs, it gets a little harder, but not that much. A language such as Perl has more than enough power to handle the job.
I inherited a bunch of HTML files that each featured the following chunk of code like this near the top.
<table width="100%"> <tr> <td width="60"><a href="home.html"><img src="img/home.gif"></a></td> <td width="88"> <a href="about.html"><img alt="About Us" src="img/about.gif"></a> </td> <td align="right">Search </td> <td align="right" width="104"> <input size="40" name="terms" type="text"> </td> <td width="45"> <input alt="Go" src="img/go.gif" type="image" name="go"> </td> </tr> </table>
I wanted to replace all occurences of this chunk with something more like
<!--#include file="navigation.inc"-->
All I needed was a regular expression that will match the chunk and I
I could apply Perl's substitution operator, s///
. It's not
that hard to write such a regular expression, but it is tedious and
therefore error-prone. In order to relieve the tedium, I wrote a small
Perl script to create the regular expression for me:
#! /usr/bin/perl -w use strict; print "s%\n"; while (<>) { # escape any regex meta-chars s/([].[\\^#|\$%*+?(){}])/\\$1/g; # match trailing whitespace (incl. newlines) on non-empty lines s/(.)$/$1\\s+/; # match any internal whitespace s/(\S)[ \t]+/$1\\s+/g; print $_; } print <<EOT; %PUT REPLACEMENT TEXT HERE %six EOT
(I put this code into a file called genreg.pl
.)
It is important to notice the "six" flags at the end. These flags change the way regular expression engine works.
-
s
allows the regular expression to match over multiple lines. In particular, it will let "." match newlines. -
i
makes the regular expression case-insensitive. -
The
x
flag is not strictly necessary and it arguably complicates things a bit. It makes the engine ignore un-escaped whitespace in the regular expression and any comments that follow a#
symbol. I like to use\s
to match spaces anyway, so I don't mind the extra complication. Besides, this script takes care of it for me.
(Gory details can be found on the perlre
man page.)
A couple of extra notes:
-
%
is not normally a shell meta-character, but I am using it to delimit my regular expression, so I want to escape it anyway. - The way the generated regular expression works, \s+ will match across multiple newlines if necessary.
Here's how I ran the Perl code. First, I put the chunk of HTML code
I wanted to replace in a file called find.html
. Then I used
genreg.pl
to create a script called sub.pl
.
$ perl genreg.pl < find.html > sub.pl
I edited the sub.pl
file and put in the replacement text I
wanted, so the last two lines of the file became:
%<!--#include file="navigation.inc"--> %six
Next I checked it by running it on one of my input files:
$ perl -p -0777 sub.pl < file01.html | less
Finally, I ran it on all my input files.
$ perl -p -0777 -i.bak sub.pl file*.html
The flags -p
, -0777
and -i
are
explained on the perlrun
man page. Suffice it to say that
they allow me to process whole files at a time and save a backup copy of
each file with a .bak
extension.
In more complicated cases, I would want to edit the regular expression.
For example, I could use brackets ()
to capture variant
parts of the text and use them in my replacement text.
thanks .. but i did work out 1 thing .. in windows you need to stick a
BEGIN {@ARGV = map { glob } @ARGV }
on the very first line of the sub.pl
Good point and thanks for the comment. I normally rely on the shell to do the globbing before the file names are passed to the script. On Windows you can't make that assumption.