Replacing Large Chunks of Text with Perl

| 2 Comments

Imagine a programmer finds chunks of (almost) identical code duplicated across multiple files. Such a programmer should be itching to extract those chunks. It doesn't really matter what programming language we're talking about: every non-trivial language has a way of refering to external code.

It's easy to do a simple find/replace operation where the text to be replaced does not run over any line boundaries. Any half-decent text editor can do it, as can tools like sed or awk, with minimal tweaking.

When it comes to chunks of text that span multiple lines or paragraphs, it gets a little harder, but not that much. A language such as Perl has more than enough power to handle the job.

I inherited a bunch of HTML files that each featured the following chunk of code like this near the top.

<table width="100%">
 <tr>
  <td width="60"><a href="home.html"><img src="img/home.gif"></a></td>
  <td width="88">
   <a href="about.html"><img alt="About Us" src="img/about.gif"></a>
  </td>
  <td align="right">Search </td>
  <td align="right" width="104">
   <input size="40" name="terms" type="text">
  </td>
  <td width="45">
   <input alt="Go" src="img/go.gif" type="image" name="go">
  </td>
 </tr>
</table>

I wanted to replace all occurences of this chunk with something more like

<!--#include file="navigation.inc"-->

All I needed was a regular expression that will match the chunk and I I could apply Perl's substitution operator, s///. It's not that hard to write such a regular expression, but it is tedious and therefore error-prone. In order to relieve the tedium, I wrote a small Perl script to create the regular expression for me:

#! /usr/bin/perl -w
use strict;
print "s%\n";
while (<>) {
	# escape any regex meta-chars
	s/([].[\\^#|\$%*+?(){}])/\\$1/g;
	# match trailing whitespace (incl. newlines) on non-empty lines
	s/(.)$/$1\\s+/;
	# match any internal whitespace
	s/(\S)[ \t]+/$1\\s+/g;
	print $_;
}
print <<EOT;
%PUT
REPLACEMENT
TEXT
HERE
%six
EOT

(I put this code into a file called genreg.pl.)

It is important to notice the "six" flags at the end. These flags change the way regular expression engine works.

  • s allows the regular expression to match over multiple lines. In particular, it will let "." match newlines.
  • i makes the regular expression case-insensitive.
  • The x flag is not strictly necessary and it arguably complicates things a bit. It makes the engine ignore un-escaped whitespace in the regular expression and any comments that follow a # symbol. I like to use \s to match spaces anyway, so I don't mind the extra complication. Besides, this script takes care of it for me.

(Gory details can be found on the perlre man page.) A couple of extra notes:

  • % is not normally a shell meta-character, but I am using it to delimit my regular expression, so I want to escape it anyway.
  • The way the generated regular expression works, \s+ will match across multiple newlines if necessary.

Here's how I ran the Perl code. First, I put the chunk of HTML code I wanted to replace in a file called find.html. Then I used genreg.pl to create a script called sub.pl.

$ perl genreg.pl < find.html > sub.pl

I edited the sub.pl file and put in the replacement text I wanted, so the last two lines of the file became:

%<!--#include file="navigation.inc"-->
%six

Next I checked it by running it on one of my input files:

$ perl -p -0777 sub.pl < file01.html | less

Finally, I ran it on all my input files.

$ perl -p -0777 -i.bak sub.pl file*.html

The flags -p, -0777 and -i are explained on the perlrun man page. Suffice it to say that they allow me to process whole files at a time and save a backup copy of each file with a .bak extension.

In more complicated cases, I would want to edit the regular expression. For example, I could use brackets () to capture variant parts of the text and use them in my replacement text.

2 Comments

thanks .. but i did work out 1 thing .. in windows you need to stick a

BEGIN {@ARGV = map { glob } @ARGV }

on the very first line of the sub.pl

Good point and thanks for the comment. I normally rely on the shell to do the globbing before the file names are passed to the script. On Windows you can't make that assumption.

About this Entry

This page contains a single entry by Christian published on December 11, 2003 4:16 PM.

Offshoring Didn't Kill IT Jobs was the previous entry in this blog.

The Joys of a Digital Camera is the next entry in this blog.

Find recent content on the main index or look in the archive to find all content.