Using Perl and Regular Expressions to Process Html Files - Part 1

Like many web content authors, over the past fewdays, fixing untidy HTML code after it has been
years I've had many occasions when I've needed toconverted.
clean up a bunch of HTML files that have beenMany applications offer excellent tools for converting
generated by a word processor or publishingdocuments to HTML and, in combination with a well
package. Initially, I used to clean up the files manually,designed cascading style sheet (CSS), can often
opening each one in turn, and making the same setproduce perfect results. Sometimes though, there are
of updates to each one. This works fine when youlittle bits of HTML code that are a bit messy,
only have a few files to fix, but when you havenormally caused by authors not applying paragraph
hundreds or even thousands to do, you can verytags or styles correctly in the source document.
quickly be looking at weeks or even months of work.Why Perl?
A few years ago someone put me on to the idea ofThe reason why Perl is such a good language to use
using Perl and regular expressions to perform thisfor this task is because it is excellent at processing
'cleaning up' process.text files, which let's face it, is all HTML files are. Perl
Why write an article about Perl and regularis also the de facto standard for the use of regular
expressions I hear you say. Well, that's a good point.expressions, which you can use to search for, and
After all the web is full of tutorials on Perl and regularreplace/change, bits of text or code in a file.
expressions. What I found though, was that when IWhat is Perl?
was trying to find out how I could process HTMLPerl (Practical Extraction and Report Language) is a
files, I found it difficult to find tutorials that met mygeneral purpose programming language, which means
criteria. I'm not saying they don't exist, I just couldn'tit can be used to do anything that any other
find them. Sure, I could find tutorials that explainedprogramming language can do. Having said that, Perl is
everything I needed to know about regularvery good at doing certain things, and not so good
expressions, and I could find plenty of tutorials aboutat others. Although you could do it, you wouldn't
how to program in Perl, and even how to use regularnormally develop a user interface in Perl as it would
expressions within Perl scripts. What I couldn't findbe much easier to use a language like Visual Basic to
though, was a tutorial that explained how to opendo this. What Perl is really good at, is processing text.
one or more HTML or text files, make updates toThis makes it a great choice for manipulating HTML
those files using regular expressions, and then savefiles.
and close the files.What is a Regular Expression?
The GoalA regular expression is a string that describes or
When converting documents into HTML the goal ismatches a set of strings, according to certain syntax
always to achieve a seamless conversion from therules. Regular expressions are not unique to Perl -
source document (for example, a word processormany languages, including JavaScript and PHP can use
document) to HTML. The last thing you need is forthem - but Perl handles them better than any other
your content authors to be spending hours, or evenlanguage.