| Like many web content authors, over the past few | | | | days, fixing untidy HTML code after it has been |
| years I've had many occasions when I've needed to | | | | converted. |
| clean up a bunch of HTML files that have been | | | | Many applications offer excellent tools for converting |
| generated by a word processor or publishing | | | | documents to HTML and, in combination with a well |
| package. Initially, I used to clean up the files manually, | | | | designed cascading style sheet (CSS), can often |
| opening each one in turn, and making the same set | | | | produce perfect results. Sometimes though, there are |
| of updates to each one. This works fine when you | | | | little bits of HTML code that are a bit messy, |
| only have a few files to fix, but when you have | | | | normally caused by authors not applying paragraph |
| hundreds or even thousands to do, you can very | | | | tags or styles correctly in the source document. |
| quickly be looking at weeks or even months of work. | | | | Why Perl? |
| A few years ago someone put me on to the idea of | | | | The reason why Perl is such a good language to use |
| using Perl and regular expressions to perform this | | | | for this task is because it is excellent at processing |
| 'cleaning up' process. | | | | text files, which let's face it, is all HTML files are. Perl |
| Why write an article about Perl and regular | | | | is also the de facto standard for the use of regular |
| expressions I hear you say. Well, that's a good point. | | | | expressions, which you can use to search for, and |
| After all the web is full of tutorials on Perl and regular | | | | replace/change, bits of text or code in a file. |
| expressions. What I found though, was that when I | | | | What is Perl? |
| was trying to find out how I could process HTML | | | | Perl (Practical Extraction and Report Language) is a |
| files, I found it difficult to find tutorials that met my | | | | general purpose programming language, which means |
| criteria. I'm not saying they don't exist, I just couldn't | | | | it can be used to do anything that any other |
| find them. Sure, I could find tutorials that explained | | | | programming language can do. Having said that, Perl is |
| everything I needed to know about regular | | | | very good at doing certain things, and not so good |
| expressions, and I could find plenty of tutorials about | | | | at others. Although you could do it, you wouldn't |
| how to program in Perl, and even how to use regular | | | | normally develop a user interface in Perl as it would |
| expressions within Perl scripts. What I couldn't find | | | | be much easier to use a language like Visual Basic to |
| though, was a tutorial that explained how to open | | | | do this. What Perl is really good at, is processing text. |
| one or more HTML or text files, make updates to | | | | This makes it a great choice for manipulating HTML |
| those files using regular expressions, and then save | | | | files. |
| and close the files. | | | | What is a Regular Expression? |
| The Goal | | | | A regular expression is a string that describes or |
| When converting documents into HTML the goal is | | | | matches a set of strings, according to certain syntax |
| always to achieve a seamless conversion from the | | | | rules. Regular expressions are not unique to Perl - |
| source document (for example, a word processor | | | | many languages, including JavaScript and PHP can use |
| document) to HTML. The last thing you need is for | | | | them - but Perl handles them better than any other |
| your content authors to be spending hours, or even | | | | language. |