I am writing a web crawler in Perl. Yes, I know there are a million web crawlers out there, but I am purposely writing one from scratch, in Perl, without stealing code from anyone else. My only resource is the Beginning Perl book on perl.org.
The concept of the web crawler is pretty easy, but the mechanics of it make it an excellent learning tool. In my creation, I need to know how to use regular expressions to find the URLs in a page, hash manipulation to not only store the URLs I find but also to keep track of URLs we’ve seen before, and some good old fashion input/output routines and flow control. Not rocket science, but certainly rocket science has these little bits in it.
The entire program is just a few lines and the most important one is the regular expression that finds the URLs. This little snippet does the following:
- It gets the URL from the command line or fails if one is not supplied
- It uses the LSWP::Simple Perl library to fetch the page from the URL
- If the resulting page has at least one line, it goes into a foreach which examines each line and extracts URLs from it.
One of the more challenging parts was identifying what a URL looked it. It is pretty clear that I want something with href near the beginning, but then what? The = attribute of the HTML href tag could:
- Start with a ” or not
- Start with a ‘ or not
- Start with http or not
- Start with https or not
- Start with / or not
- Start with any alphanumeric character or not
- End with a TLD or not
- End with a file extension or not
- End with a slash or not
- End with any alphanumeric character or not
It took some time, and this has not been tested very heavily, but it seems to succeed in only returning URLs from a given page.
January 19, 2012
