post Category: Linux,coding — Jon Watson @ 2:25 pm — post Comments (0)

I am writing a web crawler in Perl. Yes, I know there are a million web crawlers out there, but I am purposely writing one from scratch, in Perl, without stealing code from anyone else. My only resource is the Beginning Perl book on perl.org.

The concept of the web crawler is pretty easy, but the mechanics of it make it an excellent learning tool. In my creation, I need to know how to use regular expressions to find the URLs in a page, hash manipulation to not only store the URLs  I find but also to keep track of URLs we’ve seen before, and some good old fashion input/output routines and flow control. Not rocket science, but certainly rocket science has these little bits in it.

The entire program is just a few lines and the most important one is the regular expression that finds the URLs. This little snippet does the following:

  1. It gets the URL from the command line or fails if one is not supplied
  2. It uses the LSWP::Simple Perl library to fetch the page from the URL
  3. If the resulting page has at least one line, it goes into a foreach which examines each line and extracts URLs from it.

One of the more challenging parts was identifying what a URL looked it. It is pretty clear that I want something with href near the beginning, but then what? The = attribute of the HTML href tag could:

  • Start with a ” or not
  • Start with a ‘ or not
  • Start with http or not
  • Start with https or not
  • Start with / or not
  • Start with any alphanumeric character or not
  • End with a TLD or not
  • End with a file extension or not
  • End with a slash or not
  • End with any alphanumeric character or not

It took some time, and this has not been tested very heavily, but it seems to succeed in only returning URLs from a given page.

$line =~ /href=[\"\']?((https?:\/\/)?\W?([[:alnum:]]+.)*[[:alnum:]]+(\/[[:alnum:]]*)*)/

Rate this post:

There are no related posts to this one. Have some randomness:

`
post Category: coding,sysadmin — Jon Watson @ 5:26 pm — post Comments (0)

I’m brushing up on my Perl and just to make sure I don’t carry any bad assumptions into the learning process, I am starting from scratch. A couple of interesting things that I forgot (or maybe didn’t know, not sure) about Perl follow. One is neat (the numbers) one is just basic non-tech clever:

First, the Pumpkin

The Perl gate keeper is nicknamed the Pumpkin. Or, more correctly, the “patch pumpkin holder” also known as “the pumpking”.

The role of the gatekeeper is clear: to coordinate the various changes in the Perl project to ensure that only good code gets into it and to also ensure that nobody’s work conflicts with someone else’s work. So, why is the Perl gatekeeper called the Pumpking? It stems from a story from David Croy who once worked in a place that used a single tape drive to backup all of the systems within the organization. In order to prevent these systems from randomly starting backups while others were in progress, a decidedly non-tech device was used. A stuffed pumpkin. You were only allowed to backup your system if you were in possession of the “backup pumpkin”. If you didn’t have it, you didn’t dare do a backup because whomever had the pumpkin was undoubtedly doing their backup at that time.

Next numbers.

Perl is smart enough to recognize numbers in decimal, octal, binary and hex. It will correctly determine what format a number is in as long as you give it a little hint.

  • 255 will be recognized a decimal number
  • 0377 will be recognized an an octal number
  • 0b11111111 will be recognized as a binary number
  • 0xFF wil be recognized as a hex number

Onwards…

Enhanced by Zemanta

Rate this post:

There are no related posts to this one. Have some randomness:

`
post Category: sysadmin — Jon Watson @ 12:29 pm — post Comments (0)

By now we all know that keeping backups of your data is important. Whether we are business folk or personal folk, we’ve all lost something important at one time and we’ve never forgotten the pain of that lesson. There are approximately one billion companies on the Internet that also do not want you to forget that pain because they offer…you guess it…backups.

The backup products on the market simply boggle the mind. How are you supposed to know whether you want incrementals, fulls, CDPs or any other backup scheme? I’m a tech guy and even I have problems trying to figure out what acronyms are actual technical terms that mean something and which are marketing buzzwords which mean absolutely nothing.

Once you get past the feature set, you’re then stuck with the task of trying to determine which companies are reliable. Which ones will actually deliver on the promised feature set? Which ones will actually be there in that all critical moment when you need to do a recovery?

Well, the fact of the matter is that you can’t. Much like any other service-based offering out there, it’s really not possible to know which service providers will live up to their marketing hype without actually using them. Or, at least, without actually getting some feedback from people who use them.

A logical place to start is here with the best online backup reviews. It’s pretty much self explanatory: it lists the best top 10 online backup services based on features, support, speed, pricing and reliability. Each of the top 10 has its own dedicated page on the site where you can see individual reviews from actual users of the service. In these heady days of crowd sourcing, independent reviews from actual users is like gold. A shoddy service won’t survive very long.

So, you know you need backups. Go get ‘em.

Enhanced by Zemanta

Rate this post:

There are no related posts to this one. Have some randomness:

`