How to write crawlers and parse a page using Perl (Part 1)

Hello all perl freaks,

One of the most powerful thing which we can achieve using perl is, extracting any content from a website you want to. For example, you can use perl to extract information of all the artists from All Music, extract information about all cricket players and matches from CricInfo. In the past I have used perl for making web crawlers for Altertunes and most recently I used perl to extract news from Google News.

Here I will try to explain how efficiently you can extract information by parsing html pages using perl.

To start with lets revise some basic stuffs about perl.

Lets first see how can we get HTML content of a website:

Example 1

require LWP::UserAgent;

#~ Call the gethtmlpage function by passing the url we want to save
gethtmlpage("http://abhinavsingh.com");

sub gethtmlpage {
  my $ua = LWP::UserAgent->new;
  #~ Use below line of code for proxied net connection
  $ua->proxy('http','http://[PROXY_URL]:[PROXY_PORT]/');
  my $response = $ua->post("$_[0]");

  if ($response->is_success) {
    $output = $response->content;
    open($fh,">abhinavsingh.com.html");
    print $fh $output;
  }
  else {
    print "Error in getting HTML page";
  }
}

If you are using PXPerl on windows, copy paste the above code in the SciTE perl editor (which comes in packaged with PXPerl) and simply press CNTR+F7. This will result into an html file named ‘abhinavsingh.com.html’ in your folder.

Most important feature which makes PERL and Python as default choice for web crawlers, is their ability of regular expression match. Lets see at some of the regular expression we will be using for parsing an HTML page.

Example 2

$sentence = "This is a perl tutorial by Abhinav Singh at http://abhinavsingh.com";

#~ Matching $sentence for 'Abhinav Singh'
$sentence =~ m/Abhinav Singh/i;

print "Pre-Match: ".$`."n";
print "Match: ".$&."n";
print "Post-Match: ".$'."n";

Copy the above code in SciTE perl editor and press CNTR+F7. You should see a result similar to this:

Output 2

>perl example2.pl
Pre-Match: This is a perl tutorial by
Match: Abhinav Singh
Post-Match:  at http://abhinavsingh.com
>Exit code: 0    Time: 0.962

Now lets see how can we extract relevant information from a page. Suppose we are interested in extracting all information about the artist Metallica from AllMusic website. Below I will first show you my code for the same and then its result. Finally I will discuss as to how did I made all those regular expressions:

Example 3

require LWP::UserAgent;

$bandname = "Metallica";
getartistinfo($bandname);

sub getartistinfo {
  my %formdata;
  my $ua = LWP::UserAgent->new;
  #~ $ua->proxy('http','http://[PROXY_URL]:[PROXY_PORT]/');

  $formdata{'sql'}=$_[0];
  $formdata{'opt1'}=1;
  $formdata{'P'}='amg';

  print "Sending HTTP request for ".$_[0]."...n";
  my $response = $ua->post('http://www.allmusic.com/cg/amg.dll',%formdata);

  if ($response->is_success) {
    print "Got HTTP response... parsing output for ".$_[0]."...nn";
    $output=$response->content;

    # Extracting Overview, Biography, Discography, Songs, Credit, Charts & Awards link for the artist
    $output =~ m/cg/amg.dll?p=amg&searchlink=(.*)">/;
    $BaseLink = "http://www.allmusic.com/cg/amg.dll?p=amg&searchlink=";
    $OverviewLink = $1;
    $DiscographyMainAlbumLink = $BaseLink.$OverviewLink;
    $DiscographyMainAlbumLink =~ s/T0/T20/;
    print "Discography Main Album: ".$DiscographyMainAlbumLink."n";
    $DiscographySinglesEPLink = $BaseLink.$OverviewLink;
    $DiscographySinglesEPLink =~ s/T0/T22/;
    print "Discography Singles&EP: ".$DiscographySinglesEPLink."n";
    $DiscographyDvDVideosLink = $BaseLink.$OverviewLink;
    $DiscographyDvDVideosLink =~ s/T0/T23/;
    print "Discography DVD Videos: ".$DiscographyDvDVideosLink."n";
    $DiscographyAllSongsLink = $BaseLink.$OverviewLink;
    $DiscographyAllSongsLink =~ s/T0/T31/;
    print "Songs All Songs: ".$DiscographyAllSongsLink."n";
    $DiscographyCnAAlbumsLink = $BaseLink.$OverviewLink;
    $DiscographyCnAAlbumsLink =~ s/T0/T50/;
    print "Charts & Awards Billboard Albums: ".$DiscographyCnAAlbumsLink."n";
    $DiscographyCnASinglesLink = $BaseLink.$OverviewLink;
    $DiscographyCnASinglesLink =~ s/T0/T51/;
    print "Charts & Awards Billboard Singles: ".$DiscographyCnASinglesLink."n";
    $DiscographyGrammyLink = $BaseLink.$OverviewLink;
    $DiscographyGrammyLink =~ s/T0/T52/;
    print "Charts & Awards Grammy Awards: ".$DiscographyGrammyLink."nn";

    # Extracting Title Bar
    $output =~ m/<td class="titlebar"><span class="title">(.*)</span><br />/;
    $titlebar = $1;
    print "Titlebar:n".$titlebar."nn";
    $output = $';

    # Extracting Formed-Sub
    $output =~ m/Begin Formed(.*)<span>(.*)End Formed/;
    $output = $';
    $formedsub = $2;
    $formedsub =~ m/<a href=(.*)>(.*)</a>(.*)<a href=(.*)>(.*?)</a>/; # Parse $formedsub for exact string
    print "Formed: ".$2.$3.$5."nn";

    # Extracting timelinesubactive
    while($output =~ m/class="timeline-sub-active">(d+)</div>/) {
      print "ActiveYear:".$1."n";
      $output = $';
    }
    print "n";

    # Extract Genre, Style titles
    $output =~ m/id="left-sidebar-title-small"(.*?)</tr>/;
    $suboutput = $&;
    $output = $';
    while($suboutput =~ m/id="left-sidebar-title-small"><span>(.*?)</span>/) {
      #~ print "Subclasses:".$1."n";
      push(@GSM,$1);
      $suboutput = $';
    }
    #~ print "n";

    # Extract Genre contents
    $output =~ m/<td class="list-cell"(.*?)</td>/;
    $suboutput = $&;
    $output = $';
    while($suboutput =~ m/<li>(.*?)</li>/) {
      #~ print "Genres:".$1."n";
      $suboutput = $';
      $1 =~ m/<a href=(.*)>(.*)</a>/;
      push(@G,$2);
    }
    #~ print "n";

    # Extract Style contents
    $output =~ m/<td class="list-cell"(.*?)</td>/;
    $suboutput = $&;
    $output = $';
    while($suboutput =~ m/<li>(.*?)</li>/) {
      #~ print "Styles:".$1."n";
      $suboutput = $';
      $1 =~ m/<a href=(.*)>(.*)</a>/;
      push(@S,$2);
    }
    #~ print "n";

    # Extract Mood subclass
    $output =~ m/id="left-sidebar-title-small"><span>(.*?)</span>/;
    $output = $';
    #~ print "Subclasses:".$1."nn";
    push(@GSM,$1);

    # Extract Mood Contents
    $output =~ m/id="left-sidebar-list"(.*?)</div>/;
    $suboutput = $&;
    $output = $';
    while($suboutput =~ m/<li>(.*?)</li>/) {
      #~ print "Moods:".$1."n";
      $suboutput = $';
      $1 =~ m/<a href=(.*)>(.*)</a>/;
      push(@M,$2);
    }
    print "n";

    # Print the @GSM and @G,@S,@M content
    print $GSM[0].":";
    foreach $gen (@G) {
      print $gen."t";
    }
    print "nn".$GSM[1].":";
    foreach $gen (@S) {
      print $gen."t";
    }
    print "nn".$GSM[2].":";
    foreach $gen (@M) {
      print $gen."t";
    }
    print "nn";

    # Extract AMG Artist ID
    $output =~ m/<td class="sub-text"(.*?)</pre>/;
    $output = $';
    $1 =~ m/<pre>(.*)/;
    print "AMG Artist ID:".$1."nn";

    # Extracting Artist Mini Bio
    $output =~ m/id="artistminibio"><p>(.*)</p>/;
    $artistminibio = $1;
    $artistminibio =~ s/<a href(.*?)>//g; # Filtering out any link or html tags
    $artistminibio =~ s/</a>//g;
    $artistminibio =~ s/<i>//g;
    $artistminibio =~ s/</i>//g;
    print "ArtistMiniBio:n".$artistminibio."nn";

    # Extracting Other Entries, Group Members, Similar Artists, Influenced By and Follower
    $output =~ m/id="large-list"><tr>(.*?)</table>/;
    $suboutput = $&;
    $output = $';
    # Extracting two part of the table
    $suboutput =~ m/<td valign="top" width="266px">(.*)</td><td/;
    $lefthalftemp = $1;
    $righthalftemp = $';

    while($lefthalftemp =~ m/<div class="large-list-subtitle">(.*?)</div>/) {
      print $1.":n";
      $' =~ m/<ul>(.*?)</ul>/;
      $lefthalftemp = $';
      $li = $1;
      while($li =~ m/<li>(.*?)</li>/) {
        $li = $';
        $1 =~ m/<span class="libg"><a href=(.*)>(.*)</a></span>/i;
        print $2."n";
      }
      print "nn";
    }

    while($righthalftemp =~ m/<div class="large-list-subtitle">(.*?)</div>/) {
      print $1.":n";
      $' =~ m/<ul>(.*?)</ul>/;
      $righthalftemp = $';
      $li = $1;
      while($li =~ m/<li>(.*?)</li>/) {
        $li = $';
        $1 =~ m/<span class="libg"><a href=(.*)>(.*)</a></span>/i;
        print $2."n";
      }
      print "nn";
    }
  }
}

Copy the above code into the SciTE perl editor and press CNTR+F7. You should see an output as below, which contains all the extracted data about the artist Metallica.

Output 3

>perl example4.pl
Sending HTTP request for Metallica...
Got HTTP response... parsing output for Metallica...

Discography Main Album: http://www.allmusic.com/cg/amg.dll?p=amg&searchlink=METALLICA&sql=11:kifpxqe5ldte~T20
Discography Singles&EP: http://www.allmusic.com/cg/amg.dll?p=amg&searchlink=METALLICA&sql=11:kifpxqe5ldte~T22
Discography DVD Videos: http://www.allmusic.com/cg/amg.dll?p=amg&searchlink=METALLICA&sql=11:kifpxqe5ldte~T23
Songs All Songs: http://www.allmusic.com/cg/amg.dll?p=amg&searchlink=METALLICA&sql=11:kifpxqe5ldte~T31
Charts & Awards Billboard Albums: http://www.allmusic.com/cg/amg.dll?p=amg&searchlink=METALLICA&sql=11:kifpxqe5ldte~T50
Charts & Awards Billboard Singles: http://www.allmusic.com/cg/amg.dll?p=amg&searchlink=METALLICA&sql=11:kifpxqe5ldte~T51
Charts & Awards Grammy Awards: http://www.allmusic.com/cg/amg.dll?p=amg&searchlink=METALLICA&sql=11:kifpxqe5ldte~T52

Titlebar:
Metallica

Formed: 1981 in Los Angeles, CA

ActiveYear:80
ActiveYear:90
ActiveYear:2000

Genre:Rock

Styles:Thrash	Heavy Metal	Speed Metal	Hard Rock

Moods:Bitter	Suffocating	Fierce	Angry	Aggressive	Menacing
Gritty	Tense/Anxious	Hostile	Crunchy	Epic	Nihilistic	Fiery
Intense	Dramatic	Harsh	Ominous	Rebellious	Uncompromising
Searching	Gloomy

AMG Artist ID:P     4906

ArtistMiniBio:
Metallica was easily the best, most influential heavy metal band of the '80s,
responsible for bringing the music back to Earth.
Instead of playing the usual rock star games of metal stars of the early '80s,
the band looked and talked like they were from the street.
Metallica expanded the limits of thrash, using speed and volume not for their own sake,
but to enhance their intricately structured compositions.
The release of 1983's Kill 'Em All marked the beginning of the legitimization
of heavy metal's underground, bringing new complexity and depth to thrash metal.
With each album, the band's playing and writing improved;
James Hetfield developed a signature rhythm playing that matched his growl,
while lead guitarist Kirk Hammett... Read More...

Other Entries:
Movie Entry
Classical Music Entry

Group Members:
Kirk Hammett
James Hetfield
Dave Mustaine
Jason Newsted
Lars Ulrich
Cliff Burton
Robert Trujillo
Ron McGovney

Similar Artists:
Slayer
Anthrax
Sepultura
Machine Head
Coroner
Death
Dio
Danzig
King Diamond
Mercyful Fate
Metal Church
Overkill
Voivod
Death Angel
Queensr?che
Cancer
Corrosion of Conformity
White Zombie
Rollins Band
Melvins
Soundgarden

See Also:
Megadeth
Flotsam & Jetsam
Exodus
Rock Star Supernova

Influenced By:
Mot?rhead
The Misfits
Diamond Head
Black Sabbath
Judas Priest
Angel Witch
Iron Maiden
Saxon
Accept
Budgie
Deep Purple
Rush
AC/DC
Led Zeppelin
G.B.H.
Fear
Ted Nugent
Lynyrd Skynyrd
UFO
Thin Lizzy
Queen

Followers:
Carcass
Grindcrusher
At War
Crowbar
The Beyond
Sevendust
Boy Hits Car
Queens of the Stone Age
Roachpowder
Ossiris
Avenged Sevenfold
Trapt
Hurt
Scenes from a Movie
Sick City
Saving Abel

Performed Songs By:
James Hetfield
Lars Ulrich
Kirk Hammett
Cliff Burton
Bob Rock
Dave Mustaine
Brian Tatler
Sean Harris
Roger Taylor
"Fast" Eddie Clarke
Glenn Danzig
Jason Newsted
John Deacon
Brian May
Freddie Mercury
Lemmy
Lemmy Kilmister
Phil "Philthy Animal" Taylor
Burke Shelley

>Exit code: 0    Time: 5.940

Thus, on running the above script you get all the insformation about the artist Metallica from the All Music’s Metallica page. For demonstration purpose I have just extracted information from Metallica’s main page, however you can write similar code to extract information from metallica’s other sub-pages on All Music.

Meanwhile, if you are just thinging as to, How come my perl script extract the artist information? What method have i used to make sure only the relevant information is parsed from the page? or How did I made all those regular expression matches? , watch out for Part 2 of this blog. As of now I leave up on you, to figure out how is it all done.

Here are a few important links which will help you in making crawlers similar to those of Altertunes, and also understand the methods I have used above.
1. Leed’s University perl page
2. Tizag
3. Finally the documentation which comes in with PXPerl, is in itself a complete guide for everything.

Hope I helped a little in your quest of making crawlers.

In next blog I will try to wrap up this section (I am tried writing this one as of now) 😉

All the best.