Web Development

How to write crawlers and parse a page using Perl (Part 1)

Hello all perl freaks,

One of the most powerful thing which we can achieve using perl is, extracting any content from a website you want to. For example, you can use perl to extract information of all the artists from All Music, extract information about all cricket players and matches from CricInfo. In the past I have used perl for making web crawlers for Altertunes and most recently I used perl to extract news from Google News.

Here I will try to explain how efficiently you can extract information by parsing html pages using perl.

To start with lets revise some basic stuffs about perl.

Lets first see how can we get HTML content of a website:

Example 1

require LWP::UserAgent;

#~ Call the gethtmlpage function by passing the url we want to save

sub gethtmlpage {
  my $ua = LWP::UserAgent->new;
  #~ Use below line of code for proxied net connection
  my $response = $ua->post("$_[0]");

  if ($response->is_success) {
    $output = $response->content;
    print $fh $output;
  else {
    print "Error in getting HTML page";

If you are using PXPerl on windows, copy paste the above code in the SciTE perl editor (which comes in packaged with PXPerl) and simply press CNTR+F7. This will result into an html file named ‘abhinavsingh.com.html’ in your folder.

Most important feature which makes PERL and Python as default choice for web crawlers, is their ability of regular expression match. Lets see at some of the regular expression we will be using for parsing an HTML page.

Example 2

$sentence = "This is a perl tutorial by Abhinav Singh at http://abhinavsingh.com";

#~ Matching $sentence for 'Abhinav Singh'
$sentence =~ m/Abhinav Singh/i;

print "Pre-Match: ".$`."n";
print "Match: ".$&."n";
print "Post-Match: ".$'."n";

Copy the above code in SciTE perl editor and press CNTR+F7. You should see a result similar to this:

Output 2

>perl example2.pl
Pre-Match: This is a perl tutorial by
Match: Abhinav Singh
Post-Match:  at http://abhinavsingh.com
>Exit code: 0    Time: 0.962

Now lets see how can we extract relevant information from a page. Suppose we are interested in extracting all information about the artist Metallica from AllMusic website. Below I will first show you my code for the same and then its result. Finally I will discuss as to how did I made all those regular expressions:

Example 3

require LWP::UserAgent;

$bandname = "Metallica";

sub getartistinfo {
  my %formdata;
  my $ua = LWP::UserAgent->new;
  #~ $ua->proxy('http','http://[PROXY_URL]:[PROXY_PORT]/');


  print "Sending HTTP request for ".$_[0]."...n";
  my $response = $ua->post('http://www.allmusic.com/cg/amg.dll',%formdata);

  if ($response->is_success) {
    print "Got HTTP response... parsing output for ".$_[0]."...nn";

    # Extracting Overview, Biography, Discography, Songs, Credit, Charts & Awards link for the artist
    $output =~ m/cg/amg.dll?p=amg&searchlink=(.*)">/;
    $BaseLink = "http://www.allmusic.com/cg/amg.dll?p=amg&searchlink=";
    $OverviewLink = $1;
    $DiscographyMainAlbumLink = $BaseLink.$OverviewLink;
    $DiscographyMainAlbumLink =~ s/T0/T20/;
    print "Discography Main Album: ".$DiscographyMainAlbumLink."n";
    $DiscographySinglesEPLink = $BaseLink.$OverviewLink;
    $DiscographySinglesEPLink =~ s/T0/T22/;
    print "Discography Singles&EP: ".$DiscographySinglesEPLink."n";
    $DiscographyDvDVideosLink = $BaseLink.$OverviewLink;
    $DiscographyDvDVideosLink =~ s/T0/T23/;
    print "Discography DVD Videos: ".$DiscographyDvDVideosLink."n";
    $DiscographyAllSongsLink = $BaseLink.$OverviewLink;
    $DiscographyAllSongsLink =~ s/T0/T31/;
    print "Songs All Songs: ".$DiscographyAllSongsLink."n";
    $DiscographyCnAAlbumsLink = $BaseLink.$OverviewLink;
    $DiscographyCnAAlbumsLink =~ s/T0/T50/;
    print "Charts & Awards Billboard Albums: ".$DiscographyCnAAlbumsLink."n";
    $DiscographyCnASinglesLink = $BaseLink.$OverviewLink;
    $DiscographyCnASinglesLink =~ s/T0/T51/;
    print "Charts & Awards Billboard Singles: ".$DiscographyCnASinglesLink."n";
    $DiscographyGrammyLink = $BaseLink.$OverviewLink;
    $DiscographyGrammyLink =~ s/T0/T52/;
    print "Charts & Awards Grammy Awards: ".$DiscographyGrammyLink."nn";

    # Extracting Title Bar
    $output =~ m/<td class="titlebar"><span class="title">(.*)</span><br />/;
    $titlebar = $1;
    print "Titlebar:n".$titlebar."nn";
    $output = $';

    # Extracting Formed-Sub
    $output =~ m/Begin Formed(.*)<span>(.*)End Formed/;
    $output = $';
    $formedsub = $2;
    $formedsub =~ m/<a href=(.*)>(.*)</a>(.*)<a href=(.*)>(.*?)</a>/; # Parse $formedsub for exact string
    print "Formed: ".$2.$3.$5."nn";

    # Extracting timelinesubactive
    while($output =~ m/class="timeline-sub-active">(d+)</div>/) {
      print "ActiveYear:".$1."n";
      $output = $';
    print "n";

    # Extract Genre, Style titles
    $output =~ m/id="left-sidebar-title-small"(.*?)</tr>/;
    $suboutput = $&;
    $output = $';
    while($suboutput =~ m/id="left-sidebar-title-small"><span>(.*?)</span>/) {
      #~ print "Subclasses:".$1."n";
      $suboutput = $';
    #~ print "n";

    # Extract Genre contents
    $output =~ m/<td class="list-cell"(.*?)</td>/;
    $suboutput = $&;
    $output = $';
    while($suboutput =~ m/<li>(.*?)</li>/) {
      #~ print "Genres:".$1."n";
      $suboutput = $';
      $1 =~ m/<a href=(.*)>(.*)</a>/;
    #~ print "n";

    # Extract Style contents
    $output =~ m/<td class="list-cell"(.*?)</td>/;
    $suboutput = $&;
    $output = $';
    while($suboutput =~ m/<li>(.*?)</li>/) {
      #~ print "Styles:".$1."n";
      $suboutput = $';
      $1 =~ m/<a href=(.*)>(.*)</a>/;
    #~ print "n";

    # Extract Mood subclass
    $output =~ m/id="left-sidebar-title-small"><span>(.*?)</span>/;
    $output = $';
    #~ print "Subclasses:".$1."nn";

    # Extract Mood Contents
    $output =~ m/id="left-sidebar-list"(.*?)</div>/;
    $suboutput = $&;
    $output = $';
    while($suboutput =~ m/<li>(.*?)</li>/) {
      #~ print "Moods:".$1."n";
      $suboutput = $';
      $1 =~ m/<a href=(.*)>(.*)</a>/;
    print "n";

    # Print the @GSM and @G,@S,@M content
    print $GSM[0].":";
    foreach $gen (@G) {
      print $gen."t";
    print "nn".$GSM[1].":";
    foreach $gen (@S) {
      print $gen."t";
    print "nn".$GSM[2].":";
    foreach $gen (@M) {
      print $gen."t";
    print "nn";

    # Extract AMG Artist ID
    $output =~ m/<td class="sub-text"(.*?)</pre>/;
    $output = $';
    $1 =~ m/<pre>(.*)/;
    print "AMG Artist ID:".$1."nn";

    # Extracting Artist Mini Bio
    $output =~ m/id="artistminibio"><p>(.*)</p>/;
    $artistminibio = $1;
    $artistminibio =~ s/<a href(.*?)>//g; # Filtering out any link or html tags
    $artistminibio =~ s/</a>//g;
    $artistminibio =~ s/<i>//g;
    $artistminibio =~ s/</i>//g;
    print "ArtistMiniBio:n".$artistminibio."nn";

    # Extracting Other Entries, Group Members, Similar Artists, Influenced By and Follower
    $output =~ m/id="large-list"><tr>(.*?)</table>/;
    $suboutput = $&;
    $output = $';
    # Extracting two part of the table
    $suboutput =~ m/<td valign="top" width="266px">(.*)</td><td/;
    $lefthalftemp = $1;
    $righthalftemp = $';

    while($lefthalftemp =~ m/<div class="large-list-subtitle">(.*?)</div>/) {
      print $1.":n";
      $' =~ m/<ul>(.*?)</ul>/;
      $lefthalftemp = $';
      $li = $1;
      while($li =~ m/<li>(.*?)</li>/) {
        $li = $';
        $1 =~ m/<span class="libg"><a href=(.*)>(.*)</a></span>/i;
        print $2."n";
      print "nn";

    while($righthalftemp =~ m/<div class="large-list-subtitle">(.*?)</div>/) {
      print $1.":n";
      $' =~ m/<ul>(.*?)</ul>/;
      $righthalftemp = $';
      $li = $1;
      while($li =~ m/<li>(.*?)</li>/) {
        $li = $';
        $1 =~ m/<span class="libg"><a href=(.*)>(.*)</a></span>/i;
        print $2."n";
      print "nn";

Copy the above code into the SciTE perl editor and press CNTR+F7. You should see an output as below, which contains all the extracted data about the artist Metallica.

Output 3

>perl example4.pl
Sending HTTP request for Metallica...
Got HTTP response... parsing output for Metallica...

Discography Main Album: http://www.allmusic.com/cg/amg.dll?p=amg&searchlink=METALLICA&sql=11:kifpxqe5ldte~T20
Discography Singles&EP: http://www.allmusic.com/cg/amg.dll?p=amg&searchlink=METALLICA&sql=11:kifpxqe5ldte~T22
Discography DVD Videos: http://www.allmusic.com/cg/amg.dll?p=amg&searchlink=METALLICA&sql=11:kifpxqe5ldte~T23
Songs All Songs: http://www.allmusic.com/cg/amg.dll?p=amg&searchlink=METALLICA&sql=11:kifpxqe5ldte~T31
Charts & Awards Billboard Albums: http://www.allmusic.com/cg/amg.dll?p=amg&searchlink=METALLICA&sql=11:kifpxqe5ldte~T50
Charts & Awards Billboard Singles: http://www.allmusic.com/cg/amg.dll?p=amg&searchlink=METALLICA&sql=11:kifpxqe5ldte~T51
Charts & Awards Grammy Awards: http://www.allmusic.com/cg/amg.dll?p=amg&searchlink=METALLICA&sql=11:kifpxqe5ldte~T52


Formed: 1981 in Los Angeles, CA



Styles:Thrash	Heavy Metal	Speed Metal	Hard Rock

Moods:Bitter	Suffocating	Fierce	Angry	Aggressive	Menacing
Gritty	Tense/Anxious	Hostile	Crunchy	Epic	Nihilistic	Fiery
Intense	Dramatic	Harsh	Ominous	Rebellious	Uncompromising
Searching	Gloomy

AMG Artist ID:P     4906

Metallica was easily the best, most influential heavy metal band of the '80s,
responsible for bringing the music back to Earth.
Instead of playing the usual rock star games of metal stars of the early '80s,
the band looked and talked like they were from the street.
Metallica expanded the limits of thrash, using speed and volume not for their own sake,
but to enhance their intricately structured compositions.
The release of 1983's Kill 'Em All marked the beginning of the legitimization
of heavy metal's underground, bringing new complexity and depth to thrash metal.
With each album, the band's playing and writing improved;
James Hetfield developed a signature rhythm playing that matched his growl,
while lead guitarist Kirk Hammett...

Other Entries:
Movie Entry
Classical Music Entry

Group Members:
Kirk Hammett
James Hetfield
Dave Mustaine
Jason Newsted
Lars Ulrich
Cliff Burton
Robert Trujillo
Ron McGovney

Similar Artists:
Machine Head
King Diamond
Mercyful Fate
Metal Church
Death Angel
Corrosion of Conformity
White Zombie
Rollins Band

See Also:
Flotsam & Jetsam
Rock Star Supernova

Influenced By:
The Misfits
Diamond Head
Black Sabbath
Judas Priest
Angel Witch
Iron Maiden
Deep Purple
Led Zeppelin
Ted Nugent
Lynyrd Skynyrd
Thin Lizzy

At War
The Beyond
Boy Hits Car
Queens of the Stone Age
Avenged Sevenfold
Scenes from a Movie
Sick City
Saving Abel

Performed Songs By:
James Hetfield
Lars Ulrich
Kirk Hammett
Cliff Burton
Bob Rock
Dave Mustaine
Brian Tatler
Sean Harris
Roger Taylor
"Fast" Eddie Clarke
Glenn Danzig
Jason Newsted
John Deacon
Brian May
Freddie Mercury
Lemmy Kilmister
Phil "Philthy Animal" Taylor
Burke Shelley

>Exit code: 0    Time: 5.940

Thus, on running the above script you get all the insformation about the artist Metallica from the All Music’s Metallica page. For demonstration purpose I have just extracted information from Metallica’s main page, however you can write similar code to extract information from metallica’s other sub-pages on All Music.

Meanwhile, if you are just thinging as to, How come my perl script extract the artist information? What method have i used to make sure only the relevant information is parsed from the page? or How did I made all those regular expression matches? , watch out for Part 2 of this blog. As of now I leave up on you, to figure out how is it all done.

Here are a few important links which will help you in making crawlers similar to those of Altertunes, and also understand the methods I have used above.
1. Leed’s University perl page
2. Tizag
3. Finally the documentation which comes in with PXPerl, is in itself a complete guide for everything.

Hope I helped a little in your quest of making crawlers.

In next blog I will try to wrap up this section (I am tried writing this one as of now) 😉

All the best.

12 thoughts on “How to write crawlers and parse a page using Perl (Part 1)

    1. Sorry dost, Left coding in perl long back. Use it very rarely at times. But I guess most of it is still covered above. I wrote several parsers including the one for altertunes using similar methods.

      1. hi,

        what if the website uses a program to publish a particular information at the precise time? ie, before 10am, the page will show “not available yet” and then at 10am, the page has the content. can i make the retrieval program to wait?

    2. i am from speech and signal processing background. My college project was on automatic music information retrieval. Also being a musician myself helps.

      But all this helps in getting the initial list of artists only, the list of artists grows after i started including related artists from crawlers.

  1. Hi namespace,

    Yeah I have tried extracting keywords out of a webpage. However the algorithm I used was not mine and I remember it taking from someone’s blog, which I am unable to recall as of now. Will try to digg deep into my code repository and see if I can get that piece of code out 🙂

  2. Very nice explanation of the concepts. But when you crawl a web-page you will need to identify keywords out of the page. Have you ever tried exploring that?

Leave a Reply to Harit Himanshu Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.