Cuil : New search engine from former Googlers.

Just came across this search engine , which is supposedly from former Googlers or I must say rebellious googler. Cuil Inc (which is pronounced as Cool) claims that it can index faster and more cheaper, a far larger portion of the web than google.

The would-be Google rival says its service goes beyond prevailing search techniques that focus on Web links and audience traffic patterns and instead analyzes the context of each page and the concepts behind each user search request.

Personally from my experience it doesn’t give better search results for proper noun. For example querying my name doesn’t give my site at all. And when I query, it gives some Chinese characters.

However interesting feature is giving search results with images, which is a derivation from Yahoo’s glue search. Glue search is the next upcoming revolutionary search from Yahoo, which brings in all the experience for your on one page. For instance try searching Metallica and you get the following beautiful results:

Read more about the latest search engine on web here:

I thought google killed innovation outside Google, but No here are people proving it all wrong.

How to write crawlers and parse a page using Perl (Part 1)

Hello all perl freaks,

One of the most powerful thing which we can achieve using perl is, extracting any content from a website you want to. For example, you can use perl to extract information of all the artists from All Music, extract information about all cricket players and matches from CricInfo. In the past I have used perl for making web crawlers for Altertunes and most recently I used perl to extract news from Google News.

Here I will try to explain how efficiently you can extract information by parsing html pages using perl.

To start with lets revise some basic stuffs about perl.

Lets first see how can we get HTML content of a website:

Example 1

require LWP::UserAgent;

#~ Call the gethtmlpage function by passing the url we want to save

sub gethtmlpage {
  my $ua = LWP::UserAgent->new;
  #~ Use below line of code for proxied net connection
  my $response = $ua->post("$_[0]");

  if ($response->is_success) {
    $output = $response->content;
    print $fh $output;
  else {
    print "Error in getting HTML page";

If you are using PXPerl on windows, copy paste the above code in the SciTE perl editor (which comes in packaged with PXPerl) and simply press CNTR+F7. This will result into an html file named ‘’ in your folder.

Most important feature which makes PERL and Python as default choice for web crawlers, is their ability of regular expression match. Lets see at some of the regular expression we will be using for parsing an HTML page.

Example 2

$sentence = "This is a perl tutorial by Abhinav Singh at";

#~ Matching $sentence for 'Abhinav Singh'
$sentence =~ m/Abhinav Singh/i;

print "Pre-Match: ".$`."n";
print "Match: ".$&."n";
print "Post-Match: ".$'."n";

Copy the above code in SciTE perl editor and press CNTR+F7. You should see a result similar to this:

Output 2

Pre-Match: This is a perl tutorial by
Match: Abhinav Singh
Post-Match:  at
>Exit code: 0    Time: 0.962

Now lets see how can we extract relevant information from a page. Suppose we are interested in extracting all information about the artist Metallica from AllMusic website. Below I will first show you my code for the same and then its result. Finally I will discuss as to how did I made all those regular expressions:

Example 3

require LWP::UserAgent;

$bandname = "Metallica";

sub getartistinfo {
  my %formdata;
  my $ua = LWP::UserAgent->new;
  #~ $ua->proxy('http','http://[PROXY_URL]:[PROXY_PORT]/');


  print "Sending HTTP request for ".$_[0]."...n";
  my $response = $ua->post('',%formdata);

  if ($response->is_success) {
    print "Got HTTP response... parsing output for ".$_[0]."...nn";

    # Extracting Overview, Biography, Discography, Songs, Credit, Charts & Awards link for the artist
    $output =~ m/cg/amg.dll?p=amg&searchlink=(.*)">/;
    $BaseLink = "";
    $OverviewLink = $1;
    $DiscographyMainAlbumLink = $BaseLink.$OverviewLink;
    $DiscographyMainAlbumLink =~ s/T0/T20/;
    print "Discography Main Album: ".$DiscographyMainAlbumLink."n";
    $DiscographySinglesEPLink = $BaseLink.$OverviewLink;
    $DiscographySinglesEPLink =~ s/T0/T22/;
    print "Discography Singles&EP: ".$DiscographySinglesEPLink."n";
    $DiscographyDvDVideosLink = $BaseLink.$OverviewLink;
    $DiscographyDvDVideosLink =~ s/T0/T23/;
    print "Discography DVD Videos: ".$DiscographyDvDVideosLink."n";
    $DiscographyAllSongsLink = $BaseLink.$OverviewLink;
    $DiscographyAllSongsLink =~ s/T0/T31/;
    print "Songs All Songs: ".$DiscographyAllSongsLink."n";
    $DiscographyCnAAlbumsLink = $BaseLink.$OverviewLink;
    $DiscographyCnAAlbumsLink =~ s/T0/T50/;
    print "Charts & Awards Billboard Albums: ".$DiscographyCnAAlbumsLink."n";
    $DiscographyCnASinglesLink = $BaseLink.$OverviewLink;
    $DiscographyCnASinglesLink =~ s/T0/T51/;
    print "Charts & Awards Billboard Singles: ".$DiscographyCnASinglesLink."n";
    $DiscographyGrammyLink = $BaseLink.$OverviewLink;
    $DiscographyGrammyLink =~ s/T0/T52/;
    print "Charts & Awards Grammy Awards: ".$DiscographyGrammyLink."nn";

    # Extracting Title Bar
    $output =~ m/<td class="titlebar"><span class="title">(.*)</span><br />/;
    $titlebar = $1;
    print "Titlebar:n".$titlebar."nn";
    $output = $';

    # Extracting Formed-Sub
    $output =~ m/Begin Formed(.*)<span>(.*)End Formed/;
    $output = $';
    $formedsub = $2;
    $formedsub =~ m/<a href=(.*)>(.*)</a>(.*)<a href=(.*)>(.*?)</a>/; # Parse $formedsub for exact string
    print "Formed: ".$2.$3.$5."nn";

    # Extracting timelinesubactive
    while($output =~ m/class="timeline-sub-active">(d+)</div>/) {
      print "ActiveYear:".$1."n";
      $output = $';
    print "n";

    # Extract Genre, Style titles
    $output =~ m/id="left-sidebar-title-small"(.*?)</tr>/;
    $suboutput = $&;
    $output = $';
    while($suboutput =~ m/id="left-sidebar-title-small"><span>(.*?)</span>/) {
      #~ print "Subclasses:".$1."n";
      $suboutput = $';
    #~ print "n";

    # Extract Genre contents
    $output =~ m/<td class="list-cell"(.*?)</td>/;
    $suboutput = $&;
    $output = $';
    while($suboutput =~ m/<li>(.*?)</li>/) {
      #~ print "Genres:".$1."n";
      $suboutput = $';
      $1 =~ m/<a href=(.*)>(.*)</a>/;
    #~ print "n";

    # Extract Style contents
    $output =~ m/<td class="list-cell"(.*?)</td>/;
    $suboutput = $&;
    $output = $';
    while($suboutput =~ m/<li>(.*?)</li>/) {
      #~ print "Styles:".$1."n";
      $suboutput = $';
      $1 =~ m/<a href=(.*)>(.*)</a>/;
    #~ print "n";

    # Extract Mood subclass
    $output =~ m/id="left-sidebar-title-small"><span>(.*?)</span>/;
    $output = $';
    #~ print "Subclasses:".$1."nn";

    # Extract Mood Contents
    $output =~ m/id="left-sidebar-list"(.*?)</div>/;
    $suboutput = $&;
    $output = $';
    while($suboutput =~ m/<li>(.*?)</li>/) {
      #~ print "Moods:".$1."n";
      $suboutput = $';
      $1 =~ m/<a href=(.*)>(.*)</a>/;
    print "n";

    # Print the @GSM and @G,@S,@M content
    print $GSM[0].":";
    foreach $gen (@G) {
      print $gen."t";
    print "nn".$GSM[1].":";
    foreach $gen (@S) {
      print $gen."t";
    print "nn".$GSM[2].":";
    foreach $gen (@M) {
      print $gen."t";
    print "nn";

    # Extract AMG Artist ID
    $output =~ m/<td class="sub-text"(.*?)</pre>/;
    $output = $';
    $1 =~ m/<pre>(.*)/;
    print "AMG Artist ID:".$1."nn";

    # Extracting Artist Mini Bio
    $output =~ m/id="artistminibio"><p>(.*)</p>/;
    $artistminibio = $1;
    $artistminibio =~ s/<a href(.*?)>//g; # Filtering out any link or html tags
    $artistminibio =~ s/</a>//g;
    $artistminibio =~ s/<i>//g;
    $artistminibio =~ s/</i>//g;
    print "ArtistMiniBio:n".$artistminibio."nn";

    # Extracting Other Entries, Group Members, Similar Artists, Influenced By and Follower
    $output =~ m/id="large-list"><tr>(.*?)</table>/;
    $suboutput = $&;
    $output = $';
    # Extracting two part of the table
    $suboutput =~ m/<td valign="top" width="266px">(.*)</td><td/;
    $lefthalftemp = $1;
    $righthalftemp = $';

    while($lefthalftemp =~ m/<div class="large-list-subtitle">(.*?)</div>/) {
      print $1.":n";
      $' =~ m/<ul>(.*?)</ul>/;
      $lefthalftemp = $';
      $li = $1;
      while($li =~ m/<li>(.*?)</li>/) {
        $li = $';
        $1 =~ m/<span class="libg"><a href=(.*)>(.*)</a></span>/i;
        print $2."n";
      print "nn";

    while($righthalftemp =~ m/<div class="large-list-subtitle">(.*?)</div>/) {
      print $1.":n";
      $' =~ m/<ul>(.*?)</ul>/;
      $righthalftemp = $';
      $li = $1;
      while($li =~ m/<li>(.*?)</li>/) {
        $li = $';
        $1 =~ m/<span class="libg"><a href=(.*)>(.*)</a></span>/i;
        print $2."n";
      print "nn";

Copy the above code into the SciTE perl editor and press CNTR+F7. You should see an output as below, which contains all the extracted data about the artist Metallica.

Output 3

Sending HTTP request for Metallica...
Got HTTP response... parsing output for Metallica...

Discography Main Album:
Discography Singles&EP:
Discography DVD Videos:
Songs All Songs:
Charts & Awards Billboard Albums:
Charts & Awards Billboard Singles:
Charts & Awards Grammy Awards:


Formed: 1981 in Los Angeles, CA



Styles:Thrash	Heavy Metal	Speed Metal	Hard Rock

Moods:Bitter	Suffocating	Fierce	Angry	Aggressive	Menacing
Gritty	Tense/Anxious	Hostile	Crunchy	Epic	Nihilistic	Fiery
Intense	Dramatic	Harsh	Ominous	Rebellious	Uncompromising
Searching	Gloomy

AMG Artist ID:P     4906

Metallica was easily the best, most influential heavy metal band of the '80s,
responsible for bringing the music back to Earth.
Instead of playing the usual rock star games of metal stars of the early '80s,
the band looked and talked like they were from the street.
Metallica expanded the limits of thrash, using speed and volume not for their own sake,
but to enhance their intricately structured compositions.
The release of 1983's Kill 'Em All marked the beginning of the legitimization
of heavy metal's underground, bringing new complexity and depth to thrash metal.
With each album, the band's playing and writing improved;
James Hetfield developed a signature rhythm playing that matched his growl,
while lead guitarist Kirk Hammett...

Other Entries:
Movie Entry
Classical Music Entry

Group Members:
Kirk Hammett
James Hetfield
Dave Mustaine
Jason Newsted
Lars Ulrich
Cliff Burton
Robert Trujillo
Ron McGovney

Similar Artists:
Machine Head
King Diamond
Mercyful Fate
Metal Church
Death Angel
Corrosion of Conformity
White Zombie
Rollins Band

See Also:
Flotsam & Jetsam
Rock Star Supernova

Influenced By:
The Misfits
Diamond Head
Black Sabbath
Judas Priest
Angel Witch
Iron Maiden
Deep Purple
Led Zeppelin
Ted Nugent
Lynyrd Skynyrd
Thin Lizzy

At War
The Beyond
Boy Hits Car
Queens of the Stone Age
Avenged Sevenfold
Scenes from a Movie
Sick City
Saving Abel

Performed Songs By:
James Hetfield
Lars Ulrich
Kirk Hammett
Cliff Burton
Bob Rock
Dave Mustaine
Brian Tatler
Sean Harris
Roger Taylor
"Fast" Eddie Clarke
Glenn Danzig
Jason Newsted
John Deacon
Brian May
Freddie Mercury
Lemmy Kilmister
Phil "Philthy Animal" Taylor
Burke Shelley

>Exit code: 0    Time: 5.940

Thus, on running the above script you get all the insformation about the artist Metallica from the All Music’s Metallica page. For demonstration purpose I have just extracted information from Metallica’s main page, however you can write similar code to extract information from metallica’s other sub-pages on All Music.

Meanwhile, if you are just thinging as to, How come my perl script extract the artist information? What method have i used to make sure only the relevant information is parsed from the page? or How did I made all those regular expression matches? , watch out for Part 2 of this blog. As of now I leave up on you, to figure out how is it all done.

Here are a few important links which will help you in making crawlers similar to those of Altertunes, and also understand the methods I have used above.
1. Leed’s University perl page
2. Tizag
3. Finally the documentation which comes in with PXPerl, is in itself a complete guide for everything.

Hope I helped a little in your quest of making crawlers.

In next blog I will try to wrap up this section (I am tried writing this one as of now) 😉

All the best.