Perl

Moderators: Jonathan
Number of threads: 1236
Number of posts: 3605

This Forum Only
Post New Thread
Single Post View       Linear View       Threaded View      f

Report
Re: HELP: parsing unicode web sites Posted by andrewwan1980 on 4 Aug 2008 at 3:44 AM
Thanks to those who helped. Here's my working script:

#!/usr/bin/perl
# tom365crawl2.pl
# http://www.cs.utk.edu/cs594ipm/perl/crawltut.html
# http://perldoc.perl.org/Encode.html
# http://juerd.nl/site.plp/perluniadvice
# http://www.perlmonks.org/?node_id=620068

use warnings;
use strict;

use File::stat;
use Tie::File;

use LWP::Simple;
use LWP::UserAgent;
use HTTP::Request;
use HTTP::Response;
use HTML::LinkExtor; # Allows you to extract the links off of an HTML page.
#use File::Slurp;

use Encode;

my $site1 = "http://www.tom365.com/"; # Full url like http://www.tom365.com/movie_2004/html/????.html
my $delim1a = "\<div class=\"movie\"\>\<img src=\"";
my $delim1b = "\" class=\"mp\" \/\>";
my $folder1 = "movie_2004/html/";
my $url1;
my $start1 = 1000;
my $end1 = 1000;
my $contents1;
my $image1;

my $browser1 = LWP::UserAgent->new();
$browser1->timeout(10);
my $request1;
my $response1;

my $count;
for ($count=$start1; $count<=$end1; $count++) {
  $url1 = $site1 . $folder1 . $count . ".html";
  printf "Downloading %s\n", $url1;

  # Method 1
  #$contents1 = get($url1);

  # Method 2
  $request1 = HTTP::Request->new(GET => $url1);
  $response1 = $browser1->request($request1);
  if ($response1->is_error()) {
    printf "%s\n", $response1->status_line;
  }
  $contents1 = $response1->decoded_content();

  #open(NEWFILE1, "> Debug.txt");
  #(print NEWFILE1 $contents1)    or die "Can't write to Debug.txt: $!";
  #close(NEWFILE1);

  #print $contents1;

  if ($contents1 =~ /\<div class=\"movie\"\>\<img src=\"(.*)\" class=\"mp\" \/\>/m) {
    $image1 = "$1";
    printf "Downloading %s\n", $image1;
    `wget -q -O $count.jpg $image1`;

    #if ($image1 =~ /\/([^\/]*)$/m) {
    #  printf "Renaming %s to $count.jpg\n", $1;
    #} else {
    #  printf "Could not rename %s to $count.jpg\n", $image1;
    #}
  } else {
    #open(NEWFILE1, "> $count.txt");
    #(print NEWFILE1 "Download failed.\n")    or die "Can't write to $image1: $!";
    #close(NEWFILE1);
  }
}

Thread Tree
andrewwan1980 HELP: parsing unicode web sites on 31 Jul 2008 at 6:56 AM
Jonathan Re: HELP: parsing unicode web sites on 31 Jul 2008 at 8:28 AM
andrewwan1980 Re: HELP: parsing unicode web sites on 4 Aug 2008 at 3:44 AM



 

Recent Jobs

Official Programmer's Heaven Blogs
Web Hosting | Browser and Social Games | Gadgets

Popular resources on Programmersheaven.com
Assembly | Basic | C | C# | C++ | Delphi | Flash | Java | JavaScript | Pascal | Perl | PHP | Python | Ruby | Visual Basic
© Copyright 2011 Programmersheaven.com - All rights reserved.
Reproduction in whole or in part, in any form or medium without express written permission is prohibited.
Violators of this policy may be subject to legal action. Please read our Terms Of Use and Privacy Statement for more information.
Operated by CommunityHeaven, a BootstrapLabs company.