Perl

Moderators: Jonathan
Number of threads: 1259
Number of posts: 3644

This Forum Only
Post New Thread
Single Post View       Linear View       Threaded View      f

Report
HELP: parsing unicode web sites Posted by andrewwan1980 on 31 Jul 2008 at 6:56 AM
I need help in parsing unicode webpages & downloading jpeg image files via Perl scripts.

I read http://www.cs.utk.edu/cs594ipm/perl/crawltut.html about using LWP or HTTP or get($url) functions & libraries. But the content returned is always garbled. I have used get($url) on a non-unicode webpage and the content is returned in perfect ascii.

But now I want to parse http://www.tom365.com/movie_2004/html/5507.html and the page I get back is garbled encoded. I have read about Encode but don't know how to use it.

I need a Perl script to parse that above page and extract the URL for the image in this pattern:

<div class="movie"><img src="http://pic.tom365.com/imgs/tongjifan.jpg" class="mp" />

If anyone knows how to do this parsing unicode webpages then I'd be very grateful.

Thank you
Report
Re: HELP: parsing unicode web sites Posted by Jonathan on 31 Jul 2008 at 8:28 AM
Hi,

A unicode advice page:
http://juerd.nl/site.plp/perluniadvice

Notes that recent versions of LWP are unicode aware - are you running a fairly recent version of the module and/or recent version of Perl? It also suggests looking at HTTP::Response::Charset.

Thanks,

Jonathan
###
for(74,117,115,116){$::a.=chr};(($_.='qwertyui')&&
(tr/yuiqwert/her anot/))for($::b);for($::c){$_.=$^X;
/(p.{2}l)/;$_=$1}$::b=~/(..)$/;print("$::a$::b $::c hack$1.");
Report
Re: HELP: parsing unicode web sites Posted by andrewwan1980 on 4 Aug 2008 at 3:44 AM
Thanks to those who helped. Here's my working script:

#!/usr/bin/perl
# tom365crawl2.pl
# http://www.cs.utk.edu/cs594ipm/perl/crawltut.html
# http://perldoc.perl.org/Encode.html
# http://juerd.nl/site.plp/perluniadvice
# http://www.perlmonks.org/?node_id=620068

use warnings;
use strict;

use File::stat;
use Tie::File;

use LWP::Simple;
use LWP::UserAgent;
use HTTP::Request;
use HTTP::Response;
use HTML::LinkExtor; # Allows you to extract the links off of an HTML page.
#use File::Slurp;

use Encode;

my $site1 = "http://www.tom365.com/"; # Full url like http://www.tom365.com/movie_2004/html/????.html
my $delim1a = "\<div class=\"movie\"\>\<img src=\"";
my $delim1b = "\" class=\"mp\" \/\>";
my $folder1 = "movie_2004/html/";
my $url1;
my $start1 = 1000;
my $end1 = 1000;
my $contents1;
my $image1;

my $browser1 = LWP::UserAgent->new();
$browser1->timeout(10);
my $request1;
my $response1;

my $count;
for ($count=$start1; $count<=$end1; $count++) {
  $url1 = $site1 . $folder1 . $count . ".html";
  printf "Downloading %s\n", $url1;

  # Method 1
  #$contents1 = get($url1);

  # Method 2
  $request1 = HTTP::Request->new(GET => $url1);
  $response1 = $browser1->request($request1);
  if ($response1->is_error()) {
    printf "%s\n", $response1->status_line;
  }
  $contents1 = $response1->decoded_content();

  #open(NEWFILE1, "> Debug.txt");
  #(print NEWFILE1 $contents1)    or die "Can't write to Debug.txt: $!";
  #close(NEWFILE1);

  #print $contents1;

  if ($contents1 =~ /\<div class=\"movie\"\>\<img src=\"(.*)\" class=\"mp\" \/\>/m) {
    $image1 = "$1";
    printf "Downloading %s\n", $image1;
    `wget -q -O $count.jpg $image1`;

    #if ($image1 =~ /\/([^\/]*)$/m) {
    #  printf "Renaming %s to $count.jpg\n", $1;
    #} else {
    #  printf "Could not rename %s to $count.jpg\n", $image1;
    #}
  } else {
    #open(NEWFILE1, "> $count.txt");
    #(print NEWFILE1 "Download failed.\n")    or die "Can't write to $image1: $!";
    #close(NEWFILE1);
  }
}




 

Recent Jobs

Official Programmer's Heaven Blogs
Web Hosting | Browser and Social Games | Gadgets

Popular resources on Programmersheaven.com
Assembly | Basic | C | C# | C++ | Delphi | Flash | Java | JavaScript | Pascal | Perl | PHP | Python | Ruby | Visual Basic
© Copyright 2011 Programmersheaven.com - All rights reserved.
Reproduction in whole or in part, in any form or medium without express written permission is prohibited.
Violators of this policy may be subject to legal action. Please read our Terms Of Use and Privacy Statement for more information.
Operated by CommunityHeaven, a BootstrapLabs company.