Perl

Moderators: Jonathan
Number of threads: 1257
Number of posts: 3636

This Forum Only
Post New Thread
Single Post View       Linear View       Threaded View      f

Report
Extract comments from blog Posted by omar1968 on 22 Feb 2010 at 8:30 AM
hello,
i need a perl script to extract just comments from a blog? thanks for any help
Report
Re: Extract comments from blog Posted by Trizen on 2 Mar 2011 at 7:29 AM
Here is an example for googledocs.blogspot.com
It may work for another blogs from blogspot.com.

#!/usr/bin/perl

use LWP::UserAgent;
use Term::ANSIColor;
use HTML::Entities;
use HTML::Strip;

$url = 'http://googledocs.blogspot.com/';  # replace here with another URL

$hs = 'HTML::Strip'->new;
$lwp = 'LWP::UserAgent'->new;

$lwp->agent('Mozilla/5.0 (X11; Linux i686) AppleWebKit/534.23 (KHTML, like Gecko) Chrome/11.0.686.1 Safari/534.23');
$lwp->timeout(10);
$lwp->env_proxy;
$c = decode_entities('&#8410');
$content = $lwp->get($url)->content;

@content = split(/<h3>/, $content, 0);  # on some other blogs try /<h1>/

foreach $url (@content) {
    if ($url =~ /^[\s]*<a href='([^']+)'>([^<]+)/) {
        $url = $1;
        $title = decode_entities($2);
    }
    next unless $url =~ /^http:/;
    $content = $lwp->get($url)->content;
    if ($content =~ /\n[\s]*comments:[\s]*\n([^$c]+)Post a Comment/) {
        $comments = decode_entities($1);
    }
    my $clean_text = $hs->parse($comments);
    until (not $clean_text =~ /\n\n\n/) {
        $clean_text =~ s/$&/\n\n/g;
    }
    print color('bold red');
    print "\n\n=>> $title\n";
    print color('reset');
    print $clean_text;
    $clean_text = '';
    $comments = '';
}




 

Recent Jobs

Official Programmer's Heaven Blogs
Web Hosting | Browser and Social Games | Gadgets

Popular resources on Programmersheaven.com
Assembly | Basic | C | C# | C++ | Delphi | Flash | Java | JavaScript | Pascal | Perl | PHP | Python | Ruby | Visual Basic
© Copyright 2011 Programmersheaven.com - All rights reserved.
Reproduction in whole or in part, in any form or medium without express written permission is prohibited.
Violators of this policy may be subject to legal action. Please read our Terms Of Use and Privacy Statement for more information.
Operated by CommunityHeaven, a BootstrapLabs company.