Importing data from a webpage

I'm writing a program that will go to a number of websites extract some useful data then display and print what it collects.

My program is to the point that it will work if I go to the page and save the source as a text file. However since the point of this program is to automatically access the websites and get the data me going there and saving the source doesn't help much.

Is there a way to automate the collection process?

Comments

  • You can save data from a URL using the [b]java.net.URL.openStream()[/b] method.
    http://java.sun.com/j2se/1.4.2/docs/api/java/net/URL.html

    ---------------------------------
    [size=1]HOWTO ask questions: http://catb.org/~esr/faqs/smart-questions.html[/size]

  • Thanks that helped alot.

    One of the pages I need to get data from is generated from a script and won't post anything useful if I link directly to it. Is there a way I can connect to another page, and have my program 'click' the link to get to the desired page?
  • : One of the pages I need to get data from is generated from a script and won't post anything useful if I link directly to it. Is there a way I can connect to another page, and have my program 'click' the link to get to the desired page?
    :

    To be honest, I'm not 100% on what the server is expecting + what the Java API will do. The most likely thing the server is expecting is a session cookie; I don't really know (don't have a web server handy) whether the Java runtime keeps track of session cookies (try connecting to both pages in turn). You might be able to do something with the URLConnection. Alternatively, it may be expecting a method other than GET (POST or PUT - have a look at HttpURLConnection, you might be able to do something here).

    If its an link, its probably the former; if its a form submit (through a button or javascript) its probably the later.

    If you get stuck, post again; if you get a solution, post it for others (and my own curiosity).

    ---------------------------------
    [size=1]HOWTO ask questions: http://catb.org/~esr/faqs/smart-questions.html[/size]

  • This is the form that leads to the site. I'ce cut everythign away except for whats required. This is the first time I've programmed anything that accesses anything over the interent so Iim totally clueless.



    var pgf_Site="Country";

    xxx
















  • This isn't brilliant code, but hopefully you get the idea:
    [code]
    import java.io.FileOutputStream;
    import java.io.IOException;
    import java.io.InputStream;
    import java.io.OutputStream;
    import java.net.HttpURLConnection;
    import java.net.URL;

    public class URLTest {

    public static void main(String[] args) throws IOException {
    FileOutputStream fout = new FileOutputStream("result.txt");
    try {
    writeResult(fout);
    } finally {
    if (fout != null) {
    fout.close();
    }
    }
    }

    private static void writeResult(OutputStream result) throws IOException {
    InputStream in = null;
    OutputStream out = null;
    byte[] buffer = new byte[1024];

    try {
    URL url = new URL("http://localhost:7080/test/MyServlet");

    HttpURLConnection urlConnection =
    (HttpURLConnection) url.openConnection();
    urlConnection.setRequestMethod("POST");
    String requestContent =
    "TypeOfInquiryNumber=T"
    + "&"
    + "nums_displayed=5"
    + "&"
    + "HTMLVersion=5.0"
    + "&"
    + "AgreeToTermsAndConditions=yes"
    + "&"
    + "loc=en_US"
    + "&"
    + "sort_by=status"
    + "&"
    + "line1=DetailInfo"
    + "&"
    + "InquiryNumber1=123456789"
    + "&"
    + "NumberDetailLines=1"
    + "&"
    + "Requester=Home";
    byte[] bytes = requestContent.getBytes();

    urlConnection.setDoInput(true);
    urlConnection.setDoOutput(true);
    urlConnection.setRequestProperty(
    "content-length",
    "" + bytes.length);

    urlConnection.connect();

    out = urlConnection.getOutputStream();
    out.write(bytes);
    out.close();

    in = urlConnection.getInputStream();
    while (true) {
    int r = in.read(buffer);
    if (r <= 0)
    break;
    result.write(buffer, 0, r);
    }
    in.close();

    urlConnection.disconnect();
    } finally {
    if (in != null) {
    in.close();
    }
    if (out != null) {
    out.close();
    }
    }
    }

    }
    [/code]
    With the HTTP "POST" method, you submit the socket submits a header followed by

    , then the body of data. By default, HTML forms submit data of MIME type "application/x-www-form-urlencoded". I recommend looking at how HTTP works. You'll be surprised at how simple it is; it wouldn't take much work to do the above with a regular net socket.

    ---------------------------------
    [size=1]HOWTO ask questions: http://catb.org/~esr/faqs/smart-questions.html[/size]

  • I tried pulling in your code directly, modifying only the input data and URL. It didn't work, however it did get an error page sent in response. I took this as a sign that something was getting through. I tried modifying a few things but realized that I was "blind" as to what was going on. Then I used a packet sniffer so I could see what the normal web browser request looked like in comparison to mine. Then it was only a matter of altering my request to match theres, adding in an input for the x and y of the click and some other minor things. It worked.

    Thanks alot for all the help. Problem is now I gotta clean this up and finish the rest of the program.
  • :input for the x and y of the click and some other minor things. It worked.

    Doh! Forgot that the returns the click coordinates (surprised the server cared). Glad it worked; check out servlets if you get the time; believe me, you're half way there + it'll help you debug future problems. You can get a free servlet container (Tomcat) from http://jakarta.apache.org or a more complete free J2EE dev environment in the form of the Java Web Services Kit from http://java.sun.com .

    Which packet sniffer did you use? I wrote my own proxy server to do the same job, but you get different headers since its at the application layer.

    ---------------------------------
    [size=1]HOWTO ask questions: http://catb.org/~esr/faqs/smart-questions.html[/size]

  • I used sniff'em, for no other reason than it was the first google result that looked promising.

    Thankyou again for all the help.
Sign In or Register to comment.

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Categories