dongxichan8627 2011-05-31 10:39
浏览 123

Java:HtmlUnit效率关联PHP CURL?

I have a spider class which on a user request spiders websites for content. Each search results in loading about 30 websites, spidering them for the information and then standardizing this information.

I have written this in PHP using CURL, since PHP is lacking multitasking I would like to switch to Java (I am aware of the multi process curl which does not suit my demand). I need a http client which can POST/GET, receive and set cookies as well as modify HTTP headers.

I have found HtmlUnit which seems nifty but also exceeds my demand, and since the package is relatively big and I will have many hundread requests a minute I don't want to have an overkill solution slowing down my servers.

Do you think this would be an issue and do you have other suggestions to replace CURL in Java? Should I use the Java CURL binding? This is a question of efficiency and server load.

  • 写回答

3条回答 默认 最新

  • douyue8191 2011-05-31 11:30

    Perhaps take a look at Apache Http Client ?

    You can create a HttpClient per thread and use that to do your requests

    while (running) {
    HttpClient client = new DefaultHttpClient();
    HttpGet GET = new HttpGet("");
    HttpResponse response = client.execute(GET);
    // do stuff with response

    Even better, if you re-use the HttpClient between requests, it will remember the cookies sent back on previous responses, and automatically apply them to your next request. In that sense a single HttpClient models a http conversation.

    So if you did

      // cookies received in response
      // the second get will send the cookies back received from GET1 response.

    You could then take a look at Java's ExecutorService that will make it easy to place spider jobs and have multiple threads running.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
  • douchensou6495 2011-05-31 11:39

    Ultimately you will need to evaluate potential solutions to see what best suits your need.

    HtmlUnit offers a rich Api, for parsing web pages, and finding and evaluating elements on the page.

    A simpler solution would be to simply use HttpClient (which HtmlUnit uses under the hood). This would simply download the entire page and return it as a OutputStream or String. You can then use regular expressions to find links etc. probably more like you are doing currently with curl.

  • dongyange1101 2011-05-31 17:11

    try simple and efficient solution when you don t need javascript.




  • ¥15 谁能提供rabbitmq,erlang,socat压缩包,记住版本要对应,发到邮箱
  • ¥15 谁能提供rabbitmq,erlang,socat压缩包,记住版本要对应
  • ¥15 Vue3 中使用 `vue-router` 只能跳转到主页面?
  • ¥15 用QT,进行QGIS二次开发,如何在添加栅格图层时,将黑白的矢量图渲染成彩色
  • ¥50 监控摄像头 乐橙和家亲版 保存sd卡的文件怎么打开?视频怎么播放?
  • ¥15 Python的Py-QT扩展库开发GUI
  • ¥60 提问一下周期性信信号的问题
  • ¥15 jtag连接不上fpga怎么办
  • ¥30 c语言停车场模型。。
  • ¥15 c语言case3运行不出来