dongxichan8627 2011-05-31 10:39
浏览 123
已采纳

Java:HtmlUnit效率关联PHP CURL?

I have a spider class which on a user request spiders websites for content. Each search results in loading about 30 websites, spidering them for the information and then standardizing this information.

I have written this in PHP using CURL, since PHP is lacking multitasking I would like to switch to Java (I am aware of the multi process curl which does not suit my demand). I need a http client which can POST/GET, receive and set cookies as well as modify HTTP headers.

I have found HtmlUnit which seems nifty but also exceeds my demand, and since the package is relatively big and I will have many hundread requests a minute I don't want to have an overkill solution slowing down my servers.

Do you think this would be an issue and do you have other suggestions to replace CURL in Java? Should I use the Java CURL binding? This is a question of efficiency and server load.

  • 写回答

3条回答 默认 最新

  • douyue8191 2011-05-31 11:30
    关注

    Perhaps take a look at Apache Http Client ?

    You can create a HttpClient per thread and use that to do your requests

    while (running) {
    
    HttpClient client = new DefaultHttpClient();
    HttpGet GET = new HttpGet("mydomain.com/path.html");
    HttpResponse response = client.execute(GET);
    // do stuff with response
    
    }
    

    Even better, if you re-use the HttpClient between requests, it will remember the cookies sent back on previous responses, and automatically apply them to your next request. In that sense a single HttpClient models a http conversation.

    So if you did

     client.execute(GET1);
      // cookies received in response
      client.execute(GET2);
      // the second get will send the cookies back received from GET1 response.
    

    You could then take a look at Java's ExecutorService that will make it easy to place spider jobs and have multiple threads running.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(2条)

报告相同问题?

悬赏问题

  • ¥20 sub地址DHCP问题
  • ¥15 delta降尺度计算的一些细节,有偿
  • ¥15 Arduino红外遥控代码有问题
  • ¥15 数值计算离散正交多项式
  • ¥30 数值计算均差系数编程
  • ¥15 redis-full-check比较 两个集群的数据出错
  • ¥15 Matlab编程问题
  • ¥15 训练的多模态特征融合模型准确度很低怎么办
  • ¥15 kylin启动报错log4j类冲突
  • ¥15 超声波模块测距控制点灯,灯的闪烁很不稳定,经过调试发现测的距离偏大