tianchengbo1986 2011-05-31 12:57
浏览 627
已采纳

如何利用 java 多线程 爬取大量网页?

大家好,我现在遇到一个问题,手里有50,000多url,需要一一下载下来,由于都是国外的网址,因此相当慢,现在想弄成多线程抓取网页,不过我的代码无法正常运行,请达人不吝指教,多谢了,以下是我的代码:
[code="java"]
import ......

public class Down2011CaseMeshTread extends Thread {

public static int count = 0;  
public static List<String> docDoiList = getCaseDoiList2();  
private static URL url;    
private static String doi;  
static int BUFFER_SIZE = 1024*10;  

public Down2011CaseMeshTread(String doi) throws MalformedURLException{  
    String urlStr = "http://www.iteye.com/" + doi;  
    this.url = new URL(urlStr);  
    this.doi = doi;  
}  

public static Connection getConnection() throws Exception {  
       //连接数据库的 
  }  

public static String getDocNameByDoi(String docDoi){  
       //获得docDoi对应的文件名,就是下载下来之后文件的存储名字
}  

public void Test() throws IOException, InterruptedException{  

    StringBuffer sb = null;  
    BufferedReader in = null;  
    BufferedWriter out = null;  
    try {  
         sb = new StringBuffer();  
         int ch =0;  
         URLConnection conn = (HttpURLConnection)url.openConnection();  
         conn.setRequestProperty("User-Agent", "Mozilla/4.76");  
         conn.setDoOutput(true);  
         conn.setConnectTimeout(1000*60*10);  
         in = new BufferedReader(new InputStreamReader(url.openStream()));  
         FileOutputStream fo = new FileOutputStream("/home/" + getDocNameByDoi(doi));  
         OutputStreamWriter writer = new OutputStreamWriter(fo, "utf-8");  
         out = new BufferedWriter(writer);  
         while (!in.ready())  
         {  
              Thread.sleep(500); // wait for stream to be ready.  
         }  
         char[] buffer = new char[BUFFER_SIZE];  
         int charsRead = 0;  
         while ( (charsRead = in.read(buffer, 0, BUFFER_SIZE)) != -1 ) {  
              out.write(buffer, 0, charsRead);  
         }  
     out.close();  
     in.close();  
    }catch(Exception e){  
        e.printStackTrace();  
    }  
}  

public void run(){  
    try {  
        Test();  
        } catch (IOException e) {  
        e.printStackTrace();  
    } catch (InterruptedException e) {  
        e.printStackTrace();  
    }  
}  
    //应该是run()这里的问题,但我不知道改怎么改????  

public static List<String> getCaseDoiList2(){  
       //获得url链表,一共五万多 
}  

public static void main(String args[]) throws MalformedURLException{  
    for(String docDoi : docDoiList){  
            Down2011CaseMeshTread down = new Down2011CaseMeshTread(docDoi);  
            down.start();  
    }  
}  

}

[/code]

情况就是这么个情况,
请问这个该怎么改阿,非常感谢~

这个程序的问题是,开始执行特别快,大概一分钟能下两三百,但下到大约800的时候就报错
先是读到异常文件终止“EOF”
但这时还可以断断续续的下载
[code]
java.io.IOException: Premature EOF
at sun.net.www.http.ChunkedInputStream.fastRead(ChunkedInputStream.java:252)
at sun.net.www.http.ChunkedInputStream.read(ChunkedInputStream.java:680)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.read(HttpURLConnection.java:2582)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:282)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:324)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:176)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.read1(BufferedReader.java:202)
at java.io.BufferedReader.read(BufferedReader.java:278)
at org.Down2011CaseMeshTread.Test(Down2011CaseMeshTread.java:138)
at org.Down2011CaseMeshTread.run(Down2011CaseMeshTread.java:150)
[/code]

然后再报连接超时错误,这个时候下载就都中断了
[code]
java.net.ConnectException: Connection timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:327)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:193)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:180)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:384)
at java.net.Socket.connect(Socket.java:546)
at java.net.Socket.connect(Socket.java:495)
at sun.net.NetworkClient.doConnect(NetworkClient.java:178)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:409)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:530)
at sun.net.www.http.HttpClient.(HttpClient.java:240)
at sun.net.www.http.HttpClient.New(HttpClient.java:321)
at sun.net.www.http.HttpClient.New(HttpClient.java:338)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:935)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:876)
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:801)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1139)
at java.net.URL.openStream(URL.java:1029)
at org.Down2011CaseMeshTread.Test(Down2011CaseMeshTread.java:127)
at org.Down2011CaseMeshTread.run(Down2011CaseMeshTread.java:150)
[/code]

这个是连接的错误
估计是我没有设定线程的数量,所以在存在大量线程请求连接的时候出了问题,
但怎么修改我还不知道哦~

  • 写回答

1条回答 默认 最新

  • iteye_16885 2011-05-31 14:20
    关注

    这个错误是不是这里Thread.sleep(500); // wait for stream to be ready. 设置的太短了,设成1000*60。

    感觉这个多线程好想要受到连接池大小的限制,可以建里一个连接池类,专门负责产生线程来调用新产生的线程来执行下载。

    还有你的连接池大小超时是不是设置的很小

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 电力市场出清matlab yalmip kkt 双层优化问题
  • ¥30 ros小车路径规划实现不了,如何解决?(操作系统-ubuntu)
  • ¥20 matlab yalmip kkt 双层优化问题
  • ¥15 如何在3D高斯飞溅的渲染的场景中获得一个可控的旋转物体
  • ¥88 实在没有想法,需要个思路
  • ¥15 MATLAB报错输入参数太多
  • ¥15 python中合并修改日期相同的CSV文件并按照修改日期的名字命名文件
  • ¥15 有赏,i卡绘世画不出
  • ¥15 如何用stata画出文献中常见的安慰剂检验图
  • ¥15 c语言链表结构体数据插入