大家好,我现在遇到一个问题,手里有50,000多url,需要一一下载下来,由于都是国外的网址,因此相当慢,现在想弄成多线程抓取网页,不过我的代码无法正常运行,请达人不吝指教,多谢了,以下是我的代码:
[code="java"]
import ......
public class Down2011CaseMeshTread extends Thread {
public static int count = 0;
public static List<String> docDoiList = getCaseDoiList2();
private static URL url;
private static String doi;
static int BUFFER_SIZE = 1024*10;
public Down2011CaseMeshTread(String doi) throws MalformedURLException{
String urlStr = "http://www.iteye.com/" + doi;
this.url = new URL(urlStr);
this.doi = doi;
}
public static Connection getConnection() throws Exception {
//连接数据库的
}
public static String getDocNameByDoi(String docDoi){
//获得docDoi对应的文件名,就是下载下来之后文件的存储名字
}
public void Test() throws IOException, InterruptedException{
StringBuffer sb = null;
BufferedReader in = null;
BufferedWriter out = null;
try {
sb = new StringBuffer();
int ch =0;
URLConnection conn = (HttpURLConnection)url.openConnection();
conn.setRequestProperty("User-Agent", "Mozilla/4.76");
conn.setDoOutput(true);
conn.setConnectTimeout(1000*60*10);
in = new BufferedReader(new InputStreamReader(url.openStream()));
FileOutputStream fo = new FileOutputStream("/home/" + getDocNameByDoi(doi));
OutputStreamWriter writer = new OutputStreamWriter(fo, "utf-8");
out = new BufferedWriter(writer);
while (!in.ready())
{
Thread.sleep(500); // wait for stream to be ready.
}
char[] buffer = new char[BUFFER_SIZE];
int charsRead = 0;
while ( (charsRead = in.read(buffer, 0, BUFFER_SIZE)) != -1 ) {
out.write(buffer, 0, charsRead);
}
out.close();
in.close();
}catch(Exception e){
e.printStackTrace();
}
}
public void run(){
try {
Test();
} catch (IOException e) {
e.printStackTrace();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
//应该是run()这里的问题,但我不知道改怎么改????
public static List<String> getCaseDoiList2(){
//获得url链表,一共五万多
}
public static void main(String args[]) throws MalformedURLException{
for(String docDoi : docDoiList){
Down2011CaseMeshTread down = new Down2011CaseMeshTread(docDoi);
down.start();
}
}
}
[/code]
情况就是这么个情况,
请问这个该怎么改阿,非常感谢~
这个程序的问题是,开始执行特别快,大概一分钟能下两三百,但下到大约800的时候就报错
先是读到异常文件终止“EOF”
但这时还可以断断续续的下载
[code]
java.io.IOException: Premature EOF
at sun.net.www.http.ChunkedInputStream.fastRead(ChunkedInputStream.java:252)
at sun.net.www.http.ChunkedInputStream.read(ChunkedInputStream.java:680)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.read(HttpURLConnection.java:2582)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:282)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:324)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:176)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.read1(BufferedReader.java:202)
at java.io.BufferedReader.read(BufferedReader.java:278)
at org.Down2011CaseMeshTread.Test(Down2011CaseMeshTread.java:138)
at org.Down2011CaseMeshTread.run(Down2011CaseMeshTread.java:150)
[/code]
然后再报连接超时错误,这个时候下载就都中断了
[code]
java.net.ConnectException: Connection timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:327)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:193)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:180)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:384)
at java.net.Socket.connect(Socket.java:546)
at java.net.Socket.connect(Socket.java:495)
at sun.net.NetworkClient.doConnect(NetworkClient.java:178)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:409)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:530)
at sun.net.www.http.HttpClient.(HttpClient.java:240)
at sun.net.www.http.HttpClient.New(HttpClient.java:321)
at sun.net.www.http.HttpClient.New(HttpClient.java:338)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:935)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:876)
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:801)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1139)
at java.net.URL.openStream(URL.java:1029)
at org.Down2011CaseMeshTread.Test(Down2011CaseMeshTread.java:127)
at org.Down2011CaseMeshTread.run(Down2011CaseMeshTread.java:150)
[/code]
这个是连接的错误
估计是我没有设定线程的数量,所以在存在大量线程请求连接的时候出了问题,
但怎么修改我还不知道哦~