写了一个简单的爬虫程序爬京东,之前是可以运行的,现在每次爬到固定的url就开始报错,附错误信息,部分代码段。
Exception in thread "main" java.lang.IllegalArgumentException: Illegal character in path at index 38: http://vip.jd.com/fuli/detail/791.html
public String getContent(CrawlerUrl url) throws Exception {
String content = null;
String urlString = url.getUrlString();
CloseableHttpClient httpclient = HttpClients.createDefault();
// 以下代码是参考httpclient官方给出的下载网页示例代码
try {
HttpGet httpget = new HttpGet(urlString);
CloseableHttpResponse response = httpclient.execute(httpget);
try {
int statusCode = response.getStatusLine().getStatusCode();
HttpEntity entity = response.getEntity();
if ((statusCode == HttpStatus.SC_OK) && (entity != null)) {
entity = new BufferedHttpEntity(entity);
StringBuilder sb = new StringBuilder();
String contentType = entity.getContentType().toString();
int charsetStart = contentType.indexOf("charset=");
if (charsetStart != -1) { // 读取字符流
String charset = contentType.substring(charsetStart + 8);
BufferedReader reader = new BufferedReader(new InputStreamReader(entity.getContent(), charset));
int c;
while ((c = reader.read()) != -1) sb.append((char) c);
reader.close();
}
else { // 先解析html文件的前几行获取字符编码,设置好编码格式,再解析html文件的全部内容
BufferedReader FiestReader = new BufferedReader(new InputStreamReader(entity.getContent()));
String charset = null;
String line = null;
int charsetStartInHtml;
while ((line = FiestReader.readLine()) != null) {
charsetStartInHtml = line.indexOf("charset=");
if (charsetStartInHtml != -1) {
Matcher charsetMatcher = charsetRegexp.matcher(line);
while (charsetMatcher.find()) charset = charsetMatcher.group(1);
break;
}
}
FiestReader.close();
BufferedReader SecondReader = new BufferedReader(new InputStreamReader(entity.getContent(), charset));
int c;
while ((c = SecondReader.read()) != -1) sb.append((char) c);
SecondReader.close();
}
content = sb.toString();
}
} finally {
response.close();
}
} finally {
httpclient.close();
}
visitedUrls.put(url.getUrlString(), url);
url.setIsVisited();
// System.out.println(content);
return content;
}
对代码有什么意见也可以提出来,谢谢大神
java爬蟲新手問題Illegal character in path at index 38
- 写回答
- 好问题 0 提建议
- 追加酬金
- 关注问题
- 邀请回答
-
悬赏问题
- ¥15 关于#python#的问题:求帮写python代码
- ¥15 LiBeAs的带隙等于0.997eV,计算阴离子的N和P
- ¥15 关于#windows#的问题:怎么用WIN 11系统的电脑 克隆WIN NT3.51-4.0系统的硬盘
- ¥15 来真人,不要ai!matlab有关常微分方程的问题求解决,
- ¥15 perl MISA分析p3_in脚本出错
- ¥15 k8s部署jupyterlab,jupyterlab保存不了文件
- ¥15 ubuntu虚拟机打包apk错误
- ¥199 rust编程架构设计的方案 有偿
- ¥15 回答4f系统的像差计算
- ¥15 java如何提取出pdf里的文字?