写了一个简单的爬虫程序爬京东,之前是可以运行的,现在每次爬到固定的url就开始报错,附错误信息,部分代码段。
Exception in thread "main" java.lang.IllegalArgumentException: Illegal character in path at index 38: http://vip.jd.com/fuli/detail/791.html
public String getContent(CrawlerUrl url) throws Exception {
String content = null;
String urlString = url.getUrlString();
CloseableHttpClient httpclient = HttpClients.createDefault();
// 以下代码是参考httpclient官方给出的下载网页示例代码
try {
HttpGet httpget = new HttpGet(urlString);
CloseableHttpResponse response = httpclient.execute(httpget);
try {
int statusCode = response.getStatusLine().getStatusCode();
HttpEntity entity = response.getEntity();
if ((statusCode == HttpStatus.SC_OK) && (entity != null)) {
entity = new BufferedHttpEntity(entity);
StringBuilder sb = new StringBuilder();
String contentType = entity.getContentType().toString();
int charsetStart = contentType.indexOf("charset=");
if (charsetStart != -1) { // 读取字符流
String charset = contentType.substring(charsetStart + 8);
BufferedReader reader = new BufferedReader(new InputStreamReader(entity.getContent(), charset));
int c;
while ((c = reader.read()) != -1) sb.append((char) c);
reader.close();
}
else { // 先解析html文件的前几行获取字符编码,设置好编码格式,再解析html文件的全部内容
BufferedReader FiestReader = new BufferedReader(new InputStreamReader(entity.getContent()));
String charset = null;
String line = null;
int charsetStartInHtml;
while ((line = FiestReader.readLine()) != null) {
charsetStartInHtml = line.indexOf("charset=");
if (charsetStartInHtml != -1) {
Matcher charsetMatcher = charsetRegexp.matcher(line);
while (charsetMatcher.find()) charset = charsetMatcher.group(1);
break;
}
}
FiestReader.close();
BufferedReader SecondReader = new BufferedReader(new InputStreamReader(entity.getContent(), charset));
int c;
while ((c = SecondReader.read()) != -1) sb.append((char) c);
SecondReader.close();
}
content = sb.toString();
}
} finally {
response.close();
}
} finally {
httpclient.close();
}
visitedUrls.put(url.getUrlString(), url);
url.setIsVisited();
// System.out.println(content);
return content;
}
对代码有什么意见也可以提出来,谢谢大神
java爬蟲新手問題Illegal character in path at index 38
- 写回答
- 好问题 0 提建议
- 追加酬金
- 关注问题
- 邀请回答
-
悬赏问题
- ¥15 overleaf中论文编辑,报错`pages' is a missing field, not a string, for entry 4
- ¥15 vhdl+MODELSIM
- ¥20 simulink中怎么使用solve函数?
- ¥30 dspbuilder中使用signalcompiler时报错Error during compilation: Fitter failed,求解决办法
- ¥15 gwas 分析-数据质控之过滤稀有突变中出现的问题
- ¥15 没有注册类 (异常来自 HRESULT: 0x80040154 (REGDB_E_CLASSNOTREG))
- ¥15 知识蒸馏实战博客问题
- ¥15 用PLC设计纸袋糊底机送料系统
- ¥15 simulink仿真中dtc控制永磁同步电机如何控制开关频率
- ¥15 用C语言输入方程怎么