weixin_38674188 2017-05-24 11:44 采纳率: 0%
浏览 2231

java爬蟲新手問題Illegal character in path at index 38

写了一个简单的爬虫程序爬京东,之前是可以运行的,现在每次爬到固定的url就开始报错,附错误信息,部分代码段。
Exception in thread "main" java.lang.IllegalArgumentException: Illegal character in path at index 38: http://vip.jd.com/fuli/detail/791.html
public String getContent(CrawlerUrl url) throws Exception {

String content = null;

String urlString = url.getUrlString();

CloseableHttpClient httpclient = HttpClients.createDefault();

// 以下代码是参考httpclient官方给出的下载网页示例代码

try {

HttpGet httpget = new HttpGet(urlString);

CloseableHttpResponse response = httpclient.execute(httpget);

try {

int statusCode = response.getStatusLine().getStatusCode();

HttpEntity entity = response.getEntity();

if ((statusCode == HttpStatus.SC_OK) && (entity != null)) {

entity = new BufferedHttpEntity(entity);

StringBuilder sb = new StringBuilder();

String contentType = entity.getContentType().toString();

int charsetStart = contentType.indexOf("charset=");

if (charsetStart != -1) { // 读取字符流

String charset = contentType.substring(charsetStart + 8);

BufferedReader reader = new BufferedReader(new InputStreamReader(entity.getContent(), charset));

int c;

while ((c = reader.read()) != -1) sb.append((char) c);

reader.close();

}
else { // 先解析html文件的前几行获取字符编码,设置好编码格式,再解析html文件的全部内容

BufferedReader FiestReader = new BufferedReader(new InputStreamReader(entity.getContent()));

String charset = null;

String line = null;

int charsetStartInHtml;

while ((line = FiestReader.readLine()) != null) {

charsetStartInHtml = line.indexOf("charset=");

if (charsetStartInHtml != -1) {

Matcher charsetMatcher = charsetRegexp.matcher(line);

while (charsetMatcher.find()) charset = charsetMatcher.group(1);

break;

}

}

FiestReader.close();

BufferedReader SecondReader = new BufferedReader(new InputStreamReader(entity.getContent(), charset));

int c;

while ((c = SecondReader.read()) != -1) sb.append((char) c);

SecondReader.close();

}

content = sb.toString();

}

} finally {

response.close();

}

} finally {

httpclient.close();

}

visitedUrls.put(url.getUrlString(), url);

url.setIsVisited();

// System.out.println(content);
return content;

}

对代码有什么意见也可以提出来,谢谢大神

  • 写回答

1条回答 默认 最新

  • 学习编程的小猫 2023-03-23 08:38
    关注

    之前也遇到这次问题,不过我是传入url为空,参考的博客

    评论

报告相同问题?

悬赏问题

  • ¥15 overleaf中论文编辑,报错`pages' is a missing field, not a string, for entry 4
  • ¥15 vhdl+MODELSIM
  • ¥20 simulink中怎么使用solve函数?
  • ¥30 dspbuilder中使用signalcompiler时报错Error during compilation: Fitter failed,求解决办法
  • ¥15 gwas 分析-数据质控之过滤稀有突变中出现的问题
  • ¥15 没有注册类 (异常来自 HRESULT: 0x80040154 (REGDB_E_CLASSNOTREG))
  • ¥15 知识蒸馏实战博客问题
  • ¥15 用PLC设计纸袋糊底机送料系统
  • ¥15 simulink仿真中dtc控制永磁同步电机如何控制开关频率
  • ¥15 用C语言输入方程怎么