weixin_38674188 2017-05-24 11:44 采纳率: 0%
浏览 2231

java爬蟲新手問題Illegal character in path at index 38

写了一个简单的爬虫程序爬京东,之前是可以运行的,现在每次爬到固定的url就开始报错,附错误信息,部分代码段。
Exception in thread "main" java.lang.IllegalArgumentException: Illegal character in path at index 38: http://vip.jd.com/fuli/detail/791.html
public String getContent(CrawlerUrl url) throws Exception {

String content = null;

String urlString = url.getUrlString();

CloseableHttpClient httpclient = HttpClients.createDefault();

// 以下代码是参考httpclient官方给出的下载网页示例代码

try {

HttpGet httpget = new HttpGet(urlString);

CloseableHttpResponse response = httpclient.execute(httpget);

try {

int statusCode = response.getStatusLine().getStatusCode();

HttpEntity entity = response.getEntity();

if ((statusCode == HttpStatus.SC_OK) && (entity != null)) {

entity = new BufferedHttpEntity(entity);

StringBuilder sb = new StringBuilder();

String contentType = entity.getContentType().toString();

int charsetStart = contentType.indexOf("charset=");

if (charsetStart != -1) { // 读取字符流

String charset = contentType.substring(charsetStart + 8);

BufferedReader reader = new BufferedReader(new InputStreamReader(entity.getContent(), charset));

int c;

while ((c = reader.read()) != -1) sb.append((char) c);

reader.close();

}
else { // 先解析html文件的前几行获取字符编码,设置好编码格式,再解析html文件的全部内容

BufferedReader FiestReader = new BufferedReader(new InputStreamReader(entity.getContent()));

String charset = null;

String line = null;

int charsetStartInHtml;

while ((line = FiestReader.readLine()) != null) {

charsetStartInHtml = line.indexOf("charset=");

if (charsetStartInHtml != -1) {

Matcher charsetMatcher = charsetRegexp.matcher(line);

while (charsetMatcher.find()) charset = charsetMatcher.group(1);

break;

}

}

FiestReader.close();

BufferedReader SecondReader = new BufferedReader(new InputStreamReader(entity.getContent(), charset));

int c;

while ((c = SecondReader.read()) != -1) sb.append((char) c);

SecondReader.close();

}

content = sb.toString();

}

} finally {

response.close();

}

} finally {

httpclient.close();

}

visitedUrls.put(url.getUrlString(), url);

url.setIsVisited();

// System.out.println(content);
return content;

}

对代码有什么意见也可以提出来,谢谢大神

  • 写回答

1条回答 默认 最新

  • 学习编程的小猫 2023-03-23 08:38
    关注

    之前也遇到这次问题,不过我是传入url为空,参考的博客

    评论

报告相同问题?

悬赏问题

  • ¥15 关于#python#的问题:求帮写python代码
  • ¥15 LiBeAs的带隙等于0.997eV,计算阴离子的N和P
  • ¥15 关于#windows#的问题:怎么用WIN 11系统的电脑 克隆WIN NT3.51-4.0系统的硬盘
  • ¥15 来真人,不要ai!matlab有关常微分方程的问题求解决,
  • ¥15 perl MISA分析p3_in脚本出错
  • ¥15 k8s部署jupyterlab,jupyterlab保存不了文件
  • ¥15 ubuntu虚拟机打包apk错误
  • ¥199 rust编程架构设计的方案 有偿
  • ¥15 回答4f系统的像差计算
  • ¥15 java如何提取出pdf里的文字?