weixin_42208050 2008-12-23 11:50
浏览 205
已采纳

使用java下载utf-8编码日文网页的问题

我用java下载yahoo上面的日文搜索结果网页。我已经设置了编码为UTF-8,但是下载到的网页字符串和通过浏览器得到的网友不同,所有的日文字符都变成了空格。不知道哪里有问题?
下面是我使用的下载网页的代码:
[code="java"]
public static void main(String[] args) throws UnsupportedEncodingException {
String strEncoding = "UTF-8";
System.out.println(strEncoding);
String strText = getHtmlText("http://search.yahoo.com/search?ei=UTF-8&&fr=yfp-t-501&fp_ip=CN&vm=p&b=1&n=10&va_vt=any&vo_vt=any&ve_vt=any&vp_vt=any&vd=m3&vf=pdf&fl=1&vl=lang_ja&vs=&p=123+"
, 30 * 1000, strEncoding, null, null);
System.out.println(strText);
}

public static String getHtmlText(String strUrl, int timeout, String strEnCoding, String cookies, Proxy proxy) {
if (strUrl == null || strUrl.length() == 0) {
return null;
}

    StringBuffer strHtml = null;
    String strLine = "";
    HttpURLConnection httpConnection = null;// 这里可以定义成HttpURLConnection
    InputStream urlStream = null;
    BufferedInputStream buff = null;
    BufferedReader br = null;
    boolean isError = false;
    try {
        //链接网络得到网页源代码
        URL url = new URL(strUrl);
        if (proxy != null) {
            httpConnection = (HttpURLConnection) url.openConnection(proxy);
        }
        else {
            httpConnection = (HttpURLConnection) url.openConnection();
        }
        httpConnection.addRequestProperty("User-Agent", "IcewolfHttp/1.0");
        httpConnection.addRequestProperty("Accept",
                                 "www/source; text/html; image/gif; */*");
        httpConnection.addRequestProperty("Accept-Language", "");
        if (cookies != null) {
            httpConnection.setRequestProperty("Cookie", cookies);
        }      
        httpConnection.setConnectTimeout(timeout);
        httpConnection.setReadTimeout(timeout);
        urlStream = httpConnection.getInputStream();
        buff = new BufferedInputStream(urlStream);
        Reader r = null;
        if (strEnCoding == null || strEnCoding.compareTo("null") == 0) {
            r = new InputStreamReader(buff);
        } else {
            try {
                r = new InputStreamReader(buff, strEnCoding);
            } catch (UnsupportedEncodingException e) {
                r = new InputStreamReader(buff);
            }
        }

        br = new BufferedReader(r);
        strHtml = new StringBuffer("");
        while ((strLine = br.readLine()) != null) {
            strHtml.append(strLine + "\r\n");
        }
    }catch (java.lang.OutOfMemoryError out) {
        System.out.println("内存占用:" + strHtml.capacity());
        out.printStackTrace();
    }
    catch (Exception e) {
        e.printStackTrace();
        System.out.println(e.getClass() + "下载网页" + strUrl + "失败");
        isError = true;
    } finally{      
        try{
            if (httpConnection != null)
                httpConnection.disconnect();
            if (br != null)
                br.close();
            if (buff != null)
                buff.close();
            if (urlStream != null)
                urlStream.close();
        }catch(Exception e){
            System.out.println(e.getClass() + "下载网页" + strUrl + "连接关闭失败");
            return null;
        }
    }

    if (strHtml == null || isError)
        return null;
    return fromNCR(strHtml.toString());
}

[/code]

  • 写回答

1条回答 默认 最新

  • 不良校长 2008-12-23 23:01
    关注

    httpConnection.addRequestProperty("Accept-Charset", "UTF-8");

    速度给分!哈哈。 花了10分钟测试。

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 对于知识的学以致用的解释
  • ¥50 三种调度算法报错 有实例
  • ¥15 关于#python#的问题,请各位专家解答!
  • ¥200 询问:python实现大地主题正反算的程序设计,有偿
  • ¥15 smptlib使用465端口发送邮件失败
  • ¥200 总是报错,能帮助用python实现程序实现高斯正反算吗?有偿
  • ¥15 对于squad数据集的基于bert模型的微调
  • ¥15 为什么我运行这个网络会出现以下报错?CRNN神经网络
  • ¥20 steam下载游戏占用内存
  • ¥15 CST保存项目时失败