使用java下载utf-8编码日文网页的问题

我用java下载yahoo上面的日文搜索结果网页。我已经设置了编码为UTF-8，但是下载到的网页字符串和通过浏览器得到的网友不同，所有的日文字符都变成了空格。不知道哪里有问题？
下面是我使用的下载网页的代码：
[code="java"]
public static void main(String[] args) throws UnsupportedEncodingException {
String strEncoding = "UTF-8";
System.out.println(strEncoding);
String strText = getHtmlText("http://search.yahoo.com/search?ei=UTF-8&&fr=yfp-t-501&fp_ip=CN&vm=p&b=1&n=10&va_vt=any&vo_vt=any&ve_vt=any&vp_vt=any&vd=m3&vf=pdf&fl=1&vl=lang_ja&vs=&p=123+"
, 30 * 1000, strEncoding, null, null);
System.out.println(strText);
}

public static String getHtmlText(String strUrl, int timeout, String strEnCoding, String cookies, Proxy proxy) {
if (strUrl == null || strUrl.length() == 0) {
return null;
}

    StringBuffer strHtml = null;
    String strLine = "";
    HttpURLConnection httpConnection = null;// 这里可以定义成HttpURLConnection
    InputStream urlStream = null;
    BufferedInputStream buff = null;
    BufferedReader br = null;
    boolean isError = false;
    try {
        //链接网络得到网页源代码
        URL url = new URL(strUrl);
        if (proxy != null) {
            httpConnection = (HttpURLConnection) url.openConnection(proxy);
        }
        else {
            httpConnection = (HttpURLConnection) url.openConnection();
        }
        httpConnection.addRequestProperty("User-Agent", "IcewolfHttp/1.0");
        httpConnection.addRequestProperty("Accept",
                                 "www/source; text/html; image/gif; */*");
        httpConnection.addRequestProperty("Accept-Language", "");
        if (cookies != null) {
            httpConnection.setRequestProperty("Cookie", cookies);
        }      
        httpConnection.setConnectTimeout(timeout);
        httpConnection.setReadTimeout(timeout);
        urlStream = httpConnection.getInputStream();
        buff = new BufferedInputStream(urlStream);
        Reader r = null;
        if (strEnCoding == null || strEnCoding.compareTo("null") == 0) {
            r = new InputStreamReader(buff);
        } else {
            try {
                r = new InputStreamReader(buff, strEnCoding);
            } catch (UnsupportedEncodingException e) {
                r = new InputStreamReader(buff);
            }
        }

        br = new BufferedReader(r);
        strHtml = new StringBuffer("");
        while ((strLine = br.readLine()) != null) {
            strHtml.append(strLine + "\r\n");
        }
    }catch (java.lang.OutOfMemoryError out) {
        System.out.println("内存占用：" + strHtml.capacity());
        out.printStackTrace();
    }
    catch (Exception e) {
        e.printStackTrace();
        System.out.println(e.getClass() + "下载网页" + strUrl + "失败");
        isError = true;
    } finally{      
        try{
            if (httpConnection != null)
                httpConnection.disconnect();
            if (br != null)
                br.close();
            if (buff != null)
                buff.close();
            if (urlStream != null)
                urlStream.close();
        }catch(Exception e){
            System.out.println(e.getClass() + "下载网页" + strUrl + "连接关闭失败");
            return null;
        }
    }

    if (strHtml == null || isError)
        return null;
    return fromNCR(strHtml.toString());
}

[/code]

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
不良校长 2008-12-23 23:01
关注
httpConnection.addRequestProperty("Accept-Charset", "UTF-8");

速度给分！哈哈。花了10分钟测试。

本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

使用java下载utf-8编码日文网页的问题
2008-12-23 11:50

回答 1 已采纳 httpConnection.addRequestProperty("Accept-Charset", "UTF-8"); 速度给分！哈哈。花了10分钟测试。
收取 ISO-2022-JP 编码邮件时，乱码
2018-01-13 08:49

回答 2 已采纳用什么语言，java么？用str = new String(string.getBytes("ISO-2022-JP"), "utf8"); 转换下
TCPDF UTF-8日文文件名未显示 php
2016-03-09 17:54

回答 2 已采纳 I resolved this. In 'tcpdf.php' file: i was comment from line 7559 to 7562 public function Outpu
java中GBK与UTF-8编码的转换
2019-06-25 15:08

RiskAI的博客 java中文件编码的转换，主要说明了GBK与UTF-8编码之间的转换，还介绍了利用cpdetector开源库确定文件文件（网页）编码的方法。
PHP session_start，使用'UTF-8无BOM'显示日文字符而不会弄乱标题 php
2012-10-21 01:03

回答 1 已采纳 Try to add addDefaultCharset UTF-8 to .htaccess file in the root of your application
日文和俄文字符 - 网络编码？ php python
2012-03-06 10:04

回答 2 已采纳 I just edited the file site.py of python. I follow that guide: click here and everything is ok now
使用PHP和MySQL显示日文字符的问题 mysql php xml
2011-01-19 22:21

回答 2 已采纳 Actually just posted this - php mysql query encoding problem What I tend to find solves things
html中转换utf8编码,如何将html网页utf-8编码转换到utf-8编码互转换
2021-06-09 04:12

gaocegege的博客 UTF-8是UTF-8编码是一种目前广泛应用于网页的编码，它其实是一种Unicode编码，即致力于把全球所有语言纳入一个统一的编码。前UTF-8已经把几种重要的亚洲语言纳入，包括简繁中文和日韩文字。所以在制作某些网站时...
PHP上传的文件名：日文字符编码 php
2016-05-13 04:39

回答 1 已采纳 To qualify my answer (to the downvoter): Q: I have heard that UTF-8 does not support some Japa
怎么将Shift_JIS编码的字符串转换成utf-8或者gbk编码的字符串？
2008-12-19 18:26

回答 2 已采纳这个问题，是HTML实体转换的过程。需要你手动转的。 &#20851 直接去掉，转成char c = (char)Integer.parseInt("20851"); 以后
PDO将错误的日文字符插入数据库 mysql php
2015-02-15 15:09

回答 2 已采纳 The problem is I was using strtolower() in my bindParam. This changed the values of the Japanese w
Java实现将任何编码方式的txt文件以UTF-8编码方式转存
2018-11-20 19:36

James Shangguan的博客本文利用JDK中的BufferedReader和BufferedWriter实现将任何编码方式的txt文件以UTF-8编码方式转存。 UTF-8（8-bit Unicode Transformation Format）是一种针对Unicode的可变长度字符编码，又称万国码，由Ken ...
qt中，通过QString a = lineEdit->text()取日文如何避免乱码，中文也是 c# c++ qt
2022-07-05 10:36

回答 2 已采纳 QString a = QString::fromLocal8Bit(lineEdit->text());
Unicode中UTF-8与UTF-16编码详解
2021-09-30 10:44

黄Java的博客概述本文通过介绍Unicode编码以及对应的两种编码方式UTF-8和UTF-16，让读者... UTF-8编码，包含基础概念和Unicode编码转换到UTF-8编码方式 UTF-16编码，包含基础概念和Unicode编码转换到UTF-16编码方式 Jav
unicode 转换 UTF-8，UTF-16，UTF-32编码规则（包含代码）
2023-07-27 11:15

旭阳的头发呀的博客 unicode 转换 UTF-8，UTF-16，UTF-32编码规则（包含代码）
如何将UTF-8编码的数据写入文件– Java
2020-06-03 08:48

cyan20115的博客这是Java示例，演示如何将UTF-8编码的数据写入文本文件–“ c：\\ temp \\ test.txt ” PS符号“ ??” 是一些中文和日文的“ UTF-8”数据 package com.mkyong; import java.io.BufferedWriter; import java.io....
IntelliJ IDEA 统一设置编码为utf-8编码及 SpringBoot 打 jar 包运行在windows 平台控制台和日志乱码解决
2023-05-12 10:46

风随心飞飞的博客设置 Additional command line parameters选项为 -encoding utf-8。###4.然后在 Server > VM options 设置为 -Dfile.encoding=UTF-8。意思是编码格式出错，对方技术人员也说大概率是编码的问题。最新在做小程序支付...
java 判断zip文件编码_如何使用UTF-8编码打开java程序生成的zip文件
2021-03-22 13:41

iiif的博客我们的产品有一个导出功能,它使用ZipOutputStream压缩目录;但是,当您尝试压缩包含具有中文或日文字符的文件名的目录时,导出将无法正常工作.由于某种原因,压缩文件中的新文件的命名方式不同.以下是我们的压缩代码示例...
UTF-8编码详解
2023-12-31 08:15

u010405836的博客 UTF-8（Unicode Transformation Format-8-bit）是一种可变长度的字符编码，它是Unicode标准的一部分。UTF-8能够表示几乎所有的字符，...通过本文对UTF-8编码的详细解析，我们深入了解了其基本原理、用法以及应用场景。
JAVA代码导出文件设置编码_如何使用UTF-8编码打开java程序生成的zip文件
2021-03-09 07:01

九千步的博客我们的产品有一个导出功能,它使用ZipOutputStream压缩目录;但是,当您尝试压缩包含具有中文或日文字符的文件名的目录时,导出将无法正常工作.由于某种原因,压缩文件中的新文件的命名方式不同.以下是我们的压缩代码示例...
没有解决我的问题, 去提问

悬赏问题

¥15 对于知识的学以致用的解释
¥50 三种调度算法报错有实例
¥15 关于#python#的问题，请各位专家解答！
¥200 询问：python实现大地主题正反算的程序设计，有偿
¥15 smptlib使用465端口发送邮件失败
¥200 总是报错，能帮助用python实现程序实现高斯正反算吗？有偿
¥15 对于squad数据集的基于bert模型的微调
¥15 为什么我运行这个网络会出现以下报错？CRNN神经网络
¥20 steam下载游戏占用内存
¥15 CST保存项目时失败

使用java下载utf-8编码日文网页的问题

1条回答 默认 最新

悬赏问题

1条回答默认最新