抓取非英文网站的编码问题

发布于 2024-12-07 21:24:26 字数 3084 浏览 0 评论 0原文

我试图将网页的内容作为字符串获取，我发现这个问题解决了如何编写一个基本的网络爬虫，它声称（并且似乎）处理编码问题，但是那里提供的代码可以工作对于美国/英语网站，无法正确处理其他语言。

这是一个完整的 Java 类，它演示了我所指的内容：

import java.io.IOException;
import java.io.InputStreamReader;
import java.io.Reader;
import java.io.UnsupportedEncodingException;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.regex.Matcher;
import java.util.regex.Pattern;


public class I18NScraper
{
    static
    {
        System.setProperty("http.agent", "");
    }

    public static final String IE8_USER_AGENT = "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; WOW64; Trident/4.0; SLCC1; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; InfoPath.2)";

  //https://stackoverflow.com/questions/1381617/simplest-way-to-correctly-load-html-from-web-page-into-a-string-in-java
    private static final Pattern CHARSET_PATTERN = Pattern.compile("text/html;\\s+charset=([^\\s]+)\\s*");
    public static String getPageContentsFromURL(String page) throws UnsupportedEncodingException, MalformedURLException, IOException {
        Reader r = null;
        try {
            URL url = new URL(page);
            HttpURLConnection con = (HttpURLConnection)url.openConnection();
            con.setRequestProperty("User-Agent", IE8_USER_AGENT);

            Matcher m = CHARSET_PATTERN.matcher(con.getContentType());
            /* If Content-Type doesn't match this pre-conception, choose default and 
             * hope for the best. */
            String charset = m.matches() ? m.group(1) : "ISO-8859-1";
            r = new InputStreamReader(con.getInputStream(),charset);
            StringBuilder buf = new StringBuilder();
            while (true) {
              int ch = r.read();
              if (ch < 0)
                break;
              buf.append((char) ch);
            }
            return buf.toString();
        } finally {
            if(r != null){
                r.close();
            }
        }
    }

    private static final Pattern TITLE_PATTERN = Pattern.compile("<title>([^<]*)</title>");
    public static String getDesc(String page){
        Matcher m = TITLE_PATTERN.matcher(page);
        if(m.find())
            return m.group(1);
        return page.contains("<title>")+"";
    }

    public static void main(String[] args) throws UnsupportedEncodingException, MalformedURLException, IOException{
        System.out.println(getDesc(getPageContentsFromURL("http://yandex.ru/yandsearch?text=%D0%A0%D0%B5%D0%B7%D1%83%D0%BB%D1%8C%D1%82%D0%B0%D1%82%D0%BE%D0%B2&lr=223")));
    }
}

哪些输出：

???????????&nbsp;&mdash; ??????: ??????? 360&nbsp;???&nbsp;???????

虽然它应该是：

Результатов&nbsp;&mdash; Яндекс: Нашлось 360&nbsp;млн&nbsp;ответов

你能帮助我理解我做错了什么吗？尝试强制使用 UTF-8 之类的方法并没有帮助，尽管这是源代码和 HTTP 标头中列出的字符集。

原文

I'm trying to get the contents of a webpage as a string, and I found this question addressing how to write a basic web crawler, which claims to (and seems to) handle the encoding issue, however the code provided there, which works for US/English websites, fails to properly handle other languages.

Here is a full Java class that demonstrates what I'm referring to:

import java.io.IOException;
import java.io.InputStreamReader;
import java.io.Reader;
import java.io.UnsupportedEncodingException;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.regex.Matcher;
import java.util.regex.Pattern;


public class I18NScraper
{
    static
    {
        System.setProperty("http.agent", "");
    }

    public static final String IE8_USER_AGENT = "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; WOW64; Trident/4.0; SLCC1; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; InfoPath.2)";

  //https://stackoverflow.com/questions/1381617/simplest-way-to-correctly-load-html-from-web-page-into-a-string-in-java
    private static final Pattern CHARSET_PATTERN = Pattern.compile("text/html;\\s+charset=([^\\s]+)\\s*");
    public static String getPageContentsFromURL(String page) throws UnsupportedEncodingException, MalformedURLException, IOException {
        Reader r = null;
        try {
            URL url = new URL(page);
            HttpURLConnection con = (HttpURLConnection)url.openConnection();
            con.setRequestProperty("User-Agent", IE8_USER_AGENT);

            Matcher m = CHARSET_PATTERN.matcher(con.getContentType());
            /* If Content-Type doesn't match this pre-conception, choose default and 
             * hope for the best. */
            String charset = m.matches() ? m.group(1) : "ISO-8859-1";
            r = new InputStreamReader(con.getInputStream(),charset);
            StringBuilder buf = new StringBuilder();
            while (true) {
              int ch = r.read();
              if (ch < 0)
                break;
              buf.append((char) ch);
            }
            return buf.toString();
        } finally {
            if(r != null){
                r.close();
            }
        }
    }

    private static final Pattern TITLE_PATTERN = Pattern.compile("<title>([^<]*)</title>");
    public static String getDesc(String page){
        Matcher m = TITLE_PATTERN.matcher(page);
        if(m.find())
            return m.group(1);
        return page.contains("<title>")+"";
    }

    public static void main(String[] args) throws UnsupportedEncodingException, MalformedURLException, IOException{
        System.out.println(getDesc(getPageContentsFromURL("http://yandex.ru/yandsearch?text=%D0%A0%D0%B5%D0%B7%D1%83%D0%BB%D1%8C%D1%82%D0%B0%D1%82%D0%BE%D0%B2&lr=223")));
    }
}

Which outputs:

??????????? — ??????: ??????? 360 ??? ???????

Though it ought to be:

Результатов — Яндекс: Нашлось 360 млн ответов

Can you help me understand what I'm doing wrong? Trying things like forcing UTF-8 do not help, despite that being the charset listed in the source and the HTTP header.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

网白 2024-12-14 21:24:26

确定正确的字符集编码可能很棘手。

您需要使用以下组合

：a) HTML META Content-Type 标记：

<META http-equiv="Content-Type" content="text/html; charset=EUC-JP">

b) HTTP 响应标头：

Content-Type: text/html; charset=utf-8

c) 启发式方法从字节中检测字符集（请参阅这个问题）

使用这三个的原因是：

（a）和（b）可能
缺少META 内容类型可能是错误的（请参阅这个问题）

如果 (a) 和 (b) 都缺失怎么办？

在这种情况下，您需要使用一些启发式方法来确定正确的编码 - 请参阅这个问题。

我发现这个序列对于可靠地识别 HTML 页面的字符集编码来说是最可靠的：

使用 HTTP 响应标头 Content-Type（如果存在）
在响应内容字节上使用编码检测器
使用 HTML META Content-Type

但您可以选择交换 2 和 3。

Determining the right charset encoding can be tricky.

You need to use a combination of

a) the HTML META Content-Type tag:

<META http-equiv="Content-Type" content="text/html; charset=EUC-JP">

b) the HTTP response header:

Content-Type: text/html; charset=utf-8

c) Heuristics to detect charset from bytes (see this question)

The reason for using all three is:

(a) and (b) might be missing
the META Content-Type might be wrong (see this question)

What to do if (a) and (b) are both missing?

In that case you need to use some heuristics to determine the correct encoding - see this question.

I find this sequence to be the most reliable for robustly identifying the charset encoding of an HTML page:

Use HTTP response header Content-Type (if exists)
Use an encoding detector on the response content bytes
use HTML META Content-Type

but you might choose to swap 2 and 3.

回复收藏 0 原文

∞梦里开花 2024-12-14 21:24:26

您看到的问题是 Mac 上的编码不支持西里尔字母。我不确定在 Oracle JVM 上是否属实，但是当 Apple 生产自己的 JVM 时，默认字符编码Java 的版本是 MacRoman。

当您启动程序时，指定 file.encoding 系统属性以将字符编码设置为 UTF-8（Mac OS X 默认使用的编码）。请注意，启动时必须设置：java -Dfile.encoding=UTF-8 ...;如果您以编程方式设置它（通过调用System.setProperty()），那就太晚了，并且该设置将被忽略。

每当 Java 需要将字符编码为字节时（例如，当它将文本转换为字节以写入标准输出或错误流时），它将使用默认值，除非您明确指定不同的值。如果默认编码无法对特定字符进行编码，则会替换合适的替换字符。

如果编码可以处理 Unicode 替换字符 U+FFFD，则使用 (�)。否则，问号 (?) 是常用的替换字符。

回复收藏 0 原文