抓取非英文网站的编码问题

发布于 2024-12-07 21:24:26 字数 3084 浏览 0 评论 0原文

我试图将网页的内容作为字符串获取,我发现这个问题解决了 如何编写一个基本的网络爬虫,它声称(并且似乎)处理编码问题,但是那里提供的代码可以工作对于美国/英语网站,无法正确处理其他语言。

这是一个完整的 Java 类,它演示了我所指的内容:

import java.io.IOException;
import java.io.InputStreamReader;
import java.io.Reader;
import java.io.UnsupportedEncodingException;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.regex.Matcher;
import java.util.regex.Pattern;


public class I18NScraper
{
    static
    {
        System.setProperty("http.agent", "");
    }

    public static final String IE8_USER_AGENT = "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; WOW64; Trident/4.0; SLCC1; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; InfoPath.2)";

  //https://stackoverflow.com/questions/1381617/simplest-way-to-correctly-load-html-from-web-page-into-a-string-in-java
    private static final Pattern CHARSET_PATTERN = Pattern.compile("text/html;\\s+charset=([^\\s]+)\\s*");
    public static String getPageContentsFromURL(String page) throws UnsupportedEncodingException, MalformedURLException, IOException {
        Reader r = null;
        try {
            URL url = new URL(page);
            HttpURLConnection con = (HttpURLConnection)url.openConnection();
            con.setRequestProperty("User-Agent", IE8_USER_AGENT);

            Matcher m = CHARSET_PATTERN.matcher(con.getContentType());
            /* If Content-Type doesn't match this pre-conception, choose default and 
             * hope for the best. */
            String charset = m.matches() ? m.group(1) : "ISO-8859-1";
            r = new InputStreamReader(con.getInputStream(),charset);
            StringBuilder buf = new StringBuilder();
            while (true) {
              int ch = r.read();
              if (ch < 0)
                break;
              buf.append((char) ch);
            }
            return buf.toString();
        } finally {
            if(r != null){
                r.close();
            }
        }
    }

    private static final Pattern TITLE_PATTERN = Pattern.compile("<title>([^<]*)</title>");
    public static String getDesc(String page){
        Matcher m = TITLE_PATTERN.matcher(page);
        if(m.find())
            return m.group(1);
        return page.contains("<title>")+"";
    }

    public static void main(String[] args) throws UnsupportedEncodingException, MalformedURLException, IOException{
        System.out.println(getDesc(getPageContentsFromURL("http://yandex.ru/yandsearch?text=%D0%A0%D0%B5%D0%B7%D1%83%D0%BB%D1%8C%D1%82%D0%B0%D1%82%D0%BE%D0%B2&lr=223")));
    }
}

哪些输出:

???????????&nbsp;&mdash; ??????: ??????? 360&nbsp;???&nbsp;???????

虽然它应该是:

Результатов&nbsp;&mdash; Яндекс: Нашлось 360&nbsp;млн&nbsp;ответов

你能帮助我理解我做错了什么吗?尝试强制使用 UTF-8 之类的方法并没有帮助,尽管这是源代码和 HTTP 标头中列出的字符集。

I'm trying to get the contents of a webpage as a string, and I found this question addressing how to write a basic web crawler, which claims to (and seems to) handle the encoding issue, however the code provided there, which works for US/English websites, fails to properly handle other languages.

Here is a full Java class that demonstrates what I'm referring to:

import java.io.IOException;
import java.io.InputStreamReader;
import java.io.Reader;
import java.io.UnsupportedEncodingException;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.regex.Matcher;
import java.util.regex.Pattern;


public class I18NScraper
{
    static
    {
        System.setProperty("http.agent", "");
    }

    public static final String IE8_USER_AGENT = "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; WOW64; Trident/4.0; SLCC1; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; InfoPath.2)";

  //https://stackoverflow.com/questions/1381617/simplest-way-to-correctly-load-html-from-web-page-into-a-string-in-java
    private static final Pattern CHARSET_PATTERN = Pattern.compile("text/html;\\s+charset=([^\\s]+)\\s*");
    public static String getPageContentsFromURL(String page) throws UnsupportedEncodingException, MalformedURLException, IOException {
        Reader r = null;
        try {
            URL url = new URL(page);
            HttpURLConnection con = (HttpURLConnection)url.openConnection();
            con.setRequestProperty("User-Agent", IE8_USER_AGENT);

            Matcher m = CHARSET_PATTERN.matcher(con.getContentType());
            /* If Content-Type doesn't match this pre-conception, choose default and 
             * hope for the best. */
            String charset = m.matches() ? m.group(1) : "ISO-8859-1";
            r = new InputStreamReader(con.getInputStream(),charset);
            StringBuilder buf = new StringBuilder();
            while (true) {
              int ch = r.read();
              if (ch < 0)
                break;
              buf.append((char) ch);
            }
            return buf.toString();
        } finally {
            if(r != null){
                r.close();
            }
        }
    }

    private static final Pattern TITLE_PATTERN = Pattern.compile("<title>([^<]*)</title>");
    public static String getDesc(String page){
        Matcher m = TITLE_PATTERN.matcher(page);
        if(m.find())
            return m.group(1);
        return page.contains("<title>")+"";
    }

    public static void main(String[] args) throws UnsupportedEncodingException, MalformedURLException, IOException{
        System.out.println(getDesc(getPageContentsFromURL("http://yandex.ru/yandsearch?text=%D0%A0%D0%B5%D0%B7%D1%83%D0%BB%D1%8C%D1%82%D0%B0%D1%82%D0%BE%D0%B2&lr=223")));
    }
}

Which outputs:

??????????? — ??????: ??????? 360 ??? ???????

Though it ought to be:

Результатов — Яндекс: Нашлось 360 млн ответов

Can you help me understand what I'm doing wrong? Trying things like forcing UTF-8 do not help, despite that being the charset listed in the source and the HTTP header.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

网白 2024-12-14 21:24:26

确定正确的字符集编码可能很棘手。

您需要使用以下组合

:a) HTML META Content-Type 标记:

<META http-equiv="Content-Type" content="text/html; charset=EUC-JP">

b) HTTP 响应标头:

Content-Type: text/html; charset=utf-8

c) 启发式方法从字节中检测字符集(请参阅 这个问题

使用这三个的原因是:

  1. (a)和(b)可能
  2. 缺少META 内容类型可能是错误的(请参阅这个问题

如果 (a) 和 (b) 都缺失怎么办?

在这种情况下,您需要使用一些启发式方法来确定正确的编码 - 请参阅 这个问题

我发现这个序列对于可靠地识别 HTML 页面的字符集编码来说是最可靠的:

  1. 使用 HTTP 响应标头 Content-Type(如果存在)
  2. 在响应内容字节上使用编码检测器
  3. 使用 HTML META Content-Type

但您可以选择交换 2 和 3。

Determining the right charset encoding can be tricky.

You need to use a combination of

a) the HTML META Content-Type tag:

<META http-equiv="Content-Type" content="text/html; charset=EUC-JP">

b) the HTTP response header:

Content-Type: text/html; charset=utf-8

c) Heuristics to detect charset from bytes (see this question)

The reason for using all three is:

  1. (a) and (b) might be missing
  2. the META Content-Type might be wrong (see this question)

What to do if (a) and (b) are both missing?

In that case you need to use some heuristics to determine the correct encoding - see this question.

I find this sequence to be the most reliable for robustly identifying the charset encoding of an HTML page:

  1. Use HTTP response header Content-Type (if exists)
  2. Use an encoding detector on the response content bytes
  3. use HTML META Content-Type

but you might choose to swap 2 and 3.

∞梦里开花 2024-12-14 21:24:26

您看到的问题是 Mac 上的编码不支持西里尔字母。我不确定在 Oracle JVM 上是否属实,但是当 Apple 生产自己的 JVM 时,默认字符编码Java 的版本是 MacRoman。

当您启动程序时,指定 file.encoding 系统属性以将字符编码设置为 UTF-8(Mac OS X 默认使用的编码)。请注意,启动时必须设置:java -Dfile.encoding=UTF-8 ...;如果您以编程方式设置它(通过调用System.setProperty()),那就太晚了,并且该设置将被忽略。

每当 Java 需要将字符编码为字节时(例如,当它将文本转换为字节以写入标准输出或错误流时),它将使用默认值,除非您明确指定不同的值。如果默认编码无法对特定字符进行编码,则会替换合适的替换字符。

如果编码可以处理 Unicode 替换字符 U+FFFD,则使用 (�)。否则,问号 (?) 是常用的替换字符。

The problem you are seeing is that the encoding on your Mac doesn't support Cyrillic script. I'm not sure if it's true on an Oracle JVM, but when Apple was producing their own JVMs, the default character encoding for Java was MacRoman.

When you start your program, specify the file.encoding system property to set the character encoding to UTF-8 (which is what Mac OS X uses by default). Note that you have to set it when you launch: java -Dfile.encoding=UTF-8 ...; if you set it programatically (with a call to System.setProperty()), it's too late, and the setting will be ignored.

Whenever Java needs to encode characters to bytes—for example, when it's converting text to bytes to write to the standard output or error streams—it will use the default unless you explicitly specify a different one. If the default encoding can't encode a particular character, a suitable replacement character is substituted.

If the encoding can handle the Unicode replacement character, U+FFFD, (�) that's used. Otherwise, a question mark (?) is a commonly used replacement character.

清晨说晚安 2024-12-14 21:24:26

Apache Tika 包含您想要的实现。很多人都用它来做这个。您还可以查看 Apache Nutch。另一方面,那么您根本不必实现自己的爬虫。

Apache Tika contains an implementation of what you want here. Many people use it for this. You could also look into Apache Nutch. On the other hand, then you wouldn't have to implement your own crawler at all.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文