java.util.Scanner 和维基百科

发布于 2024-07-13 11:04:08 字数 864 浏览 6 评论 0原文

我正在尝试使用 java.util.Scanner 获取维基百科内容并将其用于基于单词的搜索。 事实是,一切都很好,但在阅读某些单词时,它给了我错误。 查看代码并进行一些检查,结果发现,用一些单词似乎 无法识别编码等,并且内容不再可读。 这是用于获取页面的代码:

// -Start-

try {
        connection =  new URL("http://it.wikipedia.org
wiki/"+word).openConnection();
                    Scanner scanner = new Scanner(connection.getInputStream());
        scanner.useDelimiter("\\Z");
        content = scanner.next();
//          if(word.equals("pubblico"))
//              System.out.println(content);
        System.out.println("Doing: "+ word);
//End

意大利维基百科的“pubblico”一词出现了问题。 word pubblico 上 println 的结果是这样的(已剪切): ï¿ï¿½]Ksr>�~E �1A���E�ER3tHZ�4v��&PZjtc�¿½ï¿½D�7_|����=8��Ø}

您知道为什么吗? 然而,查看页面源代码和标题是相同的,具有相同的编码...

原来内容是压缩的,所以我可以告诉维基百科不要向我发送压缩的页面还是这是唯一的方法? 谢谢

I'm trying to use java.util.Scanner to take Wikipedia contents and use it for word based searches.
The fact is that it's all fine but when reading some words it give me errors.
Looking at code and making some check it turned out that with some words it seems
not to recognize the encoding, or so, and the content is no more readable.
This is the code used to take the page:

// -Start-

try {
        connection =  new URL("http://it.wikipedia.org
wiki/"+word).openConnection();
                    Scanner scanner = new Scanner(connection.getInputStream());
        scanner.useDelimiter("\\Z");
        content = scanner.next();
//          if(word.equals("pubblico"))
//              System.out.println(content);
        System.out.println("Doing: "+ word);
//End

The problem arises with words as "pubblico" for the italian wikipedia.
the result of the println on word pubblico is like this (cutted):
ï¿ï¿½]Ksr>�~E
�1A���E�ER3tHZ�4v��&PZjtc�¿½ï¿½D�7_|����=8��Ø}

Do you have any idea why? Yet looked at page source and headers are the same, with same encoding...

Turned Out that content is gzipped, so can i tell wikipedia not to send me teir pages zipped or it's the only way? thank you

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

送舟行 2024-07-20 11:04:08

尝试使用具有指定字符集的 Scanner:

public Scanner(InputStream source, String charsetName)

对于默认构造函数:

使用底层平台的默认字符集将流中的字节转换为字符。

java.sun.com 上的扫描程序

Try using the Scanner with a specified character set:

public Scanner(InputStream source, String charsetName)

For the default constructor:

Bytes from the stream are converted into characters using the underlying platform's default charset.

Scanner on java.sun.com

柠檬心 2024-07-20 11:04:08

尝试使用 Reader 而不是 InputStream - 我认为它的工作原理如下:

connection =  new URL("http://it.wikipedia.org/wiki/"+word).openConnection();
String ctype = connection.getContentType();
int csi = ctype.indexOf("charset=");
Scanner scanner;
if (csi > 0)
    scanner = new Scanner(new InputStreamReader(connection.getInputStream(), ctype.substring(csi + 8)));
else
    scanner = new Scanner(new InputStreamReader(connection.getInputStream()));
scanner.useDelimiter("\\Z");
content = scanner.next();
if(word.equals("pubblico"))
    System.out.println(content);
System.out.println("Doing: "+ word);

您也可以直接将字符集传递给 Scanner 构造函数,如另一个答案中所示。

Try using a Reader instead of an InputStream - I think it works something like this:

connection =  new URL("http://it.wikipedia.org/wiki/"+word).openConnection();
String ctype = connection.getContentType();
int csi = ctype.indexOf("charset=");
Scanner scanner;
if (csi > 0)
    scanner = new Scanner(new InputStreamReader(connection.getInputStream(), ctype.substring(csi + 8)));
else
    scanner = new Scanner(new InputStreamReader(connection.getInputStream()));
scanner.useDelimiter("\\Z");
content = scanner.next();
if(word.equals("pubblico"))
    System.out.println(content);
System.out.println("Doing: "+ word);

You could also just pass the charset to the Scanner constructor directly as indicated in another answer.

∞梦里开花 2024-07-20 11:04:08

您需要使用 URLConnection,以便您可以确定 响应中的内容类型标头。 这应该告诉您 创建您的扫描器

具体来说,查看内容类型标头的“charset”参数。


要禁止 gzip 压缩,将接受编码标头设置为“identity”。 有关详细信息,请参阅HTTP 规范

You need to use a URLConnection, so that you you can determine the content-type header in the response. This should tell you the character encoding to use when you create your Scanner.

Specifically, look at the "charset" parameter of the content-type header.


To inhibit gzip compression, set the accept-encoding header to "identity". See the HTTP specification for more information.

迷爱 2024-07-20 11:04:08
connection =  new URL("http://it.wikipedia.org/wiki/"+word).openConnection();
            connection.addRequestProperty("Accept-Encoding","");
            System.out.println(connection.getContentEncoding());
            Scanner scanner = new Scanner(new InputStreamReader(connection.getInputStream()));
            scanner.useDelimiter("\\Z");
            content = new String(scanner.next());

编码不会改变。 为什么?

connection =  new URL("http://it.wikipedia.org/wiki/"+word).openConnection();
            connection.addRequestProperty("Accept-Encoding","");
            System.out.println(connection.getContentEncoding());
            Scanner scanner = new Scanner(new InputStreamReader(connection.getInputStream()));
            scanner.useDelimiter("\\Z");
            content = new String(scanner.next());

encoding doesn't change. why?

半岛未凉 2024-07-20 11:04:08
connection =  new URL("http://it.wikipedia.org/wiki/"+word).openConnection();
//connection.addRequestProperty("Accept-Encoding","");
//System.out.println(connection.getContentEncoding());

InputStream resultingInputStream = null;       // Stream su cui fluisce la pagina scaricata
String encoding = connection.getContentEncoding();    // Codifica di invio (identity, gzip, inflate)
// Scelta dell'opportuno decompressore per leggere la sorgente
if (connection.getContentEncoding() != null && encoding.equals("gzip")) {
    resultingInputStream = new GZIPInputStream(connection.getInputStream());
}
else if (encoding != null && encoding.equals("deflate")) {
    resultingInputStream = new InflaterInputStream(connection.getInputStream(), new Inflater(true));
}
else {
    resultingInputStream = connection.getInputStream();
}

// Scanner per estrarre dallo stream la pagina per inserirla in una stringa
Scanner scanner = new Scanner(resultingInputStream);
scanner.useDelimiter("\\Z");
content = new String(scanner.next());

所以有效!

connection =  new URL("http://it.wikipedia.org/wiki/"+word).openConnection();
//connection.addRequestProperty("Accept-Encoding","");
//System.out.println(connection.getContentEncoding());

InputStream resultingInputStream = null;       // Stream su cui fluisce la pagina scaricata
String encoding = connection.getContentEncoding();    // Codifica di invio (identity, gzip, inflate)
// Scelta dell'opportuno decompressore per leggere la sorgente
if (connection.getContentEncoding() != null && encoding.equals("gzip")) {
    resultingInputStream = new GZIPInputStream(connection.getInputStream());
}
else if (encoding != null && encoding.equals("deflate")) {
    resultingInputStream = new InflaterInputStream(connection.getInputStream(), new Inflater(true));
}
else {
    resultingInputStream = connection.getInputStream();
}

// Scanner per estrarre dallo stream la pagina per inserirla in una stringa
Scanner scanner = new Scanner(resultingInputStream);
scanner.useDelimiter("\\Z");
content = new String(scanner.next());

So works!!!

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文