java.util.Scanner 和维基百科
我正在尝试使用 java.util.Scanner 获取维基百科内容并将其用于基于单词的搜索。 事实是,一切都很好,但在阅读某些单词时,它给了我错误。 查看代码并进行一些检查,结果发现,用一些单词似乎 无法识别编码等,并且内容不再可读。 这是用于获取页面的代码:
// -Start-
try {
connection = new URL("http://it.wikipedia.org
wiki/"+word).openConnection();
Scanner scanner = new Scanner(connection.getInputStream());
scanner.useDelimiter("\\Z");
content = scanner.next();
// if(word.equals("pubblico"))
// System.out.println(content);
System.out.println("Doing: "+ word);
//End
意大利维基百科的“pubblico”一词出现了问题。 word pubblico 上 println 的结果是这样的(已剪切): ï¿ï¿½]Ksr>�~E �1A���E�ER3tHZ�4v��&PZjtc�¿½ï¿½D�7_|����=8��Ø}
您知道为什么吗? 然而,查看页面源代码和标题是相同的,具有相同的编码...
原来内容是压缩的,所以我可以告诉维基百科不要向我发送压缩的页面还是这是唯一的方法? 谢谢
I'm trying to use java.util.Scanner to take Wikipedia contents and use it for word based searches.
The fact is that it's all fine but when reading some words it give me errors.
Looking at code and making some check it turned out that with some words it seems
not to recognize the encoding, or so, and the content is no more readable.
This is the code used to take the page:
// -Start-
try {
connection = new URL("http://it.wikipedia.org
wiki/"+word).openConnection();
Scanner scanner = new Scanner(connection.getInputStream());
scanner.useDelimiter("\\Z");
content = scanner.next();
// if(word.equals("pubblico"))
// System.out.println(content);
System.out.println("Doing: "+ word);
//End
The problem arises with words as "pubblico" for the italian wikipedia.
the result of the println on word pubblico is like this (cutted):
ï¿ï¿½]Ksr>�~E
�1A���E�ER3tHZ�4v��&PZjtc�¿½ï¿½D�7_|����=8��Ø}
Do you have any idea why? Yet looked at page source and headers are the same, with same encoding...
Turned Out that content is gzipped, so can i tell wikipedia not to send me teir pages zipped or it's the only way? thank you
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
尝试使用具有指定字符集的 Scanner:
对于默认构造函数:
java.sun.com 上的扫描程序
Try using the Scanner with a specified character set:
For the default constructor:
Scanner on java.sun.com
尝试使用
Reader
而不是InputStream
- 我认为它的工作原理如下:您也可以直接将字符集传递给 Scanner 构造函数,如另一个答案中所示。
Try using a
Reader
instead of anInputStream
- I think it works something like this:You could also just pass the charset to the Scanner constructor directly as indicated in another answer.
您需要使用
URLConnection
,以便您可以确定 响应中的内容类型标头。 这应该告诉您 创建您的扫描器
。具体来说,查看内容类型标头的“charset”参数。
要禁止 gzip 压缩,将接受编码标头设置为“identity”。 有关详细信息,请参阅HTTP 规范。
You need to use a
URLConnection
, so that you you can determine the content-type header in the response. This should tell you the character encoding to use when you create yourScanner
.Specifically, look at the "charset" parameter of the content-type header.
To inhibit gzip compression, set the accept-encoding header to "identity". See the HTTP specification for more information.
编码不会改变。 为什么?
encoding doesn't change. why?
所以有效!
So works!!!