下载网页而不进行字符替换
我正在尝试使用以下内容下载 java 中的网页:
URL url = new URL("www.jksfljasdlfas.com");
FIle to = new File("/home/test/test.html");
Reader in = new InputStreamReader(url.openStream(), "UTF-8");
Writer out = new OutputStreamWriter(new FileOutputStream(to), "UTF-8");
int c;
while((c = in.read()) != -1){
out.write(c);
}
in.close();
out.close();
我下载该页面,并且某些字符被实体替换:
这个:连续分页»
变成这样:连续分页»
下载与 Chrome 相同的页面,&仍然是&。
我是字符集/编码方面的新手;有人能理解这个问题吗?
I'm tryng to download a web page in java with the following:
URL url = new URL("www.jksfljasdlfas.com");
FIle to = new File("/home/test/test.html");
Reader in = new InputStreamReader(url.openStream(), "UTF-8");
Writer out = new OutputStreamWriter(new FileOutputStream(to), "UTF-8");
int c;
while((c = in.read()) != -1){
out.write(c);
}
in.close();
out.close();
I download the page and some character are replaced by entities:
this:<a href="http://www.generation276.org/film/?m=200812&paged=2" >Pagina successiva »</a>
become this:<a href="http://www.generation276.org/film/?m=200812&paged=2" >Pagina successiva »</a>
Downloading the same page with Chrome, the & remains &.
I'm new in Charset/encoding; can anybody understand the probem?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
Java部分工作得很好。
Chrome 正在欺骗你。在 FireFox 中,当我选择
View ->页面源
,我看到这个:而使用 FireBug / Inspect Element 我看到这个:
并且它复制到剪贴板如下:
浏览器并不总是向您显示真正存在的内容。
您问题的第二部分与上一个问题相同:
因此答案也是相同的:
使用 StringEscapeUtils.unescapeHTML(String) 来自 Apache Commons/Lang 项目。
The Java part is working perfectly fine.
Chrome is tricking you there. In FireFox, when I select
View -> Page Source
, I see this:while with FireBug / Inspect Element I see this:
and it copies to the clipboard as this:
Browsers don't always show you what's really there.
The second part of your question is identical to this previous Question:
And hence the answer is also the same:
Use StringEscapeUtils.unescapeHTML(String) from the Apache Commons / Lang project.
该页面的实际来源确实说:
这完全没问题。
&
是 HTML 中文字与符号的有效字符引用,尽管实体引用&
通常更常见。这是无效的 HTML。
当您保存“仅 HTML”时,Chrome 会保存原始 HTML 源代码而不进行任何更改。当您保存“完成”时,它必须重写页面以更改对其他资源的引用。
不幸的是,这里涉及的序列化过程似乎存在一个错误,无法
&
转义URL中的“&”符号。虽然浏览器通常会让您摆脱这种情况,但如果与号右侧的单词碰巧构成有效的 HTML 实体名称或字符引用,它就会中断(破坏您的 URL)。Chrome 序列化属性值的其他地方(例如
innerHTML
)不会遇到这个相当糟糕的错误。预计到达时间:
如果您尝试使用正则表达式从源中抓取信息,则必须使用 HTML 解码器手动解码。 Java 没有内置工具,因此您需要一个第三方工具,例如由 Seanizer 链接的 Apache Commons 中的工具。
然而,使用正则表达式进行抓取是粗糙且不可靠的。我强烈建议使用 HTML 解析器 加载文件并选择你想要的数据。它将处理解码属性值和文本内容。
The actual source of that page does say:
and this is perfectly fine.
&
is a valid character reference for a literal ampersand character in HTML, although the entity reference&
is generally more common.This is invalid HTML.
When you save ‘HTML only’, Chrome saves the original HTML source without change. When you save ‘Complete’, it has to re-write the page to change references to other resources.
Unfortunately the serialisation process involved in this appears to have a bug in failing to
&
-escape the ampersands in the URL. Whilst browsers typically let you get away with this, it will break (mangling your URL) if the word to the right of the ampersand happens to make a valid HTML entity name or character reference.Other places where Chrome serialises attribute values, such as
innerHTML
, do not suffer from this rather poor bug.ETA:
If you try to scrape information out of the source using regex you'd have to decode manually using HTML decoder. There isn't one built-in to Java so you would need a third-party tool such as that from Apache Commons as linked by seanizer.
However, scraping with regex is crude and unreliable. I would strongly suggest using an HTML parser to load the file and pick out the data you want. It will deal with decoding attribute values and text content.