阅读网站的编码问题,三种不同的编码
我在 C# 中遇到 WebRequest
问题。这是一个谷歌页面。
标题状态
text/html; charset=ISO-8859-1
网站状态
<meta http-equiv=content-type content="text/html; charset=utf-8">
最后,当我使用默认为 System.Text.SBCSCodePageEncoding 的 Encoding.Default
时,我只能在调试器和正则表达式中获得预期的结果>
现在我该怎么办?您有任何提示吗?这是如何发生的,或者我如何解决这个问题?
页面的实际编码似乎是UTF-8。至少 FF 在 UTF-8 中正确显示它,在 Windows-Whatever 中不能,在 Latin1 中不能。
网址是这个
问题是 € - 符号以及所有德语变音符号。
预先感谢您对这个问题的帮助,这让我非常疯狂!
更新:当我通过它输出字符串时
// create a writer and open the file
TextWriter tw = new StreamWriter("test.txt");
// write a line of text to the file
tw.WriteLine(html);
// close the stream
tw.Close();
一切正常。
所以看来问题是调试器没有显示正确的编码,正则表达式也是如此。
如何告诉 C# 将 RegEx 作为 UTF-8 处理?
I have a problem with a WebRequest
in C#. It's a google page.
The header states
text/html; charset=ISO-8859-1
The website states
<meta http-equiv=content-type content="text/html; charset=utf-8">
And finally I only get the expected Result in the debugger as well as regular expression, when I use Encoding.Default
which defaults to System.Text.SBCSCodePageEncoding
Now what do I do? Do you have any hints, how this could happen or how I could solve this problem?
The actual Encoding of the page seems to be UTF-8. At least FF displays it correctly in UTF-8, not in Windows-Whatever and not in Latin1.
The URL is this
The problem is the €-sign as well as all German Umlauts.
Thanks in advance for your help on this problem which is making me seriously crazy!
Update: when I output the string via
// create a writer and open the file
TextWriter tw = new StreamWriter("test.txt");
// write a line of text to the file
tw.WriteLine(html);
// close the stream
tw.Close();
it works all fine.
So it seems the problem is, that the debugger does not show the correct encoding, and the Regular Expression also.
How do I tell C# to handle the RegEx as UTF-8?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
为什么不使用 Google Query API,而不是解析 HTML?
顺便说一句,在使用正则表达式解析 HTML 之前,阅读此内容 ;-)
编辑:回答您的评论:
以及。
Rather than parsing HTML, why not use the Google Query API?
BTW, before parsing HTML using regexes, read this ;-)
EDIT: In answer to your comment:
as well.