阅读网站的编码问题，三种不同的编码

发布于 2024-10-15 09:31:05 字数 972 浏览 3 评论 0原文

我在 C# 中遇到 WebRequest 问题。这是一个谷歌页面。

标题状态

text/html; charset=ISO-8859-1

网站状态

<meta http-equiv=content-type content="text/html; charset=utf-8">

最后，当我使用默认为 System.Text.SBCSCodePageEncoding 的 Encoding.Default 时，我只能在调试器和正则表达式中获得预期的结果>

现在我该怎么办？您有任何提示吗？这是如何发生的，或者我如何解决这个问题？

页面的实际编码似乎是UTF-8。至少 FF 在 UTF-8 中正确显示它，在 Windows-Whatever 中不能，在 Latin1 中不能。

网址是这个

问题是 € - 符号以及所有德语变音符号。

预先感谢您对这个问题的帮助，这让我非常疯狂！

更新：当我通过它输出字符串时

// create a writer and open the file
TextWriter tw = new StreamWriter("test.txt");

// write a line of text to the file
tw.WriteLine(html);

// close the stream
tw.Close();

一切正常。

所以看来问题是调试器没有显示正确的编码，正则表达式也是如此。

如何告诉 C# 将 RegEx 作为 UTF-8 处理？

原文

I have a problem with a WebRequest in C#. It's a google page.

The header states

text/html; charset=ISO-8859-1

The website states

<meta http-equiv=content-type content="text/html; charset=utf-8">

And finally I only get the expected Result in the debugger as well as regular expression, when I use Encoding.Default which defaults to System.Text.SBCSCodePageEncoding

Now what do I do? Do you have any hints, how this could happen or how I could solve this problem?

The actual Encoding of the page seems to be UTF-8. At least FF displays it correctly in UTF-8, not in Windows-Whatever and not in Latin1.

The URL is this

The problem is the €-sign as well as all German Umlauts.

Thanks in advance for your help on this problem which is making me seriously crazy!

Update: when I output the string via

// create a writer and open the file
TextWriter tw = new StreamWriter("test.txt");

// write a line of text to the file
tw.WriteLine(html);

// close the stream
tw.Close();

it works all fine.

So it seems the problem is, that the debugger does not show the correct encoding, and the Regular Expression also.

How do I tell C# to handle the RegEx as UTF-8?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

初与友歌 2024-10-22 09:31:05

为什么不使用 Google Query API，而不是解析 HTML？

顺便说一句，在使用正则表达式解析 HTML 之前，阅读此内容 ;-)

编辑：回答您的评论：

该 API 适用于 Google 桌面
以及。
此编码问题是否特定于 Google 页面？
除了您现在遇到的问题之外，谁知道您以后在生产时会遇到什么问题，因为这些页面的 HTML 或 Web 服务器发回的标头中存在细微的变化。 网页应该是人眼友好的，而不是计算机友好的。您唯一可以期望友好的是页面的外观和呈现的内容，而不是底层的 HTML 结构。 与 API 不同，API 应该是计算机友好的。