阅读网站的编码问题,三种不同的编码

发布于 2024-10-15 09:31:05 字数 972 浏览 3 评论 0原文

我在 C# 中遇到 WebRequest 问题。这是一个谷歌页面。

标题状态

text/html; charset=ISO-8859-1

网站状态

<meta http-equiv=content-type content="text/html; charset=utf-8">

最后,当我使用默认为 System.Text.SBCSCodePageEncoding 的 Encoding.Default 时,我只能在调试器和正则表达式中获得预期的结果>

现在我该怎么办?您有任何提示吗?这是如何发生的,或者我如何解决这个问题?

页面的实际编码似乎是UTF-8。至少 FF 在 UTF-8 中正确显示它,在 Windows-Whatever 中不能,在 Latin1 中不能

网址是这个

问题是 € - 符号以及所有德语变音符号。

预先感谢您对这个问题的帮助,这让我非常疯狂!

更新:当我通过它输出字符串时

// create a writer and open the file
TextWriter tw = new StreamWriter("test.txt");

// write a line of text to the file
tw.WriteLine(html);

// close the stream
tw.Close();

一切正常。

所以看来问题是调试器没有显示正确的编码,正则表达式也是如此。

如何告诉 C# 将 RegEx 作为 UTF-8 处理?

I have a problem with a WebRequest in C#. It's a google page.

The header states

text/html; charset=ISO-8859-1

The website states

<meta http-equiv=content-type content="text/html; charset=utf-8">

And finally I only get the expected Result in the debugger as well as regular expression, when I use Encoding.Default which defaults to System.Text.SBCSCodePageEncoding

Now what do I do? Do you have any hints, how this could happen or how I could solve this problem?

The actual Encoding of the page seems to be UTF-8. At least FF displays it correctly in UTF-8, not in Windows-Whatever and not in Latin1.

The URL is this

The problem is the €-sign as well as all German Umlauts.

Thanks in advance for your help on this problem which is making me seriously crazy!

Update: when I output the string via

// create a writer and open the file
TextWriter tw = new StreamWriter("test.txt");

// write a line of text to the file
tw.WriteLine(html);

// close the stream
tw.Close();

it works all fine.

So it seems the problem is, that the debugger does not show the correct encoding, and the Regular Expression also.

How do I tell C# to handle the RegEx as UTF-8?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

初与友歌 2024-10-22 09:31:05

为什么不使用 Google Query API,而不是解析 HTML?

顺便说一句,在使用正则表达式解析 HTML 之前,阅读此内容 ;-)

编辑:回答您的评论:

  1. 该 API 适用于 Google 桌面
    以及。
  2. 此编码问题是否特定于 Google 页面?
  3. 除了您现在遇到的问题之外,谁知道您以后在生产时会遇到什么问题,因为这些页面的 HTML 或 Web 服务器发回的标头中存在细微的变化。 网页应该是人眼友好的,而不是计算机友好的。您唯一可以期望友好的是页面的外观和呈现的内容,而不是底层的 HTML 结构。 与 API 不同,API 应该是计算机友好的

Rather than parsing HTML, why not use the Google Query API?

BTW, before parsing HTML using regexes, read this ;-)

EDIT: In answer to your comment:

  1. The API works for Google Desktop
    as well.
  2. Is this encoding issue specific to the Google page?
  3. In addition to the problem you have now, who knows what problem you'll run into later, when in production, due to subtle changes in the HTML of these pages, or in the header sent back by the Web server. A web page is supposed to be human eye-friendly, not computer friendly. The only thing you can expect to be friendly is the appearance and rendered contents of the page, not the underlying HTML structure. As opposed to an API, which is supposed to be computer-friendly.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文