如何将 UTF-8 转换为 HTML 实体中的文本?
我有一个从互联网下载页面的下载程序。 每个页面的编码都不同,有些是UTF-8,有些是Unicode。 例如:a
显示“a”字符;页面充满了这些字符。我们应该将此编码转换为普通文本。
我在 C# 中使用了 UnicodeEncoding
类,但它们对我没有帮助。
我如何将此编码解码为真实字符?是否有一个类或方法可以转换它?
谢谢 。
I have a downloader program that download pages from internet .
the encoding of each page is different , some are in UTF-8 and some are Unicode.
For example : a
that shows 'a' character ; pages full of this characters .We should convert this encodings to normal text .
I used the UnicodeEncoding
class in c# , but they do not help me .
How can i decode this encodings to real characters? Is there a class or method that converting this ?
Thanks .
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
那是 html 编码的;尝试HtmlDecode? (您需要引用 System.Web.dll)
That is html-encoded; try HtmlDecode? (you'll need a reference to System.Web.dll)
html页面中的文本以&开头的形式以 ; 结尾的,是 HTML 编码的。
您可以使用以下方法对其进行解码:
另请参阅 从互联网下载 HTML 后字符串中的字符发生了变化,了解如何确保以正确的字符集下载页面的代码。
Text in html pages which are in the form of starting with & and ending with ;, are HTML encoded.
You can decode these by using:
Also see Characters in string changed after downloading HTML from the internet for code on how to make sure you download the page in the correct character set.
您对 HTML/XML 转义和 UTF-8/Unicode 感到困惑。
如果页面是有效的 XML,那么生活会更容易 - 您可以像任何其他 XML 文档一样解析它,然后获取相关的文本节点...当您获取文本时,所有 XML 转义都将“未转义”。
如果它是任意的——而且可能是无效的——HTML,那么生活就有点困难了。您可能希望首先将其规范化为有效的 HTML,然后解析它并再次请求文本节点。
如果您能给我们一个更具体的例子,我们会更容易为您提供建议。
其他答案中建议的
HtmlDecode
方法很可能就是您所需要的一切 - 但您绝对应该首先尝试了解发生了什么。例如,您可能想要仅解码 HTML 的某些片段 - 如果您解码整个文档,那么最终可能会得到看起来它包含的文本类似于 HTML标签,但实际上只包含原始文档中的文本。You're getting confused between HTML/XML escaping and UTF-8/Unicode.
If the page is valid XML, life will be easier - you can just parse it as any other XML document, and then just get the relevant text nodes... all the XML escaping will be "unescaped" when you get the text.
If it's arbitrary - and possibly invalid - HTML then life is a bit harder. You may well want to normalize it into valid HTML first, then parse it and again ask for the text nodes.
If you can give us a more concrete example, it will be easier to advise you.
The
HtmlDecode
method suggested in other answers may very well be all you need - but you should definitely try to understand what's going on first. For example, you may well want to only decode certain fragments of the HTML - if you decode the whole document, then you could end up with text which looks it contains like HTML tags, but actually just contained text in the original document.