C# HtmlEncode - ISO-8859-1 实体名称与数字
根据以下 表格 的 ISO-8859-1 标准,似乎有一个与每个保留的 HTML 关联的实体名称和实体编号特点。
例如,对于字符 é
:
实体名称:é
实体编号:é
同样,对于字符>
:
实体名称:>
实体编号:>
对于给定字符串,HttpUtility.HtmlEncode
返回 HTML 编码的字符串,但我可以'不知道它是如何工作的。我的意思是:
Console.WriteLine(HtmlEncode("é>"));
//Outputs é>
它似乎使用 é
字符的实体编号,但使用 >
字符的实体名称。
那么 HtmlEncode 方法真的符合 ISO-8859-1 标准吗?如果确实如此,是否有原因导致有时使用实体名称而有时使用实体编号?更重要的是,我可以强迫它可靠地给我实体名称吗?
编辑: 谢谢你们的回答。不过,在执行搜索之前我无法解码该字符串。无需了解太多细节,文本存储在 SharePoint 列表中,并且“搜索”由 SharePoint 本身完成(使用 CAML 查询)。所以基本上,我不能。
我正在尝试考虑一种将实体编号转换为名称的方法,.NET 中是否有函数可以做到这一点?或者还有其他想法吗?
According to the following table for the ISO-8859-1 standard, there seems to be an entity name and an entity number associated with each reserved HTML character.
So for example, for the character é
:
Entity Name : é
Entity Number : é
Similarly, for the character >
:
Entity Name : >
Entity Number : >
For a given string, the HttpUtility.HtmlEncode
returns an HTML encoded String, but I can't figure out how it works. Here is what I mean :
Console.WriteLine(HtmlEncode("é>"));
//Outputs é>
It seems to be using the entity number for the é
character but the entity name for the >
character.
So does the HtmlEncode method really work with the ISO-8859-1 standard? If it does, is there a reason why it sometimes uses the entity name and other times the entity number? More importantly, can I force it to give me the entity name reliably?
EDIT :
Thanks for the answers guys. I cannot decode the string before I perform the search though. Without getting into too many details, the text is stored in a SharePoint List and the "search" is done by SharePoint itself (using a CAML query). So basically, I can't.
I'm trying to think of a way to convert the entity numbers into names, is there a function in .NET that does that? Or any other idea?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
该方法就是这样实现的。对于某些已知字符,它使用相应的实体,对于其他所有字符,它使用相应的十六进制值,并且您无法做太多修改此行为。摘自 System.Net.WebUtility.HtmlEncode 的实现(如使用 Reflector 所示):
这就是说您不应该关心,因为此方法将始终生成有效、安全且编码正确的 HTML。
That's how the method has been implemented. For some known characters it uses the corresponding entity and for everything else it uses the corresponding hex value and there is not much you could do to modify this behavior. Excerpt from the implementation of
System.Net.WebUtility.HtmlEncode
(as seen with reflector):This being said you shouldn't care as this method will always produce valid, safe and correctly encoded HTML.
HtmlEncode
遵循规范。 ISO 标准为每个实体指定了名称和编号,并且名称和编号是等效的。因此,HtmlEncode
的一致实现可以自由地将所有点编码为数字,或全部编码为名称,或两者的某种混合。我建议您从另一个方向解决问题:对目标文本调用
HtmlDecode
,然后使用原始字符串搜索解码后的文本。HtmlEncode
is following the spec. The ISO standard specifies both a name and a number for every entity, and the name and the number are equivalent. Therefore, a conforming implementation ofHtmlEncode
is free to encode all points as numbers, or all as names, or some mixture of the two.I suggest that you approach your problem from the other direction: call
HtmlDecode
on the target text, then search through the decoded text using the raw string.ISO-8859-1 与 HTML 字符编码并不真正相关。来自维基百科:
仅对于未定义的 Unicode 代码点,经常使用 ISO-8859-1:
现在回答您的问题:为了使搜索效果最佳,您应该使用未编码的搜索字符串来搜索未编码的 HTML(首先剥离 HTML 标签)。
匹配编码字符串将导致意外结果,例如基于 HTML 标签或注释的命中,以及由于 HTML 中在文本中不可见的差异而导致命中丢失。
ISO-8859-1 is not really relevant to HTML character encoding. From Wikipedia:
Only for undefined Unicode code points ISO-8859-1 is often used:
Now to answer your question: For search to work best, you should really search the unencoded HTML (stripping the HTML tags first) using an unencoded search string.
Matching encoded strings will lead to unexpected results, like hits based on HTML tags or comments, and hits missing because of differences in the HTML that are invisible in the text.
我做了这个功能,我认为它会有所帮助
I made this function, I think it will help
我开发了以下代码来保持 az、AZ 和 0-1 不编码而是其余:
I developed following code to keep a-z,A-Z and 0-1 not encoded but rest: