当前位置：文江博客话题详情

C# HtmlEncode - ISO-8859-1 实体名称与数字

发布于 2024-10-15 01:47:50 字数 1108 浏览 3 评论 0原文

根据以下表格的 ISO-8859-1 标准，似乎有一个与每个保留的 HTML 关联的实体名称和实体编号特点。

例如，对于字符 é ：

实体名称：é

实体编号：é

同样，对于字符>：

实体名称：>

实体编号：>

对于给定字符串，HttpUtility.HtmlEncode 返回 HTML 编码的字符串，但我可以'不知道它是如何工作的。我的意思是：

Console.WriteLine(HtmlEncode("é>"));
//Outputs &#233;&gt;

它似乎使用 é 字符的实体编号，但使用 > 字符的实体名称。

那么 HtmlEncode 方法真的符合 ISO-8859-1 标准吗？如果确实如此，是否有原因导致有时使用实体名称而有时使用实体编号？更重要的是，我可以强迫它可靠地给我实体名称吗？

编辑： 谢谢你们的回答。不过，在执行搜索之前我无法解码该字符串。无需了解太多细节，文本存储在 SharePoint 列表中，并且“搜索”由 SharePoint 本身完成（使用 CAML 查询）。所以基本上，我不能。

我正在尝试考虑一种将实体编号转换为名称的方法，.NET 中是否有函数可以做到这一点？或者还有其他想法吗？

原文

According to the following table for the ISO-8859-1 standard, there seems to be an entity name and an entity number associated with each reserved HTML character.

So for example, for the character é :

Entity Name : é

Entity Number : é

Similarly, for the character > :

Entity Name : >

Entity Number : >

For a given string, the HttpUtility.HtmlEncode returns an HTML encoded String, but I can't figure out how it works. Here is what I mean :

Console.WriteLine(HtmlEncode("é>"));
//Outputs é>

It seems to be using the entity number for the é character but the entity name for the > character.

So does the HtmlEncode method really work with the ISO-8859-1 standard? If it does, is there a reason why it sometimes uses the entity name and other times the entity number? More importantly, can I force it to give me the entity name reliably?

EDIT :
Thanks for the answers guys. I cannot decode the string before I perform the search though. Without getting into too many details, the text is stored in a SharePoint List and the "search" is done by SharePoint itself (using a CAML query). So basically, I can't.

I'm trying to think of a way to convert the entity numbers into names, is there a function in .NET that does that? Or any other idea?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

无风消散 2024-10-22 01:47:50

该方法就是这样实现的。对于某些已知字符，它使用相应的实体，对于其他所有字符，它使用相应的十六进制值，并且您无法做太多修改此行为。摘自 System.Net.WebUtility.HtmlEncode 的实现（如使用 Reflector 所示）：

...
if (ch <= '>')
{
    switch (ch)
    {
        case '&':
        {
            output.Write("&");
            continue;
        }
        case '\'':
        {
            output.Write("'");
            continue;
        }
        case '"':
        {
            output.Write(""");
            continue;
        }
        case '<':
        {
            output.Write("<");
            continue;
        }
        case '>':
        {
            output.Write(">");
            continue;
        }
    }
    output.Write(ch);
    continue;
}
if ((ch >= '\x00a0') && (ch < 'Ā'))
{
    output.Write("&#");
    output.Write(((int) ch).ToString(NumberFormatInfo.InvariantInfo));
    output.Write(';');
}
...

这就是说您不应该关心，因为此方法将始终生成有效、安全且编码正确的 HTML。

That's how the method has been implemented. For some known characters it uses the corresponding entity and for everything else it uses the corresponding hex value and there is not much you could do to modify this behavior. Excerpt from the implementation of System.Net.WebUtility.HtmlEncode (as seen with reflector):

...
if (ch <= '>')
{
    switch (ch)
    {
        case '&':
        {
            output.Write("&");
            continue;
        }
        case '\'':
        {
            output.Write("'");
            continue;
        }
        case '"':
        {
            output.Write(""");
            continue;
        }
        case '<':
        {
            output.Write("<");
            continue;
        }
        case '>':
        {
            output.Write(">");
            continue;
        }
    }
    output.Write(ch);
    continue;
}
if ((ch >= '\x00a0') && (ch < 'Ā'))
{
    output.Write("&#");
    output.Write(((int) ch).ToString(NumberFormatInfo.InvariantInfo));
    output.Write(';');
}
...

This being said you shouldn't care as this method will always produce valid, safe and correctly encoded HTML.

回复收藏 0 原文

时间海 2024-10-22 01:47:50

HtmlEncode 遵循规范。 ISO 标准为每个实体指定了名称和编号，并且名称和编号是等效的。因此，HtmlEncode 的一致实现可以自由地将所有点编码为数字，或全部编码为名称，或两者的某种混合。

我建议您从另一个方向解决问题：对目标文本调用 HtmlDecode，然后使用原始字符串搜索解码后的文本。

回复收藏 0 原文

丑丑阿 2024-10-22 01:47:50

ISO-8859-1 与 HTML 字符编码并不真正相关。来自维基百科：

数字引用始终指代
Unicode 代码点，无论
页面的编码。

仅对于未定义的 Unicode 代码点，经常使用 ISO-8859-1：

使用数字
永久引用的参考文献
未定义的字符和控制
字符是被禁止的，与
换行符、制表符和
回车符。那是，
十六进制范围内的字符
00–08、0B–0C、0E–1F、7F 和 80–9F
不能在 HTML 文档中使用，
甚至没有通过引用，所以“™”，
例如，不允许。然而，
为了向后兼容早期
HTML 作者和浏览器忽略了
这个限制，原始字符和
中的数字字符引用
80–9F 范围被一些人解释为
浏览器代表
映射到字节 80–9F 的字符
Windows-1252 编码。

现在回答您的问题：为了使搜索效果最佳，您应该使用未编码的搜索字符串来搜索未编码的 HTML（首先剥离 HTML 标签）。
匹配编码字符串将导致意外结果，例如基于 HTML 标签或注释的命中，以及由于 HTML 中在文本中不可见的差异而导致命中丢失。

回复收藏 0 原文

一紙繁鸢 2024-10-22 01:47:50

我做了这个功能，我认为它会有所帮助

        string BasHtmlEncode(string x)
        {
           StringBuilder sb = new StringBuilder();
           foreach (char c in x.ToCharArray())
               sb.Append(String.Format("&#{0};", Convert.ToInt16(c)));
           return(sb.ToString());
        }

I made this function, I think it will help

        string BasHtmlEncode(string x)
        {
           StringBuilder sb = new StringBuilder();
           foreach (char c in x.ToCharArray())
               sb.Append(String.Format("&#{0};", Convert.ToInt16(c)));
           return(sb.ToString());
        }

回复收藏 0 原文

老娘不死你永远是小三 2024-10-22 01:47:50

我开发了以下代码来保持 az、AZ 和 0-1 不编码而是其余：

public static string Encode(string source)
{
    if (string.IsNullOrEmpty(source)) return string.Empty;

    var sb = new StringBuilder(source.Length);
    foreach (char c in source)
    {
        if (c >= 'a' && c <= 'z')
        {
            sb.Append(c);
        }
        else if (c >= 'A' && c <= 'Z')
        {
            sb.Append(c);
        }
        else if (c >= '0' && c <= '9')
        {
            sb.Append(c);
        }
        else
        {
            sb.AppendFormat("&#{0};",Convert.ToInt32(c));
        }
    }

    return sb.ToString();
}

I developed following code to keep a-z,A-Z and 0-1 not encoded but rest:

public static string Encode(string source)
{
    if (string.IsNullOrEmpty(source)) return string.Empty;

    var sb = new StringBuilder(source.Length);
    foreach (char c in source)
    {
        if (c >= 'a' && c <= 'z')
        {
            sb.Append(c);
        }
        else if (c >= 'A' && c <= 'Z')
        {
            sb.Append(c);
        }
        else if (c >= '0' && c <= '9')
        {
            sb.Append(c);
        }
        else
        {
            sb.AppendFormat("&#{0};",Convert.ToInt32(c));
        }
    }

    return sb.ToString();
}

回复收藏 0 原文

~没有更多了~