C# HtmlEncode - ISO-8859-1 实体名称与数字

发布于 2024-10-15 01:47:50 字数 1108 浏览 3 评论 0原文

根据以下 表格ISO-8859-1 标准,似乎有一个与每个保留的 HTML 关联的实体名称和实体编号特点。

例如,对于字符 é

实体名称:é

实体编号:é

同样,对于字符>

实体名称:>

实体编号:>

对于给定字符串,HttpUtility.HtmlEncode 返回 HTML 编码的字符串,但我可以'不知道它是如何工作的。我的意思是:

Console.WriteLine(HtmlEncode("é>"));
//Outputs é>

它似乎使用 é 字符的实体编号,但使用 > 字符的实体名称。

那么 HtmlEncode 方法真的符合 ISO-8859-1 标准吗?如果确实如此,是否有原因导致有时使用实体名称而有时使用实体编号?更重要的是,我可以强迫它可靠地给我实体名称吗?

编辑: 谢谢你们的回答。不过,在执行搜索之前我无法解码该字符串。无需了解太多细节,文本存储在 SharePoint 列表中,并且“搜索”由 SharePoint 本身完成(使用 CAML 查询)。所以基本上,我不能。

我正在尝试考虑一种将实体编号转换为名称的方法,.NET 中是否有函数可以做到这一点?或者还有其他想法吗?

According to the following table for the ISO-8859-1 standard, there seems to be an entity name and an entity number associated with each reserved HTML character.

So for example, for the character é :

Entity Name : é

Entity Number : é

Similarly, for the character > :

Entity Name : >

Entity Number : >

For a given string, the HttpUtility.HtmlEncode returns an HTML encoded String, but I can't figure out how it works. Here is what I mean :

Console.WriteLine(HtmlEncode("é>"));
//Outputs é>

It seems to be using the entity number for the é character but the entity name for the > character.

So does the HtmlEncode method really work with the ISO-8859-1 standard? If it does, is there a reason why it sometimes uses the entity name and other times the entity number? More importantly, can I force it to give me the entity name reliably?

EDIT :
Thanks for the answers guys. I cannot decode the string before I perform the search though. Without getting into too many details, the text is stored in a SharePoint List and the "search" is done by SharePoint itself (using a CAML query). So basically, I can't.

I'm trying to think of a way to convert the entity numbers into names, is there a function in .NET that does that? Or any other idea?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

无风消散 2024-10-22 01:47:50

该方法就是这样实现的。对于某些已知字符,它使用相应的实体,对于其他所有字符,它使用相应的十六进制值,并且您无法做太多修改此行为。摘自 System.Net.WebUtility.HtmlEncode 的实现(如使用 Reflector 所示):

...
if (ch <= '>')
{
    switch (ch)
    {
        case '&':
        {
            output.Write("&");
            continue;
        }
        case '\'':
        {
            output.Write("'");
            continue;
        }
        case '"':
        {
            output.Write(""");
            continue;
        }
        case '<':
        {
            output.Write("<");
            continue;
        }
        case '>':
        {
            output.Write(">");
            continue;
        }
    }
    output.Write(ch);
    continue;
}
if ((ch >= '\x00a0') && (ch < 'Ā'))
{
    output.Write("&#");
    output.Write(((int) ch).ToString(NumberFormatInfo.InvariantInfo));
    output.Write(';');
}
...

这就是说您不应该关心,因为此方法将始终生成有效、安全且编码正确的 HTML。

That's how the method has been implemented. For some known characters it uses the corresponding entity and for everything else it uses the corresponding hex value and there is not much you could do to modify this behavior. Excerpt from the implementation of System.Net.WebUtility.HtmlEncode (as seen with reflector):

...
if (ch <= '>')
{
    switch (ch)
    {
        case '&':
        {
            output.Write("&");
            continue;
        }
        case '\'':
        {
            output.Write("'");
            continue;
        }
        case '"':
        {
            output.Write(""");
            continue;
        }
        case '<':
        {
            output.Write("<");
            continue;
        }
        case '>':
        {
            output.Write(">");
            continue;
        }
    }
    output.Write(ch);
    continue;
}
if ((ch >= '\x00a0') && (ch < 'Ā'))
{
    output.Write("&#");
    output.Write(((int) ch).ToString(NumberFormatInfo.InvariantInfo));
    output.Write(';');
}
...

This being said you shouldn't care as this method will always produce valid, safe and correctly encoded HTML.

时间海 2024-10-22 01:47:50

HtmlEncode 遵循规范。 ISO 标准为每个实体指定了名称和编号,并且名称和编号是等效的。因此,HtmlEncode 的一致实现可以自由地将所有点编码为数字,或全部编码为名称,或两者的某种混合。

我建议您从另一个方向解决问题:对目标文本调用 HtmlDecode,然后使用原始字符串搜索解码后的文本。

HtmlEncode is following the spec. The ISO standard specifies both a name and a number for every entity, and the name and the number are equivalent. Therefore, a conforming implementation of HtmlEncode is free to encode all points as numbers, or all as names, or some mixture of the two.

I suggest that you approach your problem from the other direction: call HtmlDecode on the target text, then search through the decoded text using the raw string.

丑丑阿 2024-10-22 01:47:50

ISO-8859-1 与 HTML 字符编码并不真正相关。来自维基百科:

数字引用始终指代
Unicode 代码点,无论
页面的编码。

仅对于未定义的 Unicode 代码点,经常使用 ISO-8859-1:

使用数字
永久引用的参考文献
未定义的字符和控制
字符是被禁止的,与
换行符、制表符和
回车符。那是,
十六进制范围内的字符
00–08、0B–0C、0E–1F、7F 和 80–9F
不能在 HTML 文档中使用,
甚至没有通过引用,所以“™”,
例如,不允许。然而,
为了向后兼容早期
HTML 作者和浏览器忽略了
这个限制,原始字符和
中的数字字符引用
80–9F 范围被一些人解释为
浏览器代表
映射到字节 80–9F 的字符
Windows-1252 编码。

现在回答您的问题:为了使搜索效果最佳,您应该使用未编码的搜索字符串来搜索未编码的 HTML(首先剥离 HTML 标签)。
匹配编码字符串将导致意外结果,例如基于 HTML 标签或注释的命中,以及由于 HTML 中在文本中不可见的差异而导致命中丢失。

ISO-8859-1 is not really relevant to HTML character encoding. From Wikipedia:

Numeric references always refer to
Unicode code points, regardless of the
page's encoding.

Only for undefined Unicode code points ISO-8859-1 is often used:

Using numeric
references that refer to permanently
undefined characters and control
characters is forbidden, with the
exception of the linefeed, tab, and
carriage return characters. That is,
characters in the hexadecimal ranges
00–08, 0B–0C, 0E–1F, 7F, and 80–9F
cannot be used in an HTML document,
not even by reference, so "™",
for example, is not allowed. However,
for backward compatibility with early
HTML authors and browsers that ignored
this restriction, raw characters and
numeric character references in the
80–9F range are interpreted by some
browsers as representing the
characters mapped to bytes 80–9F in
the Windows-1252 encoding.

Now to answer your question: For search to work best, you should really search the unencoded HTML (stripping the HTML tags first) using an unencoded search string.
Matching encoded strings will lead to unexpected results, like hits based on HTML tags or comments, and hits missing because of differences in the HTML that are invisible in the text.

一紙繁鸢 2024-10-22 01:47:50

我做了这个功能,我认为它会有所帮助

        string BasHtmlEncode(string x)
        {
           StringBuilder sb = new StringBuilder();
           foreach (char c in x.ToCharArray())
               sb.Append(String.Format("&#{0};", Convert.ToInt16(c)));
           return(sb.ToString());
        }

I made this function, I think it will help

        string BasHtmlEncode(string x)
        {
           StringBuilder sb = new StringBuilder();
           foreach (char c in x.ToCharArray())
               sb.Append(String.Format("&#{0};", Convert.ToInt16(c)));
           return(sb.ToString());
        }
老娘不死你永远是小三 2024-10-22 01:47:50

我开发了以下代码来保持 az、AZ 和 0-1 不编码而是其余:

public static string Encode(string source)
{
    if (string.IsNullOrEmpty(source)) return string.Empty;

    var sb = new StringBuilder(source.Length);
    foreach (char c in source)
    {
        if (c >= 'a' && c <= 'z')
        {
            sb.Append(c);
        }
        else if (c >= 'A' && c <= 'Z')
        {
            sb.Append(c);
        }
        else if (c >= '0' && c <= '9')
        {
            sb.Append(c);
        }
        else
        {
            sb.AppendFormat("&#{0};",Convert.ToInt32(c));
        }
    }

    return sb.ToString();
}

I developed following code to keep a-z,A-Z and 0-1 not encoded but rest:

public static string Encode(string source)
{
    if (string.IsNullOrEmpty(source)) return string.Empty;

    var sb = new StringBuilder(source.Length);
    foreach (char c in source)
    {
        if (c >= 'a' && c <= 'z')
        {
            sb.Append(c);
        }
        else if (c >= 'A' && c <= 'Z')
        {
            sb.Append(c);
        }
        else if (c >= '0' && c <= '9')
        {
            sb.Append(c);
        }
        else
        {
            sb.AppendFormat("&#{0};",Convert.ToInt32(c));
        }
    }

    return sb.ToString();
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文