HtmlAgilityPack -- 是否
由于某种原因自行关闭?

发布于 2024-10-03 01:52:49 字数 1539 浏览 4 评论 0原文

我刚刚写了这个测试,看看我是否疯了......

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using HtmlAgilityPack;

namespace HtmlAgilityPackFormBug
{
    class Program
    {
        static void Main(string[] args)
        {
            var doc = new HtmlDocument();
            doc.LoadHtml(@"
<!DOCTYPE html>
<html>
    <head>
        <title>Form Test</title>
    </head>
    <body>
        <form>
            <input type=""text"" />
            <input type=""reset"" />
            <input type=""submit"" />
        </form>
    </body>
</html>
");
            var body = doc.DocumentNode.SelectSingleNode("//body");
            foreach (var node in body.ChildNodes.Where(n => n.NodeType == HtmlNodeType.Element))
                Console.WriteLine(node.XPath);
            Console.ReadLine();
        }
    }
}

它输出:

/html[1]/body[1]/form[1]
/html[1]/body[1]/input[1]
/html[1]/body[1]/input[2]
/html[1]/body[1]/input[3]

但是,如果我将

更改为 它会给我:(

/html[1]/body[1]/xxx[1]

应该如此)。所以...看起来这些输入元素包含在表单中,而是直接包含在正文中,就好像

立即自行关闭一样。这是怎么回事?这是一个错误吗?


深入挖掘源代码,我发现:

ElementsFlags.Add("form", HtmlElementFlag.CanOverlap | HtmlElementFlag.Empty);

它有“空”标志,就像 META 和 IMG 一样。为什么??表单绝对不应该是空的。

I just wrote up this test to see if I was crazy...

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using HtmlAgilityPack;

namespace HtmlAgilityPackFormBug
{
    class Program
    {
        static void Main(string[] args)
        {
            var doc = new HtmlDocument();
            doc.LoadHtml(@"
<!DOCTYPE html>
<html>
    <head>
        <title>Form Test</title>
    </head>
    <body>
        <form>
            <input type=""text"" />
            <input type=""reset"" />
            <input type=""submit"" />
        </form>
    </body>
</html>
");
            var body = doc.DocumentNode.SelectSingleNode("//body");
            foreach (var node in body.ChildNodes.Where(n => n.NodeType == HtmlNodeType.Element))
                Console.WriteLine(node.XPath);
            Console.ReadLine();
        }
    }
}

And it outputs:

/html[1]/body[1]/form[1]
/html[1]/body[1]/input[1]
/html[1]/body[1]/input[2]
/html[1]/body[1]/input[3]

But, if I change <form> to <xxx> it gives me:

/html[1]/body[1]/xxx[1]

(As it should). So... it looks like those input elements are not contained within the form, but directly within the body, as if the <form> just closed itself off immediately. What's up with that? Is this a bug?


Digging through the source, I see:

ElementsFlags.Add("form", HtmlElementFlag.CanOverlap | HtmlElementFlag.Empty);

It has the "empty" flag, like META and IMG. Why?? Forms are most definitely not supposed to be empty.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

吐个泡泡 2024-10-10 01:52:49

此工作项中也报告了这一点。它包含 DarthObiwan 建议的解决方法。

您无需重新编译即可更改此设置。 ElementFlags 列表是
HtmlNode 类的静态属性。可以使用以下方法将其删除

 HtmlNode.ElementsFlags.Remove("form");

在加载文档之前

This is also reported in this workitem. It contains a suggested workaround from DarthObiwan.

You can change this without recompiling. The ElementFlags list is a
static property on the HtmlNode class. It can be removed with

    HtmlNode.ElementsFlags.Remove("form");

before doing the document load

烟沫凡尘 2024-10-10 01:52:49

由于我是 HAP 的原始作者,我可以解释为什么它被标记为空:)

这是因为早在 2000 年设计 HAP 时,HTML 3.2 就是标准。您可能知道标签在 HTML 中可以完美重叠。即:粗体斜体和粗体斜体粗体斜体和粗体< i>斜体)受到所有浏览器的支持(尽管它并未正式出现在 HTML 规范中)。并且 FORM 标签也可以完美重叠。

由于 HAP 被设计为处理任何 HTML 内容,而不是破坏您当时可以找到的大多数页面,因此我们决定将重叠标签处理为 EMPTY(使用 ElementFlags 属性),因此:

  • 您仍然可以加载它们
  • ,也可以保存它们返回而不破坏原始 HTML(如果您不需要以任何编程方式包含表单内的内容)。

您唯一不能做的就是使用 API、树模型、XSL 或任何编程方式来处理它们。
如今,XHTML/XML 几乎无处不在,这听起来很奇怪,但这就是我创建 ElementFlags 的原因:)

Since I'm the original HAP author, I can explain why it's marked as empty :)

This is because when HAP was designed, back in 2000, HTML 3.2 was the standard. You're probably aware that tags can perfectly overlap in HTML. That is: <b>bold<i>italic and bold</b>italic</i> (bolditalic and bolditalic) is supported by all browsers (although it's not officially in the HTML specification). And the FORM tag can also perfectly overlap as well.

Since HAP has been designed to handle any HTML content, rather than break most pages that you could find at that time, we just decided to handle overlapping tags as EMPTY (using the ElementFlags property) so:

  • you can still load them
  • you can save them back without breaking the original HTML (If you don't need what's inside the form in any programmatic way).

The only thing you cannot do is work with them with the API, using the tree model, nor with XSL, or anything programmatic.
Today, with XHTML/XML almost everywhere, this sounds strange, but that's why I created the ElementFlags :)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文