使用 HtmlAgilityPack 解析节点的子节点时出现问题

发布于 2024-09-06 12:47:34 字数 3140 浏览 8 评论 0原文

我在解析 html 表单的输入标签子项时遇到问题。我可以使用 //input[@type] 从根解析它们,但不能将它们解析为特定节点的子节点。

下面是一些说明问题的代码:

private const string HTML_CONTENT =
        "<html>" +
        "<head>" +
        "<title>Test Page</title>" +
        "<link href='site.css' rel='stylesheet' type='text/css' />" +
        "</head>" +
        "<body>" +
        "<form id='form1' method='post' action='http://www.someplace.com/input'>" +
        "<input type='hidden' name='id' value='test' />" +
        "<input type='text' name='something' value='something' />" +
        "</form>" +
        "<a href='http://www.someplace.com'>Someplace</a>" +
        "<a href='http://www.someplace.com/other'><img src='http://www.someplace.com/image.jpg' alt='Someplace Image'/></a>" +
        "<form id='form2' method='post' action='/something/to/do'>" +
        "<input type='text' name='secondForm' value='this should be in the second form' />" +
        "</form>" +
        "</body>" +
        "</html>";

public void Parser_Test()
    {
        var htmlDoc = new HtmlDocument
        {
            OptionFixNestedTags = true,
            OptionUseIdAttribute = true,
            OptionAutoCloseOnEnd = true,
            OptionAddDebuggingAttributes = true
        };

        byte[] byteArray = Encoding.UTF8.GetBytes(HTML_CONTENT);
        var stream = new MemoryStream(byteArray);
        htmlDoc.Load(stream, Encoding.UTF8, true);
        var nodeCollection = htmlDoc.DocumentNode.SelectNodes("//form");
        if (nodeCollection != null && nodeCollection.Count > 0)
        {
            foreach (var form in nodeCollection)
            {
                var id = form.GetAttributeValue("id", string.Empty);
                if (!form.HasChildNodes)
                    Debug.WriteLine(string.Format("Form {0} has no children", id ) );

                var childCollection = form.SelectNodes("input[@type]");
                if (childCollection != null && childCollection.Count > 0)
                {
                    Debug.WriteLine("Got some child nodes");
                }
                else
                {
                    Debug.WriteLine("Unable to find input nodes as children of Form");
                }
            }
            var inputNodes = htmlDoc.DocumentNode.SelectNodes("//input");
            if (inputNodes != null && inputNodes.Count > 0)
            {
                Debug.WriteLine(string.Format("Found {0} input nodes when parsed from root", inputNodes.Count ) );
            }
        }
        else
        {
            Debug.WriteLine("Found no forms");
        }
    }

输出是什么:

Form form1 has no children
Unable to find input nodes as children of Form
Form form2 has no children
Unable to find input nodes as children of Form
Found 3 input nodes when parsed from root

我期望 Form1 和 Form2 都有子节点,并且 input[@type] 将能够为 form1 找到 2 个节点,为 form2 找到 1 个节点

是否有特定的我没有使用我应该使用的配置设置或方法?有什么想法吗?

谢谢,

史蒂夫

I'm having a problem parsing the input tag children of a form in html. I can parse them from the root using //input[@type] but not as children of a specific node.

Here's some code that illustrates the problem:

private const string HTML_CONTENT =
        "<html>" +
        "<head>" +
        "<title>Test Page</title>" +
        "<link href='site.css' rel='stylesheet' type='text/css' />" +
        "</head>" +
        "<body>" +
        "<form id='form1' method='post' action='http://www.someplace.com/input'>" +
        "<input type='hidden' name='id' value='test' />" +
        "<input type='text' name='something' value='something' />" +
        "</form>" +
        "<a href='http://www.someplace.com'>Someplace</a>" +
        "<a href='http://www.someplace.com/other'><img src='http://www.someplace.com/image.jpg' alt='Someplace Image'/></a>" +
        "<form id='form2' method='post' action='/something/to/do'>" +
        "<input type='text' name='secondForm' value='this should be in the second form' />" +
        "</form>" +
        "</body>" +
        "</html>";

public void Parser_Test()
    {
        var htmlDoc = new HtmlDocument
        {
            OptionFixNestedTags = true,
            OptionUseIdAttribute = true,
            OptionAutoCloseOnEnd = true,
            OptionAddDebuggingAttributes = true
        };

        byte[] byteArray = Encoding.UTF8.GetBytes(HTML_CONTENT);
        var stream = new MemoryStream(byteArray);
        htmlDoc.Load(stream, Encoding.UTF8, true);
        var nodeCollection = htmlDoc.DocumentNode.SelectNodes("//form");
        if (nodeCollection != null && nodeCollection.Count > 0)
        {
            foreach (var form in nodeCollection)
            {
                var id = form.GetAttributeValue("id", string.Empty);
                if (!form.HasChildNodes)
                    Debug.WriteLine(string.Format("Form {0} has no children", id ) );

                var childCollection = form.SelectNodes("input[@type]");
                if (childCollection != null && childCollection.Count > 0)
                {
                    Debug.WriteLine("Got some child nodes");
                }
                else
                {
                    Debug.WriteLine("Unable to find input nodes as children of Form");
                }
            }
            var inputNodes = htmlDoc.DocumentNode.SelectNodes("//input");
            if (inputNodes != null && inputNodes.Count > 0)
            {
                Debug.WriteLine(string.Format("Found {0} input nodes when parsed from root", inputNodes.Count ) );
            }
        }
        else
        {
            Debug.WriteLine("Found no forms");
        }
    }

What is output is:

Form form1 has no children
Unable to find input nodes as children of Form
Form form2 has no children
Unable to find input nodes as children of Form
Found 3 input nodes when parsed from root

What I would expect is that Form1 and Form2 would both have children and that input[@type] would be able to find 2 nodes for form1 and 1 for form2

Is there a specific configuration setting or method that I'm not using that I should be? Any ideas?

Thanks,

Steve

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

瑶笙 2024-09-13 12:47:34

查看 HtmlAgilityPack 站点上的讨论主题 -
http://htmlagilitypack.codeplex.com/workitem/21782

他们是这么说的:

这不是一个错误,而是一个功能并且是可配置的。 FORM 被这样对待,因为许多 HTML 页面过去都有重叠的表单,因为这实际上是原始 HTML 的一个(强大的)功能。现在 XML 和 XHTML 已经存在,每个人都认为重叠是一个错误,但事实并非如此(在 HTML 3.2 中)。
检查 HtmlNode.cs 文件,并修改 ElementsFlags 集合(如果您愿意,也可以在运行时执行此操作)

要修改 HtmlNode.cs 文件,请注释掉以下行 -

ElementsFlags.Add("form", HtmlElementFlag.CanOverlap | HtmlElementFlag.Empty);

Check out this discussion thread on the HtmlAgilityPack site -
http://htmlagilitypack.codeplex.com/workitem/21782

This is what they say:

This is not a bug, but a feature and is configurable. FORM is treated like this because many HTML pages used to have overlapping forms, as this was actually a (powerful) feature of original HTML. Now that XML and XHTML exist, everybody assumes that overlapping is an error, but it's not (in HTML 3.2).
Check the HtmlNode.cs file, and modify the ElementsFlags collection (or do it at runtime if you prefer)

To modify the HtmlNode.cs file, comment out following line -

ElementsFlags.Add("form", HtmlElementFlag.CanOverlap | HtmlElementFlag.Empty);
-柠檬树下少年和吉他 2024-09-13 12:47:34

好吧,我现在已经放弃了 HtmlAgilityPack。似乎该图书馆还有更多工作要做才能让一切正常运转。为了解决这个问题,我已将代码移至此处使用 SGMLReader 库: http://developer.mindtouch .com/SgmlReader

使用这个库,我的所有单元测试都正确通过,并且示例代码按预期工作。

Well, I've given up on HtmlAgilityPack for now. Seems like there is still more work to do in that library to get everything working. To solve this problem I've moved the code over to use the SGMLReader library from here: http://developer.mindtouch.com/SgmlReader

Using this library all my unit tests pass properly and the sample code works as expected.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文