使用 HtmlAgilityPack 解析节点的子节点时出现问题
我在解析 html 表单的输入标签子项时遇到问题。我可以使用 //input[@type] 从根解析它们,但不能将它们解析为特定节点的子节点。
下面是一些说明问题的代码:
private const string HTML_CONTENT =
"<html>" +
"<head>" +
"<title>Test Page</title>" +
"<link href='site.css' rel='stylesheet' type='text/css' />" +
"</head>" +
"<body>" +
"<form id='form1' method='post' action='http://www.someplace.com/input'>" +
"<input type='hidden' name='id' value='test' />" +
"<input type='text' name='something' value='something' />" +
"</form>" +
"<a href='http://www.someplace.com'>Someplace</a>" +
"<a href='http://www.someplace.com/other'><img src='http://www.someplace.com/image.jpg' alt='Someplace Image'/></a>" +
"<form id='form2' method='post' action='/something/to/do'>" +
"<input type='text' name='secondForm' value='this should be in the second form' />" +
"</form>" +
"</body>" +
"</html>";
public void Parser_Test()
{
var htmlDoc = new HtmlDocument
{
OptionFixNestedTags = true,
OptionUseIdAttribute = true,
OptionAutoCloseOnEnd = true,
OptionAddDebuggingAttributes = true
};
byte[] byteArray = Encoding.UTF8.GetBytes(HTML_CONTENT);
var stream = new MemoryStream(byteArray);
htmlDoc.Load(stream, Encoding.UTF8, true);
var nodeCollection = htmlDoc.DocumentNode.SelectNodes("//form");
if (nodeCollection != null && nodeCollection.Count > 0)
{
foreach (var form in nodeCollection)
{
var id = form.GetAttributeValue("id", string.Empty);
if (!form.HasChildNodes)
Debug.WriteLine(string.Format("Form {0} has no children", id ) );
var childCollection = form.SelectNodes("input[@type]");
if (childCollection != null && childCollection.Count > 0)
{
Debug.WriteLine("Got some child nodes");
}
else
{
Debug.WriteLine("Unable to find input nodes as children of Form");
}
}
var inputNodes = htmlDoc.DocumentNode.SelectNodes("//input");
if (inputNodes != null && inputNodes.Count > 0)
{
Debug.WriteLine(string.Format("Found {0} input nodes when parsed from root", inputNodes.Count ) );
}
}
else
{
Debug.WriteLine("Found no forms");
}
}
输出是什么:
Form form1 has no children
Unable to find input nodes as children of Form
Form form2 has no children
Unable to find input nodes as children of Form
Found 3 input nodes when parsed from root
我期望 Form1 和 Form2 都有子节点,并且 input[@type] 将能够为 form1 找到 2 个节点,为 form2 找到 1 个节点
是否有特定的我没有使用我应该使用的配置设置或方法?有什么想法吗?
谢谢,
史蒂夫
I'm having a problem parsing the input tag children of a form in html. I can parse them from the root using //input[@type] but not as children of a specific node.
Here's some code that illustrates the problem:
private const string HTML_CONTENT =
"<html>" +
"<head>" +
"<title>Test Page</title>" +
"<link href='site.css' rel='stylesheet' type='text/css' />" +
"</head>" +
"<body>" +
"<form id='form1' method='post' action='http://www.someplace.com/input'>" +
"<input type='hidden' name='id' value='test' />" +
"<input type='text' name='something' value='something' />" +
"</form>" +
"<a href='http://www.someplace.com'>Someplace</a>" +
"<a href='http://www.someplace.com/other'><img src='http://www.someplace.com/image.jpg' alt='Someplace Image'/></a>" +
"<form id='form2' method='post' action='/something/to/do'>" +
"<input type='text' name='secondForm' value='this should be in the second form' />" +
"</form>" +
"</body>" +
"</html>";
public void Parser_Test()
{
var htmlDoc = new HtmlDocument
{
OptionFixNestedTags = true,
OptionUseIdAttribute = true,
OptionAutoCloseOnEnd = true,
OptionAddDebuggingAttributes = true
};
byte[] byteArray = Encoding.UTF8.GetBytes(HTML_CONTENT);
var stream = new MemoryStream(byteArray);
htmlDoc.Load(stream, Encoding.UTF8, true);
var nodeCollection = htmlDoc.DocumentNode.SelectNodes("//form");
if (nodeCollection != null && nodeCollection.Count > 0)
{
foreach (var form in nodeCollection)
{
var id = form.GetAttributeValue("id", string.Empty);
if (!form.HasChildNodes)
Debug.WriteLine(string.Format("Form {0} has no children", id ) );
var childCollection = form.SelectNodes("input[@type]");
if (childCollection != null && childCollection.Count > 0)
{
Debug.WriteLine("Got some child nodes");
}
else
{
Debug.WriteLine("Unable to find input nodes as children of Form");
}
}
var inputNodes = htmlDoc.DocumentNode.SelectNodes("//input");
if (inputNodes != null && inputNodes.Count > 0)
{
Debug.WriteLine(string.Format("Found {0} input nodes when parsed from root", inputNodes.Count ) );
}
}
else
{
Debug.WriteLine("Found no forms");
}
}
What is output is:
Form form1 has no children
Unable to find input nodes as children of Form
Form form2 has no children
Unable to find input nodes as children of Form
Found 3 input nodes when parsed from root
What I would expect is that Form1 and Form2 would both have children and that input[@type] would be able to find 2 nodes for form1 and 1 for form2
Is there a specific configuration setting or method that I'm not using that I should be? Any ideas?
Thanks,
Steve
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
查看 HtmlAgilityPack 站点上的讨论主题 -
http://htmlagilitypack.codeplex.com/workitem/21782
他们是这么说的:
要修改 HtmlNode.cs 文件,请注释掉以下行 -
Check out this discussion thread on the HtmlAgilityPack site -
http://htmlagilitypack.codeplex.com/workitem/21782
This is what they say:
To modify the HtmlNode.cs file, comment out following line -
好吧,我现在已经放弃了 HtmlAgilityPack。似乎该图书馆还有更多工作要做才能让一切正常运转。为了解决这个问题,我已将代码移至此处使用 SGMLReader 库: http://developer.mindtouch .com/SgmlReader
使用这个库,我的所有单元测试都正确通过,并且示例代码按预期工作。
Well, I've given up on HtmlAgilityPack for now. Seems like there is still more work to do in that library to get everything working. To solve this problem I've moved the code over to use the SGMLReader library from here: http://developer.mindtouch.com/SgmlReader
Using this library all my unit tests pass properly and the sample code works as expected.