使用 HtmlAgilityPack 从节点获取文本
我有以下 HTML:
<div class="top">
<p>Blah.</p>
I want <em>this</em> text.
</div>
提取字符串“I Want this text.
”的 XPath 表示法是什么? 编辑:我不一定需要单个 XPath 表达式来提取字符串。选择多个节点并迭代它们以生成句子也很棒。
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(myHtml);
doc.DocumentNode.SelectSingleNode("??????");
I have the following HTML:
<div class="top">
<p>Blah.</p>
I want <em>this</em> text.
</div>
What is the XPath notation to extract the string "I want <em>this</em> text.
"?
EDIT: I don't necessarily want a single XPath expression to extract the string. Selecting multiple nodes, and iterating over them to produce the sentence, would be great as well.
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(myHtml);
doc.DocumentNode.SelectSingleNode("??????");
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
或者
or
你想提取什么,节点还是字符串?
如果您需要节点,
“I Want this text.”
是一个 XML 片段,由 两个文本节点和一个< 文本节点组成。 em>
元素,它有一个文本节点子节点。由于它在顶层有多个节点,因此您需要使用SelectNodes("xpath expression a la @Alejandro")
而不是SelectSingleNode()
来提取它们。如果你想要一个字符串,你再次需要使用 SelectNodes();然后迭代选定的节点并连接每个节点的outerHTML。请参阅此处了解类似的一个很好的例子。
另外,从您的示例中还不清楚什么 XPath 表达式通常会给您带来您想要的东西。例如,您想要
下的初始
...
之后的所有内容吗?或者您想要
下除所有
元素之外的所有文本吗?或者也许是别的什么?当然,如果 @Alejandro 的 XPath 表达式适合您,那么它已经足够明确了。
What do you want to extract, nodes or a string?
If you want nodes,
"I want <em>this</em> text."
is an XML fragment consisting at the top level of two text nodes and an<em>
element, which has a text node child. Since it has multiple nodes at the top level, you need to useSelectNodes("xpath expression a la @Alejandro")
rather thanSelectSingleNode()
to extract them.If you want a string, again you need to use SelectNodes(); and then iterate over the selected nodes and concatenate the outerHTML of each one. See here for a good example of something similar.
Also, it's a little unclear from your example what XPath expression would in general give you what you want. E.g. do you want everything after the initial
<p>...</p>
under<div class="top">
? Or do you want all text under the<div>
except all<p>
elements? Or maybe something else? Of course if @Alejandro's XPath expressions work for you then it's already well-specified enough.