Xquery提取html中的文本
我正在研究从 html 文档中提取文本并将其存储在数据库中。我正在使用 webharvest 工具来提取内容。然而我有点陷入了困境。在 webharvest 中,我使用 XQuery 表达式来提取数据。我正在解析的html文档如下:
<td><a name="hw">HELLOWORLD</a>Hello world</td>
我需要从上面的html脚本中提取“Hello world”文本。
我尝试以这种方式提取文本:
$hw :=data($item//a[@name='hw']/text())
但是我总是得到的是“HELLOWORLD”而不是“Hello world”。
有没有办法提取“Hello World”。请帮忙。
如果我想这样做怎么办:
<td>
<a name="hw1">HELLOWORLD1</a>Hello world1
<a name="hw2">HELLOWORLD2</a>Hello world2
<a name="hw3">HELLOWORLD3</a>Hello world3
</td>
我想提取 hw2 和 hw3 之间的文本 Hello world 2。我不想使用 text()[3] 但有什么方法可以提取 /a[@name='hw2'] 和 /a[@name='hw3'] 之间的文本。
I am working on extracting text out of html documents and storing in database. I am using webharvest tool for extracting the content. However I kind of stuck at a point. Inside webharvest I use XQuery expression inorder to extract the data. The html document that I am parsing is as follows:
<td><a name="hw">HELLOWORLD</a>Hello world</td>
I need to extract "Hello world" text from the above html script.
I have tried extracting the text in this fashion:
$hw :=data($item//a[@name='hw']/text())
However what I always get is "HELLOWORLD" instead of "Hello world".
Is there a way to extract "Hello World". Please help.
What if I want to do it this way:
<td>
<a name="hw1">HELLOWORLD1</a>Hello world1
<a name="hw2">HELLOWORLD2</a>Hello world2
<a name="hw3">HELLOWORLD3</a>Hello world3
</td>
I would like to extract the text Hello world 2 that is in betweeb hw2 and hw3. I would not want to use text()[3] but is there some way I could extract the text out between /a[@name='hw2'] and /a[@name='hw3'].
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您的 xpath 正在选择
a
节点的文本,而不是td
节点的文本:将其更改为:
更新(以下注释和更新问题):
此 xpath 从
$item
中选择第二个文本节点,该节点具有a
标记,其中包含设置为的
:name
属性硬件Your xpath is selecting the text of the
a
nodes, not the text of thetd
nodes:Change it to this:
Update (following comments and update to question):
This xpath selects the second text node from
$item
that have ana
tag containing aname
attribute set tohw
:如果两个
元素之间只有一个文本节点,则以下内容将非常简单:
/a[@name='hw3']/preceding::text( )[1]
如果两个元素之间有多个文本节点,则需要表达第一个元素之后的所有文本节点与第二个元素之前的所有文本节点的交集。两个节点集的交集公式(又名 Kaysian 交集法)为:
$ns1[count(.|$ns2) = count($ns2)]
因此,只需替换上面的表达式
>$ns1
与:/a[@name='hw2']/following-sibling::text()
和
$ns2
与:/a[@name='hw3']/preceding-sibling::text()
最后,如果您确实有 XQuery(或 XPath 2),那么这很简单:
If there is just one text node between the two
<a>
elements, then the following would be quite simple:/a[@name='hw3']/preceding::text()[1]
If there are more than one text nodes between the two elements, then you need to express the intersection of all text nodes following the first element with all text nodes preceding the second element. The formula for intersection of two nodesets (aka Kaysian method of intersection) is:
$ns1[count(.|$ns2) = count($ns2)]
So, just replace in the above expression
$ns1
with:/a[@name='hw2']/following-sibling::text()
and
$ns2
with:/a[@name='hw3']/preceding-sibling::text()
Lastly, if you really have XQuery (or XPath 2), then this is simply:
这可以处理您的扩展情况,同时让您按属性值而不是位置进行选择:
这将获取具有前导同级“a”元素且名称属性为“hw2”的第一个节点。
This handles your expanded case, while letting you select by attribute value rather than position:
This gets the first node that has a preceding-sibling "a" element with a name attribute of "hw2".