Xquery 提取文本
我正在研究从 html 文档中提取文本并将其存储在数据库中。我正在使用 webharvest 工具来提取内容。然而我有点陷入了困境。在 webharvest 中,我使用 XQuery 表达式来提取数据。我正在解析的html文档如下:
<td><a name="hw">HELLOWORLD</a>Hello world</td>
我需要从上面的html脚本中提取“Hello world”文本。
我尝试以这种方式提取文本:
$hw :=data($item//a[@name='hw']/text())
但是我总是得到的是“HELLOWORLD”而不是“Hello world”。
有没有办法提取“Hello World”。请帮忙。
如果我想这样做怎么办:
<td>
<a name="hw1">HELLOWORLD1</a>Hello world1
<a name="hw2">HELLOWORLD2</a>Hello world2
<a name="hw3">HELLOWORLD3</a>Hello world3
</td>
我想提取 hw2 和 hw3 之间的文本 Hello world 2。我不想使用 text()[3] 但有什么方法可以提取 /a[@name='hw2'] 和 /a[@name='hw3'] 之间的文本。
I am working on extracting text out of html documents and storing in database. I am using webharvest tool for extracting the content. However I kind of stuck at a point. Inside webharvest I use XQuery expression inorder to extract the data. The html document that I am parsing is as follows:
<td><a name="hw">HELLOWORLD</a>Hello world</td>
I need to extract "Hello world" text from the above html script.
I have tried extracting the text in this fashion:
$hw :=data($item//a[@name='hw']/text())
However what I always get is "HELLOWORLD" instead of "Hello world".
Is there a way to extract "Hello World". Please help.
What if I want to do it this way:
<td>
<a name="hw1">HELLOWORLD1</a>Hello world1
<a name="hw2">HELLOWORLD2</a>Hello world2
<a name="hw3">HELLOWORLD3</a>Hello world3
</td>
I would like to extract the text Hello world 2 that is in betweeb hw2 and hw3. I would not want to use text()[3] but is there some way I could extract the text out between /a[@name='hw2'] and /a[@name='hw3'].
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
首先,您要查找名称属性以“hw”开头的节点。这可以通过以下路径来实现:
找到 a 节点后,您想要检索 a 节点后面的第一个文本节点。这可以这样做:
First of all, you are looking for the a nodes whose name attributes start with 'hw'. This can be achieved with the following path:
Once you have found your a nodes you want to retrieve the first text node that follows the a node. This can be done as so: