Xquery 提取文本

发布于 2024-09-06 21:12:49 字数 807 浏览 6 评论 0原文

我正在研究从 html 文档中提取文本并将其存储在数据库中。我正在使用 webharvest 工具来提取内容。然而我有点陷入了困境。在 webharvest 中，我使用 XQuery 表达式来提取数据。我正在解析的html文档如下：

 <td><a name="hw">HELLOWORLD</a>Hello world</td>

我需要从上面的html脚本中提取“Hello world”文本。

我尝试以这种方式提取文本：

  $hw :=data($item//a[@name='hw']/text())

但是我总是得到的是“HELLOWORLD”而不是“Hello world”。

有没有办法提取“Hello World”。请帮忙。

如果我想这样做怎么办：

     <td>
       <a name="hw1">HELLOWORLD1</a>Hello world1
       <a name="hw2">HELLOWORLD2</a>Hello world2
       <a name="hw3">HELLOWORLD3</a>Hello world3
     </td>

我想提取 hw2 和 hw3 之间的文本 Hello world 2。我不想使用 text()[3] 但有什么方法可以提取 /a[@name='hw2'] 和 /a[@name='hw3'] 之间的文本。

原文

I am working on extracting text out of html documents and storing in database. I am using webharvest tool for extracting the content. However I kind of stuck at a point. Inside webharvest I use XQuery expression inorder to extract the data. The html document that I am parsing is as follows:

 <td><a name="hw">HELLOWORLD</a>Hello world</td>

I need to extract "Hello world" text from the above html script.

I have tried extracting the text in this fashion:

  $hw :=data($item//a[@name='hw']/text())

However what I always get is "HELLOWORLD" instead of "Hello world".

Is there a way to extract "Hello World". Please help.

What if I want to do it this way:

     <td>
       <a name="hw1">HELLOWORLD1</a>Hello world1
       <a name="hw2">HELLOWORLD2</a>Hello world2
       <a name="hw3">HELLOWORLD3</a>Hello world3
     </td>

I would like to extract the text Hello world 2 that is in betweeb hw2 and hw3. I would not want to use text()[3] but is there some way I could extract the text out between /a[@name='hw2'] and /a[@name='hw3'].

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

貪欢 2024-09-13 21:12:49

首先，您要查找名称属性以“hw”开头的节点。这可以通过以下路径来实现：

$item//a[starts-with(@name,'hw')]

找到 a 节点后，您想要检索 a 节点后面的第一个文本节点。这可以这样做：

$item//a[starts-with(@name,'hw')]/following-sibling::text()[1]

First of all, you are looking for the a nodes whose name attributes start with 'hw'. This can be achieved with the following path:

$item//a[starts-with(@name,'hw')]

Once you have found your a nodes you want to retrieve the first text node that follows the a node. This can be done as so:

$item//a[starts-with(@name,'hw')]/following-sibling::text()[1]

回复收藏 0 原文

~没有更多了~

关于作者

想挽留

暂无简介

0 文章

0 评论

20372 人气

关注发私信

友情链接

文江博客

Xquery 提取文本

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

初遇

听闻余生

Z_dy

左岸枫

1848719402

婷

友情链接

Xquery 提取文本

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

初遇

听闻余生

Z_dy

左岸枫

1848719402

婷

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。