使用 Pentaho 获取表中所有 td 的 XPath
无论如何,是否使用 Pentaho 从 html 页面解析表 td? 假设我有这个 html 内容
<html>
<body>
<table>
<tr>
<td>info1</td>
<td>info2</td>
</tr>
<tr>
<td>info3</td>
<td>info4</td>
</tr>
</table>
</body>
</html>
I am using in Pentaho the "Get data from XML" with the following data:Content::
Loop XPath: /html/body/table/tr
Fields::
Name: tableData
XPath: td
The data information I would like to have isinfo1 info2 info3 info4
in any kind of way.Any help would be truly appreciated!
Is there anyway using Pentaho to parse a tables td's from an html page?
Lets say I have this html content
<html>
<body>
<table>
<tr>
<td>info1</td>
<td>info2</td>
</tr>
<tr>
<td>info3</td>
<td>info4</td>
</tr>
</table>
</body>
</html>
I am using in Pentaho the "Get data from XML" with the following data:
Content:: Loop XPath: /html/body/table/tr Fields:: Name: tableData XPath: td
The data information I would like to have is
info1 info2 info3 info4
in any kind of way.
Any help would be truly appreciated!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我通过将文件的每一行作为行读取来解决这个问题。然后我添加了一个 Pentaho 步骤“用户定义的 Java 类”,并让它使用 XSLT 将我的表内容解析为一个新的 XML 文件。使用该 XML,我能够获取完成任务所需的数据。
这是我在“用户定义的 Java 类”中编写的内容:
I solved it by making reading every row of my file as rows. Then I added a Pentaho step "User Defined Java Class" and made it parse my table content using XSLT to a new XML file. Using that XML I was able to get the data needed to complete the task.
Here is what I wrote in "User Defined Java Class":
看到了这个。对于现在来这里的任何人。可以使用 jsoup 使用适当的路径来解析为 xml。它是一个简单的插件,可以在用户定义的类中与您拥有的任何其他方法一起使用。它是一个 CSS 选择器。
saw this. for anyone who is coming here now. parsing to xml can be done with jsoup using the appropriate path. its a simple plugin and works in the user defined class with whatever other methods you have. it is a css selector.