如何使用 jtidy 和 xpath 提取数据
我必须从中提取 d 公司名称和面值 http://money.rediff.com/companies/20-microns-ltd/15110088
我注意到这个任务可以使用 xpath api 来完成。 因为这是一个 html 页面,所以我使用 jtidy 解析器。
这是我必须提取的面值的 xpath。
/html/body/div[4]/div[6]/div[9]/div/table/tbody/tr[4]/td[2]
这是我的代码,
URL oracle = new URL("http://money.rediff.com/companies/20-microns-ltd/15110088");
URLConnection yc = oracle.openConnection();
InputStream is = yc.getInputStream();
is = oracle.openStream();
Tidy tidy = new Tidy();
tidy.setQuiet(true);
tidy.setShowWarnings(false);
Document tidyDOM = tidy.parseDOM(is, null);
XPathFactory xPathFactory = XPathFactory.newInstance();
XPath xPath = xPathFactory.newXPath();
String expression = "/html";
XPathExpression xPathExpression = xPath.compile(expression);
Object result = xPathExpression.evaluate(tidyDOM,XPathConstants.NODESET);
System.out.println(result.toString());
请进一步指导我,因为我找不到上述问题的正确解决方案
i have to extract d company name and face value from
http://money.rediff.com/companies/20-microns-ltd/15110088
i noticed that this task could be accomplished using xpath api.
since this is an html page, i am using jtidy parser.
this is the xpath for the face value which i have to extract.
/html/body/div[4]/div[6]/div[9]/div/table/tbody/tr[4]/td[2]
This is my code
URL oracle = new URL("http://money.rediff.com/companies/20-microns-ltd/15110088");
URLConnection yc = oracle.openConnection();
InputStream is = yc.getInputStream();
is = oracle.openStream();
Tidy tidy = new Tidy();
tidy.setQuiet(true);
tidy.setShowWarnings(false);
Document tidyDOM = tidy.parseDOM(is, null);
XPathFactory xPathFactory = XPathFactory.newInstance();
XPath xPath = xPathFactory.newXPath();
String expression = "/html";
XPathExpression xPathExpression = xPath.compile(expression);
Object result = xPathExpression.evaluate(tidyDOM,XPathConstants.NODESET);
System.out.println(result.toString());
please guide me further, because, i cannot find a right solution for the above
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
尽量不要使用“完整”xpath。
好于
大多数 HTML 页面都是无效的,甚至是格式错误的。因此,当“真实世界的 HTML 解析器”处理时,DOM 结构可能会发生变化。例如,如果没有
,则可以将
插入到
下。当不同的 HTML 解析器生成不同的 DOM 树时,情况会更糟,因此一个 XPath 可能对一个解析器有效,但对另一个解析器无效。我宁愿使用“通配符”,例如
table//tr[4]
而不是table/tbody/tr[4]
或table/tr[4]< /code> 这样我就可以忘记
。当针对混乱的现实 HTML 页面使用时,此类表达式会更加稳健。
您可以使用 Firepath(Firebug 的一个插件,后来又成为 Firefox 的插件)来调试 XPath 表达式。
ps 您可以尝试我的 JHQL (http://github.com/wks/jhql) 项目来完成此任务。如果您有更多页面可以从中提取数据,您会喜欢它。
Try not to use "full" xpaths.
is better than
Most HTML pages are not valid or even well-formed. So the DOM structure may change when processed by "real-world HTML parsers". For example, a
<tbody>
may be inserted under<table>
if there isn't one. Things are worse when different HTML parsers generate different DOM trees so one XPath may be valid for one parser, but not the other. I would rather use "wildcards" liketable//tr[4]
instead oftable/tbody/tr[4]
ortable/tr[4]
so that I can forget about<tbody>
. Such expressions are more robust when used against the messy real-world HTML pages.You can use Firepath, a plugin for Firebug which is then a plugin for Firefox, to debug XPath expressions.
p.s. You can try my JHQL (http://github.com/wks/jhql) project for exactly this task. You will like it if you have more pages to extract data from.