如何使用 jtidy 和 xpath 提取数据

发布于 2024-11-29 09:38:34 字数 1077 浏览 2 评论 0原文

我必须从中提取 d 公司名称和面值 http://money.rediff.com/companies/20-microns-ltd/15110088

我注意到这个任务可以使用 xpath api 来完成。因为这是一个 html 页面，所以我使用 jtidy 解析器。

这是我必须提取的面值的 xpath。

/html/body/div[4]/div[6]/div[9]/div/table/tbody/tr[4]/td[2]

这是我的代码，

URL oracle = new URL("http://money.rediff.com/companies/20-microns-ltd/15110088");
URLConnection yc = oracle.openConnection();
InputStream is = yc.getInputStream();
is = oracle.openStream();
Tidy tidy = new Tidy();
tidy.setQuiet(true);
tidy.setShowWarnings(false);
Document tidyDOM = tidy.parseDOM(is, null);
XPathFactory xPathFactory = XPathFactory.newInstance();
XPath xPath = xPathFactory.newXPath();
String expression = "/html";
XPathExpression xPathExpression = xPath.compile(expression);
Object result = xPathExpression.evaluate(tidyDOM,XPathConstants.NODESET);
System.out.println(result.toString());

请进一步指导我，因为我找不到上述问题的正确解决方案

原文

i have to extract d company name and face value from
http://money.rediff.com/companies/20-microns-ltd/15110088

i noticed that this task could be accomplished using xpath api.
since this is an html page, i am using jtidy parser.

this is the xpath for the face value which i have to extract.

/html/body/div[4]/div[6]/div[9]/div/table/tbody/tr[4]/td[2]

This is my code

URL oracle = new URL("http://money.rediff.com/companies/20-microns-ltd/15110088");
URLConnection yc = oracle.openConnection();
InputStream is = yc.getInputStream();
is = oracle.openStream();
Tidy tidy = new Tidy();
tidy.setQuiet(true);
tidy.setShowWarnings(false);
Document tidyDOM = tidy.parseDOM(is, null);
XPathFactory xPathFactory = XPathFactory.newInstance();
XPath xPath = xPathFactory.newXPath();
String expression = "/html";
XPathExpression xPathExpression = xPath.compile(expression);
Object result = xPathExpression.evaluate(tidyDOM,XPathConstants.NODESET);
System.out.println(result.toString());

please guide me further, because, i cannot find a right solution for the above

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

-柠檬树下少年和吉他 2024-12-06 09:38:34

尽量不要使用“完整”xpath。

//div[@id='leftcontainer']//div[9]//table//tr[4]/td[2]

好于

/html/body/.../.../.../.../.../...

大多数 HTML 页面都是无效的，甚至是格式错误的。因此，当“真实世界的 HTML 解析器”处理时，DOM 结构可能会发生变化。例如，如果没有，则可以将插入到 下。当不同的 HTML 解析器生成不同的 DOM 树时，情况会更糟，因此一个 XPath 可能对一个解析器有效，但对另一个解析器无效。我宁愿使用“通配符”，例如 table//tr[4] 而不是 table/tbody/tr[4] 或 table/tr[4]< /code> 这样我就可以忘记。当针对混乱的现实 HTML 页面使用时，此类表达式会更加稳健。

您可以使用 Firepath（Firebug 的一个插件，后来又成为 Firefox 的插件）来调试 XPath 表达式。

ps 您可以尝试我的 JHQL (http://github.com/wks/jhql) 项目来完成此任务。如果您有更多页面可以从中提取数据，您会喜欢它。

Try not to use "full" xpaths.

//div[@id='leftcontainer']//div[9]//table//tr[4]/td[2]

is better than

/html/body/.../.../.../.../.../...

Most HTML pages are not valid or even well-formed. So the DOM structure may change when processed by "real-world HTML parsers". For example, a <tbody> may be inserted under <table> if there isn't one. Things are worse when different HTML parsers generate different DOM trees so one XPath may be valid for one parser, but not the other. I would rather use "wildcards" like table//tr[4] instead of table/tbody/tr[4] or table/tr[4] so that I can forget about <tbody>. Such expressions are more robust when used against the messy real-world HTML pages.

You can use Firepath, a plugin for Firebug which is then a plugin for Firefox, to debug XPath expressions.

p.s. You can try my JHQL (http://github.com/wks/jhql) project for exactly this task. You will like it if you have more pages to extract data from.

回复收藏 0 原文

~没有更多了~