为 Java 应用程序筛选格式不良的 XHTML 页面的最佳方法是什么
我希望能够从网页中获取内容,尤其是标签及其中的内容。 我尝试过 XQuery 和 XPath,但它们似乎不适用于格式错误的 XHTML,而 REGEX 则很痛苦。
有没有更好的解决办法。 理想情况下,我希望能够请求所有链接并返回 URL 数组,或者请求链接文本并返回带有链接文本的字符串数组,或者请求所有粗体文本ETC。
I want to be able to grab content from web pages, especially the tags and the content within them. I have tried XQuery and XPath but they don't seem to work for malformed XHTML and REGEX is just a pain.
Is there a better solution. Ideally I would like to be able to ask for all the links and get back an array of URLs, or ask for the text of the links and get back an array of Strings with the text of the links, or ask for all the bold text etc.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
通过 JTidy 之类的东西运行 XHTML,这应该会返回有效的 XML。
Run the XHTML through something like JTidy, which should give you back valid XML.
您可能需要查看 Watij。 我只使用了它的 Ruby 表弟 Watir,但使用它我能够加载网页并以您描述的方式请求该页面的所有 URL。
它非常容易使用 - 它实际上会启动一个网络浏览器并以良好的形式返回信息。 IE 支持似乎最好,但至少 Watir 也支持 Firefox。
You may want to look at Watij. I have only used its Ruby cousin, Watir, but with it I was able to load a webpage and request all URLs of the page in exactly the manner you describe.
It was very easy to work with - it literally fires up a webbrowser and gives you back information in nice forms. IE support seemed best, but at least with Watir Firefox was also supported.
我当时在使用 JTidy 时遇到了一些问题。 我认为这与未关闭的标签有关导致 JTidy 失败。 我不知道现在是否修复了。 我最终使用了 TagSoup 的包装器,尽管我不这样做不记得确切的项目名称。 还有 HTMLCleaner。
I had some problems with JTidy back in the day. I think it was related to tags that weren't closed that made JTidy fail. I don't know if thats fixed now. I ended up using something that was a wrapper around TagSoup, although I don't remember the exact project's name. Theres also HTMLCleaner.
我使用过http://htmlparser.sourceforge.net/。 它可以解析格式不良的 html,并可以轻松地提取数据。
I've used http://htmlparser.sourceforge.net/. It can parse poorly formed html and allows data extraction quite easily.