我在工作中进行了很多HTML解析。到目前为止,我一直在使用HTMLUNIT无头浏览器来解析和浏览器自动化。
现在,我想分开两个任务。
我想使用轻型HTML解析器,因为在HTMLUNIT中需要大量时间才能首先加载页面,然后获取源,然后将其解析。
我想知道哪些HTML解析器可以有效地解析HTML。我需要
- 易于速度
- 来通过其“ ID”或“名称”或“标签类型”来找到任何HTMLELEMENT。
如果不清洁脏HTML代码,对我来说是可以的。我不需要清洁任何HTML源。我只需要最简单的方法来跨越HTMLELEMENT并从中收集数据。
I do a lot of HTML parsing in my line of work. Up until now, I was using the HtmlUnit headless browser for parsing and browser automation.
Now, I want to separate both tasks.
I want to use a light HTML parser because it takes much time in HTMLUnit to first load a page, then get the source, and then parse it.
I want to know which HTML parser can parse HTML efficiently. I need
- Speed
- Ease to locate any HtmlElement by its "id" or "name" or "tag type".
It would be ok for me if it doesn't clean the dirty HTML code. I don't need to clean any HTML source. I just need the easiest way to move across HtmlElements and harvest data from them.
发布评论
评论(3)
jsoup
自插头:我刚刚发布了一个新的Java html解析器: jsoup 。我在这里提到它,因为我认为它会做你的事。
它的派对技巧是CSS选择器语法以查找元素,例如:
请参阅 selector Javadoc以获取更多信息。
这是一个新项目,因此非常欢迎任何改进的想法!
jsoup
Self plug: I have just released a new Java HTML parser: jsoup. I mention it here because I think it will do what you are after.
Its party trick is a CSS selector syntax to find elements, e.g.:
See the Selector javadoc for more info.
This is a new project, so any ideas for improvement are very welcome!
到目前为止,我看到的最好的是 htmlCleaner :
使用HTMLCleaner,您可以使用XPath找到任何元素。
对于其他html解析器,请参见这个问题。
The best I've seen so far is HtmlCleaner:
With HtmlCleaner you can locate any element using XPath.
For other html parsers see this SO question.
我建议 validator.nu的解析器,基于HTML5解析算法。 是Mozilla在2010-05-03 中使用的Parser
I suggest Validator.nu's parser, based on the HTML5 parsing algorithm. It is the parser used in Mozilla from 2010-05-03