如何从 HTML 中提取有意义的文本
我想解析一个 html 页面并从中提取有意义的文本。有人知道一些好的算法可以做到这一点吗?
我在 Rails 上开发我的应用程序,但我认为 ruby 在这方面有点慢,所以我认为如果在 c 中存在一些好的库,这将是合适的。
谢谢!!
PD:请不要推荐任何带有java
更新的内容: 我找到了这个 链接文本< /a>
可悲的是,是在 python 中
I would like to parse a html page and extract the meaningful text from it. Anyone knows some good algorithms to do this?
I develop my applications on Rails, but I think ruby is a bit slow in this, so I think if exists some good library in c for this it would be appropriate.
Thanks!!
PD: Please do not recommend anything with java
UPDATE:
I found this link text
Sadly, is in python
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
使用 Nokogiri,它速度很快,用 C 语言编写,适用于 Ruby 。
(使用正则表达式解析 HTML 之类的递归表达式非常困难且容易出错,我不会走这条路。我只在答案中提到这一点,因为这个问题似乎一次又一次地出现。)
使用真正的解析器,例如上面提到的 Nokogiri,您还可以获得额外的好处,即保留 HTML 文档的结构和逻辑,有时您真的需要那些线索。
Use Nokogiri, which is fast and written in C, for Ruby.
(Using regexp to parse recursive expressions like HTML is notoriously difficult and error prone and I would not go down that path. I only mention this in the answer as this issue seems to crop up again and again.)
With a real parser like for instance Nokogiri mentioned above, you also get the added benefit that the structure and logic of the HTML document is preserved, and sometimes you really need those clues.
与 Ruby 集成的解决方案
外部解决方案
Solutions integrating with Ruby
External Solutions
Lynx 能够做到这一点。如果您想看一下,这是开源的。
Lynx is able to do this. This is open source if you want to take a look at it.
您应该从文本中删除所有尖括号部分,然后折叠空格。
理论上,在其他情况下
<
和>
不应该出现。页面中到处都包含<
和>
而不是它们。折叠空格:将所有 TAB、换行符等转换为空格,然后将每个空格序列替换为单个空格。
更新:您应该在找到
标签后开始。
You should strip all angle-bracketed part from text and then collapse white-spaces.
In theory the
<
and>
should not be there in other cases. Pages contain<
and>
everywhere instead of them.Collapsing whitespaces: Convert all TAB, newline, etc to spaces, then replace every sequence of spaces to a single space.
UPDATE: And you should start after finding the
<body>
tag.