Python:检测字符串中的实际文本段落
重大使命:我试图获得网页的几行摘要。即我想要一个函数,它接受一个 URL 并返回该页面中信息最丰富的段落。 (这通常是实际内容文本的第一段,与导航栏等“垃圾文本”相反。)
因此,我设法通过删除标签、扔掉 < 来将 HTML 页面简化为一堆文本。 code> 和所有脚本。但有些文字仍然是“垃圾文字”。我想知道文本的实际段落从哪里开始。 (理想情况下,它应该与人类语言无关,但如果您有仅适用于英语的解决方案,这也可能有帮助。)
我如何找出哪些文本是“垃圾文本”,哪些是实际内容?
更新:我看到有些人建议我使用 HTML 解析库。我用的是美丽汤。我的问题不是解析 HTML;而是我已经摆脱了所有 HTML 标签,我只有一堆文本,我想将上下文文本与垃圾文本分开。
The big mission: I am trying to get a few lines of summary of a webpage. i.e. I want to have a function that takes a URL and returns the most informative paragraph from that page. (Which would usually be the first paragraph of actual content text, in contrast to "junk text", like the navigation bar.)
So I managed to reduce an HTML page to a bunch of text by cutting out the tags, throwing out the <HEAD>
and all the scripts. But some of the text is still "junk text". I want to know where the actual paragraphs of text begin. (Ideally it should be human-language-agnostic, but if you have a solution only for English, that might help too.)
How can I figure out which of the text is "junk text" and which is actual content?
UPDATE: I see some people have pointed me to use an HTML parsing library. I am using Beautiful Soup. My problem isn't parsing HTML; I already got rid of all the HTML tags, I just have a bunch of text and I want to separate the context text from the junk text.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
此问题的通用解决方案是一个需要解决的重要问题。
结合上下文来看,谷歌在搜索方面的成功很大一部分来自于他们从任意网页自动识别某些语义的能力,即找出“内容”在哪里。
脑海中浮现的一个想法是,如果您可以从同一站点抓取许多页面,那么您将能够识别模式。所有页面之间的菜单标记基本相同。如果你以某种方式将其归零(并且需要相当“模糊”),剩下的就是内容。
下一步是识别文本以及边界的构成。理想情况下,这应该是一些 HTML 段落,但大多数时候您不会那么幸运。
更好的方法可能是找到网站的 RSS 源并以这种方式获取内容,因为内容会按原样被剥离。忽略任何 AdSense(或类似)内容,您应该能够获取文本。
哦,绝对要扔掉你的正则表达式代码。毫无疑问,这需要一个 HTML 解析器。
A general solution to this problem is a non-trivial problem to solve.
To put this in context, a large part of Google's success with search has come from their ability to automatically discern some semantic meaning from arbitrary Web pages, namely figuring out where the "content" is.
One idea that springs to mind is if you can crawl many pages from the same site then you will be able to identify patterns. Menu markup will be largely the same between all pages. If you zero this out somehow (and it will need to fairly "fuzzy") what's left is the content.
The next step would be to identify the text and what constitutes a boundary. Ideally that would be some HTML paragraphs but you won't get that lucky most of the time.
A better approach might be to find the RSS feeds for the site and get the content that way because that will be stripped down as is. Ignore any AdSense (or similar) content and you should be able to get the text.
Oh and absolutely throw out your regex code for this. This requires an HTML parser absolutely without question.
您可以使用 AI depot 博客中概述的方法以及一些 python 代码:
You could use the approach outlined at the AI depot blog along with some python code:
可能有点矫枉过正,但您可以尝试 nltk,自然语言工具包。该库用于解析自然语言。这是一个非常好的图书馆和一个有趣的主题。如果您只想从文本中获取句子,您可以执行以下操作:
或者您可以使用 sentences_from_text 来自
PunktSentenceTokenizer
类的方法。在开始之前,您必须执行nltk.download()
。Probably a bit overkill, but you could try nltk, the Natural Language Toolkit. That library is used for parsing natural languages. It's quite a nice library and an interesting subject. If you want to just get sentences from a text you would do something like:
Or you could use the sentences_from_text method from the
PunktSentenceTokenizer
class. You have to donltk.download()
before you get started.我建议您了解一下 可读性 的作用。可读性会删除页面上除实际内容之外的所有内容,并重新设计其样式以方便阅读。从我的经验来看,它似乎非常有效地检测内容。
看看它的源代码(特别是
grabArticle
函数)也许你可以得到一些想法。I'd recommend having a look at what Readability does. Readability strips out all but the actual content of the page and restyles it for easy reading. It seems to work very well in terms of detecting the content from my experience.
Have a look at its source code (particularly the
grabArticle
function) and maybe you can get some ideas.