使用 Ruby 进行网页摘要
谁能推荐一个 Ruby 库来创建给定 URL 的摘要?我想到的是搜索引擎结果中看到的一两句话摘要。
Can anyone recommend a Ruby library for creating a summary of a given URL? What I have in mind is the sort of one- or two-sentence summary as seen in search engine results.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可以只从网页中抓取描述元标记,或者如果不可用,则从页面上的第一个
元素中抓取前几句话。描述元标记如下所示:
有几个用于解析 HTML 的 Ruby 库。我听说 Nokogiri 很适合这类东西,但是我个人没有这方面的经验。
You could you just scrape the web page for either description meta tag or if that's not available the first few sentences from the first
<p>
element on the page. The description meta tag looks like this:There's several Ruby libraries for parsing HTML. I hear that Nokogiri is good for this sort of stuff, but I have no experience with it personally.
抓取网站和抓取页面很容易。总结一页是很困难的。
元标签可以提供一点帮助,因为摘要和内容之间应该有直接的关联。
不幸的是,并非所有页面都有它们,而且许多页面都是不准确的。这让我们不得不对文本进行转义,希望它与内容和上下文相关。页面布局各不相同,并且没有标准说明主要内容实际上位于页面上的哪个位置,并且由于 CSS 和 Ajax,它可能不在我们期望的位置(在前几行文本中)。可能没有
标签,因为带有适当 CSS 的
可以替换外观。
我写了很多蜘蛛程序,对页面进行上下文分析,试图进行总结,但它很丑陋,而且不是防弹的,特别是在处理英语时,因为同音异义词、同义词和其他“nyms”会妨碍。
如果您可以找到要总结的文本,那么有一些不错的工具可以将几个段落或一篇论文缩减为一个短句。 Mac OS 附带了一个摘要器,并且已经使用了很多年。 "使用 Mac OSX 汇总文本或者 Microsoft Word AutoSummarize”讨论了如果您想尝试的话启用它。 "Mac 101:缩短使用摘要服务的文本”是关于在 Mac 上使用它的。有一个可以从 CLI 调用的驱动程序或应用程序。请参阅“如何使用 Mac OS X命令行上的摘要服务?”了解更多信息。
并且,作为演示,这里将林肯的葛底斯堡地址总结为一行:
Spidering a site and scraping pages is easy. Summarizing a page is difficult.
The metatags can help a little, as there is supposed to be a direct correlation between the summary and the content.
Unfortunately, not all pages have them, and many that do are inaccurate. That leaves us with having to scape text, hoping that it's pertinent to the content and context. Page layouts vary and there is no standard saying where on a page the main content actually lies and, because of CSS and Ajax, it might not be where we'd expect it, in the first couple lines of text. There might not be
<p>
tags, as a<div>
or<span>
with the appropriate CSS can replace the look.I've written many spiders that did contextual analysis of the pages, trying to summarize, and it's ugly and not bullet-proof, especially when dealing with the English language because of homonyms, synonyms, and other "nyms" that get in the way.
If you can locate text to summarize, there are decent tools to reduce several paragraphs, or a paper, into a short sentence. Mac OS comes with a summarizer, and has for years. "Summarize Text Using Mac OSX Summarize Or Microsoft Word AutoSummarize" talks about enabling it if you want to experiment. "Mac 101: Shorten text using the Summarize Service" is about using it on the Mac. There's a driver or app for it that can be called from the CLI. See "How to use Mac OS X's Summary Service on the command line?" for more info.
And, as a demo, here's Lincoln's Gettysburg address summarized to one line: