使用 Ruby 进行网页摘要

发布于 2024-08-05 16:33:41 字数 56 浏览 4 评论 0原文

谁能推荐一个 Ruby 库来创建给定 URL 的摘要?我想到的是搜索引擎结果中看到的一两句话摘要。

Can anyone recommend a Ruby library for creating a summary of a given URL? What I have in mind is the sort of one- or two-sentence summary as seen in search engine results.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

朮生 2024-08-12 16:33:41

您可以只从网页中抓取描述元标记,或者如果不可用,则从页面上的第一个

元素中抓取前几句话。描述元标记如下所示:

<meta name="description" content="Nokogiri (鋸) is an HTML, XML, SAX, and Reader parser with XPath and CSS selector support." />

有几个用于解析 HTML 的 Ruby 库。我听说 Nokogiri 很适合这类东西,但是我个人没有这方面的经验。

You could you just scrape the web page for either description meta tag or if that's not available the first few sentences from the first <p> element on the page. The description meta tag looks like this:

<meta name="description" content="Nokogiri (鋸) is an HTML, XML, SAX, and Reader parser with XPath and CSS selector support." />

There's several Ruby libraries for parsing HTML. I hear that Nokogiri is good for this sort of stuff, but I have no experience with it personally.

幸福还没到 2024-08-12 16:33:41

抓取网站和抓取页面很容易。总结一页是很困难的。

元标签可以提供一点帮助,因为摘要和内容之间应该有直接的关联。

不幸的是,并非所有页面都有它们,而且许多页面都是不准确的。这让我们不得不对文本进行转义,希望它与内容和上下文相关。页面布局各不相同,并且没有标准说明主要内容实际上位于页面上的哪个位置,并且由于 CSS 和 Ajax,它可能不在我们期望的位置(在前几行文本中)。可能没有

标签,因为带有适当 CSS 的

可以替换外观。

我写了很多蜘蛛程序,对页面进行上下文分析,试图进行总结,但它很丑陋,而且不是防弹的,特别是在处理英语时,因为同音异义词、同义词和其他“nyms”会妨碍。

如果您可以找到要总结的文本,那么有一些不错的工具可以将几个段落或一篇论文缩减为一个短句。 Mac OS 附带了一个摘要器,并且已经使用了很多年。 "使用 Mac OSX 汇总文本或者 Microsoft Word AutoSummarize”讨论了如果您想尝试的话启用它。 "Mac 101:缩短使用摘要服务的文本”是关于在 Mac 上使用它的。有一个可以从 CLI 调用的驱动程序或应用程序。请参阅“如何使用 Mac OS X命令行上的摘要服务?”了解更多信息。

并且,作为演示,这里将林肯的葛底斯堡地址总结为一行:

我们应该在这里致力于摆在我们面前的伟大任务——从这些光荣的死者身上,我们会更加投入到他们为之付出最后全部奉献的事业——我们在这里高度决心,这些死者不会白白死去——这个国家在上帝的领导下将获得自由的新生——而民有、民治、民享的政府将不会从地球上消失。

Spidering a site and scraping pages is easy. Summarizing a page is difficult.

The metatags can help a little, as there is supposed to be a direct correlation between the summary and the content.

Unfortunately, not all pages have them, and many that do are inaccurate. That leaves us with having to scape text, hoping that it's pertinent to the content and context. Page layouts vary and there is no standard saying where on a page the main content actually lies and, because of CSS and Ajax, it might not be where we'd expect it, in the first couple lines of text. There might not be <p> tags, as a <div> or <span> with the appropriate CSS can replace the look.

I've written many spiders that did contextual analysis of the pages, trying to summarize, and it's ugly and not bullet-proof, especially when dealing with the English language because of homonyms, synonyms, and other "nyms" that get in the way.

If you can locate text to summarize, there are decent tools to reduce several paragraphs, or a paper, into a short sentence. Mac OS comes with a summarizer, and has for years. "Summarize Text Using Mac OSX Summarize Or Microsoft Word AutoSummarize" talks about enabling it if you want to experiment. "Mac 101: Shorten text using the Summarize Service" is about using it on the Mac. There's a driver or app for it that can be called from the CLI. See "How to use Mac OS X's Summary Service on the command line?" for more info.

And, as a demo, here's Lincoln's Gettysburg address summarized to one line:

It is rather for us to be here dedicated to the great task remaining before us—that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion—that we here highly resolve that these dead shall not have died in vain—that this nation, under God, shall have a new birth of freedom—and that government of the people, by the people, for the people, shall not perish from the earth.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文