使用 Python 检索类似 Facebook 的链接摘要(标题、摘要、相关图像)

发布于 2024-09-10 22:41:03 字数 519 浏览 6 评论 0原文

我想复制 Facebook 用于解析链接的功能。当您提交指向 Facebook 状态的链接时,他们的系统会检索建议的标题摘要以及通常一个或多个相关的图像从该页面,您可以从中选择缩略图。

我的应用程序需要使用 Python 来完成此任务,但我愿意接受与此相关的任何类型的指南、博客文章或其他开发人员的经验,并可能帮助我弄清楚如何完成它。

在加入之前,我真的很想学习其他人的经验。

需要明确的是,当给定网页的 URL 时,我希望能够检索:

  1. 标题:可能只是 < /code> 标签,但可能是 <code><h1></code>,不确定。
  2. 页面的一段摘要。
  3. 一堆可以用作缩略图的相关图像。 (棘手的部分是过滤掉不相关的图像,例如横幅或圆角)

我可能必须自己实现它,但我至少想知道其他人是如何完成此类任务的。

I would like to replicate the functionality that Facebook uses to parse a link. When you submit a link into your Facebook status, their system goes out and retrieves a suggested title, summary and often one or more relevant images from that page, from which you can choose a thumbnail.

My application needs to accomplish this using Python, but I am open to any kind of a guide, blog post or experience of other developers which relates to this and might help me figure out how to accomplish it.

I would really like to learn from other people's experience before just jumping in.

To be clear, when given the URL of a web page, I want to be able to retrieve:

  1. The title: Probably just the <title> tag but possibly the <h1>, not sure.
  2. A one-paragraph summary of the page.
  3. A bunch of relevant images that could be used as a thumbnail. (The tricky part is to filter out irrelevant images like banners or rounded corners)

I may have to implement it myself, but I would at least want to know about how other people have been doing these kinds of tasks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

伴随着你 2024-09-17 22:41:03

BeautifulSoup 非常适合完成大部分工作。

基本上,您只需初始化 soup 对象,然后执行类似以下操作来提取您感兴趣的内容:

title = soup.findAll('title')
images = soup.findAll('img')

然后,您可以使用以下命令根据图像的 url 下载每个图像: urllib2

标题相当简单,但图像可能有点困难,因为您必须下载每张图像才能获取它们的相关统计数据。也许您可以根据尺寸和颜色数量过滤掉大部分图像?例如,圆角会很小,并且通常只有 1-2 种颜色。

至于页面摘要,这可能有点困难,但我一直在做这样的事情:

  1. 我使用 BeautifulSoup 使用以下方法从 html 中删除所有样式、脚本、表单和头块:.findAll,然后 .extract< /代码>。
  2. 我使用以下方法获取剩余的文本: .join(soup.findAll(text = True))

在您的应用程序中,也许您可​​以使用此 "text" 内容作为页面摘要?

我希望这有帮助。

BeautifulSoup is well-suited to accomplish most of this.

Basically, you simply initialize the soup object, then do something like the following to extract what you are interested in:

title = soup.findAll('title')
images = soup.findAll('img')

You could then download each of the images based on their url using urllib2.

The title is fairly simple, but the images could be a bit more difficult since you have to download each one to get the relevant stats on them. Perhaps you could filter out most of the images based on size and number of colors? Rounded corners, as an example, are going to be small and only have 1-2 colors, generally.

As for the page summary, that may be a bit more difficult, but I've been doing something like this:

  1. I use BeautifulSoup to remove all style, script, form, and head blocks from the html by using: .findAll, then .extract.
  2. I grab the remaining text using: .join(soup.findAll(text = True))

In your application, perhaps you could use this "text" content as the page summary?

I hope this helps.

半城柳色半声笛 2024-09-17 22:41:03

这是一个完整的解决方案:https://github.com/svven/summary

>>> import summary
>>> s = summary.Summary('http://stackoverflow.com/users/76701/ram-rachum')
>>> s.extract()
>>> s.title
u'User Ram Rachum - Stack Overflow'
>>> s.description
u'Israeli Python hacker.'
>>> s.image
https://www.gravatar.com/avatar/d24c45635a5171615a7cdb936f36daad?s=128&d=identic
on&r=PG
>>>

Here's a complete solution: https://github.com/svven/summary

>>> import summary
>>> s = summary.Summary('http://stackoverflow.com/users/76701/ram-rachum')
>>> s.extract()
>>> s.title
u'User Ram Rachum - Stack Overflow'
>>> s.description
u'Israeli Python hacker.'
>>> s.image
https://www.gravatar.com/avatar/d24c45635a5171615a7cdb936f36daad?s=128&d=identic
on&r=PG
>>>
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文