使用 Python 检索类似 Facebook 的链接摘要(标题、摘要、相关图像)
我想复制 Facebook 用于解析链接的功能。当您提交指向 Facebook 状态的链接时,他们的系统会检索建议的标题
、摘要
以及通常一个或多个相关的图像
从该页面,您可以从中选择缩略图。
我的应用程序需要使用 Python 来完成此任务,但我愿意接受与此相关的任何类型的指南、博客文章或其他开发人员的经验,并可能帮助我弄清楚如何完成它。
在加入之前,我真的很想学习其他人的经验。
需要明确的是,当给定网页的 URL 时,我希望能够检索:
- 标题:可能只是
< /code> 标签,但可能是 <code><h1></code>,不确定。 - 页面的一段摘要。
- 一堆可以用作缩略图的相关图像。 (棘手的部分是过滤掉不相关的图像,例如横幅或圆角)
我可能必须自己实现它,但我至少想知道其他人是如何完成此类任务的。
I would like to replicate the functionality that Facebook uses to parse a link. When you submit a link into your Facebook status, their system goes out and retrieves a suggested title
, summary
and often one or more relevant image
s from that page, from which you can choose a thumbnail.
My application needs to accomplish this using Python, but I am open to any kind of a guide, blog post or experience of other developers which relates to this and might help me figure out how to accomplish it.
I would really like to learn from other people's experience before just jumping in.
To be clear, when given the URL of a web page, I want to be able to retrieve:
- The title: Probably just the
<title>
tag but possibly the<h1>
, not sure. - A one-paragraph summary of the page.
- A bunch of relevant images that could be used as a thumbnail. (The tricky part is to filter out irrelevant images like banners or rounded corners)
I may have to implement it myself, but I would at least want to know about how other people have been doing these kinds of tasks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
BeautifulSoup 非常适合完成大部分工作。
基本上,您只需初始化
soup
对象,然后执行类似以下操作来提取您感兴趣的内容:然后,您可以使用以下命令根据图像的
url
下载每个图像:urllib2
。标题相当简单,但图像可能有点困难,因为您必须下载每张图像才能获取它们的相关统计数据。也许您可以根据尺寸和颜色数量过滤掉大部分图像?例如,圆角会很小,并且通常只有 1-2 种颜色。
至于页面摘要,这可能有点困难,但我一直在做这样的事情:
html
中删除所有样式、脚本、表单和头块:.findAll
,然后.extract< /代码>。
.join(soup.findAll(text = True))
在您的应用程序中,也许您可以使用此
"text"
内容作为页面摘要?我希望这有帮助。
BeautifulSoup is well-suited to accomplish most of this.
Basically, you simply initialize the
soup
object, then do something like the following to extract what you are interested in:You could then download each of the images based on their
url
usingurllib2
.The title is fairly simple, but the images could be a bit more difficult since you have to download each one to get the relevant stats on them. Perhaps you could filter out most of the images based on size and number of colors? Rounded corners, as an example, are going to be small and only have 1-2 colors, generally.
As for the page summary, that may be a bit more difficult, but I've been doing something like this:
html
by using:.findAll
, then.extract
..join(soup.findAll(text = True))
In your application, perhaps you could use this
"text"
content as the page summary?I hope this helps.
这是一个完整的解决方案:https://github.com/svven/summary
Here's a complete solution: https://github.com/svven/summary