如何获取 URL 的描述

发布于 2024-07-08 23:06:09 字数 417 浏览 8 评论 0原文

我有一个 URL 列表,正在尝试收集它们的“描述”。 我所说的描述是指出现的情况,例如,如果您在 Google 上搜索了该链接。 例如,http://stackoverflow.com">Google: http://stackoverflow.com 显示描述为

独立于语言的协作 编辑问答网站 程序员。 问题与解答 通过用户投票和标签显示。

这是我试图为我拥有的 URL 积累的数据。

我尝试解析 URL 的元描述,但大多数都缺少元描述(但 Google 和其他搜索引擎设法以某种方式获取描述)。

有任何想法吗? 我应该“谷歌”每个链接并抓取数据吗? 我有一种感觉谷歌不会喜欢这个......

谢谢大家。

I have a list of URLs and am trying to collect their "descriptions." By description I mean what comes up, for example, if you Googled the link. For example, http://stackoverflow.com">Google: http://stackoverflow.com shows the description as

A language-independent collaboratively
edited question and answer site for
programmers. Questions and answers
displayed by user votes and tags.

This the data I'm trying to accumulate for the URLs I have.

I tried parsing the URL's meta-descriptions, however most of them are lacking a meta-description (yet Google and other search engines manage to get a description somehow).

Any ideas? Should I just "google" each link and scrape the data? I have a feeling Google wouldn't like this...

Thanks guys.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

む无字情书 2024-07-15 23:06:09

如果/当缺少描述元标记时,不同的搜索引擎有不同的算法来从页面中获取描述。 有些人会忽略标签,即使它在那里。

如果你想要谷歌的描述,最准确的获取方法就是抓取它。 否则,您可以编写自己的代码或在网络上查找执行此操作的代码。

Different search engines have different algorithms to get the description out of the page if/when they are lacking the description meta tag. Some ignore the tag even it it's there.

If you want the description Google has, the most accurate way to get it would be to scrape it. Otherwise, you could write your own or look around on the web for code that does it.

榕城若虚 2024-07-15 23:06:09

这些称为片段。

Google 使用专有(可能还有专利)方法来获取此信息,因此这不是一个简单的答案。

正如您所建议的,他们将使用元描述信息(如果存在)。 (如何设置元信息以帮助Google。)

他们还将尊重页面作者的请求,包含片段。 (如何阻止 Google 显示代码段)您也许也应该尊重这一点(当然,还有 robots.txt。)

您可能会对现有的自动摘要包有一些运气,例如 OTS

These are called snippets.

Google use proprietary (and possibly patented) methods to garner this information, so there is no simple answer.

As you suggest, they will use meta-description information if it is there. (How to set the meta-information to help Google.)

They will also honour requests from the page authors to NOT include snippets. (How to prevent Google from displaying snippets) You should probably respect this too (as well as robots.txt, of course.)

You may have some luck with existing auto-summary packages, such as OTS.

欲拥i 2024-07-15 23:06:09

You may want to check AboutUs.org (i.e. http://www.aboutus.org/StackOverflow.com).
But, there's little chance that the site will have an aboutus page and not have a meta description.

早乙女 2024-07-15 23:06:09

一些可能解释 Google 如何执行此操作的信息:

Some info that might explain how google does this:

看春风乍起 2024-07-15 23:06:09

我不熟悉 Google API,但也许有获取此类信息的官方方式。

I am not familiar with Google APIs, but perhaps there is an official way to get such information.

那请放手 2024-07-15 23:06:09

有趣的。 有些来源比其他来源更好。

对于“audiotuts.com”google 的描述比 AboutUs.com 更糟糕。

谷歌

11 月 18 日,乔尔·法尔科纳 (Joel Falconer) 撰文·
1. 最近,一位 AUDIOTUTS 读者向我询问创作过程。 虽然这
是一个无法做成的话题
...

AboutUs.com:

AUDIOTUTS 是一个博客/教程网站
音乐家、制作人和音频
瘾君子! 它是 的姊妹网站
流行的 PSDTUTS、VECTORTUTS 和
NETTUTS。

我讨厌这样的问题......它们应该是微不足道的,但事实并非如此!

Interesting. some sources are better than others.

For "audiotuts.com" google has a worse description than AboutUs.com.

Google

Nov 18th in General by Joel Falconer ·
1. Recently, an AUDIOTUTS reader asked me about creative process. While this
is a topic that can’t be made into a
...

AboutUs.com:

AUDIOTUTS is a blog/tutorial site for
musicians, producers and audio
junkies! It is the sister site of the
popular PSDTUTS, VECTORTUTS and
NETTUTS.

I hate problems like these... they should be trivial but they aren't!

旧伤慢歌 2024-07-15 23:06:09

如果你可以假设英语内容,你可以首先寻找元描述,如果这不起作用,你可以寻找前两三个类似句子的单词序列。

我开发的一个产品寻找第一个包含多个 > 序列的 P 或 DIV。 n 由句点分隔的“单词”。 它将使用两个或三个类似句子的序列(最多 x 个单词)作为摘要段落。 它不是 100% 准确,但对于一般情况来说已经足够好了。 字数进行了几次调整,以消除导航元素等内容。

If you can assume English content, you can first look for Meta Description, and if that doesn't work, you can look for the first two or three sentence-like word sequences.

A product I worked on looked for the first P or DIV that contained more than one sequence of > n "words" delimited by periods. It would use the two or three sentence-like sequences, up to x total words, as a summary paragraph. It wasn't 100% accurate, but good enough for the average case. The number of words was adjusted a few times to eliminate things like navigation elements.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文