如何获取 URL 的描述
我有一个 URL 列表,正在尝试收集它们的“描述”。 我所说的描述是指出现的情况,例如,如果您在 Google 上搜索了该链接。 例如,http://stackoverflow.com">Google: http://stackoverflow.com 显示描述为
独立于语言的协作 编辑问答网站 程序员。 问题与解答 通过用户投票和标签显示。
这是我试图为我拥有的 URL 积累的数据。
我尝试解析 URL 的元描述,但大多数都缺少元描述(但 Google 和其他搜索引擎设法以某种方式获取描述)。
有任何想法吗? 我应该“谷歌”每个链接并抓取数据吗? 我有一种感觉谷歌不会喜欢这个......
谢谢大家。
I have a list of URLs and am trying to collect their "descriptions." By description I mean what comes up, for example, if you Googled the link. For example, http://stackoverflow.com">Google: http://stackoverflow.com shows the description as
A language-independent collaboratively
edited question and answer site for
programmers. Questions and answers
displayed by user votes and tags.
This the data I'm trying to accumulate for the URLs I have.
I tried parsing the URL's meta-descriptions, however most of them are lacking a meta-description (yet Google and other search engines manage to get a description somehow).
Any ideas? Should I just "google" each link and scrape the data? I have a feeling Google wouldn't like this...
Thanks guys.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
如果/当缺少描述元标记时,不同的搜索引擎有不同的算法来从页面中获取描述。 有些人会忽略标签,即使它在那里。
如果你想要谷歌的描述,最准确的获取方法就是抓取它。 否则,您可以编写自己的代码或在网络上查找执行此操作的代码。
Different search engines have different algorithms to get the description out of the page if/when they are lacking the description meta tag. Some ignore the tag even it it's there.
If you want the description Google has, the most accurate way to get it would be to scrape it. Otherwise, you could write your own or look around on the web for code that does it.
这些称为片段。
Google 使用专有(可能还有专利)方法来获取此信息,因此这不是一个简单的答案。
正如您所建议的,他们将使用元描述信息(如果存在)。 (如何设置元信息以帮助Google。)
他们还将尊重页面作者的请求,不包含片段。 (如何阻止 Google 显示代码段)您也许也应该尊重这一点(当然,还有 robots.txt。)
您可能会对现有的自动摘要包有一些运气,例如 OTS。
These are called snippets.
Google use proprietary (and possibly patented) methods to garner this information, so there is no simple answer.
As you suggest, they will use meta-description information if it is there. (How to set the meta-information to help Google.)
They will also honour requests from the page authors to NOT include snippets. (How to prevent Google from displaying snippets) You should probably respect this too (as well as robots.txt, of course.)
You may have some luck with existing auto-summary packages, such as OTS.
您可能需要检查 AboutUs.org(即 http://www.aboutus.org/StackOverflow.com< /a>)。
但是,该网站不太可能有一个关于我们的页面并且没有元描述。
You may want to check AboutUs.org (i.e. http://www.aboutus.org/StackOverflow.com).
But, there's little chance that the site will have an aboutus page and not have a meta description.
一些可能解释 Google 如何执行此操作的信息:
Some info that might explain how google does this:
我不熟悉 Google API,但也许有获取此类信息的官方方式。
I am not familiar with Google APIs, but perhaps there is an official way to get such information.
有趣的。 有些来源比其他来源更好。
对于“audiotuts.com”google 的描述比 AboutUs.com 更糟糕。
谷歌
AboutUs.com:
我讨厌这样的问题......它们应该是微不足道的,但事实并非如此!
Interesting. some sources are better than others.
For "audiotuts.com" google has a worse description than AboutUs.com.
Google
AboutUs.com:
I hate problems like these... they should be trivial but they aren't!
如果你可以假设英语内容,你可以首先寻找元描述,如果这不起作用,你可以寻找前两三个类似句子的单词序列。
我开发的一个产品寻找第一个包含多个 > 序列的 P 或 DIV。 n 由句点分隔的“单词”。 它将使用两个或三个类似句子的序列(最多 x 个单词)作为摘要段落。 它不是 100% 准确,但对于一般情况来说已经足够好了。 字数进行了几次调整,以消除导航元素等内容。
If you can assume English content, you can first look for Meta Description, and if that doesn't work, you can look for the first two or three sentence-like word sequences.
A product I worked on looked for the first P or DIV that contained more than one sequence of > n "words" delimited by periods. It would use the two or three sentence-like sequences, up to x total words, as a summary paragraph. It wasn't 100% accurate, but good enough for the average case. The number of words was adjusted a few times to eliminate things like navigation elements.