一些搜索网站如何在没有真实内容的情况下击中谷歌搜索的所有组合
我刚刚在谷歌上进行了转换搜索“15lbs in kg”,第一个结果是 http://www.trueknowledge .com/q/what_is_15_kg_in_lbs
然后,我可以将 15 更改为任何数字,包括小数,而且我总是在第一次点击时获得 trueknoledge,直接链接到他们的网站来转换该数字。
我可以想象,您可以通过自动链接到每个页面上的下一个数字来相当容易地构建这样的东西,并且他们似乎也通过提供“像您这样的问题”链接来做到这一点。对于这个例子来说,这很简单,但我见过许多其他情况,您搜索任意内容只是为了点击另一个搜索页面,而该页面为该确切的搜索短语提供了自己的蹩脚搜索结果。
这是否只是基于通过猜测短语来生成链接以提供给谷歌爬虫或者它是如何完成的?
我对创建这些网站的克隆不感兴趣,我真的很讨厌它们。我只是好奇它是如何制作的以及谷歌是否试图以某种方式阻止它。对于他们提供良好结果的转换,我不介意,但当我进入另一个搜索页面时,这真的很烦人。
I just made a conversion search on google for "15lbs in kg" and first hit is http://www.trueknowledge.com/q/what_is_15_kg_in_lbs
I can then change 15 to ANY number, including decimals, and I always get trueknoledge as first hit with a direct link to their site for converting that number.
I can imagine that you can build up something like this fairly easy by automatically linking to the next number on every page and they also seem to do this by providing "questions like yours"-links. For this example it's quite easy but I've seen many other cases where you search for something arbitrary only to hit another search page that provides their own crappy search results for that exact search-phrase.
Is this just based on generating links by guessing phrases to provide for googles crawler or how is it done?
I'm not interested in creating a clone of these sites, I truly hate them. I'm just curious on how it's made and if google is trying to prevent it in some way. For the conversion where they provide a good result I don't mind, but when I get to another search-page it's really annoying.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
实际上,“我可以将 15 更改为任何数字”,这不是真的。例如,现在如果您搜索“15lbs in kg”,则会给出 http://wiki.answers.com/Q /How_much_is_15_lbs_in_kg 作为链接之一。但是,如果您尝试“15.713lbs in kg”,则不会得到 http://wiki.answers .com/Q/How_much_is_15_713_lbs_in_kg 或列表中的类似内容。如果你搜索“15.71349lbs in kg”,你什么也得不到(除了谷歌转换器的输出)。正如您所提到的,并不是它不理解小数 - http://www.trueknowledge.com /q/15.1_kg_in_lbs 是搜索“15.1lbs in kg”时的第一个链接。
免责声明:我不知道这些网站是做什么的以及他们是如何做的,这只是我的意见。
这些必须以某种方式从用户查询中生成。最具生成性的可能是 http://www.trueknowledge.com/ 上的搜索栏。当用户在那里搜索时,该网站可以自动生成谷歌可以找到的链接。如果您访问网站上的某些链接,例如 http://www.trueknowledge.com/recent- Activity,您可以看到页面上有很多问题,每个问题都有一个与您发布的内容类似的链接。这是谷歌找到它们的方式之一。 “15lbs in kg”可能是一个非常常见的查询,因此它可能已经被问过一百万次,并且存在于某些问题中。
另请注意,还有一些问题页面,例如 http://www.trueknowledge.com/新问题/100。如果您从那里抓取(并且相信 Google 有快速的抓取工具:)),您每页可以获得 100 个问题。截至目前的最后一页是 http://www.trueknowledge.com/new-questions/94000 - 请注意,每次抓取 94000 个链接,对于此类网站来说,这种情况可能非常频繁地发生。
当然,还有许多其他可能的技术:
当今互联网上的信息量如此之大,以至于可以说生成像 trueknowledge.com 那样的链接并不难。这些人面临的困难部分在另一边——快速搜索并获得有意义的结果。
Actually, "I can then change 15 to ANY number" it's not true. E.g. right now if you search for "15lbs in kg" gives http://wiki.answers.com/Q/How_much_is_15_lbs_in_kg as one of the links. However, if you try "15.713lbs in kg", you don't get http://wiki.answers.com/Q/How_much_is_15_713_lbs_in_kg or similar in the list. If you search for "15.71349lbs in kg", you get nothing (except Google converter's output). As you mentioned, it's not that it doesn't understand decimals - http://www.trueknowledge.com/q/15.1_kg_in_lbs is the first link when searching for "15.1lbs in kg".
Disclaimer: I don't know what these sites do and how they do it, this is just my opinion.
These must be generated from user queries somehow. Probably the most generative one is the search bar on http://www.trueknowledge.com/. When users search there, the site can automatically generate links that Google can then find. If you go to some links on the site, such as http://www.trueknowledge.com/recent-activity, you can see that there are a lot of questions on the page, each with a link similar to what you posted. This is one of the ways Google finds them. "15lbs in kg" is probably a very common query, thus it has probably been asked a million times already and is in some of the questions.
Note, also, that there are question pages, such as http://www.trueknowledge.com/new-questions/100. If you crawl from there (and, believe it, Google has fast crawlers :)) you can get 100 questions per page. The last page as of now is http://www.trueknowledge.com/new-questions/94000 - note, that is 94000 links per crawl, which probably happens very frequently for this type of site.
There are many other possible techniques, of course:
The volume of information today on the Internet is so huge that it's arguably not really hard to generate links like trueknowledge.com does. Hard parts that these guys face are on the other side - searching and getting meaningful results fast.