Google AdSense 机器人的算法和行为
我对 Google AdSense 机器人的算法和网站行为感兴趣。我没有使用 AdSense,也没有帐户。所以我需要你的帮助来理解:
1)Gbot 时不时地从网站下载所有页面。我说得对吗?
2)Gbot不理解动态内容(由ajax加载)。所以我必须生成静态内容并在 html 页面中返回它,并且该页面必须为所有用户和 Gbot 显示相同的内容?
3)由于(1)和(2),我不能仅使用根路径 http://example.com 和一些“主”小部件。我必须生成唯一的页面,例如 http://example.com/thread?id=101 ?
4) Gbot 下载页面 (1) 以从中抓取(索引)关键字,然后(在其服务器上)存储这些信息,例如通过键/值(其中键是页面路径,值是标签云)。我说得对吗?
5) 当用户在浏览器中打开网站时。集成的 html AdSense 代码会加载一些 JavaScript。据我了解,“谷歌搜索”这个 JavaScript 不会索引页面,而是调用(使用一些参数 key==page_path)Google 的服务器并获取适当的广告链接。然后在其框架中显示该广告链接。这是正确的行为吗?也许 JavaScript 对页面内容进行了一些本地索引?
6) Gbot 和 AdSense 的 JavaScript 如何与 cookie 配合使用?据我了解,AdSense 可以使用 cookie 来显示适当的广告链接。如果是正确的,请给我一些用例;)
我知道“真正的”算法只有谷歌的工程师知道。但有些人有使用 AdSense 和 AdSense html/javascript 的经验。请纠正我的看法;)
非常感谢您的任何建议!
PS这个问题对我来说非常重要。这不是为了好玩而提出的问题!所以请不要关闭它;)
I am interesting in Google AdSense bot's algorithm and behavior with web site. I did not work with AdSense and i do not have account. So i need your help to understand:
1) Gbot from time to time downloads all pages from web site. Am i right?
2) Gbot do not understand dynamic content (loaded by ajax). So i must generate static content and return it within html page and this pages must show identical content for all users and for Gbot?
3) Because of (1) and (2) i cannot use only root path http://example.com with some "main" widget. I must generate unique pages for example http://example.com/thread?id=101 ?
4) Gbot downloads pages (1) for grabbing (indexing) keywords from them and then store (on it's servers) these information for example by key/value (where key is page path, value is tag cloud). Am i right?
5) When web site has been opened in browser by user. Integrated html AdSense's code loads some JavaScript. As i understand by "googling" this JavaScript do not index page, but makes call (with some parameter key==page_path) to Google's server and gets appropriate ad links. Then shows this ad links in it's frame. Is it right behavior? Maybe JavaScript makes some local indexing of page's content?
6) How Gbot and AdSense's JavaScript work with cookies? As i understand AdSense can use cookies for show appropriate ad links. If it is right, please give me some use cases;)
I know that "true" algorithm is known only by engineers from Google. But some of you had experience with AdSense and AdSense html/javascript. Please correct my vision of it;)
Thank you very much for any advice!!!
P.S. This question is very important for me. It is not some question for fun! So Please do not close it;)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
1) 是,如果 Googlebot 可以访问这些页面,并且它通过链接、XMLSitemaps、Google +1 等了解这些页面。
2) Googlebot 现在将发出 AJAX / XHR 请求来了解 AJAX 内容 (http://googlewebmastercentral.blogspot .com/2011/11/get-post-and-safely-surfaceing-more-of.html)。
是的,您应该向 Googlebot 展示与向用户展示相同的内容,否则这将被视为隐藏内容,这违反了他们的准则。
3)这个问题不清楚。但基本上最好更改 URL,因为这样 Google 就会知道如何单独索引内容。如果您使用 AJAX,那么您可能需要考虑您建议的永久链接,或者您可以使用 HTML5 popstate。
4) 是的,谷歌将索引页面上的单词。我不确定他们将其存储为键/值对。我什至不确定他们是否仍在使用 Big Table (http://labs.google.com/papers/bigtable.html) ...但他们很可能使用 Big Table 或类似的系统来存储倒排索引。
5) Adsense 代码嵌入了 Javascript ...对于 Google 以前从未见过的新网页,它会尝试根据在网络上找到的有关该网站的信息或可能通过指向链接的锚文本来提供最相关的广告到该页面。然而,为了更准确地了解页面内容,Google 会发送一个特定于 adsense 的机器人来抓取您的页面...有时您会发现它来得非常快,即使您第一次加载页面时也是如此时间。它使用与传统 Googlebot 不同的用户代理...您可以在此处找到 Google 的所有用户代理 (http://www.google.com/support/webmasters/bin/answer.py?answer=1061943)
6) Google 的抓取工具不接受 cookie,也不会将 cookie 传回您的服务器。这与 Google 爬虫的大规模分布式特性有关,这使得维护 cookie 或会话变得极其困难。
1) Yes if Googlebot can access the pages and if it knows about the pages through a link, XMLSitemaps, Google +1, etc.
2) Googlebot will now make AJAX / XHR requests to understand AJAX content (http://googlewebmastercentral.blogspot.com/2011/11/get-post-and-safely-surfacing-more-of.html).
Yes, you should show the same content to Googlebot as you would Users, otherwise this would be consider cloaking, which is against their guidelines.
3) This question isn't clear. But basically it's preferable to have the URL change because Google will then know how to index the content separately. If you're using AJAX then you might want to consider permalinks like you suggested, or you can use HTML5 popstate.
4) Yes Google will index the words on the page. I'm not certain they store it as a key/value pair. I'm not even sure if they're still using Big Table (http://labs.google.com/papers/bigtable.html) ... but it's likely they use Big Table or a similar system to store the inverted index.
5) The Adsense code is embedded Javascript ... for new webpages that Google hasn't seen before, it tries to deliver the most relevant ads based on the information it's found on the web about the site or possibly through anchor text of links pointing to that page. However, to get a more accurate understanding of the content of the page, Google will send an adsense specific bot to crawl your page ... sometimes you'll see it come very fast, even as soon as you load the page for the first time. It uses a different user agent than the traditional Googlebot ... you can find all the User Agents from Google here (http://www.google.com/support/webmasters/bin/answer.py?answer=1061943)
6) Google's crawlers don't accept cookies and won't pass back cookies to your server. It has to do with the massively distributed nature of Google crawlers that makes maintain cookies or sessions extremely difficult.