copyscape如何使用谷歌API
copyscape如何使用谷歌API? ajax api仅适用于启用了javascript的浏览器,因此不使用此api。不使用SOAP api,因为它不允许用于商业用途,并且每天不允许超过100个查询。
How copyscape uses google API?
The ajax api works only on browsers with javascript enabled, So this api is not used. The SOAP api is not used, because it is not allowed to be used for commercial use and no more than 100 queries are allowed per day.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
Copyscape 不使用 Google api,而是使用 Google 搜索,它对 http://www. google.com/search?q=Search 此处的关键字。然后使用正则表达式模式查找标题、描述和链接并向用户显示。但这严格违反了 Google 服务条款,这也可能导致他们被禁止,因此他们使用代理(或任何其他 ip 隐藏方法)来隐藏每次搜索的 ip
Copyscape not uses Google api instead it uses Google search it does a simple curl request to http://www.google.com/search?q=Search Keywords here . Then uses regexp patterns to find title, descriptions and links and shows to user. But this strictly violates Google terms of service which can also get them ban, so they uses proxies(or any other ip hiding method) to hide their ip for each search
从他们的常见问题解答中,他们解释了他们是如何做到这一点的。
分析
CopyScape 让我们 100% 确信 Google 和雅虎有特殊协议。我 80% 确信 CopyScape 使用与 Google Enterprise 类似的搜索解决方案(可能未公开,但类似)搜索由搜索引擎提供。
CopyScape 不会抓取结果,而是获取基于 API 的格式,例如 json 和 xml。这对于提供商(Google 和 Yahoo)来说有利于带宽和响应时间的改进。由于我之前尝试通过 python 通过短语搜索(“短语匹配”)抓取谷歌搜索结果,所以我想出了这一部分。您的抓取机器人无法也没有已知的方法来绕过 503,谷歌将在数百个结果(100 个搜索间隔或 50 个搜索间隔)后做出响应。
他们显然没有做一些浏览器自动化,然后在网络驱动程序和Python等编程语言之间获取数据。我已经尝试过这样做,它给出了类似的结果,除了自动搜索器需要对验证码进行一些手动干预,然后让您继续抓取。我还尝试使用一些最新的旁路,只需几分钟/几秒即可修补。当然,他们没有从搜索引擎中进行任何自动抓取,如果有的话,他们正在这样做。从长远来看,这是行不通的。
他们如何使用他们的特殊特权?
由于他们已经支付了费用/有特殊条款,他们现在可以通过特殊 API 实现自动化。他们要么使用 Google Search Enterprise,要么使用 Google Search Enterprise。雅虎搜索营销企业或者他们有一些更特殊的解决方案。
不使用列表
使用列表
希望
我希望 CopyScape 的人能够泄露信息,这样人们就不会猜测,CopyScape 应该有更多的竞争,因为只有一些高度可靠和受重视的抄袭检查器(可能是 1-10仅有的)。
From their FAQ they have explained how they do it.
Analysis
CopyScape made us 100% sure that Google and Yahoo have special agreements. I am 80% sure that CopyScape are using a similar search solution (probably undisclosed but similar) to Google Enterprise Search provided by the search engines.
CopyScape does not do scraped results, but is fetching API based formats like json and xml. Which is good for the providers (Google and Yahoo) for bandwidth and response time improvements. I came up with this part due to my previous attempts to scrape google search results via python by phrase searches ("phrase matching"). Your scraping bot cannot and no known way to bypass 503 that google will respond after couple of hundred results (100 search intervals or 50 search intervals).
They obviously did not do some browser automation then fetching data between web drivers and programming languages like python. I have tried doing it and it gave similar results except that the automated searcher will need some manual intervention for the captcha which will then let you continue with the scraping. I also tried using some latest bypass which was patch in just minutes/seconds. Surely they did not do any automated scraping from search engines and if ever they are doing it. It will not work long term.
How they are using their special privilege?
Since they have paid off / have special terms they can now automate from the special APIs. They are either using Google Search Enterprise & Yahoo Search Marketing Enterprise or they have something more special solution.
Not Using List
Using List
Hoping
I hope someone from CopyScape can leak information so that people won't be guessing and CopyScape should have more competition since there are only some plagarism checkers out there which are highly reliable and regarded (probably 1-10 only).