当前位置：文江博客话题详情

copyscape如何使用谷歌API

发布于 2024-09-29 02:41:10 字数 108 浏览 2 评论 0原文

copyscape如何使用谷歌API？ ajax api仅适用于启用了javascript的浏览器，因此不使用此api。不使用SOAP api，因为它不允许用于商业用途，并且每天不允许超过100个查询。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

楠木可依 2024-10-06 02:41:10

Copyscape 不使用 Google api，而是使用 Google 搜索，它对 http://www. google.com/search?q=Search 此处的关键字。然后使用正则表达式模式查找标题、描述和链接并向用户显示。但这严格违反了 Google 服务条款，这也可能导致他们被禁止，因此他们使用代理（或任何其他 ip 隐藏方法）来隐藏每次搜索的 ip

回复收藏 0 原文

痴情 2024-10-06 02:41:10

从他们的常见问题解答中，他们解释了他们是如何做到这一点的。

Copyscape 从哪里获取结果？
Copyscape 使用 Google 和 Yahoo!作为搜索提供商，根据协议
条款。这些搜索提供商将标准搜索结果发送至
Copyscape，没有任何后期处理。 Copyscape 使用复杂的
专有算法修改这些搜索结果，以便
提供抄袭检查服务。任何费用均用于
Copyscape的增值服务，不为提供搜索
搜索提供商的结果。
http://www.copyscape.com/faqs.php#providers< /p>

分析

CopyScape 让我们 100% 确信 Google 和雅虎有特殊协议。我 80% 确信 CopyScape 使用与 Google Enterprise 类似的搜索解决方案（可能未公开，但类似）搜索由搜索引擎提供。

CopyScape 不会抓取结果，而是获取基于 API 的格式，例如 json 和 xml。这对于提供商（Google 和 Yahoo）来说有利于带宽和响应时间的改进。由于我之前尝试通过 python 通过短语搜索（“短语匹配”）抓取谷歌搜索结果，所以我想出了这一部分。您的抓取机器人无法也没有已知的方法来绕过 503，谷歌将在数百个结果（100 个搜索间隔或 50 个搜索间隔）后做出响应。

他们显然没有做一些浏览器自动化，然后在网络驱动程序和Python等编程语言之间获取数据。我已经尝试过这样做，它给出了类似的结果，除了自动搜索器需要对验证码进行一些手动干预，然后让您继续抓取。我还尝试使用一些最新的旁路，只需几分钟/几秒即可修补。当然，他们没有从搜索引擎中进行任何自动抓取，如果有的话，他们正在这样做。从长远来看，这是行不通的。

他们如何使用他们的特殊特权？

由于他们已经支付了费用/有特殊条款，他们现在可以通过特殊 API 实现自动化。他们要么使用 Google Search Enterprise，要么使用 Google Search Enterprise。雅虎搜索营销企业或者他们有一些更特殊的解决方案。

不使用列表

常规/免费 API（不确定 google 和 yahoo 是否为他们免费提供）
Scrapers（Scrapy、Beautiful Soup、Selenium 等）

使用列表

企业级 API
服务器 Bash 脚本/Python 脚本/Ruby 脚本/PHP 脚本以实现可扩展性等。

希望

我希望 CopyScape 的人能够泄露信息，这样人们就不会猜测，CopyScape 应该有更多的竞争，因为只有一些高度可靠和受重视的抄袭检查器（可能是 1-10仅有的）。

From their FAQ they have explained how they do it.

Where does Copyscape get its results?
Copyscape uses Google and Yahoo! as search providers, under agreed
terms. These search providers send standard search results to
Copyscape, without any post-processing. Copyscape uses complex
proprietary algorithms to modify these search results in order to
provide a ?plagiarism checking service. Any charges are for
Copyscape's value-added services, not for the provision of search
results by the search providers.
http://www.copyscape.com/faqs.php#providers

Analysis

CopyScape made us 100% sure that Google and Yahoo have special agreements. I am 80% sure that CopyScape are using a similar search solution (probably undisclosed but similar) to Google Enterprise Search provided by the search engines.

CopyScape does not do scraped results, but is fetching API based formats like json and xml. Which is good for the providers (Google and Yahoo) for bandwidth and response time improvements. I came up with this part due to my previous attempts to scrape google search results via python by phrase searches ("phrase matching"). Your scraping bot cannot and no known way to bypass 503 that google will respond after couple of hundred results (100 search intervals or 50 search intervals).

They obviously did not do some browser automation then fetching data between web drivers and programming languages like python. I have tried doing it and it gave similar results except that the automated searcher will need some manual intervention for the captcha which will then let you continue with the scraping. I also tried using some latest bypass which was patch in just minutes/seconds. Surely they did not do any automated scraping from search engines and if ever they are doing it. It will not work long term.

How they are using their special privilege?

Since they have paid off / have special terms they can now automate from the special APIs. They are either using Google Search Enterprise & Yahoo Search Marketing Enterprise or they have something more special solution.

Not Using List

Regular / Free APIs (Not sure if google and yahoo made it free for them)
Scrapers (Scrapy, Beautiful Soup, Selenium and Etc)

Using List

Enterprise Level API
Server Bash Scripts / Python Scripts / Ruby Scripts / PHP Scripts for scalabilities and such.

Hoping

I hope someone from CopyScape can leak information so that people won't be guessing and CopyScape should have more competition since there are only some plagarism checkers out there which are highly reliable and regarded (probably 1-10 only).

回复收藏 0 原文

~没有更多了~