爬虫卡在 Drupal 中的强制年龄检查页面上

发布于 2024-08-03 03:41:01 字数 507 浏览 4 评论 0原文

我们在 drupal 中构建了一个大型社区网站,该网站在您访问网站内容之前必须进行年龄检查,

它会检查 cookie 是否存在,如果不存在,您将被重定向到年龄检查页面。

现在我们相信爬虫程序会卡在这部分,它们会被重定向到年龄检查,并且永远无法爬行整个网站。

有人以前有过这个吗?处理这样的事情最好的方法是什么?

桑德

编辑

我很抱歉现在才提到这一点, 爬虫的问题之一还在于,当社区中的某人在 Facebook 上的墙上发布内容时,Facebook 会爬回页面以获取图像和描述(在元标记中指定) 但 Facebook 也被重定向到年龄检查页面。 如果我添加 facebook 爬虫,useragentcheck 会工作吗? 如果是这样:那么有人知道 Facebook 爬虫的确切名称吗?

下面的解决方案是我们也在网上找到的解决方案。如果将 facebook 爬虫添加到该列表中有效,那么它将解决我们在这个年龄检查页面上遇到的所有问题。

we have a big community website build in drupal, where the site has a mandatory agecheck before you can access the content of the website

it checks for a cookie to be present, if not, you get redirected to the agecheck page.

now we believe crawlers get stuck on this part, they get redirected to the agecheck and never get to crawl the full website.

has anyone had this before? what would be the best way to deal with something like this?

Sander

EDIT

i am sorry only to mention this now,
one of the issues with crawlers is also that when someone in the community posts something to his wall on facebook, facebook crawls the page back to fetch images and description (which are specified in meta tags)
but facebook gets also redirected to the agecheck page.
would a useragentcheck work if i add the facebook crawler ?
if so: would anyone know the facebook crawlers exact name then?

The solution below is one that we also came a cross on the net. if adding the facebook crawler to that list works then it would solve all the problems we are having with this agecheck page.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

伊面 2024-08-10 03:41:01

您可以检查用户代理,如果它是爬虫,则不会检查浏览器/用户是否具有所需的 cookie。

这里是一个示例:

function crawlerDetect($USER_AGENT)
{
    $crawlers_agents = ‘Google|msnbot|Rambler|Yahoo|AbachoBOT|accoona|AcioRobot|ASPSeek|CocoCrawler|Dumbot|FAST-WebCrawler|GeonaBot|Gigabot|Lycos|MSRBOT|Scooter|AltaVista|IDBot|eStyle|Scrubby’;

    if ( strpos($crawlers_agents , $USER_AGENT) === false )
       return false;
}

// example

$crawler = crawlerDetect($_SERVER[’HTTP_USER_AGENT’]);

if ($crawler )
{
   // it is crawler, it’s name in $crawler variable
}
else
{
   // usual visitor
}

You could check the user-agent, and if it's a crawler you do not check if the browser/user has the required cookie.

Here is a sample:

function crawlerDetect($USER_AGENT)
{
    $crawlers_agents = ‘Google|msnbot|Rambler|Yahoo|AbachoBOT|accoona|AcioRobot|ASPSeek|CocoCrawler|Dumbot|FAST-WebCrawler|GeonaBot|Gigabot|Lycos|MSRBOT|Scooter|AltaVista|IDBot|eStyle|Scrubby’;

    if ( strpos($crawlers_agents , $USER_AGENT) === false )
       return false;
}

// example

$crawler = crawlerDetect($_SERVER[’HTTP_USER_AGENT’]);

if ($crawler )
{
   // it is crawler, it’s name in $crawler variable
}
else
{
   // usual visitor
}
咿呀咿呀哟 2024-08-10 03:41:01

Gary Keith 有一个 php 类,您可以使用它来检查访问者的所有属性(例如、浏览器或爬虫),并且该类还会自动更新浏览器和爬虫的详尽 ini 文件。定期进行爬虫。还有一个 drupal 模块,尽管我还没有尝试过。

Gary Keith has a php class that you can use to check all the attributes of a visitor (eg, browser or crawler), and the class also automatically updates an exhaustive ini file of browsers & crawlers on a regular basis. There's also a drupal module, although I haven't tried it.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文