Ruby Anemone 蜘蛛为每个访问的 url 添加标签

发布于 2024-12-03 16:34:32 字数 422 浏览 0 评论 0原文

我设置了抓取:

require 'anemone'

Anemone.crawl("http://www.website.co.uk", :depth_limit => 1) do |anemone|
anemone.on_every_page do |page|
  puts page.url
end
end

但是我希望蜘蛛在它访问的每个 URL 上使用 Google-analytics 反跟踪标记,而不一定实际单击链接。

我可以使用蜘蛛一次并存储所有 URL,并使用 WATIR 来运行它们并添加标签,但我想要以避免这种情况,因为它很慢,而且我喜欢skip_links_like和页面深度函数。

我怎样才能实现这个?

I have a crawl set up:

require 'anemone'

Anemone.crawl("http://www.website.co.uk", :depth_limit => 1) do |anemone|
anemone.on_every_page do |page|
  puts page.url
end
end

However I want the spider to use a Google-analytics anti-tracking tag on every URL it visits and not necessarily actually click the links.

I could use the spider once and store all of the URL's and use WATIR to run through them adding the tag but I want to avoid this because it is slow and I like the skip_links_like and page depth functions.

How could I implement this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

缱绻入梦 2024-12-10 16:34:32

您想在加载之前向 URL 添加一些内容,对吗?您可以使用 focus_crawl 来实现此目的。

Anemone.crawl("http://www.website.co.uk", :depth_limit => 1) do |anemone|
    anemone.focus_crawl do |page|
        page.links.map do |url|
            # url will be a URI (probably URI::HTTP) so adjust
            # url.query as needed here and then return url from
            # the block.
            url
        end
    end
    anemone.on_every_page do |page|
        puts page.url
    end
end

focus_crawl 方法旨在过滤 URL 列表:

指定一个块,该块将选择每个页面上要跟踪的链接。该块应返回 URI 对象的数组。

但您也可以将其用作通用 URL 过滤器。

例如,如果您想将 atm_source=SiteCon&atm_medium=Mycampaign 添加到所有链接,那么您的 page.links.map 将如下所示:

page.links.map do |uri|
    # Grab the query string, break it into components, throw out
    # any existing atm_source or atm_medium components. The to_s
    # does nothing if there is a query string but turns a nil into
    # an empty string to avoid some conditional logic.
    q = uri.query.to_s.split('&').reject { |x| x =~ /^atm_(source|medium)=/ }

    # Add the atm_source and atm_medium that you want.
    q << 'atm_source=SiteCon' << 'atm_medium=Mycampaign'

    # Rebuild the query string 
    uri.query = q.join('&')

    # And return the updated URI from the block
    uri
end

如果您atm_sourceatm_medium 包含非 URL 安全字符,然后对它们进行 URI 编码。

You want to add something to the URL before you load it, correct? You can use focus_crawl for that.

Anemone.crawl("http://www.website.co.uk", :depth_limit => 1) do |anemone|
    anemone.focus_crawl do |page|
        page.links.map do |url|
            # url will be a URI (probably URI::HTTP) so adjust
            # url.query as needed here and then return url from
            # the block.
            url
        end
    end
    anemone.on_every_page do |page|
        puts page.url
    end
end

The focus_crawl method intended to filter the URL list:

Specify a block which will select which links to follow on each page. The block should return an Array of URI objects.

but you can use it as a general purpose URL filter as well.

For example, if you wanted to add atm_source=SiteCon&atm_medium=Mycampaign to all the links then your page.links.map would look something like this:

page.links.map do |uri|
    # Grab the query string, break it into components, throw out
    # any existing atm_source or atm_medium components. The to_s
    # does nothing if there is a query string but turns a nil into
    # an empty string to avoid some conditional logic.
    q = uri.query.to_s.split('&').reject { |x| x =~ /^atm_(source|medium)=/ }

    # Add the atm_source and atm_medium that you want.
    q << 'atm_source=SiteCon' << 'atm_medium=Mycampaign'

    # Rebuild the query string 
    uri.query = q.join('&')

    # And return the updated URI from the block
    uri
end

If you're atm_source or atm_medium contain non-URL safe characters then URI-encode them.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文