如何使用 nokogiri/hpricot 和其他 gem 在网页中 grep 文件名和扩展名？

发布于 2024-12-25 02:49:46 字数 1011 浏览 1 评论 0原文

我正在开发一个应用程序，我必须

1）获取网站的所有链接

2）然后获取每个文件中的所有文件和文件扩展名的列表网页/链接的。

我已经完成了第一部分:) 我通过下面的代码获取网站的所有链接..

require 'rubygems'
require 'spidr'
require 'uri'


Spidr.site('http://testasp.vulnweb.com/') do |spider|
  spider.every_url { |url| 
                     puts url    
                   }
end

现在我必须获取每个中的所有文件/文件扩展名页面，所以我尝试了下面的代码

require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'spidr'

site = 'http://testasp.vulnweb.com'

in1=[]

Spidr.site(site) do |spider|

    spider.every_url { |url| in1.push url }

end


in1.each  do |input1|

  input1 = input1.to_s
  #puts input1
  begin
    doc = Nokogiri::HTML(open(input1))
    doc.traverse do |el|
        [el[:src], el[:href]].grep(/\.(txt|css|gif|jpg|png|pdf)$/i).map{|l| URI.join(input1, l).to_s}.each do |link| 
            puts link  
        end
    end
  rescue => e
       puts "errrooooooooor"
  end

end

，但是任何人都可以指导我如何解析链接/网页并获取文件- 页面中的扩展名？

原文

I am working on an application where I have to

1) get all the links of website

2) and then get the list of all the files and file extensions in each
of the web page/link.

I am done with the first part of it :)
I get all the links of website by below code..

require 'rubygems'
require 'spidr'
require 'uri'


Spidr.site('http://testasp.vulnweb.com/') do |spider|
  spider.every_url { |url| 
                     puts url    
                   }
end

now I have to get the all the files/file-extensions in each of the
page so I tried the below code

require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'spidr'

site = 'http://testasp.vulnweb.com'

in1=[]

Spidr.site(site) do |spider|

    spider.every_url { |url| in1.push url }

end


in1.each  do |input1|

  input1 = input1.to_s
  #puts input1
  begin
    doc = Nokogiri::HTML(open(input1))
    doc.traverse do |el|
        [el[:src], el[:href]].grep(/\.(txt|css|gif|jpg|png|pdf)$/i).map{|l| URI.join(input1, l).to_s}.each do |link| 
            puts link  
        end
    end
  rescue => e
       puts "errrooooooooor"
  end

end

but Can anybody guide me how to parse the links/webpage and get the file-
extensions in the page?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

一曲琵琶半遮面シ 2025-01-01 02:49:46

您可能想看看 URI#parse。 URI 模块是 Ruby 标准库的一部分，并且是 spidr gem 的依赖项。带有良好衡量规范的示例实现。

require 'rspec'
require 'uri'

class ExtensionExtractor  
  def extract(uri)
    /\A.*\/(?<file>.*\.(?<extension>txt|css|gif|jpg|png|pdf))\z/i =~ URI.parse(uri).path
    {:path => uri, :file => file, :extension => extension}
  end
end

describe ExtensionExtractor do
  before(:all) do
    @css_uri = "http://testasp.vulnweb.com/styles.css"
    @gif_uri = "http://testasp.vulnweb.com/Images/logo.gif"
    @gif_uri_with_param = "http://testasp.vulnweb.com/Images/logo.gif?size=350x350"
  end

  describe "Common Extensions" do
    it "should extract CSS files from URIs" do
      file = subject.extract(@css_uri)
      file[:path].should eq @css_uri
      file[:file].should eq "styles.css"
      file[:extension].should eq "css"
    end

    it "should extract GIF files from URIs" do
      file = subject.extract(@gif_uri)
      file[:path].should eq @gif_uri
      file[:file].should eq "logo.gif"
      file[:extension].should eq "gif"
    end

    it "should properly extract extensions even when URIs have parameters" do
      file = subject.extract(@gif_uri_with_param)
      file[:path].should eq @gif_uri_with_param
      file[:file].should eq "logo.gif"
      file[:extension].should eq "gif"
    end
  end
end

You might want to take a look at URI#parse. The URI module is a part of the Ruby standard library and is a dependency of the spidr gem. Example implementation with a spec for good measure.

require 'rspec'
require 'uri'

class ExtensionExtractor  
  def extract(uri)
    /\A.*\/(?<file>.*\.(?<extension>txt|css|gif|jpg|png|pdf))\z/i =~ URI.parse(uri).path
    {:path => uri, :file => file, :extension => extension}
  end
end

describe ExtensionExtractor do
  before(:all) do
    @css_uri = "http://testasp.vulnweb.com/styles.css"
    @gif_uri = "http://testasp.vulnweb.com/Images/logo.gif"
    @gif_uri_with_param = "http://testasp.vulnweb.com/Images/logo.gif?size=350x350"
  end

  describe "Common Extensions" do
    it "should extract CSS files from URIs" do
      file = subject.extract(@css_uri)
      file[:path].should eq @css_uri
      file[:file].should eq "styles.css"
      file[:extension].should eq "css"
    end

    it "should extract GIF files from URIs" do
      file = subject.extract(@gif_uri)
      file[:path].should eq @gif_uri
      file[:file].should eq "logo.gif"
      file[:extension].should eq "gif"
    end

    it "should properly extract extensions even when URIs have parameters" do
      file = subject.extract(@gif_uri_with_param)
      file[:path].should eq @gif_uri_with_param
      file[:file].should eq "logo.gif"
      file[:extension].should eq "gif"
    end
  end
end

回复收藏 0 原文

~没有更多了~