从网络上抓取 URL

发布于 2024-11-16 22:56:10 字数 241 浏览 7 评论 0原文

<a href="http://www.utoronto.ca/gdrs/" title="Rehabilitation Science"> Rehabilitation Science</a>

对于上面的例子，我想同时获取部门名称“康复科学”及其主页网址“http://www.utoronto.ca/gdrs/”。

有人可以建议一些可以为我完成这项工作的智能正则表达式吗？

原文

<a href="http://www.utoronto.ca/gdrs/" title="Rehabilitation Science"> Rehabilitation Science</a>

For the example above, I want to get the department name "Rehabilitation Science" and its homepage url "http://www.utoronto.ca/gdrs/" at the same time.

Could someone please suggest some smart regular expressions that would do the job for me?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

两个我 2024-11-23 22:56:10

根本没有理由使用正则表达式来执行此操作。这里有一个使用 Nokogiri 的解决方案，它是常见的 Ruby HTML/XML 解析器：

html = <<EOT
<p><a href="http://www.example.com/foo">foo</a></p>
<p><a href='http://www.example.com/foo1'>foo1</p></a>
<p><a href=http://www.example.com/foo2>foo2</a></p>
<p><a href = http://www.example.com/bar>bar</p>
<p><a 
  href="http://www.example.com/foobar"
  >foobar</a></p>
  <p><a 
    href="http://www.example.com/foobar2"
    >foobar2</p>
EOT

require 'nokogiri'

doc = Nokogiri::HTML(html)

links = Hash[
  *doc.search('a').map { |a| 
      [
        a['href'],
        a.content
      ]
    }.flatten
  ]

require 'pp'
pp links
# >> {"http://www.example.com/foo"=>"foo",
# >>  "http://www.example.com/foo1"=>"foo1",
# >>  "http://www.example.com/foo2"=>"foo2",
# >>  "http://www.example.com/bar"=>"bar",
# >>  "http://www.example.com/foobar"=>"foobar",
# >>  "http://www.example.com/foobar2"=>"foobar2"}

这会返回 URL 的哈希值作为键，并带有相关的标记的内容作为值。这意味着您将只捕获唯一的 URL，并丢弃重复的 URL。如果您希望所有 URL 使用：

links = doc.search('a').map { |a| 
    [
      a['href'],
      a.content
    ]
  }

这会导致：

# >> [["http://www.example.com/foo", "foo"],
# >>  ["http://www.example.com/foo1", "foo1"],
# >>  ["http://www.example.com/foo2", "foo2"],
# >>  ["http://www.example.com/bar", "bar"],
# >>  ["http://www.example.com/foobar", "foobar"],
# >>  ["http://www.example.com/foobar2", "foobar2"]]

我使用 CSS 访问器 'a' 来定位标签。如果我只想抓取链接，忽略锚点，我可以使用 'a[href]' 。

正则表达式在处理 HTML 和 XML 时非常脆弱，因为标记格式过于自由；它们的格式可能会有所不同，但仍保持有效，尤其是 HTML，其“正确性”可能会有很大差异。如果您不拥有正在解析的文件的生成，那么您的代码将受到使用正则表达式时生成它的人的支配；文件中的简单更改可能会严重破坏模式，从而导致持续的维护麻烦。

解析器因为实际上了解文件的内部结构，所以可以承受这些更改。请注意，我故意创建了一些格式错误的 HTML，但代码并不关心。比较解析器版本与正则表达式解决方案的简单性，并考虑长期可维护性。

There's no reason to use regex to do this at all. Here's a solution using Nokogiri, which is the usual Ruby HTML/XML parser:

html = <<EOT
<p><a href="http://www.example.com/foo">foo</a></p>
<p><a href='http://www.example.com/foo1'>foo1</p></a>
<p><a href=http://www.example.com/foo2>foo2</a></p>
<p><a href = http://www.example.com/bar>bar</p>
<p><a 
  href="http://www.example.com/foobar"
  >foobar</a></p>
  <p><a 
    href="http://www.example.com/foobar2"
    >foobar2</p>
EOT

require 'nokogiri'

doc = Nokogiri::HTML(html)

links = Hash[
  *doc.search('a').map { |a| 
      [
        a['href'],
        a.content
      ]
    }.flatten
  ]

require 'pp'
pp links
# >> {"http://www.example.com/foo"=>"foo",
# >>  "http://www.example.com/foo1"=>"foo1",
# >>  "http://www.example.com/foo2"=>"foo2",
# >>  "http://www.example.com/bar"=>"bar",
# >>  "http://www.example.com/foobar"=>"foobar",
# >>  "http://www.example.com/foobar2"=>"foobar2"}

This returns a hash of URLs as keys with the related content of the <a> tag as the value. That means you'll only capture unique URLs, throwing away duplicates. If you want all URLs use:

links = doc.search('a').map { |a| 
    [
      a['href'],
      a.content
    ]
  }

which results in:

# >> [["http://www.example.com/foo", "foo"],
# >>  ["http://www.example.com/foo1", "foo1"],
# >>  ["http://www.example.com/foo2", "foo2"],
# >>  ["http://www.example.com/bar", "bar"],
# >>  ["http://www.example.com/foobar", "foobar"],
# >>  ["http://www.example.com/foobar2", "foobar2"]]

I used a CSS accessor 'a' to locate the tags. I could use 'a[href]' if I wanted to grab only links, ignoring anchors.

Regex are very fragile when dealing with HTML and XML because the markup formats are too freeform; They can vary in their format while remaining valid, especially HTML, which can vary wildly in its "correctness". If you don't own the generation of the file being parsed, then your code is at the mercy of whoever does generate it when using regex; A simple change in the file can break the pattern badly, resulting in a continual maintenance headache.

A parser, because it actually understands the internal structure of the file, can withstand those changes. Notice that I deliberately created some malformed HTML but the code didn't care. Compare the simplicity of the parser version vs. a regex solution and think of long term maintainability.

回复收藏 0 原文

吐个泡泡 2024-11-23 22:56:10

我建议使用像 @mrk 建议的 HTML 解析器。然后获取返回的结果并将其通过正则表达式搜索器。我喜欢用红宝石。这将向您显示正则表达式正在捕获的内容，并且您可以避免得到不需要的结果。我发现在这种情况下使用正则表达式 /http[^"]+/ 会起作用，因为即使没有“www.”，它也会抓取整个 url，并且您可以避免捕获引号。

回复收藏 0 原文

烂人 2024-11-23 22:56:10

如果您正在构建蜘蛛，那么 Ruby 的 Mechanize 是一个不错的选择。要获取页面并提取链接：

require 'rubygems'
require 'mechanize'

agent = Mechanize.new
page = agent.get "http://google.com/"

page.links.each do |link|
  puts link.href
  puts link.text
end

文档和指南（我链接到的）列出了您可能想要做的很多事情。使用正则表达式来解析 HTML（或 XML）是出了名的棘手且容易出错。使用完整的解析器（正如其他人所建议的那样）将节省您的精力并使您的代码更加健壮。

If you're building a spider, then Ruby's Mechanize is a great choice. To fetch a page and extract the links:

require 'rubygems'
require 'mechanize'

agent = Mechanize.new
page = agent.get "http://google.com/"

page.links.each do |link|
  puts link.href
  puts link.text
end

The documentation and the guide (that I linked to) lay out a lot of what you'll probably want to do. Using regular expressions to parse HTML (or XML) is notoriously tricky and error prone. Using a full parser (as others have suggested) will save you effort and make you code more robust.

回复收藏 0 原文

晨曦慕雪 2024-11-23 22:56:10

尽量不要把这件事做得过于复杂：

#<a .*?href="([^"]*)".*>([^<]+)</a>#i

Trying to not do this overcomplicated:

#<a .*?href="([^"]*)".*>([^<]+)</a>#i

回复收藏 0 原文

绝對不後悔。 2024-11-23 22:56:10

这是我的 Ruby 方法：

require 'open-uri'

class HTMLScraper
    def initialize(page)
      @src = page
      open(@src) do |x|
          @html = x.read
      end
    end
    def parseLinks
      links = @html.scan(/<a\s+href\s*=\s*"([^"]+)"[^>]*>\s*([^<]+)\s*<\/a>/ui)
      puts "Link(s) Found:"
      i = 0
      while i < links.length
        puts "\t#{links[i]}"
        i += 1
      end
    end
  end

url = "http://stackoverflow.com/questions"
test = HTMLScraper.new(url)
test.parseLinks

这将为您提供一个数组数组，其中每个（内部）数组的第一项是 url，第二项是标题。希望这对您有所帮助，并注意正则表达式上的 u 开关，这是为了避免编码问题。

Here is my Ruby Approach:

require 'open-uri'

class HTMLScraper
    def initialize(page)
      @src = page
      open(@src) do |x|
          @html = x.read
      end
    end
    def parseLinks
      links = @html.scan(/<a\s+href\s*=\s*"([^"]+)"[^>]*>\s*([^<]+)\s*<\/a>/ui)
      puts "Link(s) Found:"
      i = 0
      while i < links.length
        puts "\t#{links[i]}"
        i += 1
      end
    end
  end

url = "http://stackoverflow.com/questions"
test = HTMLScraper.new(url)
test.parseLinks

This will give to you an array of arrays, in which the first item of each (inner) array is the url, and the second is the title. Hope this helps and note the u switch on the regex, it's to avoid encoding problems.

回复收藏 0 原文

~没有更多了~