使用正则表达式搜索 Hpricot

发布于 2024-08-16 03:04:03 字数 676 浏览 10 评论 0原文

我正在尝试使用 Hpricot 获取具有我不知道的类名的跨度内的值。我知道它遵循模式“foo_[几个数字]_bar”。

现在,我将整个包含元素作为字符串获取,并使用正则表达式来解析标签的字符串。该解决方案有效,但看起来真的很难看。

doc = Hpricot(open("http://scrape.example.com/search?q=#{ticker_symbol}"))
elements = doc.search("//span[@class='pr']").inner_html
string = ""
elements.each do |attr|
  if(attr =~ /foo_\d+_bar/)
    string = attr
  end
end
# get rid of the span tags, just get the value
string.sub!(/<\/span>/, "")
string.sub!(/<span.+>/, "")

return string

似乎应该有更好的方法来做到这一点。我想做类似的事情:

elements = doc.search("//span[@class='" + /foo_\d+_bar/ + "']").inner_html

但这行不通。有没有办法用正则表达式进行搜索?

I'm trying to use Hpricot to get the value within a span with a class name I don't know. I know that it follows the pattern "foo_[several digits]_bar".

Right now, I'm getting the entire containing element as a string and using a regex to parse the string for the tag. That solution works, but it seems really ugly.

doc = Hpricot(open("http://scrape.example.com/search?q=#{ticker_symbol}"))
elements = doc.search("//span[@class='pr']").inner_html
string = ""
elements.each do |attr|
  if(attr =~ /foo_\d+_bar/)
    string = attr
  end
end
# get rid of the span tags, just get the value
string.sub!(/<\/span>/, "")
string.sub!(/<span.+>/, "")

return string

It seem like there should be a better way to do that. I'd like to do something like:

elements = doc.search("//span[@class='" + /foo_\d+_bar/ + "']").inner_html

But that doesn't run. Is there a way to search with a regular expression?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

红衣飘飘貌似仙 2024-08-23 03:04:04

这应该可以:

doc.search("span[@class^='foo'][@class$='bar']")

除此之外,我们还可以提供更多示例来说明其他类似表达式的工作原理:

对于如下所示的文档:

我们得到每个查询的输出如下:

doc.search("//meta[@content='abcxy def ghi jklmn']")
=> #<Hpricot::Elements[{emptyelem <meta content="abcxy def ghi jklmn">}]>

这是我们所期望的。

doc.search("//meta[@content='def']")
=> #<Hpricot::Elements[]>

如您所见 = 正在寻找完全匹配。

doc.search("//meta[@content~='def']")
=> #<Hpricot::Elements[{emptyelem <meta content="abcxy def ghi jklmn">}]>

使用 ~ 我们可以进行子字符串匹配;但并不是真正的你所期望的。

例如请参阅以下内容。

doc.search("//meta[@content~=' def ']")
=> #<Hpricot::Elements[]>

看来空间是经过特殊处理的。

有了star我们就可以解决这个问题。现在我们正在进行真正的子字符串匹配。

doc.search("//meta[@content*=' def ']")
=> #<Hpricot::Elements[{emptyelem <meta content="abcxy def ghi jklmn">}]>

我们还可以进行字符串开始和字符串结束匹配,如下所示:

doc.search("//meta[@content^='def']")
=> #<Hpricot::Elements[]>

doc.search("//meta[@content^='ab']")
=> #<Hpricot::Elements[{emptyelem <meta content="abcxy def ghi jklmn">}]>

doc.search("//meta[@content$='mn']")
=> #<Hpricot::Elements[{emptyelem <meta content="abcxy def ghi jklmn">}]>

请注意,对于这些空格字符来说不是问题。

doc.search("//meta[@content$=' jklmn']")
=> #<Hpricot::Elements[{emptyelem <meta content="abcxy def ghi jklmn">}]>

This should do:

doc.search("span[@class^='foo'][@class$='bar']")

In addition to this we can give some more examples on how some other similar expressions work:

For a document like the following:

We get the output following for each query:

doc.search("//meta[@content='abcxy def ghi jklmn']")
=> #<Hpricot::Elements[{emptyelem <meta content="abcxy def ghi jklmn">}]>

This is what we would expect.

doc.search("//meta[@content='def']")
=> #<Hpricot::Elements[]>

As you see = is looking for exact match.

doc.search("//meta[@content~='def']")
=> #<Hpricot::Elements[{emptyelem <meta content="abcxy def ghi jklmn">}]>

With ~ we can do a substring matching; but not truly what you would expect.

For instance see the following.

doc.search("//meta[@content~=' def ']")
=> #<Hpricot::Elements[]>

It seems that spaces are treated specially.

With star we can go around this problem. Now we are doing true substring matching.

doc.search("//meta[@content*=' def ']")
=> #<Hpricot::Elements[{emptyelem <meta content="abcxy def ghi jklmn">}]>

We can also do string begin and string end matching as follows:

doc.search("//meta[@content^='def']")
=> #<Hpricot::Elements[]>

doc.search("//meta[@content^='ab']")
=> #<Hpricot::Elements[{emptyelem <meta content="abcxy def ghi jklmn">}]>

doc.search("//meta[@content$='mn']")
=> #<Hpricot::Elements[{emptyelem <meta content="abcxy def ghi jklmn">}]>

Note that for these space characters are not a problem.

doc.search("//meta[@content$=' jklmn']")
=> #<Hpricot::Elements[{emptyelem <meta content="abcxy def ghi jklmn">}]>
我ぃ本無心為│何有愛 2024-08-23 03:04:04

这应该做:

doc.search("span[@class^='foo'][@class$='bar']")

This should do:

doc.search("span[@class^='foo'][@class$='bar']")
浅暮の光 2024-08-23 03:04:04

可以在解析之前修改传入的 html。

html = open("http://scrape.example.com/search?q=#{ticker_symbol}").string
html.gsub!(/class="(foo_\d+_bar)"/){ |s| "class=\"foo_bar #{$1}\"" }
doc = Hpricot(html)

之后,您可以使用 foo_bar 类来识别元素。这远非优雅或通用,但可能被证明更有效。

One could modify the incoming html before parsing.

html = open("http://scrape.example.com/search?q=#{ticker_symbol}").string
html.gsub!(/class="(foo_\d+_bar)"/){ |s| "class=\"foo_bar #{$1}\"" }
doc = Hpricot(html)

After that you can identify the elements using the foo_bar class. This is far from elegant or general but could prove to be more efficient.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文