使用正则表达式搜索 Hpricot
我正在尝试使用 Hpricot 获取具有我不知道的类名的跨度内的值。我知道它遵循模式“foo_[几个数字]_bar”。
现在,我将整个包含元素作为字符串获取,并使用正则表达式来解析标签的字符串。该解决方案有效,但看起来真的很难看。
doc = Hpricot(open("http://scrape.example.com/search?q=#{ticker_symbol}"))
elements = doc.search("//span[@class='pr']").inner_html
string = ""
elements.each do |attr|
if(attr =~ /foo_\d+_bar/)
string = attr
end
end
# get rid of the span tags, just get the value
string.sub!(/<\/span>/, "")
string.sub!(/<span.+>/, "")
return string
似乎应该有更好的方法来做到这一点。我想做类似的事情:
elements = doc.search("//span[@class='" + /foo_\d+_bar/ + "']").inner_html
但这行不通。有没有办法用正则表达式进行搜索?
I'm trying to use Hpricot to get the value within a span with a class name I don't know. I know that it follows the pattern "foo_[several digits]_bar".
Right now, I'm getting the entire containing element as a string and using a regex to parse the string for the tag. That solution works, but it seems really ugly.
doc = Hpricot(open("http://scrape.example.com/search?q=#{ticker_symbol}"))
elements = doc.search("//span[@class='pr']").inner_html
string = ""
elements.each do |attr|
if(attr =~ /foo_\d+_bar/)
string = attr
end
end
# get rid of the span tags, just get the value
string.sub!(/<\/span>/, "")
string.sub!(/<span.+>/, "")
return string
It seem like there should be a better way to do that. I'd like to do something like:
elements = doc.search("//span[@class='" + /foo_\d+_bar/ + "']").inner_html
But that doesn't run. Is there a way to search with a regular expression?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这应该可以:
除此之外,我们还可以提供更多示例来说明其他类似表达式的工作原理:
对于如下所示的文档:
我们得到每个查询的输出如下:
这是我们所期望的。
如您所见 = 正在寻找完全匹配。
使用 ~ 我们可以进行子字符串匹配;但并不是真正的你所期望的。
例如请参阅以下内容。
看来空间是经过特殊处理的。
有了star我们就可以解决这个问题。现在我们正在进行真正的子字符串匹配。
我们还可以进行字符串开始和字符串结束匹配,如下所示:
请注意,对于这些空格字符来说不是问题。
This should do:
In addition to this we can give some more examples on how some other similar expressions work:
For a document like the following:
We get the output following for each query:
This is what we would expect.
As you see = is looking for exact match.
With ~ we can do a substring matching; but not truly what you would expect.
For instance see the following.
It seems that spaces are treated specially.
With star we can go around this problem. Now we are doing true substring matching.
We can also do string begin and string end matching as follows:
Note that for these space characters are not a problem.
这应该做:
This should do:
可以在解析之前修改传入的 html。
之后,您可以使用 foo_bar 类来识别元素。这远非优雅或通用,但可能被证明更有效。
One could modify the incoming html before parsing.
After that you can identify the elements using the
foo_bar
class. This is far from elegant or general but could prove to be more efficient.