hpricot 中的分段错误
我正在使用 hpricot 来读取 HTML。 我遇到了分段错误错误,我用 google 搜索,有人说升级到最新版本的 Ruby。 我使用的是 Rails 2.3.2 和 ruby 1.8.7。 如何解决这个错误?
I'm using hpricot to read HTML. I got a segmentation fault error, I googled and some say upgrade to latest version of Ruby. I am using rails 2.3.2 and ruby 1.8.7. How to resolve this error?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
我试图解析其中包含许多 unicode 字符的 html 页面,但 Hpricot 一直崩溃。 最后,我使用了 sanitize 中的猴子补丁,并将其放入我的 Rails 应用程序的environment.rb 中。 自从我添加此补丁以来,没有发生过任何崩溃:
http://github.com/rgrove/sanitize/blob/1e1dc9681de99e32dc166f591343dfa60fc1f648/lib/sanitize/monkeypatch/hpricot.rb
I was trying to parse html pages with many unicode characters in them and Hpricot kept crashing. Finally, I used the monkey patch from sanitize and put it in the environment.rb for my rails application. There hasn't been a single crash since I added this patch:
http://github.com/rgrove/sanitize/blob/1e1dc9681de99e32dc166f591343dfa60fc1f648/lib/sanitize/monkeypatch/hpricot.rb
如果您可以自由选择 HTML 解析库,请切换它。
为什么,Hpricot 的创建者最近发帖说,现在你最好使用 Nokogiri 而不是 HPricot。
您还可以看看 HTTParty。
If you're free to choose your HTML parsing library, switch it.
Why, the creator of Hpricot, recently posted that you should better use Nokogiri instead of HPricot, nowadays.
You may also have a look at HTTParty.
我遇到了同样的段错误问题,但遗憾的是无法咨询戴夫上面引用的问题,即使通过谷歌缓存 - 从我一直在谷歌搜索的 parse.rb 段错误与编码实体或替代字符集(重音)有关也许是字符)
清理库遇到了同样的问题并在这里发布了一个猴子补丁:
http://github.com/rgrove/sanitize/ blob/1e1dc9681de99e32dc166f591343dfa60fc1f648/lib/sanitize/monkeypatch/hpricot.rb
I'm having the same segfault issue but sadly can't consult the issues Dave cited above, even via Google cache -- from what I've been googling the parse.rb segfaults have to do with encoded entities or alt character sets (accented characters perhaps)
The sanitize lib encountered the same issue and posted a monkeypatch here:
http://github.com/rgrove/sanitize/blob/1e1dc9681de99e32dc166f591343dfa60fc1f648/lib/sanitize/monkeypatch/hpricot.rb
这似乎是错误列表中的一个突出问题。 我经历过。 我的理论是与文件中的 HTML 结构或错误/损坏的字符有关,但我还没有找到确切的位置。
以下是问题的链接:
This appears to be an outstanding issue on the bug list. I have experienced it to. My theory is has to do with the HTML structure or bad/corrupt character in the file but I have not found where exactly.
Here are the links to the issues:
根据我的记忆,自从我大约一年前上次使用它以来:
Hpricot 将属性存储在固定大小的缓冲区中,并且某些框架在文档属性中生成非常长的哈希值。 您可以在解析之前设置一些静态字段,以便您设置此缓冲区的大小。
我记得它在网页上的文档中相当突出,尽管现在它已经消失了。
From memory, since I last used it about a year ago:
Hpricot stores attributes in a fixed-size buffer, and some frameworks generate outrageously long hashes in document attributes. There's some static field you can set before parsing that lets you set the size of this buffer.
I remember it being fairly prominent in the docs on the webpage, though of course it's gone now.
好吧,根据你自己的问题,我会说“升级到最新版本的 Ruby”。 然而,我也遇到了 hpricot 段错误的问题,这似乎与我对线程的使用有关。
Well, based on your own question, I'd say "Upgrade to the latest version of Ruby". However, I've also had problems with hpricot segfaulting, which seemed to be related to my usage of threading.
在 ruby 1.8.5 上尝试使用 hpricot -v 0.6.161
这对我有用。
On ruby 1.8.5 try using hpricot -v 0.6.161
That worked for me.