如何让 Nokogiri 不转换到太空

发布于 2024-10-07 21:15:00 字数 329 浏览 9 评论 0原文

我获取一个包含

"<li>市&nbsp;场&nbsp;价"

&nbsp;”的html片段,但是在调用Nokogiri NodeSet的to_s之后,它变成了

"<li>市 场 价"

,我想保留原始的html片段,并且尝试为 to_s 方法设置 :save_with option ,但失败。

有人可以遇到同样的问题并给我帮助吗?先感谢您。

i fetch one html fragment like

"<li>市 场 价"

which contains " ", but after calling to_s of Nokogiri NodeSet, it becomes

"<li>市 场 价"

, i want to keep the original html fragment, and tried to set :save_with option for to_s method, but failed.

can someone encounter the same problem and give me help? thank you in advance.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

九厘米的零° 2024-10-14 21:15:00

我遇到了类似的情况,我想出的方法有点hack,但似乎效果很好。

nbsp = Nokogiri::HTML(" ").text
text.gsub(nbsp, " ")

就我而言,我希望 nbsp 成为一个常规空间。我认为在你的情况下,你希望它们返回到“ ”,所以你可以这样做:

nbsp = Nokogiri::HTML(" ").text
html.gsub(nbsp, " ")

I encountered a similar situation, and what I came up was a bit of a hack, but it seems to work well.

nbsp = Nokogiri::HTML(" ").text
text.gsub(nbsp, " ")

In my case, I wanted the nbsp to be a regular space. I think in your case, you want them to be returned to a " ", so you could do something like:

nbsp = Nokogiri::HTML(" ").text
html.gsub(nbsp, " ")
与君绝 2024-10-14 21:15:00

我认为问题在于你如何看待字符串。它看起来像一个空格,但并不完全相同:

require 'nokogiri'

doc = Nokogiri::HTML('"<li>市 场 价"')
(doc % 'li').content.chars.to_a[1].ord # => 160
(doc % 'li').to_html # => "<li>市 场 价\"</li>"

常规空格是 320x20' '160 是不间断空格的十进制值,这是在使用 Nokogiri 的各种 inner_text  转换为的值, contenttextto_s 标签。它不再是 HTML 实体编码,但它仍然是一个不间断的空格。我认为 Nokogiri 从实体编码的转换是请求字符串化时的适当行为。

可能有一个标志告诉 Nokogiri 不要解码该值,但我没有立即意识到它。你可以检查我在上面评论中提到的 Nokogiri 的邮件列表,看看是否有标记。我可以看到 Nokogiri 不进行解码的一个优点,所以如果没有这样的标志,偶尔会很好。

现在,综上所述,我认为 to_html 方法应该将值返回到其实体编码值,因为在 HTML 流中遇到不间断空格是一件令人讨厌的事情。 这个我认为你应该在邮件列表中提及,甚至可能作为一个错误提及。我认为这是一个不恰当的结果。


http://groups.google.com/group/nokogiri-talk/msg/0b81ef0dc180dc74

好的,我现在可以解释这个行为了。基本上,问题就解决了
到编码。

在 Ruby 1.9 中,我们检查您要输入的字符串的编码
诺科吉里。如果输入字符串是“utf-8”,则假定文档为
是一个 UTF-8 文档。输出文档时,由于“ ”能
表示为 UTF-8 字符,则输出为 UTF-8
特点。

在 1.8 中,由于我们无法检测文档的编码,因此我们假设
二进制编码并允许 libxml2 检测编码。
如果将输入文档的编码设置为二进制,则会给出
你支持你想要的实体。这是一些演示代码:

 require 'nokogiri' 
 html = '<body>hello   world</body>' 
 f    = Nokogiri.HTML(html) 
 node = f.css('body') 
 p node.inner_html 
 f    = Nokogiri.HTML(html.encode('ASCII-8BIT')) 
 node = f.css('body') 
 p node.inner_html 

我也发布了 YouTube 视频! :-)

http://www.youtube.com/watch?v=X2SzhXAt7V4 < /p>

亚伦帕特森

您的示例文本不是 ASCII-8BIT,因此请尝试将该编码字符串更改为 Unicode 字符集名称,并查看 inner_html 是否会返回实体编码值。

I think the problem is how you're looking at the string. It will look like a space, but it's not quite the same:

require 'nokogiri'

doc = Nokogiri::HTML('"<li>市 场 价"')
(doc % 'li').content.chars.to_a[1].ord # => 160
(doc % 'li').to_html # => "<li>市 场 价\"</li>"

A regular space is 32, 0x20 or ' '. 160 is the decimal value for a non-breaking-space, which is what   converts to after you use Nokogiri's various inner_text, content, text or to_s tags. It's no longer a HTML entity-encoding, but it's still a non-breaking space. I think Nokogiri's conversion from the entity-encoding is the appropriate behavior when asking for a stringification.

There might be a flag to tell Nokogiri to NOT decode the value, but I'm not aware of it off-hand. You can check on Nokogiri's mail-list that I mentioned in the comment above, to see if there is a flag. I can see an advantage for Nokogiri to not do the decode also so if there isn't such a flag it would be nice occasionally.

Now, all that said, I think the to_html method SHOULD return the value to its entity-encoded value, since a non-breaking space is a nasty thing to encounter in a HTML stream. And that I think you should mention on the mail-list or maybe even as a bug. I think it's an inappropriate result.


http://groups.google.com/group/nokogiri-talk/msg/0b81ef0dc180dc74

Okay, I can explain the behavior now. Basically, the problem boils
down to encoding.

In Ruby 1.9, we examine the encoding of the string you're feeding to
Nokogiri. If the input string is "utf-8", the document is assumed to
be a UTF-8 document. When you output the document, since " " can
be represented as a UTF-8 character, it is output as that UTF-8
character.

In 1.8, since we cannot detect the encoding of the document, we assume
binary encoding and allow libxml2 to detect the encoding.
If you set the encoding of the input document to binary, it will give
you back the entities you want. Here is some code to demo:

 require 'nokogiri' 
 html = '<body>hello   world</body>' 
 f    = Nokogiri.HTML(html) 
 node = f.css('body') 
 p node.inner_html 
 f    = Nokogiri.HTML(html.encode('ASCII-8BIT')) 
 node = f.css('body') 
 p node.inner_html 

I posted a youtube video too! :-)

http://www.youtube.com/watch?v=X2SzhXAt7V4

Aaron Patterson

Your sample text isn't ASCII-8BIT so try changing that encoding string to the Unicode character set name and see if inner_html will return an entity-encoded value.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文