如何让 Nokogiri 不转换到太空
我获取一个包含
"<li>市 场 价"
“
”的html片段,但是在调用Nokogiri NodeSet的to_s
之后,它变成了
"<li>市 场 价"
,我想保留原始的html片段,并且尝试为 to_s
方法设置 :save_with option
,但失败。
有人可以遇到同样的问题并给我帮助吗?先感谢您。
i fetch one html fragment like
"<li>市 场 价"
which contains "
", but after calling to_s
of Nokogiri NodeSet, it becomes
"<li>市 场 价"
, i want to keep the original html fragment, and tried to set :save_with option
for to_s
method, but failed.
can someone encounter the same problem and give me help? thank you in advance.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我遇到了类似的情况,我想出的方法有点hack,但似乎效果很好。
就我而言,我希望 nbsp 成为一个常规空间。我认为在你的情况下,你希望它们返回到“ ”,所以你可以这样做:
I encountered a similar situation, and what I came up was a bit of a hack, but it seems to work well.
In my case, I wanted the nbsp to be a regular space. I think in your case, you want them to be returned to a " ", so you could do something like:
我认为问题在于你如何看待字符串。它看起来像一个空格,但并不完全相同:
常规空格是
转换为的值,
32
、0x20
或' '
。160
是不间断空格的十进制值,这是在使用 Nokogiri 的各种inner_text
后content
、text
或to_s
标签。它不再是 HTML 实体编码,但它仍然是一个不间断的空格。我认为 Nokogiri 从实体编码的转换是请求字符串化时的适当行为。可能有一个标志告诉 Nokogiri 不要解码该值,但我没有立即意识到它。你可以检查我在上面评论中提到的 Nokogiri 的邮件列表,看看是否有标记。我可以看到 Nokogiri 不进行解码的一个优点,所以如果没有这样的标志,偶尔会很好。
现在,综上所述,我认为
to_html
方法应该将值返回到其实体编码值,因为在 HTML 流中遇到不间断空格是一件令人讨厌的事情。 这个我认为你应该在邮件列表中提及,甚至可能作为一个错误提及。我认为这是一个不恰当的结果。http://groups.google.com/group/nokogiri-talk/msg/0b81ef0dc180dc74
您的示例文本不是
ASCII-8BIT
,因此请尝试将该编码字符串更改为 Unicode 字符集名称,并查看inner_html
是否会返回实体编码值。I think the problem is how you're looking at the string. It will look like a space, but it's not quite the same:
A regular space is
converts to after you use Nokogiri's various
32
,0x20
or' '
.160
is the decimal value for a non-breaking-space, which is whatinner_text
,content
,text
orto_s
tags. It's no longer a HTML entity-encoding, but it's still a non-breaking space. I think Nokogiri's conversion from the entity-encoding is the appropriate behavior when asking for a stringification.There might be a flag to tell Nokogiri to NOT decode the value, but I'm not aware of it off-hand. You can check on Nokogiri's mail-list that I mentioned in the comment above, to see if there is a flag. I can see an advantage for Nokogiri to not do the decode also so if there isn't such a flag it would be nice occasionally.
Now, all that said, I think the
to_html
method SHOULD return the value to its entity-encoded value, since a non-breaking space is a nasty thing to encounter in a HTML stream. And that I think you should mention on the mail-list or maybe even as a bug. I think it's an inappropriate result.http://groups.google.com/group/nokogiri-talk/msg/0b81ef0dc180dc74
Your sample text isn't
ASCII-8BIT
so try changing that encoding string to the Unicode character set name and see ifinner_html
will return an entity-encoded value.