在 Ruby 中输出唯一 unicode 字符的列表

发布于 2025-01-02 16:44:03 字数 298 浏览 0 评论 0原文

我正在用 Ruby 解析一些包含 Unicode 字符的文本，我希望将其转录为一个输出文件中的 ASCII 值和另一个输出文件中的 HTML 编码。有没有一种简单的方法可以输出文件中找到的非 ASCII 字符？例如：

\u00A0 #should become a " " in the text text file, but &nbsp; in the html output file

我将根据我的需要手动转录它们，并希望输出我需要从初始输入文件转录的唯一字符列表。

谢谢，
本

原文

I am parsing some text in Ruby that contains Unicode character that I would like to transcribe to ASCII values in one output file and HTML encoding in another. Is there a simple way of spitting out the non-ASCII characters found in a file? For example:

\u00A0 #should become a " " in the text text file, but   in the html output file

I'm going to manually transcribe them based upon my needs and would like to output a list of unique characters I'll need to transcribe from my initial input file.

Thanks,
Ben

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

最终幸福 2025-01-09 16:44:03

有一种方法可以帮助提取字符串中的字符：

"foo\u00A0bar".chars.to_a
# => ["f", "o", "o", " ", "b", "a", "r"]

由于其中一些字符可能是多字节 UNICODE 字符，为了更彻底，您可能还想将其扩展为字节：

"foo\u00A0bar".chars.to_a.collect { |c| [ c, c.bytes.to_a ] }
# => [["f", [102]], ["o", [111]], ["o", [111]], [" ", [194, 160]], ["b", [98]], ["a", [97]], ["r", [114]]]

数组分解了所使用的特定字节来构造那个角色。在本例中，不间断空格显示为 " "，但内部实际上是 [194, 160]。

There's a method that helps to extract the characters found in your string:

"foo\u00A0bar".chars.to_a
# => ["f", "o", "o", " ", "b", "a", "r"]

Since some of these characters may be multi-byte UNICODE characters you might want to expand that into bytes as well, to be more thorough:

"foo\u00A0bar".chars.to_a.collect { |c| [ c, c.bytes.to_a ] }
# => [["f", [102]], ["o", [111]], ["o", [111]], [" ", [194, 160]], ["b", [98]], ["a", [97]], ["r", [114]]]

The array breaks down the specific bytes used to construct that character. In this case the non-breaking space shows up as " " but is actually [194, 160] internally.

回复收藏 0 原文

~没有更多了~