在 Ruby 中输出唯一 unicode 字符的列表
我正在用 Ruby 解析一些包含 Unicode 字符的文本,我希望将其转录为一个输出文件中的 ASCII 值和另一个输出文件中的 HTML 编码。有没有一种简单的方法可以输出文件中找到的非 ASCII 字符?例如:
\u00A0 #should become a " " in the text text file, but in the html output file
我将根据我的需要手动转录它们,并希望输出我需要从初始输入文件转录的唯一字符列表。
谢谢,
本
I am parsing some text in Ruby that contains Unicode character that I would like to transcribe to ASCII values in one output file and HTML encoding in another. Is there a simple way of spitting out the non-ASCII characters found in a file? For example:
\u00A0 #should become a " " in the text text file, but in the html output file
I'm going to manually transcribe them based upon my needs and would like to output a list of unique characters I'll need to transcribe from my initial input file.
Thanks,
Ben
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
有一种方法可以帮助提取字符串中的字符:
由于其中一些字符可能是多字节 UNICODE 字符,为了更彻底,您可能还想将其扩展为字节:
数组分解了所使用的特定字节来构造那个角色。在本例中,不间断空格显示为
" "
,但内部实际上是[194, 160]
。There's a method that helps to extract the characters found in your string:
Since some of these characters may be multi-byte UNICODE characters you might want to expand that into bytes as well, to be more thorough:
The array breaks down the specific bytes used to construct that character. In this case the non-breaking space shows up as
" "
but is actually[194, 160]
internally.