如何枚举 UTF-8 文档中的唯一字符?用 sed 吗?

发布于 2024-11-26 12:26:31 字数 599 浏览 1 评论 0原文

我正在将一些波兰语<->英语词典从 RTF 转换为 HTML。波兰语特殊字符表现良好。但是 IPA(国际音标)字形会变成有趣的东西,具体取决于我使用什么程序进行转换。例如,/ˈbiːrɪ/ 显示为 /ÈbiùrI/ 或 /∪βιρI/。

我想通过搜索和更正这些文档替换,但我想确保我不会错过任何字符,并且不想手动仔细查看字典条目。我想输出文档中所有唯一的非 ascii 字符的列表。

我找到了这个线程: 查找文件中的唯一字符

...我尝试了以下两个建议:
sed -e "s/./\0\n/g" 输入文件 |排序-u
sed -e "s/(.)/\1\n/g" 输入文件 | sort -u

它们都工作得很好,并且似乎都生成相同的输出。我的问题是它们只输出标准 ASCII 字符,而我正在寻找的恰恰相反。

sed 工具看起来很棒,但我现在没有时间学习它(尽管我打算稍后学习)。我希望已经掌握该工具的人能够清楚地了解该解决方案,并且他们可以节省我很多时间。 [-:

提前致谢!

I'm converting some Polish<->English dictionaries from RTF to HTML. The Polish special characters are coming out fine. But IPA (International Phonetic Alphabet) glyphs get changed to funny things, depending on what program I use for conversion. For example, /ˈbiːrɪ/ comes out as /ÈbiùrI/ or /∪βιρΙ/.

I'd like to correct these documents with a search & replace, but I want to make sure I don't miss any characters and don't want to manually pore over dictionary entries. I'd like to output a list of all unique, NON-ascii characters in a document.

I found this thread:
Find Unique Characters in a File

... and I tried the following two proposals:
sed -e "s/./\0\n/g" inputfile | sort -u
sed -e "s/(.)/\1\n/g" inputfile | sort -u

They both work nicely, and seem to both generate the same output. My problem is that they only output standard ASCII characters, and what I'm looking for is exactly the opposite.

The sed tool looks awesome, but I don't have time to learn it right now (though I intend to later). I'm hoping the solution will be clear to someone who's already mastered this tool, and they can save me a lot of time. [-:

Thanks in advance!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

2024-12-03 12:26:31

这不是 sed 解决方案,而是 Python 解决方案。它读取文件的内容,将其视为 UTF-8,然后将其转换为一个集合(从而丢弃重复项),丢弃 ASCII 字符 (0-127),对其进行排序,然后再次用空格将其重新连接在一起每个字符之间的行:

'\n'.join(sorted(set(unicode(open(inputfile).read(), 'utf-8')) - set(chr(i) for i in xrange(128))))

如果您愿意,可以从命令行运行

python -c "print '\n'.join(sorted(set(unicode(open('inputfile').read(), 'utf-8')) - set(chr(i) for i in xrange(128))))"

(您也可以使用 ''.join 而不是 '\n'.join 会列出不带 a 的字符之间换行。)

This is not a sed solution but a Python solution. It reads the contents of a file, takes it as UTF-8 and then turns it into a set (thus throwing away duplicates), throws away ASCII characters (0-127), sorts it and then joins it back together again with a blank line between each character:

'\n'.join(sorted(set(unicode(open(inputfile).read(), 'utf-8')) - set(chr(i) for i in xrange(128))))

As something you'd run from the command line if you felt so inclined,

python -c "print '\n'.join(sorted(set(unicode(open('inputfile').read(), 'utf-8')) - set(chr(i) for i in xrange(128))))"

(You could also use ''.join instead of '\n'.join which would list the characters without a newline in between.)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文