python收集ascii和utf-8的东西
我有一个包含英语单词的文本文件“words.txt”。假设它只包含三个单词:“一”、“二”和“三”。 我还有三个文件:one.dat、two.dat 和 Three.dat。这些文件中的每一个都包含表示相应单词的转录的二进制数据。格式为UTF-8。 我想要什么:我想将“words.txt”和所有这些 .dats 合并到我可以打印的单个文档中。所以我需要这样的东西(让我们将其命名为“final.dat”):
一个[wan] 两个[你:] 三个 [?ri:]
但使用正确的“th”符号而不是“?” :)
最重要的是我必须能够将“final.dat”加载到 MSWord 或 Writer 中并将其打印出来。
我将通过 python 来完成它,但我真的被所有这些“编解码器”、“编码”、“解码”等等所困扰......
I have a text file "words.txt" that contains english words. Let's assume it contains just three words: "one", "two" and "three".
I also have three files: one.dat, two.dat and three.dat. Each of these files contains binary data representing transcription of corresponding word. Format is UTF-8.
What do I want: I want to combine "words.txt" and all these .dats into single document that I would be able to print. So I need something like this (lets' name it "final.dat"):
one [wan]
two [tu:]
three [?ri:]
but with correct "th" sign instead of "?" :)
The most important thing is that I must be able to load "final.dat" into MSWord or Writer and print it out.
I'm going to accomplish it by python, but I'm really got stuck with all these 'codecs', 'encodes', 'decodes' and so on...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
来完成读取 UTF-8 文件
在 Python 2.x 中,可以使用或
两者都返回 Python
unicode
对象 。如果要将str
(ASCII/二进制字符串)s
转换为unicode
,请使用s.decode('utf- 8')
。在 Python 3.x 中,只执行
or
这个想法是
str
(Py2.x) 或bytes
(Py3.x) 对象仅包含字符串的二进制表示形式在 some 编码中,而不指定是哪种编码;decode
方法将其转换为正确的 Unicode 字符串(2.x 中为unicode
,3.x 中为str
)。(顺便说一句,UTF-8 不是“二进制数据”,它只是非 ASCII 编码的文本。)
In Python 2.x, reading a UTF-8 file can be accomplished using
or
both of which return a Python
unicode
object. If you want to turn astr
(ASCII/binary string)s
into aunicode
, uses.decode('utf-8')
.In Python 3.x, do just
or
The idea is that a
str
(Py2.x) orbytes
(Py3.x) object contains just the binary representation of a string in some encoding without specifying which encoding that is; thedecode
method turns this into a proper Unicode string (unicode
in 2.x,str
in 3.x).(Btw., UTF-8 is not "binary data", it's just text in a non-ASCII encoding.)