识别文本内容中的部分字符编码/压缩
我有一个 CSV(从 BZ2 中提取),其中仅对某些值进行了编码:
hoxvh|c1x6nos c1x6e26|0 1
hqa1x|c1xiujs c1xj4e2|1 0
hpopn|c1xeuca c1xdepf|0 1
hpibh c1xcjy1|c1xe4yn c1xd1gh|1 0
hqdex|c1xls27 c1xjvjx|1 0
|
、0
和 1
字符肯定按预期显示,但是其他值均已明确编码。事实上,它们看起来像文本压缩替代品,这可能意味着 CSV 的值被压缩,然后也被整体压缩为 BZ2。
无论是使用 7zip 提取 BZ2 然后在文本编辑器中打开 CSV,还是使用 Python bz2
模块打开,或者使用 Pandas 和 read_csv
打开,我都会得到相同的结果:
import bz2
with bz2.open("test-balanced.csv.bz2") as f:
contents = f.read().decode()
import pandas as pd
contents = pd.read_csv("test-balanced.csv.bz2", compression="bz2", encoding="utf-8")
我怎样才能确定使用哪种类型的编码类型进行解码?
源目录: https://nlp.cs.princeton.edu/SARC/2.0/main
提取的 CSV 中的前 100 行:https://pastebin.com /mgW8hKdh
我询问了 CSV/数据集的原始作者,但他们没有回复,这是可以理解的。
I have a CSV (extracted from BZ2) where only some values are encoded:
hoxvh|c1x6nos c1x6e26|0 1
hqa1x|c1xiujs c1xj4e2|1 0
hpopn|c1xeuca c1xdepf|0 1
hpibh c1xcjy1|c1xe4yn c1xd1gh|1 0
hqdex|c1xls27 c1xjvjx|1 0
The |
, 0
and 1
characters are definitely appearing as intended but the other values are clearly encoded. In fact, they look like text-compression replacements which could mean the CSV had its values compressed and then also compressed as a whole to BZ2.
I get the same results whether extracting the BZ2 with 7zip then opening the CSV in a text editor, or opening with Python bz2
module, or with Pandas and read_csv
:
import bz2
with bz2.open("test-balanced.csv.bz2") as f:
contents = f.read().decode()
import pandas as pd
contents = pd.read_csv("test-balanced.csv.bz2", compression="bz2", encoding="utf-8")
How can I identify which type of encoding type to decode with?
Source directory: https://nlp.cs.princeton.edu/SARC/2.0/main
Source file: test-balanced.csv.bz2
First 100 lines from extracted CSV: https://pastebin.com/mgW8hKdh
I asked the original authors of the CSV/dataset but they didn't respond which is understandable.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
来自 readme.txt:
将上面转换为 Python 代码片段:
请注意,文件是从 < code>pol 目录 可接受的大小(
pol
:包含与 /r/politics 中的注释相对应的主数据集子集)。结果:
D:\bat\SO\71596864.py
From readme.txt:
Converting above to a Python code snippet:
Note that files were (manually) downloaded from the
pol
directory for their acceptable size (pol
: contains subset of main dataset corresponding to comments in /r/politics).Result:
D:\bat\SO\71596864.py