识别文本内容中的部分字符编码/压缩

发布于 2025-01-16 14:48:37 字数 1257 浏览 0 评论 0原文

我有一个 CSV（从 BZ2 中提取），其中仅对某些值进行了编码：

hoxvh|c1x6nos c1x6e26|0 1
hqa1x|c1xiujs c1xj4e2|1 0
hpopn|c1xeuca c1xdepf|0 1
hpibh c1xcjy1|c1xe4yn c1xd1gh|1 0
hqdex|c1xls27 c1xjvjx|1 0

|、0 和 1 字符肯定按预期显示，但是其他值均已明确编码。事实上，它们看起来像文本压缩替代品，这可能意味着 CSV 的值被压缩，然后也被整体压缩为 BZ2。

无论是使用 7zip 提取 BZ2 然后在文本编辑器中打开 CSV，还是使用 Python bz2 模块打开，或者使用 Pandas 和 read_csv 打开，我都会得到相同的结果：

import bz2

with bz2.open("test-balanced.csv.bz2") as f:
    contents = f.read().decode()

import pandas as pd

contents = pd.read_csv("test-balanced.csv.bz2", compression="bz2", encoding="utf-8")

我怎样才能确定使用哪种类型的编码类型进行解码？

源目录： https://nlp.cs.princeton.edu/SARC/2.0/main

源文件：test-balanced.csv.bz2

提取的 CSV 中的前 100 行：https://pastebin.com /mgW8hKdh

我询问了 CSV/数据集的原始作者，但他们没有回复，这是可以理解的。

原文

I have a CSV (extracted from BZ2) where only some values are encoded:

hoxvh|c1x6nos c1x6e26|0 1
hqa1x|c1xiujs c1xj4e2|1 0
hpopn|c1xeuca c1xdepf|0 1
hpibh c1xcjy1|c1xe4yn c1xd1gh|1 0
hqdex|c1xls27 c1xjvjx|1 0

The |, 0 and 1 characters are definitely appearing as intended but the other values are clearly encoded. In fact, they look like text-compression replacements which could mean the CSV had its values compressed and then also compressed as a whole to BZ2.

I get the same results whether extracting the BZ2 with 7zip then opening the CSV in a text editor, or opening with Python bz2 module, or with Pandas and read_csv:

import bz2

with bz2.open("test-balanced.csv.bz2") as f:
    contents = f.read().decode()

import pandas as pd

contents = pd.read_csv("test-balanced.csv.bz2", compression="bz2", encoding="utf-8")

How can I identify which type of encoding type to decode with?

Source directory: https://nlp.cs.princeton.edu/SARC/2.0/main

Source file: test-balanced.csv.bz2

First 100 lines from extracted CSV: https://pastebin.com/mgW8hKdh

I asked the original authors of the CSV/dataset but they didn't respond which is understandable.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

北音执念 2025-01-23 14:48:37

来自 readme.txt：

文件指南：
raw/key.csv：raw/sarc.csv 的列键
raw/sarc.csv：包含authors.json中作者的讽刺和非讽刺评论
*/comments.json：JSON 格式的字典，包含 {comment_id: data} 格式的每个评论的文本和元数据
/.csv：CSV，其中每行包含帖子后面的一系列评论，以及对帖子中最后一条评论的一组回复
序列，以及这些响应的讽刺/非讽刺标签。这
格式为
post_id comment_id … comment_id|response_id … response_id|label … label
其中 *_id 是 */comments.json 的键
label 1 表示相应的 response_id 映射到
讽刺的回应。
因此每行有三个条目（注释
链、响应、标签）以“|”分隔，并且每个条目
具有由空格分隔的元素。
第一个条目始终包含
post_id 和 0 个或多个 comment_ids。第二项和第三项
具有相同数量的元素，第一个 response_id
对应第一个标签，依此类推。

将上面转换为 Python 代码片段：

import pandas as pd
import json
from pprint import pprint

file_csv = r"D:\bat\SO\71596864\test-balanced.csv"
data_csv = pd.read_csv(file_csv,
                       sep='|',
                       names=['posts','responses','labels'],
                       encoding='utf-8')

file_json = r"D:\bat\SO\71596864\comments.json"
with open(file_json, mode='r', encoding='utf-8') as f:
    data_json = json.load(f)

print(f'{chr(0x20)*30} First csv line decoded:')
for post_id in data_csv['posts'][0].split(chr(0x20)):
    print(f'{chr(0x20)*30} post_id: {post_id}')
    pprint(data_json[post_id])

for response_id in data_csv['responses'][0].split(chr(0x20)):
    print(f'{chr(0x20)*30} response_id: {response_id}')
    pprint(data_json[response_id])

请注意，文件是从 < code>pol 目录可接受的大小（pol：包含与 /r/politics 中的注释相对应的主数据集子集）。

结果：D:\bat\SO\71596864.py

                               First csv line decoded:
                               post_id: hqa1x
{'author': 'joshlamb619',
 'created_utc': 1307053256,
 'date': '2011-06',
 'downs': 359,
 'score': 274,
 'subreddit': 'politics',
 'text': 'Wisconsin GOP caught red handed, looking to run fake Democratic '
         'candidates during recall elections.',
 'ups': 633}
                               response_id: c1xiujs
{'author': 'Artisane',
 'created_utc': 1307077221,
 'date': '2011-06',
 'downs': 0,
 'score': -2,
 'subreddit': 'politics',
 'text': "And we're upset since the Democrats would *never* try something as "
         'sneaky as this, right?',
 'ups': -2}
                               response_id: c1xj4e2
{'author': 'stellarfury',
 'created_utc': 1307080843,
 'date': '2011-06',
 'downs': 0,
 'score': -2,
 'subreddit': 'politics',
 'text': "Oooh baby you caught me red handed Creepin' on the senate floor "
         "Picture this we were makin' up candidates Being huge election whores",
 'ups': -2}

From readme.txt:

File Guide:
raw/key.csv: column key for raw/sarc.csv
raw/sarc.csv: contains sarcastic and non-sarcastic comments of authors in authors.json
*/comments.json: dictionary in JSON format containing text and metadata for each comment in {comment_id: data} format
/.csv: CSV where each row contains a sequence of comments following a post, a set of responses to the last comment in that
sequence, and sarcastic/non-sarcastic labels for those responses. The
format is
post_id comment_id … comment_id|response_id … response_id|label … label
where *_id is a key to */comments.json
and label 1 indicates the respective response_id maps to a
sarcastic response.
Thus each row has three entries (comment
chain, responses, labels) delimited by '|', and each of these entries
has elements delimited by spaces.
The first entry always contains a
post_id and 0 or more comment_ids. The second and third entries
have the same number of elements, with the first response_id
corresponding to the first label and so on.

Converting above to a Python code snippet:

import pandas as pd
import json
from pprint import pprint

file_csv = r"D:\bat\SO\71596864\test-balanced.csv"
data_csv = pd.read_csv(file_csv,
                       sep='|',
                       names=['posts','responses','labels'],
                       encoding='utf-8')

file_json = r"D:\bat\SO\71596864\comments.json"
with open(file_json, mode='r', encoding='utf-8') as f:
    data_json = json.load(f)

print(f'{chr(0x20)*30} First csv line decoded:')
for post_id in data_csv['posts'][0].split(chr(0x20)):
    print(f'{chr(0x20)*30} post_id: {post_id}')
    pprint(data_json[post_id])

for response_id in data_csv['responses'][0].split(chr(0x20)):
    print(f'{chr(0x20)*30} response_id: {response_id}')
    pprint(data_json[response_id])

Note that files were (manually) downloaded from the pol directory for their acceptable size (pol: contains subset of main dataset corresponding to comments in /r/politics).

Result: D:\bat\SO\71596864.py

                               First csv line decoded:
                               post_id: hqa1x
{'author': 'joshlamb619',
 'created_utc': 1307053256,
 'date': '2011-06',
 'downs': 359,
 'score': 274,
 'subreddit': 'politics',
 'text': 'Wisconsin GOP caught red handed, looking to run fake Democratic '
         'candidates during recall elections.',
 'ups': 633}
                               response_id: c1xiujs
{'author': 'Artisane',
 'created_utc': 1307077221,
 'date': '2011-06',
 'downs': 0,
 'score': -2,
 'subreddit': 'politics',
 'text': "And we're upset since the Democrats would *never* try something as "
         'sneaky as this, right?',
 'ups': -2}
                               response_id: c1xj4e2
{'author': 'stellarfury',
 'created_utc': 1307080843,
 'date': '2011-06',
 'downs': 0,
 'score': -2,
 'subreddit': 'politics',
 'text': "Oooh baby you caught me red handed Creepin' on the senate floor "
         "Picture this we were makin' up candidates Being huge election whores",
 'ups': -2}

回复收藏 0 原文

~没有更多了~