识别文本内容中的部分字符编码/压缩

发布于 2025-01-16 14:48:37 字数 1257 浏览 0 评论 0原文

我有一个 CSV(从 BZ2 中提取),其中仅对某些值进行了编码:

hoxvh|c1x6nos c1x6e26|0 1
hqa1x|c1xiujs c1xj4e2|1 0
hpopn|c1xeuca c1xdepf|0 1
hpibh c1xcjy1|c1xe4yn c1xd1gh|1 0
hqdex|c1xls27 c1xjvjx|1 0

|01 字符肯定按预期显示,但是其他值均已明确编码。事实上,它们看起来像文本压缩替代品,这可能意味着 CSV 的值被压缩,然后也被整体压缩为 BZ2。

无论是使用 7zip 提取 BZ2 然后在文本编辑器中打开 CSV,还是使用 Python bz2 模块打开,或者使用 Pandas 和 read_csv 打开,我都会得到相同的结果:

import bz2

with bz2.open("test-balanced.csv.bz2") as f:
    contents = f.read().decode()
import pandas as pd

contents = pd.read_csv("test-balanced.csv.bz2", compression="bz2", encoding="utf-8")

我怎样才能确定使用哪种类型的编码类型进行解码?


源目录: https://nlp.cs.princeton.edu/SARC/2.0/main

源文件:test-balanced.csv.bz2

提取的 CSV 中的前 100 行:https://pastebin.com /mgW8hKdh

我询问了 CSV/数据集的原始作者,但他们没有回复,这是可以理解的。

I have a CSV (extracted from BZ2) where only some values are encoded:

hoxvh|c1x6nos c1x6e26|0 1
hqa1x|c1xiujs c1xj4e2|1 0
hpopn|c1xeuca c1xdepf|0 1
hpibh c1xcjy1|c1xe4yn c1xd1gh|1 0
hqdex|c1xls27 c1xjvjx|1 0

The |, 0 and 1 characters are definitely appearing as intended but the other values are clearly encoded. In fact, they look like text-compression replacements which could mean the CSV had its values compressed and then also compressed as a whole to BZ2.

I get the same results whether extracting the BZ2 with 7zip then opening the CSV in a text editor, or opening with Python bz2 module, or with Pandas and read_csv:

import bz2

with bz2.open("test-balanced.csv.bz2") as f:
    contents = f.read().decode()
import pandas as pd

contents = pd.read_csv("test-balanced.csv.bz2", compression="bz2", encoding="utf-8")

How can I identify which type of encoding type to decode with?


Source directory: https://nlp.cs.princeton.edu/SARC/2.0/main

Source file: test-balanced.csv.bz2

First 100 lines from extracted CSV: https://pastebin.com/mgW8hKdh

I asked the original authors of the CSV/dataset but they didn't respond which is understandable.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

北音执念 2025-01-23 14:48:37

来自 readme.txt

文件指南:

  • raw/key.csv:raw/sarc.csv 的列键
  • raw/sarc.csv:包含authors.json中作者的讽刺和非讽刺评论
  • */comments.json:JSON 格式的字典,包含 {comment_id: data} 格式的每个评论的文本和元数据
  • /.csv:CSV,其中每行包含帖子后面的一系列评论,以及对帖子中最后一条评论的一组回复
    序列,以及这些响应的讽刺/非讽刺标签。这
    格式为
    post_id comment_id … comment_id|response_id … response_id|label … label
    其中 *_id 是 */comments.json 的键
    label 1 表示相应的 response_id 映射到
    讽刺的回应。
    因此每行有三个条目(注释
    链、响应、标签)以“|”分隔,并且每个条目
    具有由空格分隔的元素。
    第一个条目始终包含
    post_id 和 0 个或多个 comment_ids。第二项和第三项
    具有相同数量的元素,第一个 response_id
    对应第一个标签,依此类推。

将上面转换为 Python 代码片段:

import pandas as pd
import json
from pprint import pprint

file_csv = r"D:\bat\SO\71596864\test-balanced.csv"
data_csv = pd.read_csv(file_csv,
                       sep='|',
                       names=['posts','responses','labels'],
                       encoding='utf-8')

file_json = r"D:\bat\SO\71596864\comments.json"
with open(file_json, mode='r', encoding='utf-8') as f:
    data_json = json.load(f)

print(f'{chr(0x20)*30} First csv line decoded:')
for post_id in data_csv['posts'][0].split(chr(0x20)):
    print(f'{chr(0x20)*30} post_id: {post_id}')
    pprint(data_json[post_id])

for response_id in data_csv['responses'][0].split(chr(0x20)):
    print(f'{chr(0x20)*30} response_id: {response_id}')
    pprint(data_json[response_id])

请注意,文件是从 < code>pol 目录 可接受的大小(pol:包含与 /r/politics 中的注释相对应的主数据集子集)。

结果D:\bat\SO\71596864.py

                               First csv line decoded:
                               post_id: hqa1x
{'author': 'joshlamb619',
 'created_utc': 1307053256,
 'date': '2011-06',
 'downs': 359,
 'score': 274,
 'subreddit': 'politics',
 'text': 'Wisconsin GOP caught red handed, looking to run fake Democratic '
         'candidates during recall elections.',
 'ups': 633}
                               response_id: c1xiujs
{'author': 'Artisane',
 'created_utc': 1307077221,
 'date': '2011-06',
 'downs': 0,
 'score': -2,
 'subreddit': 'politics',
 'text': "And we're upset since the Democrats would *never* try something as "
         'sneaky as this, right?',
 'ups': -2}
                               response_id: c1xj4e2
{'author': 'stellarfury',
 'created_utc': 1307080843,
 'date': '2011-06',
 'downs': 0,
 'score': -2,
 'subreddit': 'politics',
 'text': "Oooh baby you caught me red handed Creepin' on the senate floor "
         "Picture this we were makin' up candidates Being huge election whores",
 'ups': -2}

From readme.txt:

File Guide:

  • raw/key.csv: column key for raw/sarc.csv
  • raw/sarc.csv: contains sarcastic and non-sarcastic comments of authors in authors.json
  • */comments.json: dictionary in JSON format containing text and metadata for each comment in {comment_id: data} format
  • /.csv: CSV where each row contains a sequence of comments following a post, a set of responses to the last comment in that
    sequence, and sarcastic/non-sarcastic labels for those responses. The
    format is
    post_id comment_id … comment_id|response_id … response_id|label … label
    where *_id is a key to */comments.json
    and label 1 indicates the respective response_id maps to a
    sarcastic response.
    Thus each row has three entries (comment
    chain, responses, labels) delimited by '|', and each of these entries
    has elements delimited by spaces.
    The first entry always contains a
    post_id and 0 or more comment_ids. The second and third entries
    have the same number of elements, with the first response_id
    corresponding to the first label and so on.

Converting above to a Python code snippet:

import pandas as pd
import json
from pprint import pprint

file_csv = r"D:\bat\SO\71596864\test-balanced.csv"
data_csv = pd.read_csv(file_csv,
                       sep='|',
                       names=['posts','responses','labels'],
                       encoding='utf-8')

file_json = r"D:\bat\SO\71596864\comments.json"
with open(file_json, mode='r', encoding='utf-8') as f:
    data_json = json.load(f)

print(f'{chr(0x20)*30} First csv line decoded:')
for post_id in data_csv['posts'][0].split(chr(0x20)):
    print(f'{chr(0x20)*30} post_id: {post_id}')
    pprint(data_json[post_id])

for response_id in data_csv['responses'][0].split(chr(0x20)):
    print(f'{chr(0x20)*30} response_id: {response_id}')
    pprint(data_json[response_id])

Note that files were (manually) downloaded from the pol directory for their acceptable size (pol: contains subset of main dataset corresponding to comments in /r/politics).

Result: D:\bat\SO\71596864.py

                               First csv line decoded:
                               post_id: hqa1x
{'author': 'joshlamb619',
 'created_utc': 1307053256,
 'date': '2011-06',
 'downs': 359,
 'score': 274,
 'subreddit': 'politics',
 'text': 'Wisconsin GOP caught red handed, looking to run fake Democratic '
         'candidates during recall elections.',
 'ups': 633}
                               response_id: c1xiujs
{'author': 'Artisane',
 'created_utc': 1307077221,
 'date': '2011-06',
 'downs': 0,
 'score': -2,
 'subreddit': 'politics',
 'text': "And we're upset since the Democrats would *never* try something as "
         'sneaky as this, right?',
 'ups': -2}
                               response_id: c1xj4e2
{'author': 'stellarfury',
 'created_utc': 1307080843,
 'date': '2011-06',
 'downs': 0,
 'score': -2,
 'subreddit': 'politics',
 'text': "Oooh baby you caught me red handed Creepin' on the senate floor "
         "Picture this we were makin' up candidates Being huge election whores",
 'ups': -2}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文