Python-解析键值对的功能到数据框列

发布于 2025-01-20 08:39:50 字数 1381 浏览 0 评论 0原文

我在CSV文件中有一个带有键值对的数据集，看起来与此相似：

"1, {""key"": ""construction_year"", ""value"": 1900}, {""key"": ""available_date"", ""value"": ""Vereinbarung""}"
"2, {""key"": ""available_date"", ""value"": ""01.04.2022""}, {""key"": ""useful_area"", ""value"": 60.0}"
"3, {""key"": ""construction_year"", ""value"": 2020}, {""key"": ""available_date"", ""value"": ""sofort""}"
"4, {""key"": ""available_date"", ""value"": ""Vereinbarung""}, {""key"": ""wheelchair_accessible"", ""value"": true}"

我的预期输出如下：

id      construction_year   available_date   useful_area   wheelchair_accessible
1       1900                Vereinbarung     nan           nan
2       nan                 01.04.202        60.0          nan
3       2020                sofort           nan           nan 
4       nan                 Vereinbarung     nan           true

我已经尝试使用json将此数据转换为dict。加载，然后解析。当我可以确保所有行都以JSON样式的完美格式化时，此方法在小规模上起作用。

但是，当我尝试在list上使用200'000观测值使用json.loads时，我总是会遇到一些错误，因为并非所有行都处于正确的JSON格式。例如，有时“键”中缺少一个“值”，有时在错误的位置上有一个{，因此json.loads导致以下错误：jsondecodeerror：期望以双引号包含的属性名称

我已经尝试了几天来将整个数据修复到JSON友好格式中，但这似乎是不可能的，我以可怕的格式收到了此数据集，并且非常混乱。

因此，我想知道是否有人可以提出一个函数，该函数可以使我可以将键值对分为单个列，而无需使用json.loads。

提前致谢。

原文

I have a dataset with key-value pairs in a CSV file that looks similar to this:

"1, {""key"": ""construction_year"", ""value"": 1900}, {""key"": ""available_date"", ""value"": ""Vereinbarung""}"
"2, {""key"": ""available_date"", ""value"": ""01.04.2022""}, {""key"": ""useful_area"", ""value"": 60.0}"
"3, {""key"": ""construction_year"", ""value"": 2020}, {""key"": ""available_date"", ""value"": ""sofort""}"
"4, {""key"": ""available_date"", ""value"": ""Vereinbarung""}, {""key"": ""wheelchair_accessible"", ""value"": true}"

My expected output is the following:

id      construction_year   available_date   useful_area   wheelchair_accessible
1       1900                Vereinbarung     nan           nan
2       nan                 01.04.202        60.0          nan
3       2020                sofort           nan           nan 
4       nan                 Vereinbarung     nan           true

I have already tried to convert this data to a dict using json.loads and then parsing it. This method works on a small scale, when I can assure that all rows are perfectly formatted in a JSON style.

However, when I try to use json.loads on a list with the 200'000 observations, I always get some error since not all rows are on the right JSON format. For example, sometimes there is a "value" missing from a "key", sometimes there is a { on the wrong place, so json.loads results in the following error: JSONDecodeError: Expecting property name enclosed in double quotes

I have tried for days to fix the whole data into a JSON friendly format, but this seems impossible, I received this dataset in a terrible format and very messy.

So I wanted to know if someone could come up with a function which would allow me to split the key-value pairs into individual columns, without having to use json.loads.

Thanks in advance.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

小姐丶请自重 2025-01-27 08:39:50

看起来有人抓取了 JavaScript 代码并保存为 CSV 字符串。

"1, {""key"": ""construction_year"", ""value"": 1900}, {""key"": ""available_date"", ""value"": ""Vereinbarung""}"

它需要将 CSV 字符串转换回普通字符串，然后解析它。

或者它需要更改行中的文本以更正

[1, {"key": "construction_year", "value": 1900}, {"key": "available_date", "value": "Vereinbarung"}]

可转换为 3 列的 JSON 数据。

稍后您可以将字典转换为一个字典

[1, {'construction_year': 1900, 'available_date': 'Vereinbarung'}]

，该字典可以使用 pandas 和 .apply(pd.Series) 转换为列，

我使用 text 作为字符串，但您可以从文件结果中读取它

text = '''"1, {""key"": ""construction_year"", ""value"": 1900}, {""key"": ""available_date"", ""value"": ""Vereinbarung""}"
"2, {""key"": ""available_date"", ""value"": ""01.04.2022""}, {""key"": ""useful_area"", ""value"": 60.0}"
"3, {""key"": ""construction_year"", ""value"": 2020}, {""key"": ""available_date"", ""value"": ""sofort""}"
"4, {""key"": ""available_date"", ""value"": ""Vereinbarung""}, {""key"": ""wheelchair_accessible"", ""value"": true}"
'''

import pandas as pd

#text = open('data.csv').read()

rows = []
for line in text.splitlines():
    line = line.replace('""', '"')
    line = '[' + line[1:-1] + ']'
    line = json.loads(line)

    item = {}
    for d in line[1:]:
        key = d['key']
        val = d['value']
        item[key] = val

    rows.append( [line[0], item] )
    
df = pd.DataFrame(rows, columns=['id', 'data'])

# convert dictionaries to columns
df = df.join(df['data'].apply(pd.Series))

# remove column with dictionaries
del df['data']

print(df.to_string())

：

    id  construction_year available_date  useful_area wheelchair_accessible
0   1             1900.0   Vereinbarung          NaN                   NaN
1   2                NaN     01.04.2022         60.0                   NaN
2   3             2020.0         sofort          NaN                   NaN
3   4                NaN   Vereinbarung          NaN                  True

It looks like someone scrape JavaScript code and saved as CSV string.

"1, {""key"": ""construction_year"", ""value"": 1900}, {""key"": ""available_date"", ""value"": ""Vereinbarung""}"

It needs to convert CSV string back to normal string and later parse it.

Or it needs to change text in lines to correct JSON data

[1, {"key": "construction_year", "value": 1900}, {"key": "available_date", "value": "Vereinbarung"}]

which can be converted to 3 columns.

And later you can convert dictionaries to one dictionary

[1, {'construction_year': 1900, 'available_date': 'Vereinbarung'}]

which can be converted to columns using pandas and .apply(pd.Series)

I uses text as string but you could read it from file

text = '''"1, {""key"": ""construction_year"", ""value"": 1900}, {""key"": ""available_date"", ""value"": ""Vereinbarung""}"
"2, {""key"": ""available_date"", ""value"": ""01.04.2022""}, {""key"": ""useful_area"", ""value"": 60.0}"
"3, {""key"": ""construction_year"", ""value"": 2020}, {""key"": ""available_date"", ""value"": ""sofort""}"
"4, {""key"": ""available_date"", ""value"": ""Vereinbarung""}, {""key"": ""wheelchair_accessible"", ""value"": true}"
'''

import pandas as pd

#text = open('data.csv').read()

rows = []
for line in text.splitlines():
    line = line.replace('""', '"')
    line = '[' + line[1:-1] + ']'
    line = json.loads(line)

    item = {}
    for d in line[1:]:
        key = d['key']
        val = d['value']
        item[key] = val

    rows.append( [line[0], item] )
    
df = pd.DataFrame(rows, columns=['id', 'data'])

# convert dictionaries to columns
df = df.join(df['data'].apply(pd.Series))

# remove column with dictionaries
del df['data']

print(df.to_string())

Result:

    id  construction_year available_date  useful_area wheelchair_accessible
0   1             1900.0   Vereinbarung          NaN                   NaN
1   2                NaN     01.04.2022         60.0                   NaN
2   3             2020.0         sofort          NaN                   NaN
3   4                NaN   Vereinbarung          NaN                  True

回复收藏 0 原文

~没有更多了~