Python-解析键值对的功能到数据框列
我在CSV文件中有一个带有键值对的数据集,看起来与此相似:
"1, {""key"": ""construction_year"", ""value"": 1900}, {""key"": ""available_date"", ""value"": ""Vereinbarung""}"
"2, {""key"": ""available_date"", ""value"": ""01.04.2022""}, {""key"": ""useful_area"", ""value"": 60.0}"
"3, {""key"": ""construction_year"", ""value"": 2020}, {""key"": ""available_date"", ""value"": ""sofort""}"
"4, {""key"": ""available_date"", ""value"": ""Vereinbarung""}, {""key"": ""wheelchair_accessible"", ""value"": true}"
我的预期输出如下:
id construction_year available_date useful_area wheelchair_accessible
1 1900 Vereinbarung nan nan
2 nan 01.04.202 60.0 nan
3 2020 sofort nan nan
4 nan Vereinbarung nan true
我已经尝试使用json将此数据转换为
,然后解析。当我可以确保所有行都以JSON样式的完美格式化时,此方法在小规模上起作用。dict
。加载
但是,当我尝试在list
上使用200'000观测值使用json.loads
时,我总是会遇到一些错误,因为并非所有行都处于正确的JSON格式。例如,有时“键”中缺少一个“值”,有时在错误的位置上有一个{
,因此json.loads
导致以下错误:jsondecodeerror:期望以双引号包含的属性名称
我已经尝试了几天来将整个数据修复到JSON友好格式中,但这似乎是不可能的,我以可怕的格式收到了此数据集,并且非常混乱。
因此,我想知道是否有人可以提出一个函数,该函数可以使我可以将键值对分为单个列,而无需使用json.loads
。
提前致谢。
I have a dataset with key-value pairs in a CSV file that looks similar to this:
"1, {""key"": ""construction_year"", ""value"": 1900}, {""key"": ""available_date"", ""value"": ""Vereinbarung""}"
"2, {""key"": ""available_date"", ""value"": ""01.04.2022""}, {""key"": ""useful_area"", ""value"": 60.0}"
"3, {""key"": ""construction_year"", ""value"": 2020}, {""key"": ""available_date"", ""value"": ""sofort""}"
"4, {""key"": ""available_date"", ""value"": ""Vereinbarung""}, {""key"": ""wheelchair_accessible"", ""value"": true}"
My expected output is the following:
id construction_year available_date useful_area wheelchair_accessible
1 1900 Vereinbarung nan nan
2 nan 01.04.202 60.0 nan
3 2020 sofort nan nan
4 nan Vereinbarung nan true
I have already tried to convert this data to a dict
using json.loads
and then parsing it. This method works on a small scale, when I can assure that all rows are perfectly formatted in a JSON style.
However, when I try to use json.loads
on a list
with the 200'000 observations, I always get some error since not all rows are on the right JSON format. For example, sometimes there is a "value" missing from a "key", sometimes there is a {
on the wrong place, so json.loads
results in the following error: JSONDecodeError: Expecting property name enclosed in double quotes
I have tried for days to fix the whole data into a JSON friendly format, but this seems impossible, I received this dataset in a terrible format and very messy.
So I wanted to know if someone could come up with a function which would allow me to split the key-value pairs into individual columns, without having to use json.loads
.
Thanks in advance.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
看起来有人抓取了 JavaScript 代码并保存为 CSV 字符串。
它需要将 CSV 字符串转换回普通字符串,然后解析它。
或者它需要更改行中的文本以更正
可转换为 3 列的 JSON 数据。
稍后您可以将字典转换为一个字典
,该字典可以使用
pandas
和.apply(pd.Series)
转换为列,我使用
text
作为字符串,但您可以从文件结果中读取它:
It looks like someone scrape JavaScript code and saved as CSV string.
It needs to convert CSV string back to normal string and later parse it.
Or it needs to change text in lines to correct JSON data
which can be converted to 3 columns.
And later you can convert dictionaries to one dictionary
which can be converted to columns using
pandas
and.apply(pd.Series)
I uses
text
as string but you could read it from fileResult: