使用 R 扁平化 json 数据进行数据解析
我有一个txt文件,其中包含以下示例数据:
id,001
v1,some_value
id,002
v2,some_value
v2,some_value
id,003
v2,some_value
id,004
v4,some_value
事实上,原始数据是xml/json格式。但数据已经趋平。因此值的顺序很重要。
我们的想法是获取如下结构化数据:
我必须按如下方式工作的 R 代码:
txt <- "
id,001
v1,some_value
id,002
v2,some_value
id,003
v2,some_value
id,004
v4,some_value"
existing_list <- c(id = "", v1 = "", v2 = "", v3 = "", v4 = "")
df=read.csv(textConnection(txt),header = F,stringsAsFactors = F)
id_list <- split(df, cumsum(df$V1 == "id"))
do.call(rbind, lapply(id_list, function(x) {
vec <- setNames(x$V2, x$V1)
existing_list[match(names(vec), names(existing_list))] <- vec
as.data.frame(as.list(existing_list))
}))
问题是它不适用于以下数据
txt <- "
id,001
v1,some_value
id,002
v2,some_value
v2,some_value
id,003
v2,some_value
id,004
v4,some_value"
所以我的问题是如何修改 R代码使其适用于第二个数据集。
另一种方法是将扁平化的txt数据转换为json,然后使用像rjson这样的包就可以很容易地解析数据。但我不知道该怎么做。
{
"items": [
{
"id": "001",
"attributes": [
{
"v1": "some_value"
}
]
},
{
"id": "002",
"attributes": [
{
"v2": "some_value"
},
{
"v2": "some_value"
}
]
},
{
"id": "003",
"attributes": [
{
"v2": "some_value"
}
]
}
]
}
[更新] akrun 提供了一个非常有用的答案,但后来我意识到该结构可以嵌套。
txt <- "id,001
v1,some_value
id,002
v1,some_value
subid,002001
v2,valuev2_1
subid,002002"
这将被转换为
红色部分需填写。
根据 akrun 提供的答案,我认为我们将无法区分之前的数据和这个数据:
txt <- "id,001
v1,some_value
id,002
v1,some_value
subid,002001
subid,002002
v2,valuev2_1"
因为在检查 tibble 的列时,我们有相同的:
所以理想的解决方案将是转换csv 转 json。当然,还提供了按键的分层结构。但也许我错了。
要完成的一步是将具有列表列的小标题转换为具有普通列的小标题。
I have a txt file with the following sample data:
id,001
v1,some_value
id,002
v2,some_value
v2,some_value
id,003
v2,some_value
id,004
v4,some_value
In fact, the original data is in xml/json format. But the data has been flatten. So the order of the values is important.
The idea is to get the structured data as below:
I have to R code that works as follows:
txt <- "
id,001
v1,some_value
id,002
v2,some_value
id,003
v2,some_value
id,004
v4,some_value"
existing_list <- c(id = "", v1 = "", v2 = "", v3 = "", v4 = "")
df=read.csv(textConnection(txt),header = F,stringsAsFactors = F)
id_list <- split(df, cumsum(df$V1 == "id"))
do.call(rbind, lapply(id_list, function(x) {
vec <- setNames(x$V2, x$V1)
existing_list[match(names(vec), names(existing_list))] <- vec
as.data.frame(as.list(existing_list))
}))
The problem is that it does not work for the following data
txt <- "
id,001
v1,some_value
id,002
v2,some_value
v2,some_value
id,003
v2,some_value
id,004
v4,some_value"
So my question is how to modify the R code to make it work for the second dataset.
Another aproach would be to convert the flatten txt data to json, then with a package like rjson
is would be easy to parse the data. But I have no idea how to do it.
{
"items": [
{
"id": "001",
"attributes": [
{
"v1": "some_value"
}
]
},
{
"id": "002",
"attributes": [
{
"v2": "some_value"
},
{
"v2": "some_value"
}
]
},
{
"id": "003",
"attributes": [
{
"v2": "some_value"
}
]
}
]
}
[update] akrun provided a very useful answer, but then I realized that the structure can be nested.
txt <- "id,001
v1,some_value
id,002
v1,some_value
subid,002001
v2,valuev2_1
subid,002002"
This is to be transform into
the red part to be completed.
And with the answer akrun provided, I think that we would not be able to distinguish the previous data from this one:
txt <- "id,001
v1,some_value
id,002
v1,some_value
subid,002001
subid,002002
v2,valuev2_1"
Because when examing the columns of the tibble, we have the same:
So the ideal solution would be to convert the csv to json. With the hierachical structure of the keys provided of course. But maybe I am wrong.
One step to be accomplished is to transform the tibble with list-cols into a tibble with normal columns.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我们可以使用
pivot_wider
重塑为“宽”格式对于第二个示例
-output,
We may reshape to 'wide' format with
pivot_wider
For the second example
-output
采用 JSON 构建方法,考虑将文本数据迁移到数据框架并向下遍历行:
输入
处理
输出
Going the JSON build approach, consider migrating text data to data frame and walk down the rows:
Input
Process
Output