使用 R 扁平化 json 数据进行数据解析

发布于 2025-01-17 03:23:24 字数 2333 浏览 0 评论 0原文

我有一个txt文件，其中包含以下示例数据：

id,001
v1,some_value
id,002
v2,some_value
v2,some_value
id,003
v2,some_value
id,004
v4,some_value

事实上，原始数据是xml/json格式。但数据已经趋平。因此值的顺序很重要。

我们的想法是获取如下结构化数据：

我必须按如下方式工作的 R 代码：

txt <- "
id,001
v1,some_value
id,002
v2,some_value
id,003
v2,some_value
id,004
v4,some_value"

existing_list <- c(id = "", v1 = "", v2 = "", v3 = "", v4 = "")

df=read.csv(textConnection(txt),header = F,stringsAsFactors = F)

id_list <- split(df, cumsum(df$V1 == "id"))

do.call(rbind, lapply(id_list, function(x) {
  vec <- setNames(x$V2, x$V1)
  existing_list[match(names(vec), names(existing_list))] <- vec
  as.data.frame(as.list(existing_list))
  }))

问题是它不适用于以下数据

txt <- "
id,001
v1,some_value
id,002
v2,some_value
v2,some_value
id,003
v2,some_value
id,004
v4,some_value"

所以我的问题是如何修改 R代码使其适用于第二个数据集。

另一种方法是将扁平化的txt数据转换为json，然后使用像rjson这样的包就可以很容易地解析数据。但我不知道该怎么做。

{
  "items": [
    {
      "id": "001",
      "attributes": [
        {
          "v1": "some_value"
        }
      ]
    },
    {
      "id": "002",
      "attributes": [
        {
          "v2": "some_value"
        },
        {
          "v2": "some_value"
        }
      ]
    },
    {
      "id": "003",
      "attributes": [
        {
          "v2": "some_value"
        }
      ]
    }
  ]
}

[更新] akrun 提供了一个非常有用的答案，但后来我意识到该结构可以嵌套。

txt <- "id,001
v1,some_value
id,002
v1,some_value
subid,002001
v2,valuev2_1
subid,002002"

这将被转换为

红色部分需填写。

根据 akrun 提供的答案，我认为我们将无法区分之前的数据和这个数据：

txt <- "id,001
v1,some_value
id,002
v1,some_value
subid,002001
subid,002002
v2,valuev2_1"

因为在检查 tibble 的列时，我们有相同的：

所以理想的解决方案将是转换csv 转 json。当然，还提供了按键的分层结构。但也许我错了。

要完成的一步是将具有列表列的小标题转换为具有普通列的小标题。

原文

I have a txt file with the following sample data:

id,001
v1,some_value
id,002
v2,some_value
v2,some_value
id,003
v2,some_value
id,004
v4,some_value

In fact, the original data is in xml/json format. But the data has been flatten. So the order of the values is important.

The idea is to get the structured data as below:

I have to R code that works as follows:

txt <- "
id,001
v1,some_value
id,002
v2,some_value
id,003
v2,some_value
id,004
v4,some_value"

existing_list <- c(id = "", v1 = "", v2 = "", v3 = "", v4 = "")

df=read.csv(textConnection(txt),header = F,stringsAsFactors = F)

id_list <- split(df, cumsum(df$V1 == "id"))

do.call(rbind, lapply(id_list, function(x) {
  vec <- setNames(x$V2, x$V1)
  existing_list[match(names(vec), names(existing_list))] <- vec
  as.data.frame(as.list(existing_list))
  }))

The problem is that it does not work for the following data

txt <- "
id,001
v1,some_value
id,002
v2,some_value
v2,some_value
id,003
v2,some_value
id,004
v4,some_value"

So my question is how to modify the R code to make it work for the second dataset.

Another aproach would be to convert the flatten txt data to json, then with a package like rjson is would be easy to parse the data. But I have no idea how to do it.

{
  "items": [
    {
      "id": "001",
      "attributes": [
        {
          "v1": "some_value"
        }
      ]
    },
    {
      "id": "002",
      "attributes": [
        {
          "v2": "some_value"
        },
        {
          "v2": "some_value"
        }
      ]
    },
    {
      "id": "003",
      "attributes": [
        {
          "v2": "some_value"
        }
      ]
    }
  ]
}

[update] akrun provided a very useful answer, but then I realized that the structure can be nested.

txt <- "id,001
v1,some_value
id,002
v1,some_value
subid,002001
v2,valuev2_1
subid,002002"

This is to be transform into

the red part to be completed.

And with the answer akrun provided, I think that we would not be able to distinguish the previous data from this one:

txt <- "id,001
v1,some_value
id,002
v1,some_value
subid,002001
subid,002002
v2,valuev2_1"

Because when examing the columns of the tibble, we have the same:

So the ideal solution would be to convert the csv to json. With the hierachical structure of the keys provided of course. But maybe I am wrong.

One step to be accomplished is to transform the tibble with list-cols into a tibble with normal columns.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

荒路情人 2025-01-24 03:23:24

我们可以使用 pivot_wider 重塑为“宽”格式

library(dplyr)
library(tidyr)
out <- df %>%
    group_by(grp = cumsum(V1 == 'id')) %>%
    mutate(id = first(V2)) %>%
    ungroup %>%
    filter(V1 != 'id') %>% 
    pivot_wider(names_from = V1, values_from = V2)

对于第二个示例

library(purrr)
split(df, cumsum(df$V1 == "id")) %>%
   map_dfr(~ {
       x1 <- split(.x$V2, .x$V1)
       mx <- max(lengths(x1))
     map_dfr(x1, `length<-`, mx)}) %>% 
  fill(id, v1, .direction = "downup")

-output，

# A tibble: 3 × 4
  id    v1         subid  v2       
  <chr> <chr>      <chr>  <chr>    
1 001   some_value <NA>   <NA>     
2 002   some_value 002001 valuev2_1
3 002   some_value 002002 <NA>

We may reshape to 'wide' format with pivot_wider

library(dplyr)
library(tidyr)
out <- df %>%
    group_by(grp = cumsum(V1 == 'id')) %>%
    mutate(id = first(V2)) %>%
    ungroup %>%
    filter(V1 != 'id') %>% 
    pivot_wider(names_from = V1, values_from = V2)

For the second example

library(purrr)
split(df, cumsum(df$V1 == "id")) %>%
   map_dfr(~ {
       x1 <- split(.x$V2, .x$V1)
       mx <- max(lengths(x1))
     map_dfr(x1, `length<-`, mx)}) %>% 
  fill(id, v1, .direction = "downup")

-output

# A tibble: 3 × 4
  id    v1         subid  v2       
  <chr> <chr>      <chr>  <chr>    
1 001   some_value <NA>   <NA>     
2 002   some_value 002001 valuev2_1
3 002   some_value 002002 <NA>

回复收藏 0 原文

极度宠爱 2025-01-24 03:23:24

采用 JSON 构建方法，考虑将文本数据迁移到数据框架并向下遍历行：

输入

library(jsonlite)

txt <- "
id,001
v1,some_value
id,002
v2,some_value
v2,some_value
id,003
v2,some_value
id,004
v4,some_value"

处理

# BUILD DATA FRAME FROM TEXT
lines_df <- read.csv(text=txt, header=FALSE)

# BUILD NESTED LIST
lines_lst <- list(items = list())
for(row in 1:nrow(lines_df)) {   
   if(lines_df$V1[row] == "id"){
     lines_lst$items[[row]] <- list(id = lines_df$V2[row])
     lines_lst$items[[row]]$attributes <- list()
     curr <- row
     i <- 1
   }  else {
     lines_lst$items[[curr]]$attributes[[i]] <- setNames(
       list(lines_df$V2[row]), lines_df$V1[row]
     )
     i <- i + 1
   }
}

# REMOVE NULLs
lines_lst$items <- Filter(length, lines_lst$items)

# OUTPUT TO JSON
json_output <- toJSON(lines_lst, pretty=TRUE)

输出

json_output
{
  "items": [
    {
      "id": ["001"],
      "attributes": [
        {
          "v1": ["some_value"]
        }
      ]
    },
    {
      "id": ["002"],
      "attributes": [
        {
          "v2": ["some_value"]
        },
        {
          "v2": ["some_value"]
        }
      ]
    },
    {
      "id": ["003"],
      "attributes": [
        {
          "v2": ["some_value"]
        }
      ]
    },
    {
      "id": ["004"],
      "attributes": [
        {
          "v4": ["some_value"]
        }
      ]
    }
  ]
}

Going the JSON build approach, consider migrating text data to data frame and walk down the rows:

Input

library(jsonlite)

txt <- "
id,001
v1,some_value
id,002
v2,some_value
v2,some_value
id,003
v2,some_value
id,004
v4,some_value"

Process

# BUILD DATA FRAME FROM TEXT
lines_df <- read.csv(text=txt, header=FALSE)

# BUILD NESTED LIST
lines_lst <- list(items = list())
for(row in 1:nrow(lines_df)) {   
   if(lines_df$V1[row] == "id"){
     lines_lst$items[[row]] <- list(id = lines_df$V2[row])
     lines_lst$items[[row]]$attributes <- list()
     curr <- row
     i <- 1
   }  else {
     lines_lst$items[[curr]]$attributes[[i]] <- setNames(
       list(lines_df$V2[row]), lines_df$V1[row]
     )
     i <- i + 1
   }
}

# REMOVE NULLs
lines_lst$items <- Filter(length, lines_lst$items)

# OUTPUT TO JSON
json_output <- toJSON(lines_lst, pretty=TRUE)

Output

json_output
{
  "items": [
    {
      "id": ["001"],
      "attributes": [
        {
          "v1": ["some_value"]
        }
      ]
    },
    {
      "id": ["002"],
      "attributes": [
        {
          "v2": ["some_value"]
        },
        {
          "v2": ["some_value"]
        }
      ]
    },
    {
      "id": ["003"],
      "attributes": [
        {
          "v2": ["some_value"]
        }
      ]
    },
    {
      "id": ["004"],
      "attributes": [
        {
          "v4": ["some_value"]
        }
      ]
    }
  ]
}

回复收藏 0 原文

~没有更多了~