如何在 R 中的字符向量中找到最常见的单词？

发布于 2025-01-10 06:02:28 字数 847 浏览 2 评论 0原文

我正在分析一些 fmri 数据 - 特别是，我正在研究哪些类型的认知功能与 fmri 扫描的坐标相关（在受试者执行任务时进行）。我的数据可以通过以下函数获得：

library(httr)
scrape_and_sort = function(neurosynth_link){
  result = content(GET(neurosynth_link), "parsed")$data
  names  = c("Name", "z_score", "post_prob", "func_con", "meta_analytic")
  df = do.call(rbind, lapply(result, function(x) setNames(as.data.frame(x), names)))
  df$z_score = as.numeric(df$z_score)
  df = df[order(-df$z_score), ]
  df = df[-which(df$z_score<3),]
  df = na.omit(df)
  return(df)
}
 RO4 = scrape_and_sort('https://neurosynth.org/api/locations/-58_-22_6_6/compare')

现在，我想要知道哪些关键词最常出现，并理想地构建最常见单词的列表我尝试了以下方法：

sort(table(RO4$Name),decreasing=TRUE)

但这显然行不通。问题是名称（例如：“听觉皮层”）是字符串。有多个单词，所以结果这样的“听觉”和“听觉皮层”作为两个单独的条目出现，而我希望它们被视为“听觉”的两个实例，

但我不确定如何在每个字符串中搜索并记录这样的单个单词。

原文

I am analysing some fmri data – in particular, I am looking at what sorts of cognitive functions are associated with coordinates from an fmri scan (conducted while subjects were performing a task. My data can be obtained with the following function:

library(httr)
scrape_and_sort = function(neurosynth_link){
  result = content(GET(neurosynth_link), "parsed")$data
  names  = c("Name", "z_score", "post_prob", "func_con", "meta_analytic")
  df = do.call(rbind, lapply(result, function(x) setNames(as.data.frame(x), names)))
  df$z_score = as.numeric(df$z_score)
  df = df[order(-df$z_score), ]
  df = df[-which(df$z_score<3),]
  df = na.omit(df)
  return(df)
}
 RO4 = scrape_and_sort('https://neurosynth.org/api/locations/-58_-22_6_6/compare')

Now, I want know which key words are coming up most often and ideally construct a list of the most common words. I tried the following:

sort(table(RO4$Name),decreasing=TRUE)

But this clearly won't work.The problem is that the names (for example: "auditory cortex") are strings with multiple words in, so results such 'auditory' and 'auditory cortex' come out as two separate entries, whereas I want them counted as two instances of 'auditory'.

But I am not sure how to search inside each string and record individual words like that. Any ideas?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

无声静候 2025-01-17 06:02:28

使用包 {jsonlite}、{dplyr} 和管道运算符 %>% 提高易读性：

将响应存储为数据帧 df

url <- 'https://neurosynth.org/api/locations/-58_-22_6_6/compare/'
df <- jsonlite::fromJSON(url) %>% as.data.frame

重塑和聚合

df %>%
    ## keep first column only and name it 'keywords':
    select('keywords' = 1) %>%
    ## multiple cell values (as separated by a blank)
    ## into separate rows:
    separate_rows(keywords, sep = " ") %>%
    group_by(keywords) %>%
    summarise(count = n()) %>%
    arrange(desc(count))

结果：

+ # A tibble: 965 x 2
   keywords count
   <chr>    <int>
 1 cortex      53
 2 gyrus       26
 3 temporal    26
 4 parietal    23
 5 task        22
 6 anterior    19
 7 frontal     18
 8 visual      17
 9 memory      16
10 motor       16
# ... with 955 more rows

编辑：或者，如果您想从数据帧继续，

RO4 %>%
    select(Name) %>%
    ## select(everything())
    ## select(Name:func_con)
    separate_rows(Name, sep=' ') %>%
    ## do remaining stuff

您当然可以通过多种方便的方式选择更多列（请参阅上面的注释行和？dplyr :: select）。请注意，其他变量的值将重复多次，以容纳“名称”列中的任何多值所需的行数，因此这将引入一些冗余。

如果您想采用 {dplyr} 样式，则按 z 分数降序排列并排除不需要的 z 分数，如下所示：

RO4 %>%
    filter(z_score < 3 & !is.na(z_score)) %>%
    arrange(desc(z_score))

using packages {jsonlite}, {dplyr} and the pipe operator %>% for legibility:

store response as dataframe df

url <- 'https://neurosynth.org/api/locations/-58_-22_6_6/compare/'
df <- jsonlite::fromJSON(url) %>% as.data.frame

reshape and aggregate

df %>%
    ## keep first column only and name it 'keywords':
    select('keywords' = 1) %>%
    ## multiple cell values (as separated by a blank)
    ## into separate rows:
    separate_rows(keywords, sep = " ") %>%
    group_by(keywords) %>%
    summarise(count = n()) %>%
    arrange(desc(count))

result:

+ # A tibble: 965 x 2
   keywords count
   <chr>    <int>
 1 cortex      53
 2 gyrus       26
 3 temporal    26
 4 parietal    23
 5 task        22
 6 anterior    19
 7 frontal     18
 8 visual      17
 9 memory      16
10 motor       16
# ... with 955 more rows

edit: or, if you want to proceed from your dataframe

RO4 %>%
    select(Name) %>%
    ## select(everything())
    ## select(Name:func_con)
    separate_rows(Name, sep=' ') %>%
    ## do remaining stuff

You can of course select more columns in a number of convenient ways (see commented lines above and ?dplyr::select). Mind that values of the other variables will repeated as many times as rows are needed to accomodate any multivalue in column "Name", so that will introduce some redundancy.

If you want to adopt {dplyr} style, arranging by descending z-score and excluding unwanted z-scores would read like this:

RO4 %>%
    filter(z_score < 3 & !is.na(z_score)) %>%
    arrange(desc(z_score))

回复收藏 0 原文

耶耶耶 2025-01-17 06:02:28

不太明白。你不能像这样继续吗：

x <- c("auditory cortex", "auditory", "auditory", "hello friend")
unlist(strsplit(x, " "))
# "auditory" "cortex"   "auditory" "auditory" "hello"    "friend"

Not sure to understand. Can't you proceed like this:

x <- c("auditory cortex", "auditory", "auditory", "hello friend")
unlist(strsplit(x, " "))
# "auditory" "cortex"   "auditory" "auditory" "hello"    "friend"

回复收藏 0 原文

~没有更多了~