如何在 R 中的字符向量中找到最常见的单词?
我正在分析一些 fmri 数据 - 特别是,我正在研究哪些类型的认知功能与 fmri 扫描的坐标相关(在受试者执行任务时进行)。我的数据可以通过以下函数获得:
library(httr)
scrape_and_sort = function(neurosynth_link){
result = content(GET(neurosynth_link), "parsed")$data
names = c("Name", "z_score", "post_prob", "func_con", "meta_analytic")
df = do.call(rbind, lapply(result, function(x) setNames(as.data.frame(x), names)))
df$z_score = as.numeric(df$z_score)
df = df[order(-df$z_score), ]
df = df[-which(df$z_score<3),]
df = na.omit(df)
return(df)
}
RO4 = scrape_and_sort('https://neurosynth.org/api/locations/-58_-22_6_6/compare')
现在,我想要知道哪些关键词最常出现,并理想地构建最常见单词的列表我尝试了以下方法:
sort(table(RO4$Name),decreasing=TRUE)
但这显然行不通。问题是名称(例如:“听觉皮层”)是字符串。有多个单词,所以结果这样的“听觉”和“听觉皮层”作为两个单独的条目出现,而我希望它们被视为“听觉”的两个实例,
但我不确定如何在每个字符串中搜索并记录这样的单个单词。
I am analysing some fmri data – in particular, I am looking at what sorts of cognitive functions are associated with coordinates from an fmri scan (conducted while subjects were performing a task. My data can be obtained with the following function:
library(httr)
scrape_and_sort = function(neurosynth_link){
result = content(GET(neurosynth_link), "parsed")$data
names = c("Name", "z_score", "post_prob", "func_con", "meta_analytic")
df = do.call(rbind, lapply(result, function(x) setNames(as.data.frame(x), names)))
df$z_score = as.numeric(df$z_score)
df = df[order(-df$z_score), ]
df = df[-which(df$z_score<3),]
df = na.omit(df)
return(df)
}
RO4 = scrape_and_sort('https://neurosynth.org/api/locations/-58_-22_6_6/compare')
Now, I want know which key words are coming up most often and ideally construct a list of the most common words. I tried the following:
sort(table(RO4$Name),decreasing=TRUE)
But this clearly won't work.The problem is that the names (for example: "auditory cortex") are strings with multiple words in, so results such 'auditory' and 'auditory cortex' come out as two separate entries, whereas I want them counted as two instances of 'auditory'.
But I am not sure how to search inside each string and record individual words like that. Any ideas?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
使用包 {jsonlite}、{dplyr} 和管道运算符
%>%
提高易读性:df
结果:
编辑:或者,如果您想从数据帧继续,
您当然可以通过多种方便的方式
选择
更多列(请参阅上面的注释行和?dplyr :: select
)。请注意,其他变量的值将重复多次,以容纳“名称”列中的任何多值所需的行数,因此这将引入一些冗余。如果您想采用 {
dplyr
} 样式,则按 z 分数降序排列并排除不需要的 z 分数,如下所示:using packages {jsonlite}, {dplyr} and the pipe operator
%>%
for legibility:df
result:
edit: or, if you want to proceed from your dataframe
You can of course
select
more columns in a number of convenient ways (see commented lines above and?dplyr::select
). Mind that values of the other variables will repeated as many times as rows are needed to accomodate any multivalue in column "Name", so that will introduce some redundancy.If you want to adopt {
dplyr
} style, arranging by descending z-score and excluding unwanted z-scores would read like this:不太明白。你不能像这样继续吗:
Not sure to understand. Can't you proceed like this: