当前位置：文江博客话题详情

从作者隶属关系中提取国家/地区名称

发布于 2024-10-22 06:50:05 字数 368 浏览 4 评论 0原文

我目前正在探索从作者单位（PubMed 文章）中提取国家名称的可能性，我的样本数据如下：

新加坡国立大学机械与生产工程系。

癌症研究运动哺乳动物细胞 DNA英国剑桥动物学系修复小组

英国剑桥动物学系癌症研究运动哺乳动物细胞 DNA 修复小组。

礼来研究实验室，礼来公司，印第安纳波利斯，IN 46285。

最初，我尝试删除标点符号并将向量拆分为单词，然后将其与维基百科中的国家/地区名称列表进行比较，但我没有成功。

谁能建议我更好的方法吗？我更喜欢 R 中的解决方案，因为我必须在 R 中进行进一步分析并生成图形。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

莳間冲淡了誓言ζ 2024-10-29 06:50:05

这是一个简单的解决方案，可能会帮助您入门。它利用地图包中包含城市和国家/地区数据的数据库。如果你能找到一个更好的数据库，修改代码应该很简单。

library(maps)
library(plyr)

# Load data from package maps
data(world.cities)

# Create test data
aa <- c(
    "Mechanical and Production Engineering Department, National University of Singapore.",
    "Cancer Research Campaign Mammalian Cell DNA Repair Group, Department of Zoology, Cambridge, U.K.",
    "Cancer Research Campaign Mammalian Cell DNA Repair Group, Department of Zoology, Cambridge, UK.",
    "Lilly Research Laboratories, Eli Lilly and Company, Indianapolis, IN 46285."
)

# Remove punctuation from data
caa <- gsub(aa, "[[:punct:]]", "")    ### *Edit*

# Split data at word boundaries
saa <- strsplit(caa, " ")

# Match on cities in world.cities
# Assumes that if multiple matches, the last takes precedence, i.e. max()
llply(saa, function(x)x[max(which(x %in% world.cities$name))])

# Match on country in world.countries
llply(saa, function(x)x[which(x %in% world.cities$country.etc)])

这是城市的结果：

[[1]]
[1] "Singapore"

[[2]]
[1] "Cambridge"

[[3]]
[1] "Cambridge"

[[4]]
[1] "Indianapolis"

以及国家/地区的结果：

[[1]]
[1] "Singapore"

[[2]]
[1] "UK"

[[3]]
[1] "UK"

[[4]]
character(0)

通过一些数据清理，您也许可以对此做一些事情。

Here is a simple solution that might get you started some of the way. It makes use of a database containing city and country data in the maps package. If you can get hold of a better database, it should be simple to modify the code.

library(maps)
library(plyr)

# Load data from package maps
data(world.cities)

# Create test data
aa <- c(
    "Mechanical and Production Engineering Department, National University of Singapore.",
    "Cancer Research Campaign Mammalian Cell DNA Repair Group, Department of Zoology, Cambridge, U.K.",
    "Cancer Research Campaign Mammalian Cell DNA Repair Group, Department of Zoology, Cambridge, UK.",
    "Lilly Research Laboratories, Eli Lilly and Company, Indianapolis, IN 46285."
)

# Remove punctuation from data
caa <- gsub(aa, "[[:punct:]]", "")    ### *Edit*

# Split data at word boundaries
saa <- strsplit(caa, " ")

# Match on cities in world.cities
# Assumes that if multiple matches, the last takes precedence, i.e. max()
llply(saa, function(x)x[max(which(x %in% world.cities$name))])

# Match on country in world.countries
llply(saa, function(x)x[which(x %in% world.cities$country.etc)])

This is the result for cities:

[[1]]
[1] "Singapore"

[[2]]
[1] "Cambridge"

[[3]]
[1] "Cambridge"

[[4]]
[1] "Indianapolis"

And the result for countries:

[[1]]
[1] "Singapore"

[[2]]
[1] "UK"

[[3]]
[1] "UK"

[[4]]
character(0)

With a bit of data cleanup you may be able to do something with this.

回复收藏 0 原文

灵芸 2024-10-29 06:50:05

一种方法可能是分割字符串以隔离地理信息（例如，删除第一个逗号之前的所有内容），然后将结果提交给地理编码服务。

例如，Google 地理编码 API 允许发送地址并取回本地化信息和相应的地理信息，例如国家/地区。我认为没有现成的 R 包可以做到这一点，但您可以在这里找到一些函数，例如：

使用 Google 地图在 R 中进行地理编码

还有其他语言的扩展，例如 Ruby：

http://geokit.rubyforge.org/

这还取决于您的观察数量，例如，如果我没记错的话，免费的 Google API 仅限于大约 200 个地址/IP/天。

回复收藏 0 原文

蓝眸 2024-10-29 06:50:05

@Andrie 的答案很好，但它错过了不止一个词的城市和国家，例如新西兰或纽约。第二个示例是一个问题，因为它将被标记为与英国约克而不是美国纽约匹配。

这种替代方案应该能更好地捕捉这些情况。

library(maps)
library(plyr)

# Load data from package maps
data(world.cities)

# Create test data
aa <- c(
    "Mechanical and Production Engineering Department, National University of Singapore.",
    "Cancer Research Campaign Mammalian Cell DNA Repair Group, Department of Zoology, Cambridge, U.K.",
    "Cancer Research Campaign Mammalian Cell DNA Repair Group, Department of Zoology, Cambridge, UK.",
    "Lilly Research Laboratories, Eli Lilly and Company, Indianapolis, IN 46285."
)

saa <- sapply(aa, strsplit, split = ", ", USE.NAMES = FALSE)
llply(saa, function(x)x[which(x %in% world.cities$name)])
llply(saa, function(x)x[which(x %in% world.cities$country.etc)])

缺点是任何没有特定国家或城市字段的条目都不会返回任何内容，例如新加坡大学的例子。

城市：

[[1]]
character(0)

[[2]]
[1] "Cambridge"

[[3]]
[1] "Cambridge"

[[4]]
[1] "Indianapolis"

对我来说，这比多词城市/国家问题更重要。选择更适合您的数据的选项。也许有一种方法可以将两者结合起来？

@Andrie's answer is nice, but it misses cities and countries that are more than one word e.g. New Zealand or New York. The second example is a concern as it would be labelled as a match to York, UK not New York, USA.

This alternative should capture those cases a bit better.

library(maps)
library(plyr)

# Load data from package maps
data(world.cities)

# Create test data
aa <- c(
    "Mechanical and Production Engineering Department, National University of Singapore.",
    "Cancer Research Campaign Mammalian Cell DNA Repair Group, Department of Zoology, Cambridge, U.K.",
    "Cancer Research Campaign Mammalian Cell DNA Repair Group, Department of Zoology, Cambridge, UK.",
    "Lilly Research Laboratories, Eli Lilly and Company, Indianapolis, IN 46285."
)

saa <- sapply(aa, strsplit, split = ", ", USE.NAMES = FALSE)
llply(saa, function(x)x[which(x %in% world.cities$name)])
llply(saa, function(x)x[which(x %in% world.cities$country.etc)])

The downside is that any entries without a specific country or city field is not going to return anything e.g. the University of Singapore example.

Cities:

[[1]]
character(0)

[[2]]
[1] "Cambridge"

[[3]]
[1] "Cambridge"

[[4]]
[1] "Indianapolis"

That is less of an issue for me than the multi-word city/country problem. Choose whichever is a better fit for your data. Maybe there's a way of combining the two?

回复收藏 0 原文

~没有更多了~