检测 R 中的文本语言

发布于 2024-12-15 00:44:05 字数 40 浏览 1 评论 0 原文

我有一个推文列表,我想只保留英文推文。

我该怎么做?

I have a list of tweets and I would like to keep only those that are in English.

How can I do this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

沒落の蓅哖 2024-12-22 00:44:05

textcat 包执行此操作。它可以检测 74 种“语言”(更准确地说,是语言/编码组合),还有更多其他扩展。详细信息和示例可在这篇免费文章中找到:

Hornik, K.、Mair, P.、Rauch, J.、Geiger, W.、Buchta, C., & Feinerer, I. R. Journal of Statistical Software, 52, 1 中的基于 n-Gram 的文本分类的 textcat 包 -17。

摘要如下:

识别所使用的语言通常是大多数情况下的第一步
自然语言处理任务。在种类繁多的语言中
文献中讨论的识别方法,采用的方法
Cavnar 和 Trenkle (1994) 的文本分类方法基于
字符 n 元语法频率特别成功。这
论文提出了基于 n-gram 的文本的 R 扩展包 textcat
同时实施 Cavnar 和 Trenkle 方法的分类
以及旨在消除冗余的简化 n-gram 方法
原来的方法。多语言语料库取自
有关精选主题的维基百科页面用于
说明该包的功能和性能
提供了语言识别方法。

这是他们的例子之一:

library("textcat")
textcat(c(
  "This is an English sentence.",
  "Das ist ein deutscher Satz.",
  "Esta es una frase en espa~nol."))
[1] "english" "german" "spanish" 

The textcat package does this. It can detect 74 'languages' (more properly, language/encoding combinations), more with other extensions. Details and examples are in this freely available article:

Hornik, K., Mair, P., Rauch, J., Geiger, W., Buchta, C., & Feinerer, I. The textcat Package for n-Gram Based Text Categorization in R. Journal of Statistical Software, 52, 1-17.

Here's the abstract:

Identifying the language used will typically be the first step in most
natural language processing tasks. Among the wide variety of language
identification methods discussed in the literature, the ones employing
the Cavnar and Trenkle (1994) approach to text categorization based on
character n-gram frequencies have been particularly successful. This
paper presents the R extension package textcat for n-gram based text
categorization which implements both the Cavnar and Trenkle approach
as well as a reduced n-gram approach designed to remove redundancies
of the original approach. A multi-lingual corpus obtained from the
Wikipedia pages available on a selection of topics is used to
illustrate the functionality of the package and the performance of the
provided language identification methods.

And here's one of their examples:

library("textcat")
textcat(c(
  "This is an English sentence.",
  "Das ist ein deutscher Satz.",
  "Esta es una frase en espa~nol."))
[1] "english" "german" "spanish" 
寻找一个思念的角度 2024-12-22 00:44:05

之前答案中的 cldr 软件包在 CRAN 上不再可用,并且可能难以安装。不过,Google(Chromium 的)cld 库现在可以通过其他专用包 cld2cld3 在 R 中使用。

在对多种欧洲语言的数千条推文进行测试后,我可以说,在可用选项中,textcat 是迄今为止最不可靠的。使用 textcat,我还经常收到被错误检测为“middle_frisian”、“rumantsch”、“sanskrit”或其他不常见语言的推文。对于其他类型的文本来说,它可能相对较好,但我认为 textcat 对于推文来说相当糟糕。

总的来说,cld2 似乎仍然比 cld3 更好。如果您想要一种安全的方式仅包含英文推文,您仍然可以运行 cld2cld3 并仅保留被两者识别为英文的推文。

这是一个基于 Twitter 搜索的示例,该搜索通常提供多种不同语言的结果,但始终包含一些英文推文。

if (!require("pacman")) install.packages("pacman") # for package manangement
pacman::p_load("tidyverse") 
pacman::p_load("textcat")
pacman::p_load("cld2")
pacman::p_load("cld3")
pacman::p_load("rtweet")

punk <- rtweet::search_tweets(q = "punk") %>% mutate(textcat = textcat(x = text), cld2 = cld2::detect_language(text = text, plain_text = FALSE), cld3 = cld3::detect_language(text = text)) %>% select(text, textcat, cld2, cld3)
View(punk)

# Only English tweets
punk %>% filter(cld2 == "en" & cld3 == "en")

最后,如果这个问题与推文特别相关,我也许应该补充一点:Twitter 通过 API 提供了自己的推文语言检测,而且它似乎相当准确(可以理解,对于非常短的推文来说,准确度较低)。因此,如果您运行 rtweet::search_tweets(q = "punk"),您将看到生成的 data.frame 包含一个“lang”列。如果您通过 API 获取推文,那么您可能更信任 Twitter 自己的检测系统,而不是上面建议的替代解决方案(对其他文本仍然有效)。

The cldr package in a previous answer is not any more available on CRAN and may be difficult to install. However, Google's (Chromium's) cld libraries are now available in R through other dedicated packages, cld2 and cld3.

After testing with some thousands of tweets in multiple European languages, I can say that among available options, textcat is by far the least reliable. With textcat I also get quite frequently tweets wrongly detected as "middle_frisian", "rumantsch", "sanskrit", or other unusual languages. It may be relatively good with other types of texts, but I think textcat is pretty bad for tweets.

cld2 seems to be in general still better than cld3. If you want a safe way to include only tweets in English, you can still run both cld2 and cld3 and keep only tweets that are recognised as English by both.

Here's an example based on a Twitter search which usually offers result in many different languages, but always including some tweets in English.

if (!require("pacman")) install.packages("pacman") # for package manangement
pacman::p_load("tidyverse") 
pacman::p_load("textcat")
pacman::p_load("cld2")
pacman::p_load("cld3")
pacman::p_load("rtweet")

punk <- rtweet::search_tweets(q = "punk") %>% mutate(textcat = textcat(x = text), cld2 = cld2::detect_language(text = text, plain_text = FALSE), cld3 = cld3::detect_language(text = text)) %>% select(text, textcat, cld2, cld3)
View(punk)

# Only English tweets
punk %>% filter(cld2 == "en" & cld3 == "en")

Finally, I should perhaps add the obvious if this question is specifically related to tweets: Twitter provides via API its own language detection for tweets, and its seems to be pretty accurate (understandably less so with very short tweets). So if you run rtweet::search_tweets(q = "punk"), you will see that the resulting data.frame includes a "lang" column. If you get your tweets via API, then you can probably trust Twitter's own detection system more than the alternative solutions suggested above (which remain valid for other texts).

懷念過去 2024-12-22 00:44:05

尝试 http://cran.r-project.org/web/packages/cldr/< /a> 将 Google Chrome 的语言检测功能引入 R。

#install from archive
url <- "http://cran.us.r-project.org/src/contrib/Archive/cldr/cldr_1.1.0.tar.gz"
pkgFile<-"cldr_1.1.0.tar.gz"
download.file(url = url, destfile = pkgFile)
install.packages(pkgs=pkgFile, type="source", repos=NULL)
unlink(pkgFile)
# or devtools::install_version("cldr",version="1.1.0")

#usage
library(cldr)
demo(cldr)

Try http://cran.r-project.org/web/packages/cldr/ which brings Google Chrome's language detection to R.

#install from archive
url <- "http://cran.us.r-project.org/src/contrib/Archive/cldr/cldr_1.1.0.tar.gz"
pkgFile<-"cldr_1.1.0.tar.gz"
download.file(url = url, destfile = pkgFile)
install.packages(pkgs=pkgFile, type="source", repos=NULL)
unlink(pkgFile)
# or devtools::install_version("cldr",version="1.1.0")

#usage
library(cldr)
demo(cldr)
送舟行 2024-12-22 00:44:05

tl;drcld2 是迄今为止最快的(cld3 x22textcat x118,手工溶液 x252

这里有很多关于准确性的讨论,这对于推文来说是可以理解的。但速度呢?

以下是 cld2cld3textcat 的基准测试。

我还添加了一些我编写的简单函数,它计算文本中停用词的出现次数(使用 tm::stopwords )。

我认为对于长文本,我可能不需要复杂的算法,并且对多种语言进行测试可能是有害的。最后我的方法最终是最慢的(很可能是因为打包的方法在 C 中循环。

我把它留在这里,这样我就可以腾出时间给那些有相同想法的人。我希望 Tyler Rinker 的 Englishinator 解决方案也会很慢(仅测试一种语言,但需要测试更多单词和类似的代码)。

detect_from_sw <- function(text,candidates){
  sapply(strsplit(text,'[ [:punct:]]'),function(y)
    names(which.max(sapply(candidates,function(x) sum(tm::stopwords(x) %in% y))))
  )
}

>

data(reuters,package = "kernlab") # a corpus of articles in english
length(reuters)
# [1] 40
sapply(reuters,nchar)
# [1] 1311  800  511 2350  343  388 3705  604  254  239  632  607  867  240
# [15]  234  172  538  887 2500 1030  538 2681  338  402  563 2825 2800  947
# [29] 2156 2103 2283  604  632  602  642  892 1187  472 1829  367
text <- unlist(reuters)

microbenchmark::microbenchmark(
  textcat = textcat::textcat(text),
  cld2 = cld2::detect_language(text),
  cld3 = cld3::detect_language(text),
  detect_from_sw = detect_from_sw(text,c("english","french","german")),
  times=100)

# Unit: milliseconds
# expr                 min         lq      mean     median         uq         max neval
# textcat        212.37624 222.428824 230.73971 227.248649 232.488500  410.576901   100
# cld2             1.67860   1.824697   1.96115   1.955098   2.034787    2.715161   100
# cld3            42.76642  43.505048  44.07407  43.967939  44.579490   46.604164   100
# detect_from_sw 439.76812 444.873041 494.47524 450.551485 470.322047 2414.874973   100

注意事项textcat 的不准确性

我无法评论 cld2cld3 的准确性(@giocomai 声称 cld2< /code> 在他的回答中更好),但我确认 textcat 似乎非常不可靠(在本页的几个地方提到过),所有文本都通过上述所有方法正确分类,除了这个被分类为西班牙语的文本。经过文本猫:

“1987 年 1 月,阿根廷原油产量\n下降了 10.8%,至
Yacimientos Petroliferos Fiscales 表示,产量从 1986 年 1 月的 1381 万桶增至 1232 万桶。 \n 1987 年 1 月 自然
天然气产量11.5亿立方米,同比增长3.6%
1986 年 1 月,Yacimientos Petroliferos Fiscales 补充道,产量为 11.1 亿立方米。 \n 路透社”

tl;dr: cld2 is the fastest by far (cld3 x22, textcat x118, handmade solution x252)

There's been a lot of discussion about accuracy here, which is understandable for tweets. But what about speed ?

Here's a benchmark of cld2, cld3 and textcat.

I threw in also some naïve function I wrote, it's counting occurences of stopwords in the text (uses tm::stopwords).

I thought for long texts I may not need a sophisticated algorithm, and testing for many languages might be detrimental. In the end my approach ends up being the slowest (most likely because the packaged approaches are looping in C.

I leave it here so I can spare time to those that would have the same idea. I expect the Englishinator solution of Tyler Rinker would be slow as well (testing for only one language, but much more words to test and similar code).

detect_from_sw <- function(text,candidates){
  sapply(strsplit(text,'[ [:punct:]]'),function(y)
    names(which.max(sapply(candidates,function(x) sum(tm::stopwords(x) %in% y))))
  )
}

The benchmark

data(reuters,package = "kernlab") # a corpus of articles in english
length(reuters)
# [1] 40
sapply(reuters,nchar)
# [1] 1311  800  511 2350  343  388 3705  604  254  239  632  607  867  240
# [15]  234  172  538  887 2500 1030  538 2681  338  402  563 2825 2800  947
# [29] 2156 2103 2283  604  632  602  642  892 1187  472 1829  367
text <- unlist(reuters)

microbenchmark::microbenchmark(
  textcat = textcat::textcat(text),
  cld2 = cld2::detect_language(text),
  cld3 = cld3::detect_language(text),
  detect_from_sw = detect_from_sw(text,c("english","french","german")),
  times=100)

# Unit: milliseconds
# expr                 min         lq      mean     median         uq         max neval
# textcat        212.37624 222.428824 230.73971 227.248649 232.488500  410.576901   100
# cld2             1.67860   1.824697   1.96115   1.955098   2.034787    2.715161   100
# cld3            42.76642  43.505048  44.07407  43.967939  44.579490   46.604164   100
# detect_from_sw 439.76812 444.873041 494.47524 450.551485 470.322047 2414.874973   100

Note on textcat's inaccuracy

I can't comment on the accuracy of cld2 vs cld3 (@giocomai claimed cld2 was better in his answer), but I confirm that textcat seems very unreliable (metionned in several places on this page). All texts were classified correctly by all methods above except this one classified as Spanish by textcat:

"Argentine crude oil production was \ndown 10.8 pct in January 1987 to
12.32 mln barrels, from 13.81 \nmln barrels in January 1986, Yacimientos Petroliferos Fiscales \nsaid. \n January 1987 natural
gas output totalled 1.15 billion cubic \nmetrers, 3.6 pct higher than
1.11 billion cubic metres produced \nin January 1986, Yacimientos Petroliferos Fiscales added. \n Reuter"

真心难拥有 2024-12-22 00:44:05

R 中的一种方法是保留英语单词的文本文件。我有其中几个,其中一个来自 http://www.sil.org/linguistics/wordlists /英语/。获取 .txt 文件后,您可以使用该文件来匹配每条推文。比如:

lapply(tweets, function(x) EnglishWordComparisonList %in% x)

你想要有一些阈值百分比来切断以确定它是否是英语(我随意选择了0.06)。

EnglishWordComparisonList<-as.vector(source(path to the list you downloaded above))

Englishinator<-function(tweet, threshold = .06) {
    TWTS <- which((EnglishWordComparisonList %in% tweet)/length(tweet) > threshold)
    tweet[TWTS]
    #or tweet[TWTS,] if the original tweets is a data frame
}

lapply(tweets, Englishinator)

我实际上并没有使用这个,因为我在研究中使用英语单词列表的方式有很大不同,但我认为这会起作用。

An approach in R would be to keep a text file of English words. I have several of these including one from http://www.sil.org/linguistics/wordlists/english/. After sourcing the .txt file you can use this file to match against each tweet. Something like:

lapply(tweets, function(x) EnglishWordComparisonList %in% x)

You'd want to have some threshold percentage to cut off to determine if it's English (I arbitrarily chose .06).

EnglishWordComparisonList<-as.vector(source(path to the list you downloaded above))

Englishinator<-function(tweet, threshold = .06) {
    TWTS <- which((EnglishWordComparisonList %in% tweet)/length(tweet) > threshold)
    tweet[TWTS]
    #or tweet[TWTS,] if the original tweets is a data frame
}

lapply(tweets, Englishinator)

I haven't actually used this because I use the English word list much differently in my research but I think this would work.

如果没有你 2024-12-22 00:44:05

还有一个运行良好的 R 包,名为“franc” 。虽然它比其他的慢,但我对它的体验比 cld2 尤其是 cld3 更好。

There is also a pretty well working R package called "franc". Though, it is slower than the others, I had a better experience with it than with cld2 and especially cld3.

孤蝉 2024-12-22 00:44:05

我不确定 R,但有几个其他语言的库。您可以在此处找到其中收集的一些内容:

http://www.detectlanguage.com/

另外还有一个最近有趣的项目:

http://blog.mikemccandless.com/2011/10/language-detection-with-googles-compact.html

使用此库生成了 Twitter 语言地图:

http://www.flickr.com/photos/walkingsf/6277163176/in/photostream

如果您找不到 R 库,我建议考虑通过 Web 服务使用远程语言检测器。

I'm not sure about R, but there are several libraries for other languages. You can find some of them collected here:

http://www.detectlanguage.com/

Also one recent interesting project:

http://blog.mikemccandless.com/2011/10/language-detection-with-googles-compact.html

Using this library Twitter languages map was produced:

http://www.flickr.com/photos/walkingsf/6277163176/in/photostream

If you will not find a library for R, I suggest to consider using remote language detector through webservice.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文