从文本中提取名词+名词或(形容词|名词)+名词

发布于 2024-10-10 02:23:50 字数 1586 浏览 2 评论 0原文

是否可以使用 R 包 openNLP 提取 noun+noun(adj|noun)+noun?也就是说,我想使用语言过滤来提取候选名词短语。你能指导我该怎么做吗? 非常感谢。


感谢您的回复。 代码如下:

library("openNLP")

acq <- "Gulf Applied Technologies Inc said it sold its subsidiaries engaged in
        pipeline and terminal operations for 12.2 mln dlrs. The company said 
        the sale is subject to certain post closing adjustments, 
        which it did not explain. Reuter." 

acqTag <- tagPOS(acq)    
acqTagSplit = strsplit(acqTag," ")
acqTagSplit

qq = 0
tag = 0

for (i in 1:length(acqTagSplit[[1]])){
    qq[i] <-strsplit(acqTagSplit[[1]][i],'/')
    tag[i] = qq[i][[1]][2]
}

index = 0

k = 0

for (i in 1:(length(acqTagSplit[[1]])-1)) {
    
    if ((tag[i] == "NN" && tag[i+1] == "NN") | 
        (tag[i] == "NNS" && tag[i+1] == "NNS") | 
        (tag[i] == "NNS" && tag[i+1] == "NN") | 
        (tag[i] == "NN" && tag[i+1] == "NNS") | 
        (tag[i] == "JJ" && tag[i+1] == "NN") | 
        (tag[i] == "JJ" && tag[i+1] == "NNS"))
    {      
            k = k +1
            index[k] = i
    }

}

index

读者可以参考 acqTagSplit 上的 index 来执行 noun+noun(adj|noun)+noun代码>提取。 (该代码不是最佳的,但它可以工作。如果您有任何想法,请告诉我。)

我还有一个额外的问题:

Justeson 和 Katz (1995) 提出了另一种语言过滤来提取候选名词短语:

((Adj|Noun)+|((Adj|Noun)*(Noun-Prep)?)(Adj|Noun)*)Noun

我无法理解其含义出色地。你能帮我解释一下吗?或者展示如何用R语言编写过滤规则? 非常感谢。

Is it possible to extract noun+noun or (adj|noun)+noun using the R package openNLP? That is, I would like to use linguistic filtering to extract candidate noun phrases. Could you direct me how to do?
Many thanks.


Thanks for the responses.
here is the code:

library("openNLP")

acq <- "Gulf Applied Technologies Inc said it sold its subsidiaries engaged in
        pipeline and terminal operations for 12.2 mln dlrs. The company said 
        the sale is subject to certain post closing adjustments, 
        which it did not explain. Reuter." 

acqTag <- tagPOS(acq)    
acqTagSplit = strsplit(acqTag," ")
acqTagSplit

qq = 0
tag = 0

for (i in 1:length(acqTagSplit[[1]])){
    qq[i] <-strsplit(acqTagSplit[[1]][i],'/')
    tag[i] = qq[i][[1]][2]
}

index = 0

k = 0

for (i in 1:(length(acqTagSplit[[1]])-1)) {
    
    if ((tag[i] == "NN" && tag[i+1] == "NN") | 
        (tag[i] == "NNS" && tag[i+1] == "NNS") | 
        (tag[i] == "NNS" && tag[i+1] == "NN") | 
        (tag[i] == "NN" && tag[i+1] == "NNS") | 
        (tag[i] == "JJ" && tag[i+1] == "NN") | 
        (tag[i] == "JJ" && tag[i+1] == "NNS"))
    {      
            k = k +1
            index[k] = i
    }

}

index

Reader can refer index on acqTagSplit to do noun+noun or (adj|noun)+noun extraction. (The code is not optimal, but it works. If you have any idea, please let me know.)

I have an additional problem:

Justeson and Katz (1995) proposed another linguistic filtering to extract candidate noun phrases:

((Adj|Noun)+|((Adj|Noun)*(Noun-Prep)?)(Adj|Noun)*)Noun

I cannot understand its meaning well. Could you do me a favor and explain it? Or show how to code the filtering rule in the R language?
Many thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

虚拟世界 2024-10-17 02:23:50

我没有一个开放的控制台来测试这个,但是您尝试使用 tagPOS 进行标记,然后 grep 查找“名词”,“名词”或者粘贴(tagPOS(acq),崩溃=“。”)并搜索对于“名词.名词”。然后可以使用 gregexpr 来提取位置。

编辑:标记输出的格式与我记忆中的有点不同。我认为用“\n”替换空格后的 read.table()-ing 方法比我上面看到的要高效得多:

 acqdf <- read.table(textConnection(gsub(" ", "\n", acqTag)), sep="/", stringsAsFactors=FALSE)
 acqdf$nnadj <- grepl("NN|JJ", acqdf$V2)
 acqdf$nnadj 
# [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE
#[16] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE
#[31]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE
 acqdf$nnadj[1:(nrow(acqdf)-1)] & acqdf$nnadj[2:nrow(acqdf)]
# [1]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
#[16] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE
#[31] FALSE FALSE FALSE FALSE FALSE FALSE
 acqdf$pair <- c(NA, acqdf$nnadj[1:(nrow(acqdf)-1)] & acqdf$nnadj[2:nrow(acqdf)])
 acqdf[1:7, ]

            V1  V2 nnadj  pair
1         Gulf NNP  TRUE    NA
2      Applied NNP  TRUE  TRUE
3 Technologies NNP  TRUE  TRUE
4          Inc NNP  TRUE  TRUE
5         said VBD FALSE FALSE
6           it PRP FALSE FALSE
7         sold VBD FALSE FALSE

I don't have an open console on which to test this, but have your tried to tokenize with tagPOS and then grep for "noun", "noun" or perhaps paste(tagPOS(acq), collapse=".") and search for "noun.noun". Then gregexpr could be used to extract positions.

EDIT: The format of the tagged output was a bit different than I remembered. I think this method of read.table()-ing after substituting "\n"s for spaces is much more efficient than what I see above:

 acqdf <- read.table(textConnection(gsub(" ", "\n", acqTag)), sep="/", stringsAsFactors=FALSE)
 acqdf$nnadj <- grepl("NN|JJ", acqdf$V2)
 acqdf$nnadj 
# [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE
#[16] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE
#[31]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE
 acqdf$nnadj[1:(nrow(acqdf)-1)] & acqdf$nnadj[2:nrow(acqdf)]
# [1]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
#[16] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE
#[31] FALSE FALSE FALSE FALSE FALSE FALSE
 acqdf$pair <- c(NA, acqdf$nnadj[1:(nrow(acqdf)-1)] & acqdf$nnadj[2:nrow(acqdf)])
 acqdf[1:7, ]

            V1  V2 nnadj  pair
1         Gulf NNP  TRUE    NA
2      Applied NNP  TRUE  TRUE
3 Technologies NNP  TRUE  TRUE
4          Inc NNP  TRUE  TRUE
5         said VBD FALSE FALSE
6           it PRP FALSE FALSE
7         sold VBD FALSE FALSE
懒猫 2024-10-17 02:23:50

这是可能的。

编辑:

你明白了。使用词性标注器并按空格分割:ll <- strsplit(acqTag,' ')。从那里迭代输入列表的长度(ll 的长度),如下所示:
for (i in 1:37){qq <-strsplit(ll[[1]][i],'/')} 并获取您要查找的词性序列。

在按空格分割之后,它只是 R 中的列表处理。

It is possible.

EDIT:

You got it. Use the POS tagger and split on spaces: ll <- strsplit(acqTag,' '). From there iterate on the length of the input list (length of ll) like:
for (i in 1:37){qq <-strsplit(ll[[1]][i],'/')} and get the part of speech sequence you're looking for.

After splitting on spaces it is just list processing in R.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文