提取“((Adj|名词)”|((Adj|名词)(Noun-Prep)?)(Adj|名词))名词”来自文本(Justeson 和 Katz,1995)

发布于 2024-10-10 17:38:49 字数 1034 浏览 2 评论 0原文

是否可以使用 Justeson 和 Katz (1995) 提出的 ((Adj|Noun)+|((Adj|Noun)(Noun-Prep)?)(Adj|Noun))Noun 来提取R 包 openNLP

也就是说,我想使用这种语言过滤来提取候选名词短语。

我不能很好地理解它的含义。

你能帮我解释一下吗?或者展示如何用R语言编写过滤规则?

非常感谢。

也许我们可以从以下位置开始示例代码:

library("openNLP")  

acq <- "This paper describes a novel optical thread plug
gauge (OTPG) for internal thread inspection using machine
vision. The OTPG is composed of a rigid industrial
endoscope, a charge-coupled device camera, and a two
degree-of-freedom motion control unit. A sequence of
partial wall images of an internal thread are retrieved and
reconstructed into a 2D unwrapped image. Then, a digital
image processing and classification procedure is used to
normalize, segment, and determine the quality of the
internal thread." 

acqTag <- tagPOS(acq)     

acqTagSplit = strsplit(acqTag," ")

我被告知为此提出一个新问题。原始问题位于此处

Is it possible to extract ((Adj|Noun)+|((Adj|Noun)(Noun-Prep)?)(Adj|Noun))Noun proposed by Justeson and Katz (1995) using the R package openNLP?

That is, I would like to use this linguistic filtering to extract candidate noun phrases.

I cannot understand its meaning well.

Could you do me a favor to explain it? Or show how to code the filtering rule in the R language?

Many thanks.

Maybe we can start the sample code from:

library("openNLP")  

acq <- "This paper describes a novel optical thread plug
gauge (OTPG) for internal thread inspection using machine
vision. The OTPG is composed of a rigid industrial
endoscope, a charge-coupled device camera, and a two
degree-of-freedom motion control unit. A sequence of
partial wall images of an internal thread are retrieved and
reconstructed into a 2D unwrapped image. Then, a digital
image processing and classification procedure is used to
normalize, segment, and determine the quality of the
internal thread." 

acqTag <- tagPOS(acq)     

acqTagSplit = strsplit(acqTag," ")

I was told to open a new question for this. The original question is here.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

简单气质女生网名 2024-10-17 17:38:49

通过以下方式安装包:

install.packages("openNLP")
install.packages("openNLPmodels.en")

之后,您可以运行上面的代码。它将对文本中的所有单词进行 POS 标记,并返回带有名词、动词等标记的所有单词的原始文本。我的示例如下:

acqTagSplit = strsplit(acqTag," ")
> acqTag
[1] "This/DT paper/NN describes/VBZ a/DT novel/NN optical/JJ thread/NN plug/NN gauge/NN (OTPG)/NN for/IN internal/JJ thread/NN inspection/NN using/VBG machine/NN vision./NN The/DT OTPG/NNP is/VBZ composed/VBN of/IN a/DT rigid/JJ industrial/JJ endoscope,/NNS a/DT charge-coupled/JJ device/NN camera,/VBD and/CC a/DT two/CD degree-of-freedom/NN motion/NN control/NN unit./NN A/DT sequence/NN of/IN partial/JJ wall/NN images/NNS of/IN an/DT internal/JJ thread/NN are/VBP retrieved/VBN and/CC reconstructed/VBN into/IN a/DT 2D/JJ unwrapped/JJ image./NN Then,/IN a/DT digital/JJ image/NN processing/NN and/CC classification/NN procedure/NN is/VBZ used/VBN to/TO normalize,/JJ segment,/NN and/CC determine/VB the/DT quality/NN of/IN the/DT internal/JJ thread./NN"

在所有单词之后,用破折号分隔,您就拥有了所有 POS 标记。要将这些词与单词分开,您可以首先将单词分开 - 正如您在示例中所做的那样:

acqTagSplit = strsplit(acqTag," ")
acqTagSplit
    [[1]]
     [1] "This/DT"              "paper/NN"             "describes/VBZ"       
     [4] "a/DT"                 "novel/NN"             "optical/JJ"          
     [7] "thread/NN"            "plug/NN"              "gauge/NN"            
    [10] "(OTPG)/NN"            "for/IN"               "internal/JJ"         
    [13] "thread/NN"            "inspection/NN"        "using/VBG"           
    [16] "machine/NN"           "vision./NN"           "The/DT"              
    [19] "OTPG/NNP"             "is/VBZ"               "composed/VBN"        
    [22] "of/IN"                "a/DT"                 "rigid/JJ"            
    [25] "industrial/JJ"        "endoscope,/NNS"       "a/DT"                
    [28] "charge-coupled/JJ"    "device/NN"            "camera,/VBD"         
    [31] "and/CC"               "a/DT"                 "two/CD"              
    [34] "degree-of-freedom/NN" "motion/NN"            "control/NN"          
    [37] "unit./NN"             "A/DT"                 "sequence/NN"         
    [40] "of/IN"                "partial/JJ"           "wall/NN"             
    [43] "images/NNS"           "of/IN"                "an/DT"               
    [46] "internal/JJ"          "thread/NN"            "are/VBP"             
    [49] "retrieved/VBN"        "and/CC"               "reconstructed/VBN"   
    [52] "into/IN"              "a/DT"                 "2D/JJ"               
    [55] "unwrapped/JJ"         "image./NN"            "Then,/IN"            
    [58] "a/DT"                 "digital/JJ"           "image/NN"            
    [61] "processing/NN"        "and/CC"               "classification/NN"   
    [64] "procedure/NN"         "is/VBZ"               "used/VBN"            
    [67] "to/TO"                "normalize,/JJ"        "segment,/NN"         
    [70] "and/CC"               "determine/VB"         "the/DT"              
    [73] "quality/NN"           "of/IN"                "the/DT"              
    [76] "internal/JJ"          "thread./NN"          

然后将单词与 POS 标签分开:

strsplit(acqTagSplit[[1]], "/")

您将得到一个列表,其中包含带有标签的所有单词,并且在里面首先将单词和标签分开。看:

str(strsplit(acqTagSplit[[1]], "/"))
List of 77
 $ : chr [1:2] "This" "DT"
 $ : chr [1:2] "paper" "NN"
 $ : chr [1:2] "describes" "VBZ"
 $ : chr [1:2] "a" "DT"
 $ : chr [1:2] "novel" "NN"
 $ : chr [1:2] "optical" "JJ"
 $ : chr [1:2] "thread" "NN"
 $ : chr [1:2] "plug" "NN"
 $ : chr [1:2] "gauge" "NN"
 $ : chr [1:2] "(OTPG)" "NN"
 $ : chr [1:2] "for" "IN"
 $ : chr [1:2] "internal" "JJ"
 $ : chr [1:2] "thread" "NN"
 $ : chr [1:2] "inspection" "NN"
 $ : chr [1:2] "using" "VBG"
 $ : chr [1:2] "machine" "NN"
 $ : chr [1:2] "vision." "NN"
 $ : chr [1:2] "The" "DT"
 $ : chr [1:2] "OTPG" "NNP"
 $ : chr [1:2] "is" "VBZ"
 $ : chr [1:2] "composed" "VBN"
 $ : chr [1:2] "of" "IN"
 $ : chr [1:2] "a" "DT"
 $ : chr [1:2] "rigid" "JJ"
 $ : chr [1:2] "industrial" "JJ"
 $ : chr [1:2] "endoscope," "NNS"
 $ : chr [1:2] "a" "DT"
 $ : chr [1:2] "charge-coupled" "JJ"
 $ : chr [1:2] "device" "NN"
 $ : chr [1:2] "camera," "VBD"
 $ : chr [1:2] "and" "CC"
 $ : chr [1:2] "a" "DT"
 $ : chr [1:2] "two" "CD"
 $ : chr [1:2] "degree-of-freedom" "NN"
 $ : chr [1:2] "motion" "NN"
 $ : chr [1:2] "control" "NN"
 $ : chr [1:2] "unit." "NN"
 $ : chr [1:2] "A" "DT"
 $ : chr [1:2] "sequence" "NN"
 $ : chr [1:2] "of" "IN"
 $ : chr [1:2] "partial" "JJ"
 $ : chr [1:2] "wall" "NN"
 $ : chr [1:2] "images" "NNS"
 $ : chr [1:2] "of" "IN"
 $ : chr [1:2] "an" "DT"
 $ : chr [1:2] "internal" "JJ"
 $ : chr [1:2] "thread" "NN"
 $ : chr [1:2] "are" "VBP"
 $ : chr [1:2] "retrieved" "VBN"
 $ : chr [1:2] "and" "CC"
 $ : chr [1:2] "reconstructed" "VBN"
 $ : chr [1:2] "into" "IN"
 $ : chr [1:2] "a" "DT"
 $ : chr [1:2] "2D" "JJ"
 $ : chr [1:2] "unwrapped" "JJ"
 $ : chr [1:2] "image." "NN"
 $ : chr [1:2] "Then," "IN"
 $ : chr [1:2] "a" "DT"
 $ : chr [1:2] "digital" "JJ"
 $ : chr [1:2] "image" "NN"
 $ : chr [1:2] "processing" "NN"
 $ : chr [1:2] "and" "CC"
 $ : chr [1:2] "classification" "NN"
 $ : chr [1:2] "procedure" "NN"
 $ : chr [1:2] "is" "VBZ"
 $ : chr [1:2] "used" "VBN"
 $ : chr [1:2] "to" "TO"
 $ : chr [1:2] "normalize," "JJ"
 $ : chr [1:2] "segment," "NN"
 $ : chr [1:2] "and" "CC"
 $ : chr [1:2] "determine" "VB"
 $ : chr [1:2] "the" "DT"
 $ : chr [1:2] "quality" "NN"
 $ : chr [1:2] "of" "IN"
 $ : chr [1:2] "the" "DT"
 $ : chr [1:2] "internal" "JJ"
 $ : chr [1:2] "thread." "NN"

Installing the package by:

install.packages("openNLP")
install.packages("openNLPmodels.en")

After, you could run the above code. It will POS tag all words in the text and give back the original text with all words tagged like noun, verb etc. I this example as follows:

acqTagSplit = strsplit(acqTag," ")
> acqTag
[1] "This/DT paper/NN describes/VBZ a/DT novel/NN optical/JJ thread/NN plug/NN gauge/NN (OTPG)/NN for/IN internal/JJ thread/NN inspection/NN using/VBG machine/NN vision./NN The/DT OTPG/NNP is/VBZ composed/VBN of/IN a/DT rigid/JJ industrial/JJ endoscope,/NNS a/DT charge-coupled/JJ device/NN camera,/VBD and/CC a/DT two/CD degree-of-freedom/NN motion/NN control/NN unit./NN A/DT sequence/NN of/IN partial/JJ wall/NN images/NNS of/IN an/DT internal/JJ thread/NN are/VBP retrieved/VBN and/CC reconstructed/VBN into/IN a/DT 2D/JJ unwrapped/JJ image./NN Then,/IN a/DT digital/JJ image/NN processing/NN and/CC classification/NN procedure/NN is/VBZ used/VBN to/TO normalize,/JJ segment,/NN and/CC determine/VB the/DT quality/NN of/IN the/DT internal/JJ thread./NN"

After all word, separated by a dash, you have all the POS tags. To separate theese from the word, you could first separate the words - as you did in your example:

acqTagSplit = strsplit(acqTag," ")
acqTagSplit
    [[1]]
     [1] "This/DT"              "paper/NN"             "describes/VBZ"       
     [4] "a/DT"                 "novel/NN"             "optical/JJ"          
     [7] "thread/NN"            "plug/NN"              "gauge/NN"            
    [10] "(OTPG)/NN"            "for/IN"               "internal/JJ"         
    [13] "thread/NN"            "inspection/NN"        "using/VBG"           
    [16] "machine/NN"           "vision./NN"           "The/DT"              
    [19] "OTPG/NNP"             "is/VBZ"               "composed/VBN"        
    [22] "of/IN"                "a/DT"                 "rigid/JJ"            
    [25] "industrial/JJ"        "endoscope,/NNS"       "a/DT"                
    [28] "charge-coupled/JJ"    "device/NN"            "camera,/VBD"         
    [31] "and/CC"               "a/DT"                 "two/CD"              
    [34] "degree-of-freedom/NN" "motion/NN"            "control/NN"          
    [37] "unit./NN"             "A/DT"                 "sequence/NN"         
    [40] "of/IN"                "partial/JJ"           "wall/NN"             
    [43] "images/NNS"           "of/IN"                "an/DT"               
    [46] "internal/JJ"          "thread/NN"            "are/VBP"             
    [49] "retrieved/VBN"        "and/CC"               "reconstructed/VBN"   
    [52] "into/IN"              "a/DT"                 "2D/JJ"               
    [55] "unwrapped/JJ"         "image./NN"            "Then,/IN"            
    [58] "a/DT"                 "digital/JJ"           "image/NN"            
    [61] "processing/NN"        "and/CC"               "classification/NN"   
    [64] "procedure/NN"         "is/VBZ"               "used/VBN"            
    [67] "to/TO"                "normalize,/JJ"        "segment,/NN"         
    [70] "and/CC"               "determine/VB"         "the/DT"              
    [73] "quality/NN"           "of/IN"                "the/DT"              
    [76] "internal/JJ"          "thread./NN"          

And later split up the words from the POS tags:

strsplit(acqTagSplit[[1]], "/")

You will have a list, which contains all of your words with the tags, and inside first have the word and after the tag separated. See:

str(strsplit(acqTagSplit[[1]], "/"))
List of 77
 $ : chr [1:2] "This" "DT"
 $ : chr [1:2] "paper" "NN"
 $ : chr [1:2] "describes" "VBZ"
 $ : chr [1:2] "a" "DT"
 $ : chr [1:2] "novel" "NN"
 $ : chr [1:2] "optical" "JJ"
 $ : chr [1:2] "thread" "NN"
 $ : chr [1:2] "plug" "NN"
 $ : chr [1:2] "gauge" "NN"
 $ : chr [1:2] "(OTPG)" "NN"
 $ : chr [1:2] "for" "IN"
 $ : chr [1:2] "internal" "JJ"
 $ : chr [1:2] "thread" "NN"
 $ : chr [1:2] "inspection" "NN"
 $ : chr [1:2] "using" "VBG"
 $ : chr [1:2] "machine" "NN"
 $ : chr [1:2] "vision." "NN"
 $ : chr [1:2] "The" "DT"
 $ : chr [1:2] "OTPG" "NNP"
 $ : chr [1:2] "is" "VBZ"
 $ : chr [1:2] "composed" "VBN"
 $ : chr [1:2] "of" "IN"
 $ : chr [1:2] "a" "DT"
 $ : chr [1:2] "rigid" "JJ"
 $ : chr [1:2] "industrial" "JJ"
 $ : chr [1:2] "endoscope," "NNS"
 $ : chr [1:2] "a" "DT"
 $ : chr [1:2] "charge-coupled" "JJ"
 $ : chr [1:2] "device" "NN"
 $ : chr [1:2] "camera," "VBD"
 $ : chr [1:2] "and" "CC"
 $ : chr [1:2] "a" "DT"
 $ : chr [1:2] "two" "CD"
 $ : chr [1:2] "degree-of-freedom" "NN"
 $ : chr [1:2] "motion" "NN"
 $ : chr [1:2] "control" "NN"
 $ : chr [1:2] "unit." "NN"
 $ : chr [1:2] "A" "DT"
 $ : chr [1:2] "sequence" "NN"
 $ : chr [1:2] "of" "IN"
 $ : chr [1:2] "partial" "JJ"
 $ : chr [1:2] "wall" "NN"
 $ : chr [1:2] "images" "NNS"
 $ : chr [1:2] "of" "IN"
 $ : chr [1:2] "an" "DT"
 $ : chr [1:2] "internal" "JJ"
 $ : chr [1:2] "thread" "NN"
 $ : chr [1:2] "are" "VBP"
 $ : chr [1:2] "retrieved" "VBN"
 $ : chr [1:2] "and" "CC"
 $ : chr [1:2] "reconstructed" "VBN"
 $ : chr [1:2] "into" "IN"
 $ : chr [1:2] "a" "DT"
 $ : chr [1:2] "2D" "JJ"
 $ : chr [1:2] "unwrapped" "JJ"
 $ : chr [1:2] "image." "NN"
 $ : chr [1:2] "Then," "IN"
 $ : chr [1:2] "a" "DT"
 $ : chr [1:2] "digital" "JJ"
 $ : chr [1:2] "image" "NN"
 $ : chr [1:2] "processing" "NN"
 $ : chr [1:2] "and" "CC"
 $ : chr [1:2] "classification" "NN"
 $ : chr [1:2] "procedure" "NN"
 $ : chr [1:2] "is" "VBZ"
 $ : chr [1:2] "used" "VBN"
 $ : chr [1:2] "to" "TO"
 $ : chr [1:2] "normalize," "JJ"
 $ : chr [1:2] "segment," "NN"
 $ : chr [1:2] "and" "CC"
 $ : chr [1:2] "determine" "VB"
 $ : chr [1:2] "the" "DT"
 $ : chr [1:2] "quality" "NN"
 $ : chr [1:2] "of" "IN"
 $ : chr [1:2] "the" "DT"
 $ : chr [1:2] "internal" "JJ"
 $ : chr [1:2] "thread." "NN"
不如归去 2024-10-17 17:38:49

看来你需要理解正则表达式:((Adj|Noun)+|((Adj|Noun)(Noun-Prep)?)(Adj|Noun))Noun,将其转换为DFA(确定性有限自动机)并遵循 R 中的 DFA。

这里通过正则表达式描述了正则语言。与文本处理中常见的正则表达式不同,“符号”不是简单的字符,而是形容词、名词和名词介词。一旦您理解了该理论(自动机理论),您将能够轻松地在 R(或您选择的任何 PL)中实现 DFA。

问题不在于R,问题在于你不理解理论。

It seems like you need to understand the regular expression: ((Adj|Noun)+|((Adj|Noun)(Noun-Prep)?)(Adj|Noun))Noun, convert it to a DFA (deterministic finite automata) and follow the DFA in R.

Here you have a description of a regular language through a regular expression. Unlike the common usage of regular expressions in text processing the "symbols" are not simple characters, but adjectives, nouns and noun prepositions. Once you understand the theory (automata theory), you will be able to easily implement the DFA in R (or whatever PL you choose).

The problem in not R, the problem is that you don't understand the theory.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文