Perl 和 NLP,从传记中解析名字
总的来说,我对 NLP 还很陌生,但对 Perl 非常熟悉,我想知道有哪些强大的 NLP 模块。基本上,我有一个包含一堆段落的文件,其中一些是人们的传记。因此,首先我需要查找一个人的名字,这有助于后面的其余过程。
所以我大致是这样开始的:
foreach $PPid (0 .. $PPscalar) {
$paragraph = @PP[$PPid];
if ($paragraph =~ /^(\w+ \w\. \w+|\w+ \w+)( also|)( has served| served| worked| joined| currently serves| has| was| is|, )/){
$possibleName = $1;
$badName = 0;
foreach $piece (@pieces){
if ($possibleName =~ /$piece/){
$badName = 1;
}
}
if ($badName == 0){
push @namePile, $possibleName;
}
}
}
因为大多数名字都是从段落的开头开始的。然后我正在寻找表示行动或占有的关键字,但现在,它会拾取额外的不是名称的垃圾。必须有一个模块来做到这一点,对吗?
I'm pretty new to NLP in general, but getting really good at Perl, and I was wondering what kind of powerful NLP modules are out there. Basically, I have a file with a bunch of paragraphs, and some of them are people's biographies. So, first I need to look for a person's name, and that helps with the rest of the process later.
So I was roughly starting with something like this:
foreach $PPid (0 .. $PPscalar) {
$paragraph = @PP[$PPid];
if ($paragraph =~ /^(\w+ \w\. \w+|\w+ \w+)( also|)( has served| served| worked| joined| currently serves| has| was| is|, )/){
$possibleName = $1;
$badName = 0;
foreach $piece (@pieces){
if ($possibleName =~ /$piece/){
$badName = 1;
}
}
if ($badName == 0){
push @namePile, $possibleName;
}
}
}
Because most of the names start at the beginning of the paragraphs. And then I'm looking for keywords that denote action or possession, but right now, that picks up extra junk that is not a name. There has to be a module to do this, right?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
从数据中提取名称很困难。有多种解决方案。对于命名实体提取,您可以使用以下
: :加莱是迄今为止速度和准确性的最佳选择。如果您需要开源的底层实现,请使用斯坦福图书馆。
Extracting names from data is hard. There are a variety of solutions. For named entity extraction you've got the following
Net::Calais is by far the best bet for speed and accuracy. Go with the Stanford library if you need the underlying implementation to be open source.
您尝试过搜索 CPAN 吗?
http://search.cpan.org/search?query=NLP&mode=所有
我还尝试搜索“自然语言”,并发现您可能感兴趣的以下内容:
Lingua::EN::Tagger
另外,如果您必须自己开发 NLP,您需要查看 Regexp::Grammars。这是 Parse::RecDesent 的继承者。
Have you tried searching CPAN?
http://search.cpan.org/search?query=NLP&mode=all
I also tried searching for "Natural Language" and found the following that you might be interested in:
Lingua::EN::Tagger
Also, if you must roll your own, with regards to NLP, you want to check out Regexp::Grammars. This is the successor to Parse::RecDesent.
我不知道有任何 Perl 模块可以处理英语以将其分解为词性。我希望有 C 或 C++ 或其他语言的库可以做到这一点,所以如果您找不到好的答案,也许您可以扩大您的搜索范围。
一种简单的方法是检查两个大写的单词:
或检查标题:
I don't know of any Perl modules which do processing of English in order to break it into parts of speech. I expect there are libraries out there which do that, in C or C++ or something, so if you don't find a good answer, maybe you can broaden your search.
One easy hack is to check for two words which are both capitalized:
or check for titles: