使用 Visual C# 从文本文件语料库中提取名词、名词短语、形容词动词
我正在做一个项目,其中我必须从文本文件(.doc)格式中提取名词形容词名词短语和动词。 我有大约 75 个这样的文件的语料库。我访问了 net 来查找它,并且在 python 中使用 nltk 发现了 POS 标记。 由于我的项目是用 C# 编写的(使用 Visual Studio 2008),因此我需要一个代码来执行此操作。 我已经尝试过相同的 wordnet api,甚至 Sharpnlp,但由于我是新手,我发现这些很难与我的项目集成。 任何人都可以建议我使用词汇等更简单的代码来做到这一点。请帮助我。 谢谢。
i am doing a project wherein i have to extract nouns adjectives noun phrases and verbs from text files(.doc) format.
i have a corpus of around 75 such files. i have accessed net to find about it and i came across POS tagging in python using nltk.
as my project is in c# (using visual studio 2008) i need a code to do so.
i have tried wordnet api for the same and even sharpnlp but as i am a newbie i found these tough to integrate with my project.
can anybody please suggest me simpler code to do so using something like vocabulary etc. plz help me guys.
thanx.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我在 NLP(自然语言处理)领域为一位行业领导者工作了一段时间,你想做的事情可不是一件简单的任务。我认识 nltk 的创建者之一,并且我自己也使用过它;它是一个高质量的开源工具,我建议您使用它(您有特别令人信服的理由使用 C# 吗?)
POS 标记通常是通过在手工注释数据上训练语言模型,然后将该模型应用于新文本,预测词性并给出置信度。
nltk
有可以执行此操作的工具,并且它们也有一些模型(如果我没有记错的话)。您会发现大多数工具都是用 C++、Java 和 Python 编写的。如果您不懂任何一种语言,请将此视为学习一些东西的绝佳机会!
请参阅 Wikipedia,特别是底部的链接,了解更多信息和其他可用软件用于此类标记。
I worked in NLP (Natural Language Processing) for an industry leader for a while and what you want to do is no trivial task. I know one of the creators of
nltk
and I have used it myself; it's a high quality open source tool and I'd recommend you use it (do you have a particularly compelling reason to use C#?)POS tagging is typically implemented by training a model of language on hand-annotated data, then applying that model to new text, predicting the parts of speech and giving a confidence .
nltk
has tools that do this, and they also have some models (if I'm not mistaken).You'll find that most tools are written in C++, Java, and Python. If you don't know any of the languages look at this as an excellent opportunity to learn something!
See Wikipedia, especially the links at the bottom, for more information and other software available to use for such tagging.
Christopher 的说法是正确的,即 NLP 的实现并非易事。不过,我最近研究了一个在 .NET 项目中使用 OpenNLP 和基本 PoS 解析器的可行解决方案。在我的示例中,我正在寻找名词短语,但在文本中查找其他片段应该不会太困难。我发现 1.5 版的 OpenNLP 工具模型 足以满足我的目的。
我意识到这个答案对于提问者来说太晚了,但希望它能给其他人一些进入这个困难领域的启发。
提取名词短语使用 OpenNLP 在 .NET 中实现上下文相关性
Christopher is correct in his statement that NLP implementations are no picnic. However, I've recently looked into a viable solution using OpenNLP in a .NET project with a rudimentary PoS parser. In my example I am looking for noun phrases, but it shouldn't be too difficult a text to find other fragments as well. I find the OpenNLP Tools Models for 1.5 to be sufficient for my purposes.
I realize this answer is woefully late for the questioner, but hopefully it will give others some inspiration with this difficult field to get into.
Extracting noun phrases with contextual relevance in .NET using OpenNLP