从字典中查找文本字符串中的单词
您将如何解析一串自由格式的文本,以根据位置和名称字典来检测位置和名称等内容?在我的特定应用程序中,我的字典中将有数以万计(如果不是更多)的条目,因此我很确定仅运行所有这些条目是不可能的。另外,是否有任何方法可以添加“模糊”匹配,以便您还可以检测字典单词的 x
编辑范围内的子字符串?如果我没记错的话,这属于自然语言处理领域,更具体地说,属于命名实体识别(NER)领域;然而,我试图寻找有关 NER 背后的算法和流程的信息却一无所获。我更喜欢使用 Python,因为我对此最熟悉,尽管我愿意考虑其他解决方案。
How would you go about parsing a string of free form text to detect things like locations and names based on a dictionary of location and names? In my particular application there will be tens of thousands if not more entries in my dictionaries so I'm pretty sure just running through them all is out of the question. Also, is there any way to add "fuzzy" matching so that you can also detect substrings that are within x
edits of a dictionary word? If I'm not mistaken this falls within the field of natural language processing and more specifically named entity recognition (NER); however, my attempt to find information about the algorithms and processes behind NER have come up empty. I'd prefer to use Python for this as I'm most familiar with that although I'm open to looking at other solutions.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以尝试下载斯坦福命名实体识别器:
http://nlp.stanford.edu/software/CRF-NER.shtml
如果您不想使用别人的代码而想自己做,我建议您查看他们相关论文中的算法,因为他们为此使用的条件随机场模型已经成为相当常见的NER 的方法。
我不确定如何准确回答您问题的第二部分,即在没有更多详细信息的情况下查找子字符串。您可以修改斯坦福大学的程序,或者可以使用词性标注器来标记文本中的专有名词。这不会区分位置和名称,但是可以非常简单地找到距离每个专有名词 x 个单词的单词。
You might try downloading the Stanford Named Entity Recognizer:
http://nlp.stanford.edu/software/CRF-NER.shtml
If you don't want to use someone else's code and you want to do it yourself, I'd suggest taking a look at the algorithm in their associated paper, because the Conditional Random Field model that they use for this has become a fairly common approach to NER.
I'm not sure exactly how to answer the second part of your question on looking for substrings without more details. You could modify the Stanford program, or you could use a part-of-speech tagger to mark proper nouns in the text. That wouldn't distinguish locations from names, but it would make it very simple to find words that are x words away from each proper noun.