我可以使用命名实体识别来识别 Intranet 页面内容吗?
我是自然语言处理的新手,我想通过创建一个简单的项目来了解更多信息。 NLTK 有人建议在 NLP 中流行,所以我将在我的项目中使用它。
我想做的是:
- 我想扫描我们公司的内联网页面;大约 3K 页面
- 我想根据某些标准对这些页面的内容进行解析和分类,例如:人力资源、工程、公司页面等...
从我到目前为止所读到的内容来看,我可以使用命名实体识别来做到这一点。我可以描述每个页面类别的实体,训练 NLTK 解决方案并运行每个页面以确定类别。
这是正确的方法吗?我很欣赏任何方向和想法...
谢谢
I am new to Natural Language Processing and I want to learn more by creating a simple project. NLTK was suggested to be popular in NLP so I will use it in my project.
Here is what I would like to do:
- I want to scan our company's intranet pages; approximately 3K pages
- I would like to parse and categorize the content of these pages based on certain criteria such as: HR, Engineering, Corporate Pages, etc...
From what I have read so far, I can do this with Named Entity Recognition. I can describe entities for each category of pages, train the NLTK solution and run each page through to determine the category.
Is this the right approach? I appreciate any direction and ideas...
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
看起来您想要进行文本/文档分类,这与命名实体不太一样识别,目标是识别文本中的任何命名实体(专有名称、地点、机构等)。然而,在有限域中进行文本分类时,专有名称可能是非常好的特征,例如,带有首席工程师姓名的页面可能会被分类为“工程”。
NLTK 书中有关于基本文本分类的章节。
It looks like you want to do text/document classification, which is not quite the same as Named Entity Recognition, where the goal is to recognize any named entities (proper names, places, institutions etc) in text. However, proper names might be very good features when doing text classification in a limited domain, it is for example likely that a page with the name of the head engineer could be classified as Engineering.
The NLTK book has a chapter on basic text classification.