从大量文本中过滤信息
是否有最佳实践、算法或软件(需要许可的开源......)可以从文本正文中查找信息?我指的是:
- 查找文本中的所有电子邮件地址
- 查找所有提及的城市
- 查找所有提及的州
- 查找所有网址
- 查找所有提及的电话号码
- 查找所有提及的邮政编码 ...能够添加更多...
我听说RapidMiner应该能够像这样进行文本挖掘,但是AGPL对于我的目的来说不是一个可接受的许可证。
有什么“标准”可以进行这种分析吗?
Is there a best practice, algorithm or software (open source with a permissive license required...) which can find information from bodies of text? I'm referring to:
- find all email addresses in a text
- find all mentions of cities
- find all mentions of states
- find all urls
- find all mentions of telephone numbers
- find all mentions of zipcodes
... with the ability to add more ...
I heard RapidMiner should be able to do text mining like this, but AGPL is not an acceptable license for my purpose.
Is there anything 'standard' to do this kind of analysis?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
了解命名实体识别。您可以尝试 Apache OpenNLP 或 Apache UIMA,两者都拥有 Apache 许可证。
Read about Named Entity Recognition. You can try Apache OpenNLP or Apache UIMA, both of which have the, well, Apache license.
对于此类实体类型,您可以使用基于规则的 NER 工具,例如 gexp。
For such entities type you can use rule-based NER tool like gexp.