从文本字符串中提取国家/地区名称
我正在考虑编写一个混搭应用程序,该应用程序将从 Reddit 子版块中获取提交标题,并尝试根据它们可能相关的位置将它们绘制在地图上。稍后我还想添加 Twitter 等内容。
我计划中遇到的困难是如何从标题中检测出最有可能是相关的国家/地区。我的第一个猜测是拥有一个国家/地区列表及其匹配排列(例如“英语”匹配“英格兰”等),并检查文本中这些项目的出现情况。然而,这可能会非常慢,并且需要我列出每个国家/地区的所有格*名称。
我计划在Python中执行此操作(以便学习使用它),所以我想知道是否有a)一个库可以执行此操作(并且我可以从中学习)或b)一种更明显的方法这?
为了让大家了解我正在处理的输入类型,这里有一些样本以及我试图从中得到的内容:
- “好吧,他们不能逮捕我们所有人 - 对英国法律体系竖起中指(图)”
- 关键字:英国(英国)
- “民意调查:维基解密阿桑奇领先时代‘年度人物’——阿桑奇,一位澳大利亚人,他的发布已成为五角大楼的眼中钉。截至周五,有关伊拉克和阿富汗战争的美国秘密军事文件已收到 21,736 票。”
- 关键字:阿富汗、伊拉克、[澳大利亚](阿富汗、伊拉克、[澳大利亚])- 澳大利亚很难找出来,因为基本上不相关,但这对于我的目的来说是可以接受的
- “网络攻击”诺贝尔和平奖网站上线,保持优雅。”
- 关键字:中国(中国)
- “一位犹太外科医生在发现病人手臂上有纳粹纹身后,拒绝为病人做手术,并走出手术室。”
- 关键字:无 - 适合我的目的
* 这可能是错误的词
I'm looking at writing a mashup app that will take submission titles from a subreddit and attempt to plot them on a map based on where they are likely to be relevant. I'd also like to add on things like Twitter later on.
What I'm having difficulty planning is how to detect the most likely to be relevant country from the title. My first guess is to have a list of countries, along with their matching permutations (e.g. "English" matches "England", etc.) and check for occurrences of those items in the text. However this is probably going to be quite slow and will require me listing the possessive* name for each country.
I'm planning on doing this in Python (so as to learn to use it) so I'm wondering is there a) a library that does this (and that I can learn from it) or b) a more obvious way to do this?
To give an idea of the types of input I'm working with here are some samples and what I'm trying to get out of them:
- "Well they can't arrest all of us - Giving the middle finger to the British legal system (pic)"
- Keyword: British (Great Britain)
- "Poll: Wikileaks Assange leading Time 'Person of the Year' - Assange, an Australian who has become a thorn in the side of the Pentagon with his releases of secret US military documents about the wars in Iraq and Afghanistan, had received 21,736 votes as of Friday."
- Keywords: Afghanistan, Iraq, [Australian] (Afghanistan, Iraq, [Australia]) - Australia would be difficult to catch out as mainly irrelevant but this is acceptable for my purposes
- "Cyber attack on Nobel peace prize website launched. Stay classy, China."
- Keyword: China (China)
- "A Jewish surgeon refuses to operate on a patient and walks out of the operating room after discovering a nazi tattoo on the patient's arm."
- Keywords: none - acceptable for my purposes
* This is probably the wrong word to use
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您可以查看 Yahoo!地点制作工具 API
You can look into the Yahoo! Place Maker API
在 MySQL 中使用全文搜索索引。然后使用 AJAX 调用来查询数据库。
Use a FullText search index in MySQL. Then use AJAX calls to query against your database.
请查看此答案是否有帮助:
Please see if this answer may help: