从文本字符串中提取国家/地区名称

发布于 2024-10-01 17:08:49 字数 873 浏览 9 评论 0原文

我正在考虑编写一个混搭应用程序,该应用程序将从 Reddit 子版块中获取提交标题,并尝试根据它们可能相关的位置将它们绘制在地图上。稍后我还想添加 Twitter 等内容。

我计划中遇到的困难是如何从标题中检测出最有可能是相关的国家/地区。我的第一个猜测是拥有一个国家/地区列表及其匹配排列(例如“英语”匹配“英格兰”等),并检查文本中这些项目的出现情况。然而,这可能会非常慢,并且需要我列出每个国家/地区的所有格*名称。

我计划在Python中执行此操作(以便学习使用它),所以我想知道是否有a)一个库可以执行此操作(并且我可以从中学习)或b)一种更明显的方法这?

为了让大家了解我正在处理的输入类型,这里有一些样本以及我试图从中得到的内容:

  • “好吧,他们不能逮捕我们所有人 - 对英国法律体系竖起中指(图)”
    • 关键字:英国(英国)
  • “民意调查:维基解密阿桑奇领先时代‘年度人物’——阿桑奇,一位澳大利亚人,他的发布已成为五角大楼的眼中钉。截至周五,有关伊拉克和阿富汗战争的美国秘密军事文件已收到 21,736 票。”
    • 关键字:阿富汗、伊拉克、[澳大利亚](阿富汗、伊拉克、[澳大利亚])- 澳大利亚很难找出来,因为基本上不相关,但这对于我的目的来说是可以接受的
  • “网络攻击”诺贝尔和平奖网站上线,保持优雅。”
    • 关键字:中国(中国)
  • “一位犹太外科医生在发现病人手臂上有纳粹纹身后,拒绝为病人做手术,并走出手术室。”
    • 关键字: - 适合我的目的

* 这可能是错误的词

I'm looking at writing a mashup app that will take submission titles from a subreddit and attempt to plot them on a map based on where they are likely to be relevant. I'd also like to add on things like Twitter later on.

What I'm having difficulty planning is how to detect the most likely to be relevant country from the title. My first guess is to have a list of countries, along with their matching permutations (e.g. "English" matches "England", etc.) and check for occurrences of those items in the text. However this is probably going to be quite slow and will require me listing the possessive* name for each country.

I'm planning on doing this in Python (so as to learn to use it) so I'm wondering is there a) a library that does this (and that I can learn from it) or b) a more obvious way to do this?

To give an idea of the types of input I'm working with here are some samples and what I'm trying to get out of them:

  • "Well they can't arrest all of us - Giving the middle finger to the British legal system (pic)"
    • Keyword: British (Great Britain)
  • "Poll: Wikileaks Assange leading Time 'Person of the Year' - Assange, an Australian who has become a thorn in the side of the Pentagon with his releases of secret US military documents about the wars in Iraq and Afghanistan, had received 21,736 votes as of Friday."
    • Keywords: Afghanistan, Iraq, [Australian] (Afghanistan, Iraq, [Australia]) - Australia would be difficult to catch out as mainly irrelevant but this is acceptable for my purposes
  • "Cyber attack on Nobel peace prize website launched. Stay classy, China."
    • Keyword: China (China)
  • "A Jewish surgeon refuses to operate on a patient and walks out of the operating room after discovering a nazi tattoo on the patient's arm."
    • Keywords: none - acceptable for my purposes

* This is probably the wrong word to use

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

像你 2024-10-08 17:08:49

您可以查看 Yahoo!地点制作工具 API

Placemaker 提供地理丰富
的极其显着的比例
具有地理分布的网页内容
相关但不具有地理意义
可发现的。提供自由形式
文本,服务识别地点
文本中提到的,消除歧义
地点,并返回唯一标识符
(WOEID)每个,以及
有关多少次的信息
文中找到的地方,以及在哪里
在文本中找到了。 WOEID
服务返回可以通过
访问 Yahoo! 的 GeoPlanet™ API 以获取更多信息
地理丰富和发现。

You can look into the Yahoo! Place Maker API

Placemaker provides geo-enrichment for
the hugely significant proportion of
Web content that is geographically
relevant but not geographically
discoverable. Provided with free-form
text, the service identifies places
mentioned in text, disambiguates those
places, and returns unique identifiers
(WOEIDs) for each, as well as
information about how many times the
place was found in the text, and where
in the text it was found. The WOEIDs
returned by the service can be passed
to Yahoo!'s GeoPlanet™ API for further
geographic enrichment and discovery.

初见 2024-10-08 17:08:49

在 MySQL 中使用全文搜索索引。然后使用 AJAX 调用来查询数据库。

Use a FullText search index in MySQL. Then use AJAX calls to query against your database.

灼痛 2024-10-08 17:08:49

请查看答案是否有帮助:

[geograpy3 包] 允许您从 URL 或文本中提取地名,并向这些名称添加上下文 - 例如区分国家、地区或城市。

Please see if this answer may help:

[The package geograpy3] allows you to extract place names from a URL or text, and add context to those names -- for example distinguishing between a country, region or city.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文