对地名数据进行位置消歧的最佳方法是什么?

发布于 2025-01-07 20:44:52 字数 471 浏览 1 评论 0原文

对地名数据进行位置消歧的最佳方法是什么?

有一些用于地名搜索的评分算法,但他们没有开源,我不确定它们是否非常复杂。 (即对于soma, ca,它返回Soma Lake in Canada,它甚至没有维基百科文章,而不是非常流行的Soma Neirbohood in san francisco

我在谷歌学术中也找到了一些作品,但它们看起来非常浅薄,与我的启发式相似,例如按某物评分(log(population)+ 1000*hasWikipedia(article)+ isCity100+isCapital(10))。

我的领域在旅游文章中,因此我的评分函数应该提供最可能的旅游地点(城市、名胜古迹(迪士尼乐园、体育馆、大本钟))。

您是否知道该领域的任何重要文章,或者 Google 地图、雅虎、bing 甚至 geonames 在生产中使用的算法?

What is the best method to do location disambiguation for geonames data?

There are some scoring algorithm for geonames search, but they do not open source it and I'm not sure they are very sophisticated. (i.e. for soma, ca it returns Soma lake in Canada which haven't even wikipedia article, instead of very popular Soma Neirbohood in san francisco)

There also some works I have found in google scholar, but they seems very shallow and similar with my heuristics like scoring by something(log(population) + 1000*hasWikipedia(article)+ isCity100+isCapital(10)).

My domain in travel articles so my scoring function should provide most probable tourist places(cities, place of interest(Disneyland, colleseum, big ben)).

Do you know any important article in this field, or algorithms used in production by Google maps, yahoo, bing or even geonames?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

我很坚强 2025-01-14 20:44:52

@yura,这不是你要找的,但我认为没有任何聪明的算法能够始终如一地消除诸如“soma ca”之类的查询是否指的是旧金山的 Soma 或加拿大的 Soma Lake 的歧义。问题不在于你的算法不够复杂;而在于你的算法不够复杂。问题是查询“soma ca”中根本没有足够的信息。

我不知道如何清楚地表达它,但是这里发生了信息论的事情。这就像随机数据无法无损压缩一样:输入中没有足够的信息来计算所需的输出。

即使人类手动解释您的查询,他们也不一定理解“soma ca”应该是 SF 中的 Soma。也许对你来说,像“ca”这样的两个字母的缩写“自然地”指的是美国的一个州而不是外国,但这个选择从根本上来说没有什么“正确”的,而且它不能用纯粹的逻辑推导出来。这是一个任意的、特定于域的临时规则,就像您提到的临时 log(population) 启发式规则一样。

一些可能的“解决方案”(除了设计一台可以读取用户想法的心灵感应计算机之外):

  1. 为用户提供每个查询的可能匹配列表。跟踪他们选择的查询,当其他用户稍后输入相同的查询时,按受欢迎程度对结果进行排序。
  2. 或者,一旦收集了有关查询结果流行程度的大量数据,您甚至可以使用机器学习算法挖掘数据,并从中得出更好的启发式方法。
  3. 或者,在将应用程序投入生产使用之前,您可以首先编译一组虚假查询,以及您认为算法应该为每个此类查询生成的结果。然后使用你的机器学习算法。
  4. 编译大量虚假查询和所需的响应,或者从真实用户的选择中获取数据,并使用该数据来衡量手动设计和编码的排名启发式的准确性。不断发明新的启发式方法,直到找到一种可以在测试数据集上实现高精度的方法。

@yura, this isn't what you're looking for, but I don't think any clever algorithm will be able to consistently disambiguate whether queries like "soma ca" refer to Soma in San Fran or Soma Lake in Canada. The problem is not that your algorithm is not sophisticated enough; the problem is that there is simply not enough information in the query "soma ca".

I don't know how to express it clearly, but there is an information theoretic thing going on here. It's like the way that random data can't be compressed losslessly: there's not enough information in the input to compute the desired output.

Even if a human was to interpret your queries manually, they would not necessarily understand that "soma ca" is supposed to mean Soma in SF. Maybe to you a 2-letter abbreviation like "ca" "naturally" refers to a US state rather than a foreign country, but there is nothing fundamentally "correct" about that choice, and it cannot be derived using pure logic. It's an arbitrary, domain-specific, ad-hoc rule, just like the ad-hoc log(population) heuristic which you referred to.

Some possible "solutions" (aside from designing a telepathic computer which can read users' minds):

  1. Provide users a list of possible matches for each query. Keep track of the ones they choose, and when other users later type the same query, order the results by popularity.
  2. OR, once you gather lots of data on the popularity of query results, you may even be able to mine the data with machine-learning algorithms, and derive better heuristics from it.
  3. Or, before putting the application into production use, you could first compile a body of fake queries, along with the results which you think your algorithm should yield for each such query. Then use your machine-learning algorithms on that.
  4. Compile a body of fake queries and desired responses, OR get the data from the choices of real users, and use that data to benchmark the accuracy of your manually designed and coded ranking heuristics. Keep inventing new heuristics until you find one which achieves high accuracy on your test data set.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文