“模糊搜索”的使用当交叉引用数据时
我的部门负责收集和显示来自各种公司内部来源的数据,以用于数据挖掘/公司仪表板。
我们面临的一大挑战是跨部门交叉引用位置名称。我们是一个相当大的组织,具有不同利益的部门在任何一个地点都有自己的报告。一般来说,这些部门的报告中位置名称的确切名称存在很大差异。例如,某个位置可能会被称为:
- The Fabulous Restaurant
- Fabulous Restaurant
- Fabulous F&B
- 当该位置进行一些翻新时。 .Fabulous Cafe'
- 甚至利润中心 12345ABC
所以我的问题是在我们自己的数据库和代码中协调这些名称的最佳实践是什么?我们暂时假设我的部门没有能力将组织统一在一个共同的层次结构标准下(这将是最佳解决方案)。目前,我们的做法是维护不断增长的位置名称参考表,然后将其引用回我们自己的命名标准。这使我们能够保持数据的历史一致性。
在交叉引用位置时实施某种“模糊搜索”是否可行/可取?例如,某些东西可能会忽略“the”等单词的实例,或者同等对待“cafe”和“restaurant”(基于某些预定义的逻辑)。
我当然不认为我们能够通过算法解释我们遇到的所有随机命名约定,但是能够解释其中的一些/大部分就足够了吗?
My department handles the collection and display of data from a wide range of intra-company sources for use in data-mining/company dashboards.
One large challenge we have is cross-referencing location names across various departments. We are a rather large organization, and departments with different interests do their own reporting for any one location. In general there is alot of discrepancy in the EXACT name that a location name has in the reporting across those departments. For instance, a location may be referred to as:
- The Fabulous Restaurant
- fabulous restaurant
- Fabulous F&B
- When the location goes through some renovation... Fabulous Cafe'
- or even Profit Center 12345ABC
So my question is what best practices exist in reconciling these names in our own database and code? Let's assume for the moment that my department does not have the ability to unite the organization under a common hierarchy standard (which would be the optimal solution). At the moment our practice is to maintain ever growing reference tables of location names which are then referenced back into our own naming standard. This allows us to maintain historical consistency with our data.
Is it feasible/advisable to implement some kind of "fuzzy search" when cross-referencing locations? Something, for instance, that might ignore instances of words like "the", or treat "cafe'" and "restaurant" equally (based on some pre defined logic).
I certainly don't think we would ever be able to algorithmically account for ALL of the random naming conventions we encounter, but is it enough to be able to account for some/most of them?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
关键字是数据集成。重新标记。模糊搜索在信息检索,在这里绝对有用。但是您给出的示例对于自动集成来说可能有点太难,您需要用户干预和适当的 数据清理。
我已成功使用模糊匹配来重新导入音乐播放列表。即使来自互联网。标题和艺术家通常会提供足够的数据来对我的音乐收藏进行相当可靠的模糊匹配(至少如果我有这首歌的话)。
但是,如果您本质上只有一个单词,则模糊匹配将不可靠,就像您的“fabulous Restaurant”示例一样。
良好的模糊匹配将使用词干分析并具有常用词和同义词的概念。因此“餐厅”和“咖啡馆”可能不会被认为是重要的。那么关键是要有足够的数据。一个词可能不足以识别位置。
The keyword is data-integration. retagged. Fuzzy search is common in information-retrieval, and definitely useful here. But the examples you gave might be a bit too hard for automatic integration, you'll need user intervention and proper data-cleaning.
I've successfully used fuzzy matching to re-import music playlists. Even from the internet. Title and Artist usually provide enough data to do a rather reliable fuzzy matching to my music collection (at least if I have the song).
However, fuzzy matching will not be reliable if you have just a single word essentially, as in your "fabulous restaurant" example.
A good fuzzy matching will use stemming and have a notion of common words and synonyms. So "restaurant" and "cafe" will probably be not considered significant. The key point then is to have enough data. A single word will probably not be enough to identify locations.