匹配核心数据存储中的近似字符串

发布于 2024-07-20 06:19:29 字数 841 浏览 12 评论 0原文

我目前正在编写的核心数据应用程序有一个小问题。 我有两种不同的模型,上下文和持久存储。 一个用于我的应用程序数据,另一个用于包含与我相关的信息的网站。

大多数时候,我会将应用程序中的一条记录与其他来源的另一条记录完全匹配。 然而,有时,我必须回退到模糊字符串匹配来链接两个记录。 我正在尝试匹配歌曲名称。 我的本地标题可能是(编造的)“The French Idealist is in your pensée”,远程歌曲标题可能是“01 - 10 - French Idealist in in you're pensée, (dub remix, feat.DJ Objective-C)”

我搜索了 stack Overflow、Google、cocoa 文档,但我找不到关于如何在这些情况下进行模糊匹配的任何明确答案。 我的字符串可以以任何内容开头,有一堆特殊字符,通常以随机或被忽略的字符结尾。

Regexp 不行,NSPredicates 也不行,Soundex 不能很好地处理外国名称,也许 Levenshtein 还不够(或者会吗?)。

我正在寻找一组大约十几场潜在比赛中的冠军,但我必须做很多这样的操作。 100% 准确率不是目标。

我正在考虑删除被忽略的单词,提取关键字(在本例中为“french,idealist,pensée”),将它们连接起来,然后使用 Levenshtein 距离(歌曲标题中的单词应按相同顺序)。

在我的特殊情况下,它会起作用吗? 关于这个问题的行业标准是什么(我不可能是世界上唯一一个想要匹配稍微不同的歌曲名称的人)Core Data、Cocoa 或 Objective-C 可以帮助我吗?

多谢。

I have a small problem with the core data application i'm currently writing. I have two differents models, contexts and peristent stores. One is for my app data, the other one is for a website with relevant infos to me.

Most of the time, I match exactly one record from my app to another record from the other source. Sometimes however, I have to fallback to fuzzy string matching to link the two records.
I'm trying to match song titles. My local title could be the (made up) "The French Idealist is in your pensée" and the remote song title could be "01 - 10 - French idealist in in you're pensee, The (dub remix, feat. DJ Objective-C)"

I search stack overflow, Google, the cocoa documentation, and I can't find any clear answer on how to do a fuzzy matching in these cases. My strings can start with anything, have a bunch of special characters, usually end with random or to be ignored characters.

Regexp won't do, nor NSPredicates, Soundex doesn't work well with foreign names, and maybe the Levenshtein won't be enough (or will it ?).

I'm looking for a title in a set of about a dozen potential matches, but I hava to do this operation quite a lot. 100% accuracy is not the goal.

I was thinking of removing the ignored words, extracting the keywords (in this example, "french, idealist, pensée"), concatenate them, and then use the Levenshtein distance (words in song title should be in the same order).

In my special case, would it work ? What is the industry standard regarding this problem (I can't be the only one in the world who want to match slightly different songs names) Can Core Data, Cocoa or Objective-C help me ?

Thanks a lot.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

千寻… 2024-07-27 06:19:29

您希望搜索不区分变音符号,以匹配 pensée 中的“é”和 pensee 中的“e”。 您可以通过在属性后面添加 [d] 来获得此信息。 就像这样:

    NSPredicate *predicate = [NSPredicate predicateWithFormat:@"(songTitle like[cd] %@)", yourSongSubstring];

The 'c' in [cd] is for case insensitivity.

由于您的字符串可能以任何顺序出现在您正在搜索的字符串中,因此您可以标记您的搜索字符串 ([... ComponentsByString:@" "]),然后创建一个谓词,例如

    NSPredicate *predicate = [NSPredicate predicateWithFormat:@"(songTitle like[cd] %@) and (songTitle like[cd] %@)", songToken1, songToken2];

That syntax to combine predicates above may be off, going from memory.

You want your search to be diacritic insensitive to match the 'é' in pensée and 'e' in pensee. You get this by adding the [d] after the attribute. Like so:

    NSPredicate *predicate = [NSPredicate predicateWithFormat:@"(songTitle like[cd] %@)", yourSongSubstring];

The 'c' in [cd] is for case insensitivity.

Since your string could appear in any order in the string you are searching, you could tokenize your search string ([... componentsByString:@" "]) then create a predicate like

    NSPredicate *predicate = [NSPredicate predicateWithFormat:@"(songTitle like[cd] %@) and (songTitle like[cd] %@)", songToken1, songToken2];

That syntax to combine predicates above may be off, going from memory.

故笙诉离歌 2024-07-27 06:19:29

我相信您想在这里使用的工具是 SearchKit。 我这么说就好像我刚刚让你的工作变得容易一样……我没有,但它应该拥有你在这里取得成功所需的工具。 LNC 仍在免费提供 SearchKit Podcast(非常好) 。

在这种情况下,每个曲目都是一个文档,您需要想出一种好方法,使用可用于查找它们的标识符来对它们进行索引。 然后,您可以使用元数据加载它们并搜索它们。 也许将标题“放入”文档中会有助于促进相似性搜索 (kSKSearchOptionFindSimilar) 的使用。 这可能会也可能不会很好地发挥作用。

你问的问题是一个很好的问题,但肯定没有行业标准,因为任何能很好地解决这个问题的人(即每个主要搜索引擎)都会对其算法保密。 这是一个难题; 没有人愿意透露自己的答案。

I believe the tool you want to use here is SearchKit. I say that as if I've just made your job easy.... I haven't, but it should have the tools you need to be successful here. LNC is still offering their SearchKit Podcast for free (very nice).

Each track would be a document in this case, and you'd need to come up with a good way to index them with an identifier that can be used to find them. You can then load them up with metadata, and search them. Perhaps putting the title "in" the document would be helpful here to facilitate the use of Similarity Searching (kSKSearchOptionFindSimilar). That may or may not work really well.

The question you've asked is a good one, but there is certainly no industry standard for it because anyone who solves this problem well (i.e. every major search engine) keeps their algorithms very secret. This is a hard problem; no one is quite ready to give away their answer.

小伙你站住 2024-07-27 06:19:29

考虑q-gram,它们是长度为 q 的子字符串 (Gravano 等人,2001)。

对于两个字符串 s1 和 s2,您可以为 s1 的每个 q-gram 确定 s2 的具有最小编辑距离的相应 q-gram。 然后将所有这些距离相加,最终得到一个对于单词和额外字符的排列非常稳健的度量。

一般来说,q 应适应您的问题域(使用 q = 3, 4, 5... 进行实验)。

Consider q-grams, which are substrings of length q (Gravano et al., 2001).

You could, for two strings s1 and s2, determine for each q-gram of s1 the corresponding q-gram of s2 with smallest edit distance. Then add all those distances and you end up with a metric which is very robust to permutation of words and extra characters.

Generally, q should be adapted to your problem domain (experiment with q = 3, 4, 5...).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文