与Python模块recordLinkage索引时,是否有类似于大约相等数字值的阻止索引?
我有一个音乐曲目的SQLite数据库,我想删除重复。我想根据标题和持续时间比较曲目。 (我可能会尝试稍后将艺术家扔进去,但这是一个单独的桌子(每首曲目的多个艺术家),但是目前,我在持续时间(以秒为单位)有一个文本字段和一个整数字段。 )该数据库中的重复曲目倾向于在彼此的5-10秒内具有相似的标题(或至少具有相似的前缀)和持续时间。
我正在尝试学习record链接以检测重复项,我的第一个尝试是制作完整的索引,使用Smith-Waterman比较标题并在此期间进行简单的线性数字比较。没有大惊喜;数据库太大了,无法执行完整的索引。我可以在持续时间上进行一个块索引,以限制为相同的持续时间,但是持续时间通常会熄灭几秒钟。我可以做分类的社区,但是如果*我正确理解这一点(*一个大“如果”),这意味着,如果我设置了一个窗口(例如)10,则每个曲目只能与10个最接近的曲目配对就持续时间而言,这几乎总是相同的持续时间,并且完全错过了持续但不完全相同的持续时间。在我看来,拥有“近似封锁索引”或类似的东西是自然而然的一步,但我似乎找不到任何简单的方法来做到这一点。
谁能在这里帮我吗?
I've got a sqlite database of music tracks, and I want to remove duplicates. I'd like to compare tracks based on title and duration. (I'll probably try to throw artists in later, but that's a separate table (multiple artists per track), but for now, I've got a text field for the title and an integer field for the duration (in seconds).) Duplicate tracks in this database tend to have similar titles (or at least with similar prefixes) and durations within 5-10 seconds of each other.
I'm trying to learn recordlinkage to detect the duplicates, and my first attempt was to make a full index, use Smith-Waterman to compare titles and a simple linear numeric comparison for the duration. No big surprise; the database was WAAY too big to do a full index. I could do a block index on the duration to limit down to pairs to durations that are identical, but the durations are often off by a few seconds. I could do sorted neighborhood, but if* I'm understanding this correctly (*a big "if"), that means that if I set a window to (for example) 10, each track will only be paired with the 10 closest tracks in terms of duration, which will pretty much always be identical durations and completely miss the durations that are close but not identical. It seems to me like having an "approximate blocking index" or something like that would be a natural step, but I can't seem to find any simple way to do that.
Can anyone help me out here?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
好的,在这里回答我自己的问题,因为我相信我已经弄清楚了我最初的问题中的误解。
我误解了分类的社区索引的工作原理。我在想,如果您将窗口设置为(例如)3,它将按密钥对所有记录进行排序,然后将每个记录与恰好的3个邻居记录(记录本身,上方的记录,以及下面的记录配对) )。因此,如果有5个以上具有相同键值的记录,则实际上将导致 对比块索引少。但是我现在很确定它实际上是首先按密钥对值进行分组,因此3的窗口将与所有记录配对具有完全相同的密钥值,所有具有下一个最高密钥值的记录,所有记录都带有所有记录下一个最低的钥匙值。
现在,这并不能使我完全我想要的东西,但它使我足够近了。如果我设置了一个11(或21)的窗口大小,那么我将保证在5秒(或10秒)内获取所有值。如果数据相对于持续时间很少,则会有更多。 (这仅是因为它是整数数据。如果它是任意精度的浮点数,那将是另一回事。)
Okay, answering my own question here because I believe I've figured out the misunderstanding in my original question.
I was misunderstanding how sorted neighborhood indexing works. I was thinking that if you set the window to (for example) 3, it would sort all the records by the key and then pair each record with exactly 3 neighbor records (the record itself, the one above it, and the one below it). So if there were more than 5 records with the same key value, this would actually result in fewer pairs than a block index. But I'm now pretty sure that it's actually grouping the values by the key first, so that a window of 3 will pair with all records with the exact same key value, all the records with the next highest key value and all the records with the next lowest key value.
Now this doesn't get me exactly what I asked for, but it gets me close enough. If I set a window size of 11 (or 21), then I'll be guaranteed to get all values within 5 seconds (or 10 seconds). If the data is sparse with respect to duration, there will be a bit more. (And this only works because it's integer data. If it were floating point numbers of arbitrary precision, then that would be a different matter.)