如何快速将许多具有重复项的记录添加到可扩展存储引擎
我需要向 ESE 数据库添加几百万条数据记录。除其他值外,每条记录都有一个唯一的字符串值。这个值可以被认为是一个键。
对记录感兴趣的是,输入集中可能存在同一记录的多个相同实例。输入后,我只想要一个包含每个唯一字符串的记录。
我的问题是如何做到这一点 - 如何快速过滤掉重复项?
现在,我仅在搜索密钥后添加每条记录,如果该条目已存在,我会跳过它。如果数据库中没有,我会添加记录和进度。这里最大的成本是对每个条目进行搜索。
有什么想法可以让它变得非常快吗?是否有办法键入该值,使得添加重复项会失败?
I need to add a few million data records to an ESE database. Among other values, each record has a unique string value. This value can be thought of as a key.
Interesting to the records is that there may be multiple identical instances of the same record within the input set. Once entered I only want one record with each of the unique strings.
My question is how to do this - how can I quickly filter out duplicates?
Right now I'm adding each record only after doing a search for the key, if the entry already exists I skip it. If it's not in the database I add the record and progress. The big cost here is doing the search on each entry.
any ideas on making this very fast? is there anyway to key the value such that adding a duplicate would fail?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可以通过将 JET_bitIndexUnique 传递到 JetCreateIndex 来在字符串列上创建唯一索引:
插入重复值会因 JET_errKeyDuplicate 失败。
如果你的琴弦很短,这种方法是最好的。如果您的字符串很长,您应该使用字符串的哈希值来测试唯一性。
Your can just create a unique index on the string column by passing JET_bitIndexUnique into JetCreateIndex:
An insertion of a duplicate value with fail with JET_errKeyDuplicate.
This approach is best if your strings are short. If your strings are long you should use a hash of the string to test for uniqueness.
**
**
**
**