快速、可扩展的字符串查找
我有一套 500 万根琴弦。这些当前存储在单列 MySQL 表中。我的应用程序必须执行查找并检查给定的字符串是否在集合中。这当然可以通过 HashSet(在 Java 中)来完成。但我不是构建自定义解决方案,而是想知道是否有任何现有的、广泛使用的、经过验证的解决方案可以做到这一点?这似乎是一个常见的场景。该解决方案应该是可扩展的(集合可能会增加到超过 500 万)、具有故障转移(因此可能是分布式的)并且在大量请求下表现良好。有什么建议吗?
更新:我的应用程序还可以查询以检查全局(500 万个)集中是否存在给定的字符串集。
I have a set of 5 million strings. These are currently stored in a single column MySQL table. My application has to perform lookups and check if a given string is in the set. This can of course be done with a HashSet (in Java). But instead of building a custom solution, I was wondering if there are any existing, widely used, proven solutions that do this? It seems like a common scenario. The solution should be scalable (the set might increase beyond 5 million), have failover (so probably distributed) and perform well under a huge number of requests. Any suggestions?
Update: My app can also query to check if a given set of strings is present in the global (the 5 million one) set.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您可以尝试 Trie 或 Patricia-trie。第二个内存效率更高。另外在这里您可以找到两种数据结构[Trie、TreeSet]、内存数据库及其性能的比较。
You can try Trie or Patricia-trie.The second is more memory efficient.Also here you can find a comparison of 2 data structures [Trie,TreeSet],In-memory database and their performance.
尝试 memcached,一个高性能的分布式内存对象缓存系统。您使用键/值哈希进行查找。 Facebook 与许多其他高度可扩展的网站一样使用 memcached。需要存储更多字符串吗?只需向集群添加更多 memcached 实例即可。另外,您可以在 2 层缓存设置中使用,首先查询 memcached,如果缓存未命中,则查询完整数据库。
您是否考虑过在 MySQL 数据库中添加列索引?支持哈希、b 树和 r 树。
MySQL 还可以复制和集群以获得高可扩展性。
Try memcached, a high-performance, distributed memory object caching system. You lookup using key/value hashes. Facebook uses memcached as do many other highly scalable sites. Need to store more strings? Just add more memcached instances to the cluster. Plus you can use in a 2-tier caching setup where you first query memcached, if cache miss then query the full database.
Have you considered adding column indexing to your MySQL database? Hash, b-tree and r-tree are supported.
MySQL can also be replicated and clustered for high scalability.
虽然 Trie 可能是最好的解决方案,但对已排序的字符串列表进行二分搜索也应该在运行时表现良好。
While a Trie might be the best solution, binary search on the sorted list of strings should also perform well run time wise.