域名数据库选择什么NoSQL解决方案?
我有一个项目,在数据库中存储数百万个域名并执行搜索请求以查找数据库中是否存在域名。我唯一需要的操作 - 检查给定值是否存在。没有范围查询,没有附加信息,什么都没有。
我对数据库进行的查询数量相当大,例如每个用户会话 100,000 个。
我每天都会有一个新数据库,甚至可以检查删除了哪些记录以及添加了哪些记录 - 我认为这不值得。因此,我将数据库导入到新表并将脚本指向新名称。
寻找可以使整个过程更快的解决方案,因为我不使用任何 SQL 功能。名称搜索和导入时间对我来说很重要。
我的服务器无法将该数据库存储在内存中,甚至不能存储一半,所以我认为一些从硬盘驱动器工作的 NoSQL 解决方案可以帮助我。
你能建议一下吗?
I have a project that stores several millions of domain names in database and perform search requests to find if domain is present in DB. The only operation I need - check if given value exists. No range queries, no additional information, nothing.
The number of queries that I make to database is rather big, for example 100'000 per one user session.
I have new database once a day and even it's possible to check what records were deleted and what added - I don't think that it's worth it. So, I am importing database to a new table and point script to a new name.
Looking for solution that can make the whole things faster, as I don't use any SQL features. Name search and import time are important for me.
My server can't store this database in memory, even half of it, so I think some NoSQL solution working from hard drive can help me.
Can you suggest something?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这里有很多选择。 Berkeley DB 确实可以完成这项工作,并且可能是最简单的解决方案之一。就像简单一样:将所有内容存储在 memcached 中,然后您可以选择在需要时将值的缓存拆分到多台计算机上(如果查询负载或数据大小增加)。
Many options here. Berkeley DB certainly does the job and is probably one of the simplest solutions. Just as simple: store everything in memcached, then you have the option of splitting the cache of the values across several machines if needed (if query load or data size grows).
一个更小、更快的解决方案是将 Berkeley DB 与 Berkeley DB 结合使用。 ly/hgblLz" rel="nofollow">键值对 API。 Berkeley DB 是一个链接到您的应用程序的数据库库,因此没有客户端/服务器开销,也没有单独的服务器需要安装和管理。 Berkeley DB 非常简单,在多个 API 中提供了一个简单的键值 (NoSQL) API,它提供了您希望在更大、更复杂的 RDBMS 中找到的所有基本数据管理例程(索引、二级索引、外键),但没有 SQL 引擎的开销。
免责声明:我是 Berkeley DB 的产品经理,所以我有点偏见。也就是说,它的设计正是为了满足您的要求——简单、快速、可扩展的键值数据管理,而无需不必要的开销。
事实上,有许多“数据库域”类型的应用程序服务使用 Berkeley DB 作为其主要数据存储。大多数开源和/或商业 LDAP 实现都使用 Berkeley DB(包括 OpenLDAP、Redhat 的 LDAP、Sun Directory Server 等)。 Cisco、Juniper、AT&T、Alcatel、Mitel、Motorola 和许多其他公司都使用 Berkeley DB 来管理他们的网关、身份验证和配置管理系统,他们使用 Berkeley DB,因为它完全满足他们的需要,非常好用。快速、可扩展且可靠。
A much smaller and faster solution would be to use Berkeley DB with the key-value pair API. Berkeley DB is a database library that links into your application, so there is no client/server overhead nor separate server to install and manage. Berkeley DB is very straightforward and provides, among several APIs, a simple key-value (NoSQL) API that provides all of the basic data management routines that you would expect to find in a much larger, more complex RDBMS (indexing, secondary indexes, foreign keys), but without the overhead of a SQL engine.
Disclaimer: I am the Product Manager for Berkeley DB, so I am a little biased. That said, it was designed to do exactly what you're asking for -- straightforward, fast, scalable key-value data management without unnecessary overhead.
In fact, there are many "database domain" type application services that use Berkeley DB as their primary data store. Most of the open source and/or commercial LDAP implementations use Berkeley DB (including OpenLDAP, Redhat's LDAP, Sun Directory Server, etc.). Cisco, Juniper, AT&T, Alcatel, Mitel, Motorola and many others use Berkeley DB to manage their They use Berkeley DB for their gateway, authentication, and configuration management systems, They use BDB because it does exactly what they need, it's very fast, scalable and reliable.
如果您可以接受非常小的误报率,那么仅使用 布隆过滤器 就可以很好地完成任务(假设您使用足够大的过滤器)。
另一方面,您当然可以使用 Cassandra。它大量使用布隆过滤器,因此请求不存在的东西很快,而且您不必担心误报。它旨在处理不适合内存的数据集,因此性能下降非常平稳。
导入任何数量的数据都应该很快——在普通机器上,Cassandra 每秒可以处理大约 15k 写入。
You could get by quite nicely with just a Bloom filter if you can accept a very small false positive rate (assuming you use a large enough filter).
On the other hand, you could certainly use Cassandra. It makes heavy use of bloom filters, so asking for something that doesn't exist is quick, and you don't have to worry about false positives. It's designed to handle data sets that do not fit into memory, so performance degredation there is quite smooth.
Importing any amount of data should be quick -- on a normal machine, Cassandra can handle about 15k writes per second.