我希望将一些信息分发到不同的机器,以便高效且极快地访问,而无需任何网络开销。数据存在于关系模式中,需要在实体之间的关系上“加入”,但根本不需要写入数据库(它将离线生成)。
我对 SQLite 能够提供性能很有信心,但 RDMBS 似乎在根本上不适合:由于索引查找的成本,连接非常昂贵,并且在我的只读上下文中,这是不必要的开销,实体可以在其中使用以文件偏移量的形式存储彼此的直接引用。这样,索引查找就切换为文件查找。
我在这里有什么选择?数据库似乎并没有真正描述我正在寻找的内容。我知道 Neo4j,但我无法将 Java 嵌入到我的应用程序中。
蒂亚!
编辑,回答评论:
- 数据大小将达到 1GB,并且我使用 PHP,因此将数据保留在内存中并不是真正的选择。我将依靠操作系统缓冲区缓存来避免不断访问磁盘。
- 示例是一个具有 15 个混合类型字段的产品表,以及一个列出具有特定品牌的产品的查询,并连接到类别表。
- 解决方案必须是某种平面文件。我想知道是否已经有一些软件可以满足我的需求。
@Mark Wilkins:
性能问题是经过衡量的。本质上,在我的情况下,用对 SQLite 的 5 毫秒 CPU 绑定调用替换对 Memcache 的 2 毫秒 IO 绑定查询是不可接受的...例如,类别表有 500 条记录,包含父类别和子类别。以下查询大约需要 8 毫秒,没有磁盘 IO:在 b.id = a.parent_id 上选择 1 FROM 类别 a INNER JOIN 类别 B。一些更简单的无连接查询非常快。
I'm looking to distribute some information to different machines for efficient and extremely fast access without any network overhead. The data exists in a relational schema, and it is a requirement to "join" on relations between entities, but it is not a requirement to write to the database at all (it will be generated offline).
I had alot of confidence that SQLite would deliver on performance, but RDMBS seems to be unsuitable at a fundamental level: joins are very expensive due to cost of index lookups, and in my read-only context, are an unnecessary overhead, where entities could store direct references to each other in the form of file offsets. In this way, an index lookup is switched for a file seek.
What are my options here? Database doesn't really seem to describe what I'm looking for. I'm aware of Neo4j, but I can't embed Java in my app.
TIA!
Edit, to answer the comments:
- The data will be up to 1gb in size, and I'm using PHP so keeping the data in memory is not really an option. I will rely on the OS buffer cache to avoid continually going to disk.
- Example would be a Product table with 15 fields of mix type, and a query to list products with a certain make, joining on a Category table.
- The solution will have to be some kind of flat file. I'm wondering if there already exists some software that meets my needs.
@Mark Wilkins:
The performance problem is measured. Essentially, it is unacceptable in my situation to replace a 2ms IO bound query to Memcache with an 5ms CPU bound call to SQLite... For example, the categories table has 500 records, containing parent and child categories. The following query takes ~8ms, with no disk IO: SELECT 1 FROM categories a INNER JOIN categories B on b.id = a.parent_id. Some simpler, join-less queries are very fast.
发布评论
评论(2)
我可能不完全清楚您的目标以及您需要的查询类型。但是关于存储文件偏移量到其他数据的部分似乎是一个非常脆弱的解决方案,很难维护和调试。可能有一些工具可以帮助解决这个问题,但我怀疑你最终会自己编写大部分内容。如果其他人稍后必须来调试并找出本地文件格式,那么工作量将会更大。
然而,我的第一个想法是想知道所描述的性能问题是此时估计的还是实际测量的。您是否使用关系格式的数据运行测试来看看它的实际速度有多快?确实,联接几乎总是涉及更多文件读取(按照您提到的方式进行二分搜索,然后获取关联的记录信息,然后查找该记录)。起初,这可能需要 4、5 或更多磁盘操作。但在类别表(来自OP)中,如果经常被命中,它最终可能会被缓存。这完全是我的猜测,但在很多情况下类别的数量相对较少。如果是这种情况,整个类别表及其索引可能会被操作系统缓存在内存中,从而导致非常快速的连接。
如果性能确实是一个真正的问题,另一种可能性可能是对数据进行非规范化< /a>.在类别示例中,只需复制类别值/名称并将其与每个产品记录一起存储。数据库大小将因此增加,但您仍然可以使用嵌入式数据库(有多种可能性< /a>)。如果明智地完成,它仍然可以得到相当好的维护,并提供通过一次查找/查找和一次读取来读取完整对象的能力。
I may not be completely clear on your goals as to the types of queries you are needing. But the part about storing file offsets to other data seems like it would be a very brittle solution that is hard to maintain and debug. There might be some tool that would help with it, but my suspicion is that you would end up writing most of it yourself. If someone else had to come along later and debug and figure out a homegrown file format, it would be more work.
However, my first thought is to wonder if the described performance problem is estimated at this point or actually measured. Have you run the tests with the data in a relational format to see how fast it actually is? It is true that a join will almost always involve more file reads (do the binary search as you mentioned and then get the associated record information and then lookup that record). This could take 4 or 5 or more disk operations ... at first. But in the categories table (from the OP), it could end up cached if it is commonly hit. This is a complete guess on my part, but in many situations the number of categories is relatively small. If that is the case here, the entire category table and its index may stay cached in memory by the OS and thus result in very fast joins.
If the performance is indeed a real problem, another possibility might be to denormalize the data. In the categories example, just duplicate the category value/name and store it with each product record. The database size will grow as a result, but you could still use an embedded database (there are a number of possibilities). If done judiciously, it could still be maintained reasonably well and provide the ability to read full object with one lookup/seek and one read.
一般来说,您首先可以做的最快的事情可能是对数据进行非规范化,从而避免 JOIN 和其他多表查找。
使用 SQLite,您当然可以自定义各种内容并根据您的需求进行定制。例如,如果您仅通过一个线程访问,则禁用所有互斥、增加内存缓存大小、自定义索引(包括删除许多索引)、自定义构建以禁用不必要的元数据、调试等。
请看一下以下内容:
这当然是假设您需要一个数据库。
In general probably the fastest thing you can do at first is to denormalize your data thus avoiding JOINs and other mutli-table lookups.
Using SQLite you can certainly customize all sorts of things and tailor them to your needs. For example, disable all mutexing if you're only accessing via one thread, up the memory cache size, customize indexes (including getting rid of many), custom build to disable unnecessary meta data, debugging, etc.
Take a look at the following:
This is all of course assuming a database is what you need.