索引具有用户定义模式的无模式数据库?

发布于 2024-10-18 14:19:56 字数 2053 浏览 2 评论 0原文

任何数据库最重要的特征之一就是查询速度。我们存储数据并希望快速访问符合我们标准的数据。然而,最近,无模式数据库变得流行。如果我们有一个无模式数据库,但实际上有一个推断的(头部/应用程序内)模式,这是一回事;它只是数据库尚未正式声明。

另一方面,假设我们确实需要一个开放数据库,其中多个用户针对各自的问题领域拥有自己的模式。用户可以定义他自己的“域”。该域(RDBMS 服务器上的数据库)将有其类型(RDBMS 中的表),并且这些类型将有自己的属性(RDBMS 中的列)。如何创建复合索引以从给定域中提取特定对象/文档/记录(您有什么)?我的查询应针对某些列(start_date >= 今天、start_date <= 今天 + 1 周、open_for_registration = true、calendar_name = ' 选择一个或多个域(IN 子句)、仅一种主题类型(例如 CalendarEvent)民众')。在具有固定架构(即使未声明也是隐含的)的数据库中,这很简单:针对列创建复合索引。

复杂性在于,我们本质上已经创建了一个实例,假设 MongoDB 充当具有许多数据库的 RDBMS 服务器,其中每个数据库及其相关模式都是我们的“域”。

在为这个问题绞尽脑汁一周并查看各种数据库(MongoDB、Neo4j、MySQL、PostgreSQL)后,我只找到了一些可能的解决方案:

  • 对所有属性建立索引。属性可以在属性表中表示,也可以在 MongoDB 中表示为嵌入文档。在 RDBMS 中,属性值必须序列化为字符串。缺点:a)一次只能搜索一个属性(无复合索引),b)所有内容都有一个索引,因此我们会产生不必要的开销。
  • 索引选择属性。在 PostgreSQL 中,这可以通过过滤索引来完成。基本上,财产记录将有一个我必须维护的称为“索引”的部分。该位将决定过滤索引是否使用该特定属性。缺点:a)我们一次仍然只能搜索一个属性。这消除了“复合索引”的使用。我可以想象模仿复合索引的唯一方法是搜索每个单独的索引属性并返回 PK 的交集。
  • 创建/维护数据库结构以反映工作索引。在 MongoDB 中,我可以创建一个“可索引”集合。此集合中的文档可能如下所示: {domain_id: ObjectId(..), type_id: ObjectId(..), fields: {field1: "some int value", field2: "some date value", field3: "some位值”}}。然后我在 {domain_id: 1, type_id: 1, "fields.field1": 1, "fields:field2": 1, "fields:field3", 1} 上索引“indexables”集合。然后,每次我在“事物”集合中创建/更新文档时,我都必须将其值插入可索引的 field1、field2、field3 插槽中。 (这与 MongoDB 配合得很好,因为我可以将任何数据类型的值插入到这些占位符中。在 MySQL 中,使用相同的模式,我必须将值序列化为字符串。)我还必须维护 domain_id 和 type_id。基本上,它是一个建立在数据库处理的索引之上的索引层(我自己管理)。缺点:有额外的开销。虽然数据库通常会代表我管理索引,但现在我必须自己小心地执行此操作。由于 MongoDB 没有事务的概念,我无法保证文档及其各种索引是在一个步骤中提交的。优点:我的复合索引回来了。索引是在域级别维护的。
  • 我考虑过允许用户拥有自己的数据库 X 实例。或者在 MongoDB 中拥有自己的集合。但我想知道这是否会产生更多问题,特别是当我们遇到实际限制(允许的数据库或集合的数量)时。我没有多想就放弃了这个想法。

还有其他想法吗?其他类型的数据库可以更好地处理这个问题吗?

同样,这个想法是这样的:不同的用户管理自己的域。域内可以是任何“类型”的项目。对于每个键入的项目,我们都有属性。我希望允许用户针对其域运行查询,以获取具有与其条件匹配的属性的类型的项目。 (因此复合索引)

最后一个想法。域本身并不打算变得巨大。它可能有 10-20 个“类型”。在每种类型中,它们可能有多达 5000 条记录(在大多数情况下),在极端情况下可能有 20000 条记录。

不幸的是,这是其中之一,尽管 Joel Spolsky 的建议我还是尝试了宇航员架构。

One of the most essential features of any database is query speed. We store data away and want quick access to data that matches our criteria. However, of late, schema-less databases have become popular. It's one thing if we have a schema-less database but there actually is an inferred (in-the-head/in-the-app) schema; it just hasn't been declared formally by the database.

On the other hand, let's say we truly need an open database where several users have their own schemas for their own individual problem areas. A user would define his own "domain". That domain (a database on a RDBMS server) would have its types (tables in an RDBMS) and those types would have their own properities (columns in an RDBMS). How do I create compound indexes to pull specific objects/documents/records (what have you) from a given domain? My query should select one or more domains (an IN clause), just one topic type (e.g. a CalendarEvent), against certain columns (start_date >= today, start_date <= today + 1 week, open_for_registration = true, calendar_name = ‘Public'). In a database with a fixed schema (implied even if not declared), this is simple: you create a compound index against the columns.

The complexity is that we have essentially made a single instance of let's say MongoDB act like a RDBMS server with many databases where each database and its related schema is our "domain".

After busting my brain on this problem for a week and looking at various databases (MongoDB, Neo4j, MySQL, PostgreSQL) I have only found a few possible solutions:

  • Index all properties. A property could be represented in a Properties table or as an embedded document in MongoDB. In an RDBMS the property values would have to be serialized to strings. CONS: a) Can only search against one property at a time (no compound indexes), b) everything gets an index so we're incurring needless overhead.
  • Index select properties. In PostgreSQL this could be done with a Filtered Index. Basically, the property record would have a bit called "indexed" that I would have to maintain. This bit would drive whether or not the filtered index uses that particular property. CONS: a) we can still only search against one property at a time. This eliminates "compound indexes" from use. The only way I can imagine to mimic a compound index would be to search against each individual indexed property and return the intersection of the PKs.
  • Create/maintain database constructs to reflect working indexes. In MongoDB, I could create an "indexables" collection. A document in this collection might look like this: {domain_id: ObjectId(..), type_id: ObjectId(..), fields: {field1: "some int value", field2: "some date value", field3: "some bit value"}}. Then I index the "indexables" collection on {domain_id: 1, type_id: 1, "fields.field1": 1, "fields:field2": 1, "fields:field3", 1}. Then every time I create/update a document in my "things" collection I would have to plug it's values into the field1, field2, field3 slots of indexables. (This works nicely with MongoDB because I can plug values of any datatype into those placeholders. In MySQL, using the same pattern I would have to serialize values to strings.) I would also have to maintain the domain_id and type_id. Basically, it's an index layer (that I manage myself) built on top of indexes handled by the database. CONS: There's additional overhead. Whereas the database would normally manage indexes on my behalf, I now have to take care to do this myself. As MongoDB has no concept of transactions I couldn't guarantee that the document and it's various indexes were committed in a single step. PROS: I have my compound indexes back. Indexes are maintained at the domain level.
  • I have considered allowing users to have their own instances of database X. Or in MongoDB their own collections. But I wondered if this wouldn't create more issues especially as we run up against practical limitations (number of databases or collections allowed). I tossed this idea out after not too much thought.

Other ideas? Other kinds of databases that might better handle this problem?

Again, the idea is this: different users manage their own domains. Within a domain can be items of any "type". For each typed item we have properties. I want to allow users to run queries against their domains to get items of a type having properties that match their conditions. (thus compound indexes)

One last thought. A domain in itself is not intended to be humongous. It might have 10-20 "types". Within each type their might be as many as 5000 records (in most cases) and say 20000 in extreme cases.

Unfortunately, this is one of those cases where despite Joel Spolsky's advice I attempted astronaut architecture.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

走走停停 2024-10-25 14:19:56

其他类型的数据库可以更好地处理这个问题?

您考虑过 Excel 吗?也许只是索引平面文件:)

看,您将在这里遇到的基本问题是没有灵丹妙药。你的想法很好,但在某些时候你必须接受一些权衡。

您无法对所有内容建立索引。在某些时候,您必须识别“常用”查询并为这些内容构建一些索引。除非您打算将所有内容都保留在内存中,否则您最终会在某个时候创建​​索引。

在每种类型中,它们可能有多达 5000 条记录(在大多数情况下),在极端情况下可能有 20000 条记录。

嘿,确实有一个限制。您可以投入多少硬件来处理 5k 条记录? 200k 条记录怎么样?将所有内容都保存在 RAM 中就足够了吗?将其一部分保留在 RAM 中?仅将索引保留在 RAM 中?

如果你想让用户只填充他们自己的“动态”模式,我个人认为 MongoDB 是一个自然的选择。特别是对于您所指示的这些小数据集。

但这无论如何都不是灵丹妙药。这些解决方案中的每一个都会有自己的一系列问题。如果有一个实际的数据库可以处理您提出的所有要求,让我们面对现实吧,我们都会使用该数据库:)

Other kinds of databases that might better handle this problem?

Have you considered Excel? Maybe just indexed flat files :)

Look, the basic problem you're going to have here is that there is not silver bullet. Your idea is fine, but at some point you have to accept some set of trade-offs.

You can't index everything. At some point you'll have to identify "commonly-used" queries and build some indexes for those things. Unless you're planning to keep everything in memory, you'll end up creating indexes at some point.

Within each type their might be as many as 5000 records (in most cases) and say 20000 in extreme cases.

Hey there's a true limitation. How much hardware can you throw at 5k records? How about 200k records? Is it going to be enough to keep it all in RAM? Keep part of it in RAM? Keep just the indexes in RAM?

If you want to let users just stuff in their own "dynamic" schemas, I personally feel that MongoDB is a natural fit. Especially for these small data sets you're indicating.

But it's not a silver bullet by any means. Each of these solutions will have their own set of problems. If there was an actual DB that could handle all of the requirements you put forth, let's face it, we'd all be using that DB :)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文