标准化跨表共享的通用 ID 类型
这是问题的简化版本。
我们的客户向我们发送大量数据然后进行查询。 他们要求我们拥有几个“公共”ID,以便他们可以查询我们的数据。 (大多数人希望通过随数据一起发送的 ID 来查询我们的系统,但并非总是如此)。 为了简单起见,我们将它们称为“pid”、“crid”和“musicbrainzid”。 我们有一个“实体”表来存储这些信息。 它看起来像这样(“权威”是发送数据的人):
entity
--
entity_id
authority // who sent the data
type // 'pid', 'crid', 'musicbrainz', etc.
value // the actual id value
然后我们有单独的实体,例如“剧集”,“系列”和“广播”(实际上,还有更多,但我保持简单这里)。 其中每个都有一个指向实体表的entity_id。
外部客户如何通过 pid 或 crid 进行搜索并获得适当的剧集或连续剧,并正确识别它是什么? 给定 pid,我们可以获取实体 id,但随后我们需要在剧集、系列和广播表中搜索该值。 此外,并非所有 id 都必然与所有其他表相关,但任何实体(例如,“剧集”)可能有多个 id(pid、crid 等)。
策略:
- 查找 pid 的实体 id 并搜索pid 的所有其他表。
- 在实体上放置一个“entity_type”列,但是如果它是剧集表中的 pid,但我们不小心将 Episode.type 设置为系列怎么办? 我们不想重复数据,我不想将数据库元数据放入列值中。
选项 1 很慢并且似乎是错误的(此外,各个表具有不同的结构,这会产生问题)。
选项 2 意味着重复数据,并且该数据可能不同步。 我们可以使用触发器来强制执行此操作,但这看起来确实很令人讨厌,而且无论如何,mysql 触发器实现中的错误已经多次困扰我们。 我们现在正在使用这个策略,但没有触发器。
选项3是什么?
旁注:我们知道我们需要将“权限”分解到一个单独的表中,因为并非所有权限/类型组合都是有效的。
This is a simplified version of the problem.
We have customers who send us lots of data and then query it. We are required by them to have several "public" ids they can query our data by. (Most want to query our system via the id they send along with the data, but not always). For simplicity, we'll call them "pid", "crid" and "musicbrainzid". We have an "entity" table which stores this information. It looks something like this (the "authority" is who sent the data):
entity
--
entity_id
authority // who sent the data
type // 'pid', 'crid', 'musicbrainz', etc.
value // the actual id value
Then we have separate entities such as "episode", "series" and "broadcast" (actually, there's a lot more, but I'm keeping it simple here). Each of these has an entity_id pointing to the entity table.
How can external customers search, via pid or crid and get the appropriate episode or series, along with proper identification of what it is? Given a pid, we can fetch the entity id, but then we need to search the episode, series and broadcast tables for this value. Further, not all ids will necessarily be related to all of the other tables, but any entity (e.g., an "episode") might have several ids (pid, crid, etc.)
Strategies:
- Find the entity id for a pid and search every other table for the pid.
- Put an "entity_type" column on entity, but what if it's a pid in the episode table but we accidentally set episode.type as series? We don't want to duplicate data and I don't want to put database metadata into column values.
Option number 1 is slow and seems wrong (further, the various tables have different structures making problematic).
Option 2 means duplicate data and this data can get out of synch. We can use triggers for force this, but this seems really nasty and, in any event, bugs in the implementation of mysql triggers have hit us several times. We're using this strategy right now, but without triggers.
What's option 3?
Side note: we know we need to break "authority" out into a separate table because not all authority/type combinations are valid.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果我正确理解了你的问题,我会选择选项 1。
基于entity_id 识别行的查询不应该那么慢,因为所有数据都应该在索引中。
如果您的索引配置正确,这甚至不应该访问实际数据。
(至少在 SQL Server 中不会。)
我要做的一个小更改是创建一小组表来标识哪些 id 对哪些表有效。
然后,您可以使用它来缩小需要搜索的表的范围。
选项 1 或 2 的替代方案可能是完全更改数据库结构,在同一个表上存储不同的数据,使用entity_id 作为主键,并使用包含数据的通用列。
这肯定会更激进,但我发现它对于像您这样的系统来说效果很好,其中数据及其结构非常动态。
If i've understood your question correctly, I'd go with Option 1.
The query to identity the row based on the entity_id shouldn't be that slow as all that data should be in an index.
If your indexes are configured correctly this shouldn't even access the actual data.
(At least in SQL Server it wouldn't.)
One small change I'd make would be to create a small set of tables to identity which id's are valid for which tables.
You would then use this to narrow down which tables you need to search through.
An alternative to Option 1 or 2, might be to change your database structure completely, to store different data on the same table, using entity_id as the primary key, with generic columns containing the data.
This would certainly be more radical, but i've seen it work well for a system like yours where the data and it's structure is quite dynamic.