如何为用户定义字段设计数据库?
我的需求是:
- 需要能够动态添加任意数据类型的用户定义字段
- 需要能够快速查询UDF 需要能够
- 根据数据类型对UDF进行计算
- 需要能够根据数据类型对UDF进行排序
其他信息:
- 我主要寻找性能
- 有几百万条主记录可以附加 UDF 数据
- 当我上次检查时,我们当前的数据库中有超过 5000 万条 UDF 记录
- 大多数时候,UDF 仅附加到少数几个数千条主记录,并非所有
- UDF 都未连接或用作键。它们只是用于查询或报告的数据
选项:
-
使用 StringValue1、StringValue2...IntValue1、IntValue2...等创建一个大表。我讨厌这个想法,但如果有人可以告诉我,我会考虑它比其他想法更好以及为什么。
-
创建一个动态表,根据需要添加新列。我也不喜欢这个想法,因为我觉得除非对每一列建立索引,否则性能会很慢。
-
创建一个包含 UDFName、UDFDataType 和 Value 的表。添加新的 UDF 时,生成一个视图,该视图仅提取该数据并将其解析为指定的任何类型。不满足解析条件的项目返回 NULL。
-
创建多个 UDF 表,每种数据类型一个。因此,我们会有 UDFStrings、UDFDates 等表。可能会执行与 #2 相同的操作,并在添加新字段时自动生成视图
XML 数据类型?我以前没有使用过这些,但见过他们提到过。不确定他们是否会给我想要的结果,尤其是性能。
-
还有别的事吗?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(14)
如果性能是主要考虑因素,我会选择#6...每个 UDF 一个表(实际上,这是#2 的变体)。这个答案是专门针对这种情况以及所描述的数据分布和访问模式的描述而定制的。
优点:
因为您指出某些 UDF
具有一小部分的值
整体数据集,单独的
表会给你最好的
性能,因为该表将
只需要多大就够了
支持UDF。相关索引也是如此。
您还可以通过限制聚合或其他转换所需处理的数据量来提高速度。将数据拆分到多个表中可以让您对 UDF 数据执行一些聚合和其他统计分析,然后通过外键将该结果连接到主表以获取非聚合属性。
您可以使用表/列名称
反映数据的实际内容。
您可以完全控制数据类型的使用,
检查约束、默认值等。
定义数据域。不要低估动态数据类型转换对性能造成的影响。这样的
约束也有助于 RDBMS 查询
优化器开发更有效
计划。
如果您需要使用外国语言
键,内置声明式
参考性的
诚信很少被超越
基于触发器或应用程序级别
约束执行。
缺点:
这可能会创建很多表。
强制模式分离和/或
命名约定会减轻
这。
还有更多应用程序代码
操作UDF定义所需的
和管理。我希望这是
所需的代码仍然比
原始选项 1、3、& 4.
其他注意事项:
如果有任何关于
数据的性质
对 UDF 进行分组的意义,
这应该受到鼓励。这样,
这些数据元素可以组合起来
到一个表中。例如,
假设您有颜色 UDF,
尺寸和成本。的趋势在
数据表明,大多数情况下
数据看起来像
<前><代码>'红色','大',45.03
而不是
<前><代码> NULL,'中',NULL
在这种情况下,您不会遭受
明显的速度损失
将 3 列合并到 1 个表中
因为很少有值为 NULL 的值
你可以避免再制作两张桌子,
减少 2 个连接
您需要访问所有 3 列。
如果你从
UDF 人口稠密且
经常使用,那么应该是
考虑纳入
主表。
逻辑表设计可以带你
某个点,但是当记录
计数变得非常庞大,你也
应该开始看看什么表
分区选项由您选择的 RDBMS 提供。
If performance is the primary concern, I would go with #6... a table per UDF (really, this is a variant of #2). This answer is specifically tailored to this situation and the description of the data distribution and access patterns described.
Pros:
Because you indicate that some UDFs
have values for a small portion of
the overall data set, a separate
table would give you the best
performance because that table will
be only as large as it needs to be
to support the UDF. The same holds true for the related indices.
You also get a speed boost by limiting the amount of data that has to be processed for aggregations or other transformations. Splitting the data out into multiple tables lets you perform some of the aggregating and other statistical analysis on the UDF data, then join that result to the master table via foreign key to get the non-aggregated attributes.
You can use table/column names that
reflect what the data actually is.
You have complete control to use data types,
check constraints, default values, etc.
to define the data domains. Don't underestimate the performance hit resulting from on-the-fly data type conversion. Such
constraints also help RDBMS query
optimizers develop more effective
plans.
Should you ever need to use foreign
keys, built-in declarative
referential
integrity is rarely out-performed by
trigger-based or application level
constraint enforcement.
Cons:
This could create a lot of tables.
Enforcing schema separation and/or a
naming convention would alleviate
this.
There is more application code
needed to operate the UDF definition
and management. I expect this is
still less code needed than for the
original options 1, 3, & 4.
Other Considerations:
If there is anything about the
nature of the data that would make
sense for the UDFs to be grouped,
that should be encouraged. That way,
those data elements can be combined
into a single table. For example,
let's say you have UDFs for color,
size, and cost. The tendency in the
data is that most instances of this
data looks like
rather than
In such a case, you won't incur a
noticeable speed penalty by
combining the 3 columns in 1 table
because few values would be NULL and
you avoid making 2 more tables,
which is 2 fewer joins needed when
you need to access all 3 columns.
If you hit a performance wall from a
UDF that is heavily populated and
frequently used, then that should be
considered for inclusion in the
master table.
Logical table design can take you to
a certain point, but when the record
counts get truly massive, you also
should start looking at what table
partitioning options are provided by your RDBMS of choice.
我已经写过关于这个问题很多。最常见的解决方案是实体属性值反模式,它类似于您在选项 #3 中描述的内容。 像避免瘟疫一样避免这种设计。
当我需要真正动态的自定义字段时,我在此解决方案中使用的是将它们存储在 XML 中,这样我就可以随时添加新字段。但为了提高速度,还需要为需要搜索或排序的每个字段创建额外的表(不是每个字段一个表,而是每个可搜索字段一个表)。这有时称为倒排索引设计。
您可以在此处阅读 2009 年关于此解决方案的有趣文章:http://backchannel.org/blog/ friendfeed-schemaless-mysql
或者您可以使用面向文档的数据库,其中每个文档都需要有自定义字段。我会选择 Solr。
I have written about this problem a lot. The most common solution is the Entity-Attribute-Value antipattern, which is similar to what you describe in your option #3. Avoid this design like the plague.
What I use for this solution when I need truly dynamic custom fields is to store them in a blob of XML, so I can add new fields at any time. But to make it speedy, also create additional tables for each field you need to search or sort on (you don't a table per field--just a table per searchable field). This is sometimes called an inverted index design.
You can read an interesting article from 2009 about this solution here: http://backchannel.org/blog/friendfeed-schemaless-mysql
Or you can use a document-oriented database, where it's expected that you have custom fields per document. I'd choose Solr.
这听起来像是一个可以通过非关系型解决方案(例如 MongoDB 或 CouchDB)更好地解决的问题。
它们都允许动态模式扩展,同时允许您保持所需的元组完整性。
我同意 Bill Karwin 的观点,EAV 模型对您来说不是一种高性能方法。在关系系统中使用名称-值对本质上并不是坏事,但只有当名称-值对构成完整的信息元组时才有效。当使用它迫使您在运行时动态重建表时,所有事情都开始变得困难。查询成为枢轴维护的一项练习,或者迫使您将元组重建推入对象层。
如果不在对象层中嵌入架构规则,您将无法确定空值或缺失值是有效条目还是缺少条目。
您将失去有效管理架构的能力。 100 个字符的 varchar 是否是“值”字段的正确类型? 200个字符?应该是 nvarchar 吗?这可能是一个艰难的权衡,最终您必须对集合的动态性质进行人为限制。类似于“您只能有 x 个用户定义字段,每个字段只能有 y 个字符长。
使用面向文档的解决方案,如 MongoDB 或 CouchDB,您可以在单个元组中维护与用户关联的所有属性。因为联接不是一个问题是,生活是幸福的,因为这两者都不能很好地处理连接,尽管你的用户可以定义他们想要的(或者你允许的)尽可能多的属性,在你达到之前,管理起来并不困难。 如果您有需要 ACID 级完整性
的数据,您可以考虑拆分解决方案,将高完整性数据存储在关系数据库中,将动态数据存储在非关系存储中。
This sounds like a problem that might be better solved by a non-relational solution, like MongoDB or CouchDB.
They both allow for dynamic schema expansion while allowing you to maintain the tuple integrity you seek.
I agree with Bill Karwin, the EAV model is not a performant approach for you. Using name-value pairs in a relational system is not intrinsically bad, but only works well when the name-value pair make a complete tuple of information. When using it forces you to dynamically reconstruct a table at run-time, all kinds of things start to get hard. Querying becomes an exercise in pivot maintenance or forces you to push the tuple reconstruction up into the object layer.
You can't determine whether a null or missing value is a valid entry or lack of entry without embedding schema rules in your object layer.
You lose the ability to efficiently manage your schema. Is a 100-character varchar the right type for the "value" field? 200-characters? Should it be nvarchar instead? It can be a hard trade-off and one that ends with you having to place artificial limits on the dynamic nature of your set. Something like "you can only have x user-defined fields and each can only be y characters long.
With a document-oriented solution, like MongoDB or CouchDB, you maintain all attributes associated with a user within a single tuple. Since joins are not an issue, life is happy, as neither of these two does well with joins, despite the hype. Your users can define as many attributes as they want (or you will allow) at lengths that don't get hard to manage until you reach about 4MB.
If you have data that requires ACID-level integrity, you might consider splitting the solution, with the high-integrity data living in your relational database and the dynamic data living in a non-relational store.
我很可能会创建一个具有以下结构的表:
当然,确切的类型取决于您的需要(当然还取决于您正在使用的 dbms)。您还可以使用 NumberValue(十进制)字段来表示整数和布尔值。您可能还需要其他类型。
您需要一些指向拥有该值的主记录的链接。为每个主表创建一个用户字段表并添加一个简单的外键可能是最简单、最快的方法。通过这种方式,您可以轻松快速地按用户字段过滤主记录。
您可能想要某种元数据信息。因此,您最终会得到以下结果:
Table UdfMetaData
Table MasterUdfValues
无论您做什么,我都不会动态更改表结构。这是一场维护噩梦。我也不会使用 XML 结构,它们太慢了。
I would most probably create a table of the following structure:
The exact types of course depend on your needs (and of course on the dbms you are using). You could also use the NumberValue (decimal) field for int's and booleans. You may need other types as well.
You need some link to the Master records which own the value. It's probably easiest and fastest to create a user fields table for each master table and add a simple foreign key. This way you can filter master records by user fields easily and quickly.
You may want to have some kind of meta data information. So you end up with the following:
Table UdfMetaData
Table MasterUdfValues
Whatever you do, I would not change the table structure dynamically. It is a maintenance nightmare. I would also not use XML structures, they are much too slow.
即使您允许用户添加自定义列,对这些列的查询也不一定会表现良好。查询设计涉及许多方面,可以让查询良好地执行,其中最重要的是首先对应存储的内容进行正确的规范。因此,从根本上来说,您是否希望允许用户在不考虑规范的情况下创建模式,并能够快速从该模式中获取信息?如果是这样,那么任何此类解决方案都不太可能很好地扩展,特别是如果您希望允许用户对数据进行数值分析。
选项 1
IMO 这种方法为您提供了架构,但不知道该架构的含义,这对于报表设计者来说是灾难的根源和噩梦。即,您必须拥有元数据才能知道哪些列存储哪些数据。如果元数据混乱,就有可能破坏您的数据。另外,它很容易将错误的数据放入错误的列中。 (“什么?String1 包含修道院的名称?我以为这是 Chalie Sheen 最喜欢的药物。”)
选项 3、4、5
IMO,要求 2、3 和 4 消除了 EAV 的任何变化。如果您需要对这些数据进行查询、排序或计算,那么 EAV 就是 Cthulhu 的梦想,也是您的开发团队和 DBA 的噩梦。 EAV 会造成性能瓶颈,并且无法为您提供快速获取所需信息所需的数据完整性。查询很快就会变成交叉表的棘手结。
选项 2,6
这确实留下了一个选择:收集规范,然后构建模式。
如果客户希望他们希望存储的数据获得最佳性能,那么他们需要与开发人员合作以了解他们的需求,以便尽可能高效地存储数据。它仍然可以存储在与其余表分开的表中,并使用基于表架构动态构建表单的代码。如果您有一个允许列上扩展属性的数据库,您甚至可以使用它们来帮助表单构建器使用漂亮的标签、工具提示等,这样所需要做的就是添加架构。无论哪种方式,为了有效地构建和运行报告,需要正确存储数据。如果相关数据有大量空值,则某些数据库有能力存储此类信息。例如,SQL Server 2008 有一个名为“稀疏列”的功能,专门针对含有大量空值的数据。
如果这只是一包数据,不需要对其进行分析、过滤或排序,我认为 EAV 的某些变体可能会达到目的。但是,根据您的要求,最有效的解决方案是获取正确的规范,即使您将这些新列存储在单独的表中并根据这些表动态构建表单也是如此。
稀疏列
Even if you provide for a user adding custom columns, it will not necessarily be the case that querying on those columns will perform well. There are many aspects that go into query design that allow them to perform well, the most important of which is the proper specification on what should be stored in the first place. Thus, fundamentally, is it that you want to allow users to create schema without thought as to specifications and be able to quickly derive information from that schema? If so, then it is unlikley that any such solution will scale well especially if you want to allow the user to do numerical analysis on the data.
Option 1
IMO this approach gives you schema with no knowledge as to what the schema means which is a recipe for disaster and a nightmare for report designers. I.e., you must have the meta data to know what column stores what data. If that metadata gets messed up, it has the potential to hose your data. Plus, it makes it easy to put the wrong data in the wrong column. ("What? String1 contains the name of convents? I thought it was Chalie Sheen's favorite drugs.")
Option 3,4,5
IMO, requirements 2, 3, and 4 eliminate any variation of an EAV. If you need to query, sort or do calculations on this data, then an EAV is Cthulhu's dream and your development team's and DBA's nightmare. EAV's will create a bottleneck in terms of performance and will not give you the data integrity you need to quickly get to the information you want. Queries will quickly turn to crosstab Gordian knots.
Option 2,6
That really leaves one choice: gather specifications and then build out the schema.
If the client wants the best performance on data they wish to store, then they need to go through the process of working with a developer to understand their needs so that it is stored as efficiently as possible. It could still be stored in a table separate from the rest of the tables with code that dynamically builds a form based on the schema of the table. If you have a database that allows for extended properties on columns, you could even use those to help the form builder use nice labels, tooltips etc. so that all that was necessary is to add the schema. Either way, to build and run reports efficiently, the data needs to be stored properly. If the data in question will have lots of nulls, some databases have the ability to store that type of information. For example, SQL Server 2008 has a feature called Sparse Columns specifically for data with lots of nulls.
If this were only a bag of data on which no analysis, filtering, or sorting was to be done, I'd say some variation of an EAV might do the trick. However, given your requirements, the most efficient solution will be to get the proper specifications even if you store these new columns in separate tables and build forms dynamically off those tables.
Sparse Columns
根据我的研究,基于数据类型的多个表不会帮助您表现。特别是如果您有大量数据,例如具有 50 多个 UDF 的 20K 或 25K 记录。表现是最差的。
您应该使用具有多列的单个表,例如:
According to my research multiple tables based on the data type not going to help you in performance. Especially if you have bulk data, like 20K or 25K records with 50+ UDFs. Performance was the worst.
You should go with single table with multiple columns like:
这是一个有问题的情况,并且没有一个解决方案看起来“正确”。然而,无论从简单性还是性能角度来看,选项 1 可能都是最好的。
这也是一些商业企业应用中使用的解决方案。
编辑
现在可用但在最初提出问题时不存在(或至少不成熟)的另一个选项是使用数据库中的 json 字段。
许多关系数据库现在支持基于 json 的字段(可以包含子字段的动态列表)并允许对其进行查询
mysql
This is a problematic situation, and none of the solutions appears "right". However option 1 is probably the best both in terms of simplicity and in terms of performance.
This is also the solution used in some commercial enterprise applications.
EDIT
another option that is available now, but didn't exist (or at least wasn't mature) when the question was original asked is to use json fields in the DB.
many relational DBs support now json based fields (that can include a dynamic list of sub fields) and allow querying on them
postgress
mysql
我们的数据库为 SaaS 应用程序(帮助台软件)提供支持,其中用户拥有超过 7000 个“自定义字段”。我们使用组合方法:
(EntityID, FieldID, Value)
表用于搜索数据entities
表中的 JSON 字段,该字段包含所有内容实体值,用于显示数据。 (这样你就不需要一百万个 JOIN 来获取值)。您可以进一步拆分#1以拥有“每个数据类型的表”,如此答案建议的那样,这样您甚至可以为您的数据建立索引UDF。
PS:用几句话来捍卫每个人都在抨击的“实体-属性-值”方法。几十年来,我们一直使用#1,而不使用#2,而且效果很好。有时这是一个商业决定。您是否有时间重写您的应用程序并重新设计数据库,或者您可以在云服务器上投入几美元,这些天来云服务器真的很便宜?顺便说一句,当我们使用#1 方法时,我们的数据库保存了数百万个实体,由数十万个用户访问,并且 16GB 双核数据库服务器运行得很好
Our database powers a SaaS app (helpdesk software) where users have over 7k "custom fields". We use a combined approach:
(EntityID, FieldID, Value)
table for searching the dataentities
table, that holds all entity values, used for displaying the data. (this way you don't need a million JOIN's to get the values values).You could further split #1 to have a "table per datatype" like this answer suggests, this way you can even index your UDFs.
P.S. Couple of words to defend the "Entity-Attribute-Value" approach everyone keeps bashing. We have used #1 without #2 for decades and it worked just fine. Sometimes it's a business decision. Do you have time to rewrite your app and redesign the db or you can throw a couple of bucks on cloud-servers, which are really cheap these days? By the way, when we were using #1 approach, our DB was holding millions of entities, accessed by 100s of thousands of users, and a 16GB dual-core db server was doing just fine
我有过 1、3 和 4 的经验,它们最终要么是混乱的,不清楚数据是什么,要么是通过某种软分类将数据分解为动态类型的记录而变得非常复杂。
我很想尝试 XML,您应该能够针对 xml 的内容强制执行架构以检查数据类型等,这将有助于保存不同的 UDF 数据集。在较新版本的 SQL Server 中,您可以对 XML 字段建立索引,这应该有助于提高性能。
(参见http:// /blogs.technet.com/b/josebda/archive/2009/03/23/sql-server-2008-xml-indexing.aspx)例如
I've had experience or 1, 3 and 4 and they all end up either messy, with it not being clear what the data is or really complicated with some sort of soft categorisation to break the data down into dynamic types of record.
I'd be tempted to try XML, you should be able to enforce schemas against the contents of the xml to check data typing etc which will help holding difference sets of UDF data. In newer versions of SQL server you can index on XML fields, which should help out on the performance.
(see http://blogs.technet.com/b/josebda/archive/2009/03/23/sql-server-2008-xml-indexing.aspx) for example
如果您使用 SQL Server,请不要忽视 sqlvariant 类型。它非常快,应该可以完成你的工作。其他数据库可能有类似的东西。
由于性能原因,XML 数据类型不太好。如果您在服务器上进行计算,那么您必须不断地反序列化这些计算。
选项 1 听起来很糟糕而且看起来很粗糙,但从性能角度来看可能是你最好的选择。我之前创建过包含名为 Field00-Field99 的列的表,因为您无法超越性能。您可能还需要考虑 INSERT 性能,在这种情况下,这也是值得追求的。如果您希望它看起来整洁,您可以随时在此表上创建视图!
If you're using SQL Server, don't overlook the sqlvariant type. It's pretty fast and should do your job. Other databases might have something similar.
XML datatypes are not so good for performance reasons. If youre doing calculations on the server then you're constantly having to deserialize these.
Option 1 sounds bad and looks cruddy, but performance-wise can be your best bet. I have created tables with columns named Field00-Field99 before because you just can't beat the performance. You might need to consider your INSERT performance too, in which case this is also the one to go for. You can always create Views on this table if you want it to look neat!
SharePoint 使用选项 1 并且具有合理的性能。
SharePoint uses option 1 and has reasonable performance.
我过去已经非常成功地管理了这个问题,没有使用这些选项(选项 6?:))。
我创建一个模型供用户使用(存储为 xml 并通过自定义建模工具公开),并从模型生成的表和视图将基表与用户定义的数据表连接起来。因此,每种类型都有一个包含核心数据的基表和一个包含用户定义字段的用户表。
以文档为例:典型的字段是名称、类型、日期、作者等。这将位于核心表中。然后,用户可以使用自己的字段定义自己的特殊文档类型,例如contract_end_date、renewal_clause、blah blah blah。对于该用户定义的文档,将有核心文档表,即 xcontract 表,在公共主键上连接(因此 xcontracts 主键在核心表的主键上也是外部的)。然后我会生成一个视图来包装这两个表。查询时的性能很快。附加的业务规则也可以嵌入到视图中。这对我来说非常有效。
I've managed this very successfully in the past using none of these options (option 6? :) ).
I create a model for the users to play with (store as xml and expose via a custom modelling tool) and from the model generated tables and views to join the base tables with the user-defined data tables. So each type would have a base table with core data and a user table with user defined fields.
Take a document as an example: typical fields would be name, type, date, author, etc. This would go in the core table. Then users would define their own special document types with their own fields, such as contract_end_date, renewal_clause, blah blah blah. For that user defined document there would be the core document table, the xcontract table, joined on a common primary key (so the xcontracts primary key is also foreign on the primary key of the core table). Then I would generate a view to wrap these two tables. Performance when querying was fast. additional business rules can also be embedded into the views. This worked really well for me.
在评论中我看到您说 UDF 字段用于转储用户未正确映射的导入数据。
也许另一个选择是跟踪每个用户创建的 UDF 数量,并通过说他们可以使用 6 个(或其他同样随机的限制)自定义字段顶部来强制他们重用字段。
当您面临这样的数据库结构问题时,通常最好回到应用程序的基本设计(在您的情况下为导入系统)并对其施加更多限制。
现在我要做的是选项 4(编辑),添加一个指向用户的链接:
现在确保创建视图以优化性能并获得正确的索引。这种标准化级别使数据库占用空间更小,但您的应用程序更复杂。
In the comments I saw you saying that the UDF fields are to dump imported data that is not properly mapped by the user.
Perhaps another option is to track the number of UDF's made by each user and force them to reuse fields by saying they can use 6 (or some other equally random limit) custom fields tops.
When you are faced with a database structuring problem like this it is often best to go back to the basic design of the application (import system in your case) and put a few more restraints on it.
Now what I would do is option 4 (EDIT) with the addition of a link to users:
Now make sure to make views to optimize performance and get your indexes right. This level of normalization makes the DB footprint smaller, but your application more complex.
我会推荐#4,因为这种类型的系统已在 Magento 中使用,这是一个高度认可的电子商务 CMS 平台。使用单个表通过 fieldId & 定义您的自定义字段。 标签列。然后,为每种数据类型建立单独的表,并且在每个表中都有一个按 fieldId 和数据类型 value 列进行索引的索引。然后,在您的查询中,使用类似以下内容:
在我看来,这将确保用户定义类型的最佳性能。
根据我的经验,我曾在多个 Magento 网站上工作过,这些网站每月为数百万用户提供服务,托管数千种具有自定义产品属性的产品,并且数据库可以轻松处理工作负载,甚至可以用于报告。
对于报告,您可以使用
PIVOT
将字段表标签值转换为列名称,然后将每个数据类型表的查询结果转换为那些旋转的柱子。I would recommend #4 since this type of system was used in Magento which is a highly accredited e-commerce CMS platform. Use a single table to define your custom fields using fieldId & label columns. Then, have separate tables for each data type and within each of those tables have an index that indexes by fieldId and the data type value columns. Then, in your queries, use something like:
This will ensure the best possible performance for user-defined types in my opinion.
In my experience, I've worked on several Magento websites that serves millions of users per month, hosts thousands of products with custom product attributes, and the database handles the workload easily, even for reporting.
For reporting, you can use
PIVOT
to convert your Fields table label values into column names, then pivot your query results from each data type table into those pivoted columns.