EAV数据库方案
我有一个包含超过 100K 条记录的数据库。 很多类别和很多项目(每个类别具有不同的属性) 一切都存储在 EAV 中。
如果我尝试打破这个方案并为任何类别创建一个唯一的表 我必须避免什么吗?
是的,我知道我可能会有很多表,并且需要更改它们 如果我想添加一个额外的字段,但这错了吗?
我还读到,我拥有的表越多,数据库就越会填充更多文件 这对任何文件系统都不好。
有什么建议吗?
I have a db with more that 100K records.
A lot of categories and many items (with different properties per category)
Everything is stored in a EAV.
If I try to break this scheme and create for any category a unique table
is something that will I have to avoid?
Yes, I know that probably I'll have a lot of tables and I'll need to ALTER them
if I want to add an extra field, BUT is this so wrong?
I have also read that as many tables I have, the db will be populate with more files
and this isn't good for any filesystem.
Any suggestion?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
作为数据库设计中的主要结构,该结构会随着数据的增长而失效。当您需要查询数据库模式以进行报告时,您就知道数据库模式不适合业务模型。 EAV 需要许多解决方法和非本机数据库功能才能获得合理的报告。即,即使是最小的查询,您也会不断创建交叉表/透视查询。所有获取 EAV 并将其置于可查询格式的处理都会消耗 CPU 周期,并且很容易出错。此外,数据规模呈几何级数增长。如果您有 10 个属性,则标准设计中的 10 行将生成 100 个 EAV 行。 100 个标准行相当于 1000 个 EAV 行,依此类推。
数据库管理系统设计用于处理大量表,这不应该是一个担心。
可以创建一个混合解决方案,其中 EAV 结构是解决方案的一部分。但是,规则必须是您永远不能包含查询
[AttributeCol] = 'Attribute'
。即,您永远不能对任何属性进行过滤、排序或限制范围。您不能将特定属性放置在报告或屏幕上的任何位置。它只是一团数据。与系统其余部分的良好架构相结合,拥有存储大量数据的 EAV 可能会很有用。实现这项工作的关键是您自己和开发人员之间的强制执行,永远不要跨越属性的过滤或排序界限。一旦走上黑暗之路,它就将永远主宰你的命运。As the primary structure in a database design, the structure will fail as the data grows. The way you know that a database schema does not fit the business model is when you need to query against it for reporting. EAV's require many workarounds and non-native database functionality in order to get reasonable reports. I.e., you are constantly creating crosstabs/pivot queries for even the smallest query. All that processing to take the EAV and put it in a queryable format chews through CPU cycles and is highly prone to error. In addition, the size of the data is growing geometrically. If you have 10 attributes, 10 rows in a standard design will generate 100 EAV rows. 100 standard rows would equate to 1000 EAV rows and so on.
Database management systems are designed to handle lots of tables and this should not be a worry.
It is possible to create a hybrid solution where an EAV structure is part of the solution. However, the rule must be that you can never include a query
[AttributeCol] = 'Attribute'
. I.e., you can never filter on, sort on, restrict the range on any attribute. You cannot place a specific attribute anywhere in a report or on-screeen. It is just a blob of data. Combined with a good schema for the rest of the system, having an EAV that stores a blob of data can be useful. The key to making this work is enforcement amongst yourself and the developers never to cross the line of filtering or sorting on an attribute. Once you go down the dark path, forever will it dominate your destiny.有专门为运行 EAV 模型而构建的数据库引擎。我不了解他们,所以我无法推荐。但将 EAV 模型推入关系引擎会导致灾难。灾难一定会发生,这只是时间问题。
您的数据可能会保持足够小,并且您的查询足够简单,因此可以正常工作,但这种情况很少发生。
There are database engines purpose built to run EAV models. I don't know them so I can't recommend one. But shoving an EAV model into a relational engine is a recipe for disaster. Disaster will occur, it's really just a matter of time.
It's possible that your data will stay small enough, and your queries simple enough for this to work but that's rarely the case.
EAV DB 模式非常灵活,可以添加更多关系数据库的“列”,但代价是降低查询性能并丢失保留在关系数据库模式中的业务逻辑。
因为您必须创建多个视图来实际透视结果,如果表包含数十亿行,这将导致性能问题。 EAV 模式的另一个性质是,当您将数据表与元数据表连接起来时,总是会进行查询,并且同一个数据表上可能存在多个连接。
这是根据我的经验得出的。
EAV DB schema is very flexible for adding more relational database's "columns" but at the cost of deteriorating the query performance and losing your business logic which was kept in the relational database schema.
Because you have to create multiple views to actually pivot the result, which will cause the performance issue if the table contains billions of rows. And another nature of EAV schemas is queries are always made when you join the data table with the meta data table and there might be multiple joins on the same data table.
This is based on my experience.
我在大约 4 年前为电子学习构建的创作系统上采用了这种方法。我当时并不知道我在做 EAV,但我认为我只是在狡猾地使用名称/值类型对。我想我会增加记录,但会减少重新设计,因为每次收到更改请求时我都非常厌倦将列向左调整。
我进行了第一次测试,在一个表中构建了系统的层次结构。大约 4 个项目、25 个产品和 4 到 5 个工具的表现非常好,每个工具都通过链接回其主键的层整数进行分配。
我一直在记录通过系统传递的资源,这意味着 FLV 文件、SWF、JPG、PNG、GIF、PDF、MP3 等......以及有关它们的所有 mime 类型细节。每个文件的属性范围从 4 到 10 个不等。其“资产数据”记录总数高达 800 万条,而我们大约有 80 万条资产(预计)。
我要求将所有这些信息放入报告的列中。 SQL 语句必须自行执行许多表连接,更不用说如果他们想知道它所使用的内容、产品或项目,那么它只是一系列的 JOIN。
从细粒度的角度来看效果很好。从 Excel 报告的角度来看,请系好安全带。我通过对表进行快照来缓解这个问题,这些表按照某人想要的方式在报告中反映数据,但是需要一段时间来编译该信息,这需要我将其卸载(SQL 转储)到另一台服务器。
我发现自己问自己这是否是正确的做法,对于这个项目,我可以说,对于这份大规模报告的要求“是”。但这会让服务器在将所有这些关联起来时感到非常糟糕。确实取决于他们提出的深层查询。
自从我从 2002 年开始涉足 SQL 并将其用于支持工具以来,它就没有大规模保存下来。如果是一个更大的 100 万人、TB+ 数据库,我可能会抓狂。
特别提示:我发现这个系统是在RedHat上的,而且是32位的。许多 PHP 处理线程无法在超过 1 个 CPU 核心上运行,并且服务器还有 7 个核心闲置!在此计算机上运行长达 45 分钟的查询实际上可以在正确配置的 64 位系统上运行 14-25 秒。在考虑性能时也值得深思。
I took this approach on a Authoring System I built for e-learning about 4 years ago. I didn't know I was doing EAV at the time, but I thought I was being all sly just using name/value type pairs. I figured I'd have increased records, but less re-design as I got highly tired of adjusting columns out to the left every time we had a change request.
I did my first test constructing out a hierarchy for the system in one table. Thats performed great with about 4 projects, 25 Products and 4 to 5 tools each all assigned out thru tier integers that link back to their primary keys.
I've been recording assets that pass thru the system, and this meant FLV files, SWF, JPG, PNG, GIF, PDF, MP3 etc ... and all the mime-type specifics about them. This ranges from just 4 to 10 attributes on each file. Its totaled up to 8 million "asset data" records, where as we have about 800K assets (est).
I had a request to put all that information into columns for a report. The SQL Statement would have to do a number of table joins on it self, let alone the fact if they want to know the content it was used in, product, or project its just a slew of JOIN's.
From a granular perspective works great. From a Excel report perspective put your seat belt on. I've mitigated it by doing snapshots out to tables that reflect the data the way someone wants in a report, but it takes awhile to compile that information which required me to offload (SQL Dump) to another server.
I've found my self asking if this was the right thing to do, and for this project I could say up to this request for a report on a grand scale "yes". But it makes the server sweat pretty bad correlating it all. Really depends on the deep level of queries they make.
Since I dabble with SQL since 2002 and use it in supporting tools nothing on a huge scale its survived. If it was a larger million person, terabyte+ database I'd be probably pulling my hair out.
Special Note: I found out this system was on RedHat, and it was 32bit. Much of the PHP processing threads were unable to run on more than 1 CPU core, and the server had 7 more cores sitting idle! Queries that were taking up to 45 minutes to run on this machine, actually could run in 14-25 seconds on a 64bit system properly configured. Also food for thought when considering performance.