糟糕的现实数据库模式
我们的硕士论文项目是创建一个数据库模式分析器。作为此基础,我们正在努力量化糟糕的数据库设计。
我们的主管要求我们分析我们选择的现实世界模式,以便我们可以识别一些/几个设计问题。这些问题将用作模式分析器的起点。
找到一个好的模式有点困难,因为我们不想要一个各方面设计都很好的模式,而是一个更“稀有到中等”的模式。
我们已经安排了以下模式进行分析:wikimedia、moodle 和 drupal。不确定每个都属于哪个类别。该模式不一定是开源的。
使用的数据库引擎并不重要,但我们希望重点关注 SQL Server、Posgresql 和 Oracle。
目前,文献将被推迟,因为这项任务应该为我们提供可在论文中使用的现实世界的例子。即“我们认为设计 X 是糟糕的设计,我们的分析器识别出该设计并提出改进建议”,而不是提出人为的示例。
当我们准备好某种工具时,我将更新这篇文章。
Our masters thesis project is creating a database schema analyzer. As a foundation to this, we are working on quantifying bad database design.
Our supervisor has tasked us with analyzing a real world schema, of our choosing, such that we can identify some/several design issues. These issues are to be used as a starting point in the schema analyzer.
Finding a good schema is a bit difficult because we do not want a schema which is well designed in all aspects, but a schema that is more "rare to medium".
We have already scheduled the following schemas for analysis: wikimedia, moodle and drupal. Not sure in which category each fit. It is not necessary that the schema is open source.
The database engine used is not important, though we would like to focus on SQL server, Posgresql and Oracle.
For now literature will be deferred, as this task is supposed to give us real world examples which can be used in the thesis. i.e. "Design X is perceived by us as bad design, which our analyzer identifies and suggests improvements to", instead of coming up with contrived examples.
I will update this post when we have some kind of a tool ready.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
检查Dell-dvd-store,您可以免费使用它。
Bill Karwin 写了一本关于不良设计的好书:SQL 反模式
Check the Dell-dvd-store, you can use it for free.
Bill Karwin has written a great book about bad designs: SQL antipatterns
我正在开发一个项目,其中包括地理信息系统。在我看来,这些设计通常是“中等”到“罕见”的。
以下是一些示例:
1) Geonames.org
您可以在此处找到数据和架构:http://download.geonames.org/export/dump/(向下滚动到页面底部对于模式,它在网站上以纯文本形式显示!)
这个数据库设计如何在如此大量的数据下执行会很有趣!
2) OpenGeoDB
这个在德语国家(德国、奥地利、瑞士)非常流行,因为它是一个数据库,包含德语地区几乎每个城市/城镇/村庄的邮政编码、名称、层次结构和坐标。
这个带有 .sql 模式,并且表字段是英文的,所以这应该不是问题。
http://fa-technik.adfc.de/code/opengeodb/
这两个示例中有趣的事情是他们如何管理国家/地区 -> 等实体的层次结构。状态->县->城市-> PS:也许
你也可以评判我的数据库设计;)数据库基于角色的访问控制架构
I'm working on a project including a geographical information system. And in my opinion these designs are often "medium" to "rare".
Here are some examples:
1) Geonames.org
You can find the data and the schema here: http://download.geonames.org/export/dump/ (scroll down to the bottom of the page for the schema, it's in plain text on the site !)
It'd be interesting how this DB design performs with such a HUGE amount of data!
2) OpenGeoDB
This one is very popular in german-speaking countries (Germany, Austria, Switzerland) because it's a database containing nearly every city/town/village in the german speaking region with zip-code, name, hierarchy and coordinates.
This one comes with a .sql schema and the table fields are in english, so this shouldn't be a problem.
http://fa-technik.adfc.de/code/opengeodb/
The interesting thing in both examples is how they managed the hierarchy of entities like Country -> State -> County -> City -> Village etc.
PS: Maybe you could judge my DB design too ;) DB Schema of a Role Based Access Control
vBulletin 的数据库架构非常糟糕。
vBulletin has a really bad database schema.
“我们正在努力量化糟糕的数据库设计。”
在我看来,您正在开发一个模型、流程或设备,它将关系模式作为输入并对其质量进行评分。
我邀请您思考以下问题:
物理模式是否可能“糟糕”,而逻辑模式却“非常好”?您打算正确区分“逻辑模式”和“物理模式”吗?你梦想如何实现这一目标?
您如何判定物理设计的某个方面“不好”?以缺少某些索引为例。如果那个“所谓的理想索引”所在的相关变量本身被限制为单例,那么缺少该索引会给系统带来什么有害影响?如果不存在这样的有害影响,那么有什么理由将没有这样的指标定性为“坏”呢?
您如何确定逻辑设计的某个方面是“坏的”?逻辑设计中的选择是根据实际需求进行的。如果没有形式化且机器可读的方式来指定实际需求是什么,您如何能够对逻辑设计做出任何判断?
"we are working on quantifying bad database design."
It seems to me like you are developing a model, or process, or apparatus, that takes a relational schema as input and scores it for quality.
I invite you to ponder the following:
Can a physical schema be "bad" while the logical schema is nonetheless "extremely good" ? Do you intend to distinguish properly between "logical schema" and "physical schema" ? How do you dream to achieve that ?
How do you decide that a certain aspect of physical design is "bad" ? Take for example the absence of some index. If the relvar that that "supposedly desirable index" is to be on, is itself constrained to be a singleton, then what detrimental effects would the absence of that index cause for the system ? If there are no such detrimental effects, then what grounds are there for qualifying the absence of such an index as "bad" ?
How do you decide that a certain aspect of logical design is "bad" ? Choices in logical design are done as a consequence of what the actual requirements are. How can you make any judgment whatsoever about a logical design, without a formalized and machine-readable way to specify what the actual requirements are ?
哇——你面前有一个雄心勃勃的项目。除非有广泛理解的原则和指南,否则要确定什么是好的数据库设计可能是不可能的。
以下是我想到的一些想法:
我在一家为几家大型零售公司进行数据库管理的公司工作。我们根据这些公司打算如何使用数据(用于直邮、电子邮件活动等)以及他们喜欢使用哪种分析和选择参数,为每个公司设计了定制数据库。例如,一家在商店和网上销售音乐设备的公司希望区分现场客户和在线客户,根据客户购买的商品类型(鼓、吉他、麦克风、键盘、录音设备、扩音器、等),并跟踪他们在过去 6 个月或过去一年中花费了多少以及购买了什么。他们使用此信息来决定谁将收到邮件中的目录。这些邮件非常昂贵;每个客户可能需要一两美元,因此该公司希望只将目录邮寄给那些最有可能购买商品的人。他们的数据库中可能有 1500 万客户,但只有 300 万购买了鼓,并且在过去一年中只有 750,000 人购买过任何东西。
如果您要分析我们创建的数据库,您会发现许多“工作”表,这些表用于特定选择目的,并且根据数据库设计原则,这些表实际上可能没有正确设计。虽然“主”表经过有效设计并具有适当的关系和索引,但这些“工作”表会使整个数据库看起来设计不佳,而实际上,工作表可能只使用几次,甚至就一次,我们还没有进去清理或扔掉它们。该特定数据库中的工作表数量远远超过主表。
人们还必须考虑所管理的数据量。 1000万客户群每周可能拥有10至2000万笔交易的交易数据。或者每天。有时,为了便于管理,必须按日期范围将这些数据分区到表中,然后使用视图从适当的子表中选择数据。这对于如此巨大的体积来说是有效的,但对于自动分析仪来说可能会显得重复。
在分析开始之前,您的分析仪需要可由用户配置。有些项目必须跳过,而另一些项目可能绝对重要。
另外,如何分析存储过程和用户定义函数等?我见过一些非常丑陋的代码,但工作效率却很高。而且,一些最丑陋、最低效的代码是为了一次性使用而编写的。
好吧,我暂时没有想法。祝你的项目好运。
Wow - you have an ambitious project ahead of you. To determine what is a good database design may be impossible, except for broadly understood principles and guidelines.
Here are a few ideas that come to mind:
I work for a company that does database management for several large retail companies. We have custom databases designed for each of these companies, according to how they intend for us to use the data (for direct mail, email campaigns, etc.), and what kind of analysis and selection parameters they like to use. For example, a company that sells musical equipment in stores and online will want to distinguish between walk-in and online customers, categorize the customers according to the type of items they buy (drums, guitars, microphones, keyboards, recording equipment, amplifiers, etc.), and keep track of how much they spent, and what they bought, over the past 6 months or the past year. They use this information to decide who will receive catalogs in the mail. These mailings are very expensive; maybe one or two dollars per customer, so the company wants to mail the catalogs only to those most likely to buy something. They may have 15 million customers in their database, but only 3 million buy drums, and only 750,000 have purchased anything in the past year.
If you were to analyze the database we created, you would find many "work" tables, that are used for specific selection purposes, and that may not actually be properly designed, according to database design principles. While the "main" tables are efficiently designed and have proper relationships and indexes, these "work" tables would make it appear that the entire database is poorly designed, when in reality, the work tables may just be used a few times, or even just once, and we haven't gone in yet to clear them out or drop them. The work tables far outnumber the main tables in this particular database.
One also has to take into account the volume of the data being managed. A customer base of 10 million may have transaction data numbering 10 to 20 million transactions per week. Or per day. Sometimes, for manageability, this data has to be partitioned into tables by date range, and then a view would be used to select data from the proper sub-table. This is efficient for this huge volume, but it may appear repetitive to an automated analyzer.
Your analyzer would need to be user configurable before the analysis began. Some items must be skipped, while others may be absolutely critical.
Also, how does one analyze stored procedures and user-defined functions, etc? I have seen some really ugly code that works quite efficiently. And, some of the ugliest, most inefficient code was written for one-time use only.
OK, I am out of ideas for the moment. Good luck with your project.
如果你能找到它,项目管理系统 Clarity 的数据库设计很糟糕。不知道有没有试用版可以下载。
If you can get ahold of it, the project management system Clarity has a horrible database design. I don't know if they have a trial version you can download.