是否存在永远不会使用主键列的原因?
我有一个例程,将创建单独的表(Sql Server 2008)来存储我的应用程序(Asp.net 3.5)生成的报告结果。每个报告都需要自己的表格,因为表格的列会根据报告设置而有所不同。一个表将包含 10-5,000 行,很少超过 10,000 行。
将应用以下使用规则:
- 数据一旦存储,将永远不会更新。
- 每当访问表的结果时,都会检索所有数据。
- 没有其他表需要与该表执行联接。
知道了这一点,还有什么理由在表上创建PK索引列呢?这样做是否有助于以任何方式检索数据的性能,如果可以的话,这是否会超过插入数据时更新索引的额外负载(我知道 10K 记录是一个相对较小的量,但这个解决方案需要能够扩展)。
更新:以下是有关正在处理的数据的更多详细信息,这些详细信息涉及每个报告一个表的当前设计决策:
- 表将记录一组数值(根据报告设置在运行时设置) )对应于一组不同的引用 varchar 值(也在运行时根据报表设置进行设置)。
- 每当检索数据时,在将输出显示给用户之前,都需要在服务器上进行一些后处理(因此我将始终检索所有值)。
我还会怀疑有人声称他们必须在每次运行报告时创建一个新表。然而,考虑到每次运行报告时都可能需要不同的列(包括数字、名称和数据类型),我没有看到一个很好的替代方案。
我唯一能想到的就是有一个 ID 列(标识 ReportVersionID,对应于另一个表)、ReferenceValues 列(varchar 字段,包含所有引用值,按指定的顺序,用一些分隔符分隔)和 NumericValues 列(与 ReferenceValues 相同,但对于数字),然后当我检索结果时,将所有内容放入系统中的专用对象中,根据定义的分隔符分隔值)。这看起来更可取吗?
I have a routine that will be creating individual tables (Sql Server 2008) to store the results of reports generated by my application (Asp.net 3.5). Each report will need its own table, as the columns for the table would vary based on the report settings. A table will contain somewhere between 10-5,000 rows, rarely more than 10,000.
The following usage rules will apply:
- Once stored, the data will never be updated.
- Whenever results for the table are accessed, all data will be retrieved.
- No other table will need to perform a join with this table.
Knowing this, is there any reason to create a PK index column on the table? Will doing so aid the performance of retrieving the data in any way, and if it would, would this outweigh the extra load of updating the index when inserting data (I know that 10K records is a relatively small amount, but this solution needs to be able to scale).
Update: Here are some more details on the data being processed, which goes into the current design decision of one table per report:
- Tables will record a set of numeric values (set at runtime based on the report settings) that correspond to a different set of reference varchar values (also set at runtime based on the report settings).
- Whenever data is retrieved, it some post-processing on the server will be required before the output can be displayed to the user (thus I will always be retrieving all values).
I would also be suspicious of someone claiming that they had to create a new table for each time the report was run. However, given that different columns (both in number, name and datatype) could conceivably be needed for every time the report was run, I don't see a great alternative.
The only other thing I can think of is to have an ID column (identifying the ReportVersionID, corresponding to another table), ReferenceValues column (varchar field, containing all Reference values, in a specified order, separated by some delimiter) and NumericValues column (same as ReferenceValues, but for the numbers), and then when I retrieve the results, put everything into specialized objects in the system, separating the values based on the defined delimiter). Does this seem preferable?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
主键不是任何和所有数据表都必须的。确实,它们通常非常有用,放弃它们是不明智的。 但是,除了速度的主要任务(我同意这无疑会受到积极影响)之外,还有独特性。为此,并重视您已经明显采取的考虑,我建议主键的唯一需要是控制表的预期唯一性。
更新:
您在评论中提到,如果您进行 PK,它将包含一个目前不存在且不需要的 Identity 列。在这种情况下,我建议完全不要PK。正如@RedFilter 指出的,代理键永远不会添加任何值。
Primary keys are not a MUST for any and all data tables. True, they are usually quite useful and to abandon them is unwise. However, in addition to a primary missions of speed (which I agree would doubtfully be positively affected) is also that of uniqueness. To that end, and valuing the consideration you've already obviously taken, I would suggest that the only need for a primary key would be to govern the expected uniqueness of the table.
Update:
You mentioned in a comment that if you did a PK that it would include an Identity column that presently does not exist and is not needed. In this case, I would advise against the PK altogether. As @RedFilter pointed out, surrogate keys never add any value.
我会保持简单,只需将转换为 json 或 xml 的报告结果存储在 VARCHAR(MAX) 列中
I would keep it simple, just store the report results converted to json or xml, in a VARCHAR(MAX) column
数据完整性(首先是主键和外键引用)最有用且最少(明确)强调的好处之一是它强制数据和应用程序之间进行“按合同设计”;这可以阻止很多类型的错误对您的数据造成任何损害。这是一个巨大的胜利,也是一件被隐含地视为理所当然的事情(保护它的不是“数据库”,而是您指定的完整性规则;放弃这些规则会使您的数据遭受不同程度的降级)。
这对您来说似乎并不重要(因为您甚至没有讨论什么是可能的主键),并且您的数据似乎与系统的其他部分完全无关(因为您不会与任何其他表进行联接) ;但仍然 - 如果所有事情都相同,我会正确地对数据进行建模,然后如果不使用主键(或其他数据完整性规则),并且如果追求每一个最后的性能,我会考虑将它们放入生产(并测试任何实际收益)。
至于评论说创建表会影响性能 - 这是事实,但您没有告诉我们这些表的临时性如何?一旦创建,它们在废弃之前会被大量使用吗?或者您是否计划为几十个读取操作创建表。
如果您将大量使用这些表并且,如果您将提供干净的机制来管理它们(不使用时删除它们,选择它们等...)我认为动态创建表将是完美的很好(您可以分享有关表格本身的更多详细信息;用例会很好)
其他解决方案的注释:
EAV 模型
很糟糕,除非满足非常具体的条件(例如:灵活性至关重要,自动化 DDL 太麻烦)。远离它(或者非常非常善于预测您将必须处理什么类型的查询并严格验证前端数据)。
如果您将在表示层将数据作为 XML/BLOB 使用(始终读取所有行,始终写入整个“对象”,最后,如果您的表示层喜欢 XML/BLOBS) ,XML/BLOB 方法
可能适合您)
编辑:
另外,根据使用模式,拥有主键确实可以提高检索速度,并且如果我可以读取数据不会更新的事实,因为“它将被写入一次并读取多次”次”,那么很有可能它确实会超重更新插入索引的成本。
One of the most useful and least emphasized (explicitly) benefits of data integrity (primary keys and foreign key references to start with) is that it forces a 'design by contract' between your data and your application(s); which stops quite a lot of types of bugs from doing any damage to your data. This is such a huge win and a thing that is implicitly taken for granted (it is not 'the database' that protects it, but the integrity rules you specify; forsaking the rules you expose your data to various levels of degradation).
This seems unimportant to you (from the fact that you did not even discuss what would be a possible primary key) and your data seems quite unrelated to other parts of the system (from the fact that you will not do joins to any other tables); but still - if all things are equal I would model the data properly and then if primary keys (or other data integrity rules) are not used and if chasing every last bit of performance I would consider dropping them in production (and test for any actual gains).
As for comments that creating tables is a performance hit - that is true, but you did not tell us how temporary are these tables? Once created will they be heavily used before scrapped? Or do you plan to create tables for just dozen of read operations.
In case you will heavily use these tables and if you will provide clean mechanism for managing them (removing them when not used, selecting them, etc...) I think that dynamically creating the tables would be perfectly fine (you could have shared more details on the tables themselves; use case would be nice)
Notes on other solutions:
EAV model
is horrible unless very specific conditions are met (for example: flexibility is paramount and automating DDL is too much of a hassle). Keep away from it (or be very, very good at anticipating what kinds of queries will you have to deal with and rigorous in validating data on the front end).
XML/BLOB approach
might be the right thing for you if you will consume the data as XML/BLOBs at presentation layer (always read all of the rows, always write the whole 'object' and finally, if your presentation layer likes XML/BLOBS)
EDIT:
Also, depending on the usage patterns, having primary key can indeed increase the speed of retrieval, and if I can read the fact that the data will not be updated as 'it will be written once and read many times' then there is a good chance that it will indeed overweight the cost of updating the index on inserts.
给定报告的每次运行都会有一张表,还是给定报告的所有运行都会有一张表?换句话说,如果您有报表#1,并且在不同的数据范围内运行它 5 次,您会生成 5 个表,还是报表的所有 5 次运行都会存储在同一个表中?
如果您将报告的所有 5 次运行存储在同一个表中,那么您需要过滤数据,使其适合相关的运行。在这种情况下,拥有主键将使您可以更快地执行过滤器的 where 语句。
如果您要为每次运行报表创建一个新表,则不需要主键。但是,随着系统中表数量的增长,您将遇到其他性能问题......假设您没有适当的方法来删除旧数据/表。
will it 1 table for every run of a given report, or one table to all runs of a given report? in other words, if you have Report #1 and you run it 5 times, over a different range of data, will you produce 5 tables, or will all 5 runs of the report be stored in the same table?
If you are storing all 5 runs of the report in the same table, then you'll need to filter the data so that it is appropriate to the run in question. in this case, having a primary key will let you do the where statement for the filter, much faster.
if you are creating a new table for every run of the report, then you don't need a primary key. however, you are going to run into to other performance problems as the number of tables in your system grows... assuming you don't have something in place to drop old data / tables.
如果您确实不将这些表用作只读数据块以外的任何用途,那么您也可以将所有报告作为 XML 值存储在单个表中。
If you are really not using the tables for anything other than as a chunk of read-only data, you could just as well store all the reports in a single table, as XML values.
PK 索引将构建在哪些列上?如果只是一个代理标识列,则在插入行时不会影响性能,因为它们会“按顺序”插入。如果它不是代理键,那么您就拥有了公认的次要但仍然有用的保证,即您没有重复的条目。
主键是否用于控制报表行的打印顺序?如果不是,那么如何确保信息的正确排序? (或者这只是一个数据表,每当生成报告时都会以一种方式求和?)
如果使用聚集主键,则不会像使用非聚集索引那样使用那么多的存储空间。
总的来说,我发现虽然并非每个表都需要主键,但拥有一个主键也没有什么坏处,并且由于正确的关系数据库设计需要所有表上都有主键,因此最好始终包含主键。
What column or columns would the PK index be built on? If just a surrogate identity column, you'll have no performance hit when inserting rows, as they'd be inserted "in order". If it is not a surrogate key, then you have the admittedly minor but still useful assurance that you don't have duplicate entries.
Is the primary key used to control the order in which report rows are to be printed? If not, then how do you ensure proper ordering of the information? (Or is this just a data table that gets summed one way and another whenever a report is generated?)
If you use a clustered primary key, you wouldn't use as much storage space as you would with a non-clustered index.
By and large, I find that while not every table requires a primary key, it does not hurt to have one present, and since proper relational database design requires primary keys on all tables, it's good practice to always include them.