使用 SQL/关系数据库存储和检索历史数据
鉴于此表:
CREATE TABLE DeptPeopleHistory (
DEPT_ID INTEGER,
PERSON_ID INTEGER,
START_DATE INTEGER,
END_DATE INTEGER,
UNIQUE(DEPT_ID, START_DATE, PERSON_ID), -- works as sorted index.
UNIQUE(PERSON_ID, START_DATE),
UNIQUE(PERSON_ID, END_DATE),
CONSTRAINT (START_DATE < END_DATE)
);
我有两个需求。第一个是获取在给定日期在给定部门工作的所有人员。目前我使用这个(语义上正确的)查询:
SELECT PERSON_ID FROM DeptPeopleHistory
WHERE
DEPT_IT = :given_dept AND
START_DATE <= :given_date AND :given_date < END_DATE
这对于小型历史表或查询最近的数据来说很快,但对于大型历史表和旧数据来说很慢,因为优化器仅使用第一个索引,并且没有好的方法来处理 END_DATE 。我尝试将 END_DATE 添加到第一个索引,但查询性能是相同的。我猜这是因为子过滤器 (DEPT_IT=:given_dept AND START_DATE <= :given_date) 应用于排序索引 (DEPT_ID、START_DATE、END_DATE、PERSON_ID) 时会产生未排序的 END_DATE 数据,因此 (:given_date < END_DATE )仍然需要对结果进行顺序扫描。
我的另一个需求是强制执行以下约束:一个人不能同时在两个部门工作,也不能在同一部门工作两次。这意味着:
-- This must work for previously empty data:
INSERT INTO DeptPeopleHistory(DEPT_ID, PERSON_ID, START_DATE, END_DATE)
VALUES (1, 1, 20100501, 20100520);
-- This should cause constraint violation because the person already
-- works at dept 1 on days from 20100517 to 20100519:
INSERT INTO DeptPeopleHistory(DEPT_ID, PERSON_ID, START_DATE, END_DATE)
VALUES (:any_dept, 1, 20100517, 20100523);
指定此约束的另一种方法是,对于给定的 PERSON_ID,START_DATE 必须是另一条记录中的最小值或等于 END_DATE。
考虑到这两个需求,我们实际上需要一种有效的方法来处理非相交范围。您是否知道通用 SQL 或某些特定数据库中的某些功能或结构可以满足这些需求?也许有一些“空间数据库”功能?
这些示例是在 MySQL 中进行的,但我需要适用于 Oracle、SQL Server 和 FireBird 的解决方案。这些解决方案不需要在所有此类数据库之间移植。
Given this table:
CREATE TABLE DeptPeopleHistory (
DEPT_ID INTEGER,
PERSON_ID INTEGER,
START_DATE INTEGER,
END_DATE INTEGER,
UNIQUE(DEPT_ID, START_DATE, PERSON_ID), -- works as sorted index.
UNIQUE(PERSON_ID, START_DATE),
UNIQUE(PERSON_ID, END_DATE),
CONSTRAINT (START_DATE < END_DATE)
);
I have two needs. The first is to get all people that works on a given department at a given date. Currently I use this (semantically correct) query:
SELECT PERSON_ID FROM DeptPeopleHistory
WHERE
DEPT_IT = :given_dept AND
START_DATE <= :given_date AND :given_date < END_DATE
This is fast for small history table or querying recent data, but is slow for big history tables and old data, because the optimizer uses only the first index and there's no good way to deal with END_DATE. I've tried to add END_DATE to the first index, but query performance is the same. I guess it's because the sub-filter (DEPT_IT=:given_dept AND START_DATE <= :given_date) when applied to a sorted index (DEPT_ID, START_DATE, END_DATE, PERSON_ID) results in data with unsorted END_DATE, so (:given_date < END_DATE) still requires a sequential scan on the result.
My other need is to enforce the following constraint: a person cannot work at two departments at same time, nor twice at the same department. This means the following:
-- This must work for previously empty data:
INSERT INTO DeptPeopleHistory(DEPT_ID, PERSON_ID, START_DATE, END_DATE)
VALUES (1, 1, 20100501, 20100520);
-- This should cause constraint violation because the person already
-- works at dept 1 on days from 20100517 to 20100519:
INSERT INTO DeptPeopleHistory(DEPT_ID, PERSON_ID, START_DATE, END_DATE)
VALUES (:any_dept, 1, 20100517, 20100523);
Another way to specify this constraint, is that for a given PERSON_ID, START_DATE must be the minimum or equals to END_DATE from another record.
Looking at those two needs, we actually need an efficient way for dealing with non-intersected ranges. Do you know some feature or construct in generic SQL or some specific database than can deal with these needs? Perhaps some "spatial database" feature?
The examples are in MySQL, but I need solutions that work on Oracle, SQL Server and FireBird. The solutions don't need to be portable across all such databases.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
作为起点,我推荐 Rick Snodgrass 编写的《用 SQL 开发面向时间的数据库应用程序》一书,可通过以下网址获取: 免费 PDF 下载。看起来您可以直接跳到第 5 章并通读第 6 章和第 7 章(但不要忽略后面章节中的替代方法)。
至于实现,postgreSQL 目前总体上具有良好的时间支持以及对可延迟约束的支持(这对于 SQL 中至关重要!对于诸如排序键之类的概念)。
请注意,还有其他时态数据库模型,例如 Date Darwen Lorentzos。
As a starting point, I recommend the book Developing Time-Oriented Database Applications in SQL by Rick Snodgrass, available as a free PDF download. Looks like you can jump right in a chapter 5 and read through chapters 6 and 7 (but don't dismiss the alternative approaches in later chapters).
As regards implementation, postgreSQL currently has good temporal support generally and support for deferrable constraints (which is vital -- in SQL! -- for concepts such as sequenced keys).
Note there are other models for temporal databases e.g. Date Darwen Lorentzos.
您是否尝试过在 DEPT_ID 和 END_DATE 上添加另一个索引?如果您使用 MySQL 5+,它可能能够进行索引合并并使用该索引和 DEPT_ID、START_DATE、PERSON_ID 索引。
至于你的第二个问题,我认为强制执行这种类型的约束的唯一方法是通过应用程序逻辑或插入/更新触发器。
Have you tried adding another index on DEPT_ID and END_DATE? If you are using MySQL 5+, it may be able to do an index merge and use both that index and the DEPT_ID, START_DATE, PERSON_ID index.
As for your second question, I think the only way to enforce that type of constraint would be either via application logic or an insert/update trigger.
是否可以将表
DeptPeopleHistory
的结构更改为?:优点:
UNIQUE
约束,也不需要强制执行START_DATE < END_DATE
一。缺点:
(1, 1, 20100501, 20100520)
现在分为 20 行。我想说,这不是一个真正的问题。关系数据库旨在处理许多行。START_DATE
或END_DATE
,必须运行查询。 (如果这太慢,我怀疑,可以使用额外的表)哦,你的慢查询将被写成:
使用你当前的
DeptPeopleHistory
设计,你能尝试以下的性能询问?Would it be possible to change the structure of table
DeptPeopleHistory
to?:Pros:
UNIQUE
constraints, nor theSTART_DATE < END_DATE
one.Cons:
(1, 1, 20100501, 20100520)
from the previous example is now split into 20 rows. Not a real problem, I'd say. Relational databases are designed to handle many rows.START_DATE
orEND_DATE
for a person in a department, a query has to be run. (if that is too slow, which I doubt, an additional table can be used)Oh, and your slow query would be written as:
With your current
DeptPeopleHistory
design, can you try the performance of the following query?