事务数据库中的快照设计以及参考数据的版本控制
免责声明:我已经阅读了有关堆栈溢出和互联网上的快照和版本控制主题的所有内容。我的要求不是审计跟踪或数据库级快照的版本跟踪。我花了超过 1 周的时间自行研究并思考可能的选择。抱歉,我可能错过了一些链接 - 如果我的问题的解决方案已经在其他线程中讨论过,请指出我在那里。
有点长;请耐心听我说。
情况是这样的:我们正在尝试创建一个通用设计来将事务数据的快照存储在事务数据库中,并保留参考数据的修订历史记录。
作为业务流程的一部分,用户可以按下按钮来发布特定对象。为了便于说明,我们假设用户可以在谈判开始之前发布供应商的提案。然后,在协商过程的不同时间点,用户可以发布提案数据。该提案包含预算、销售目标和许多其他项目。当提案被快照时,所有链接的实体都必须被快照。最后,经过谈判,签订合同。此时,必须创建合约的完整快照。并非合同中的所有实体都存在于提案中 - 有很多重叠的实体,但提案和合同附加了独特的实体。
我们必须保留这些已发布的版本和最新的活动版本。已发布的版本可在网站上提供,供供应商和管理团队参考。并非所有已发布的版本都可以在网站上找到,但最后发布的提案和最新发布的合同始终可以在网站上找到。该网站还必须从同一数据库填充。
此外,财务用户可以决定仅对预算进行快照,而销售经理可以对销售目标进行快照。因此,快照可以在多个粒度上使用。
我们还需要跟踪主数据的版本。跟踪关键主数据列随时间的所有变化是一项业务需求。例如,我们有与销售目标相关的区域信息。该地区的名称可能会发生变化,我们希望跟踪这些变化。我们假设在提议时,区域名称为 R1,并创建了一个快照。然后,该区域的名称更改为 R2,然后创建另外 2 个快照。我们希望能够将销售目标链接到这些时间点的正确区域名称,而不一定是最新的区域名称。
我们在建模方面具有一定的灵活性,因为我们同时拥有事务数据库和数据仓库数据库,并且我们可以决定将其中一些信息存储在事务数据库或数据仓库数据库中。
这是我们的设计。我们有一个发布表,它捕获有关已发布数据的基本信息 - 发布者和日期、原因以及发布对象的类型(提案或预算或销售目标)。
我们将快照存储在与原始数据相同的表中。因此,提案快照将与实时提案一起存储在提案表中。我们在每个必须发布的表中都有一个名为“发布 ID”的列。此列是发布表的 FK。如果发布 ID 为空,则该记录是活动版本。
我意识到这篇文章很长。因此,我没有列出场景细节,而是想到在思维导图中快速总结设计注意事项。
现在我们倾向于两种解决方案 - 两者都会存储所有数据的快照,无论是否有改变与否。在保持表结构完整的同时仅维护增量将需要一个非常复杂的存储过程,该存储过程必须在任何快照对象的每次插入/更新时运行。我不想走这条路,因为这会花费更长的时间,而且数量也没有那么大。
解决方案 1:每次发布一个对象(如提案或预算)时,我们都会填充一个 XML 树并将其保留在数据库中。网站上只需要提供最新版本,很少需要旧版本。鉴于此,我会因为使用 XML 而遇到大的性能问题吗?我们使用 SQL Server。数据量并不大。
解决方案 2:所有事务表都将有一个发布 ID,参考数据将有开始日期和结束日期。每当发布一个对象时,我们都会复制所有事务记录并将发布 ID 放在那里,然后复制所有引用数据记录并将快照日期作为结束日期。这将使我们能够在发布过程之外对参考数据进行正常的版本控制。
我需要有经验的人就这两种方法的缺点以及是否有其他更好的方案提出意见。
Disclaimer: I have read everything that I can read on the topic of snapshots and versioning on both stack overflow and on internet. My requirement is not version tracking for audit trail or the database-level snapshots. I have spent more than 1 week to research on my own and to think through the possible options. Sorry, I could have missed some links - if the solution to my problem is already discussed in some other thread, please point me there.
It is a bit long; please bear with me.
Here is the situation: We are trying to create a generic design to store snapshots of the transactional data in our transactional database and also to keep a revision history of reference data.
As part of the business process, a user can press a button to publish certain object. For the purpose of illustration, let us say that the user can publish a proposal from the vendor before negotiation starts. Then, at different points in time through the negotiation process, the user can publish the proposal data. The proposal contains a budget, sales targets and many other items. When a proposal is snapshotted, all the linked entities have to be snapshotted. Finally, after the negotiation, a contract is signed. At this point, a complete snapshot of the contract has to be created. Not all the entities in the contract are there in the proposal – there are lot of overlapping entities, but there are unique entities attached to proposal and contract.
We have to keep both these published versions and the latest active versions available. Published versions are made available on a website to be referenced by both vendors as well the management team. Not all published versions are made available on the website, but the last published proposal and latest published contract are always available in the website. This website also has to be populated from the same database.
Also, a finance user can decide to snapshot only the budget alone and a sales manager can snapshot the sales targets. So, snapshots are available at multiple granularities.
We also have a requirement to track versions of the master data. It is a business requirement to track all the changes to key master data columns over time. For example, we have region information associated with the sales targets. The name of the region can change and we want to track these changes. Let us assume that at the time of proposal, the region’s name is R1 and a snapshot is created. Then, the name of the region changes to R2 and then 2 other snapshots are created. We want to able to link the sales targets to the correct region name at those points in time, not necessarily to the latest region name.
We have some flexibility in modelling as we have both a transaction DB and a data warehouse DB and we can decide to store some of this information either in the transaction DB or in the data warehouse DB.
Here is our design. We have a Publication table which captures basic information about the published data – who published and the date, the reason, and the type of object published (proposal or budget or sales targets).
We store the snapshots in the same table as the original data. So, proposal snapshots would be stored with live proposals in the proposals table. We have a column called Publication ID in every table which has to be published. This column is a FK to the Publication table. If the Publication ID is null, that record is the active version.
I realized that the post is very lengthy. Hence, rather than listing the scenario details, I thought of quickly summarizing the design considerations in a mind map.
Now there are 2 solutions that we are leaning towards - both would store a snapshot of all the data whether it has changed or not. Maintaining only the delta while keeping table structures intact would necessitate a very complex stored procedure which has to run on every insert/update of any of the snapshotted object. I do not want to go down this route as this would take longer and the volumes are not that huge anyway.
Solution 1: Each time, an object is published (like proposal or budget), we would populate an XML tree and persist this in the database. Only the latest version need to be available in website and old versions are rarely needed. Given this, would I run into big performance issue due to using XML? We use SQL Server. The data volumes are that not huge.
Solution 2: All the transaction tables would have a publication ID and the reference data would have start and end dates. Whenever an object is published, we would make a copy of all transaction records and put the publication ID there and we would copy all of reference data records and put a snapshot date as the end date. This would allow us to have normal versioning for reference data outside of the publication process.
I would need opinions from experienced minds here as to the drawbacks of these 2 approaches and whether there is any other better scenario.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我的方法是选择解决方案 2。按顺序考虑您的设计考虑因素:
我将在快照中存储所有内容的副本。如果您仅存储更改,那么您就会遇到对流程详细信息进行快照以从更改中获取所需快照的问题。最初这不是问题,但随着模式、程序和流程的更改,您将必须维护如何从本身已更改的流程中检索所需快照的详细信息。可行,但可能很脆弱。
我会选择您的图表中未提及的选项,尽管在解决方案 2 的描述中概述了这一选项。这使用的模式与事务数据库的模式非常相似,但扩展为包括特定于快照的信息。您提到出版物 ID 作为外键,以及参考数据的日期。您可能会发现您需要与交易数据相关的其他信息,例如日期。
相同的模式不行 - 您已经指出(发布 ID)相同的模式是不够的。您发布的内容中没有任何内容表明您需要采用针对阅读而优化的不同架构。即使这被证明是必需的,也可以在稍后阶段合并,以当前的扩展模式作为起点。我对 XML 树没有太多经验,但会问“当您有可以利用现有基础设施的替代技术时,为什么要引入另一种技术呢?”您从这种方法中感受到的任何优势都必须非常重要,才能避免放弃现有架构的优势。类似的考虑也适用于非规范化数据库。除非有明确的必要性,否则为什么要去那里?
我将再次采用跟踪版本控制和快照的方法。您在解决方案 2 中给出了这种方法的主要优点。我将添加参考数据的快照作为快照过程的一部分,而不是版本控制过程。 (即,当拍摄快照时,确保适当的参考表构成快照的一部分)。从您的描述看来,您有两个不同的要求恰好利用相同的数据 - 快照和版本控制。它们之间似乎几乎没有依赖性,因此您应该使它们尽可能独立 - 缺乏耦合。
您提到可能使用数据仓库作为存储,尽管您的解决方案中没有具体提及。如果您的卷如您所建议的那样很低,那么我会认为单独的数据库就足够了。您确实给人的印象是快照的数据量和用户量都很低,因此似乎没有使用数据仓库的表面案例。同时,仓库确实有一些机制可以准确存储此类历史数据,以用于读取和分析。
很抱歉,我没有在这里直接回答您的问题 - 但我希望这能为您所述的情况提供一些指导和另一种观点。
My approach would be to opt for solution 2. Taking your design considerations in order:
I would store a copy of everything in the snapshot. If you store the change only, you give yourself the problem of snapshotting details of the process to obtain the desired snapshot from the changes. Initially this is not an issue, but as schemas, programs, and processes change, you will have to maintain details of how to retreive your desired snapshot from a process that has itself changed. Doable, but potentially fragile.
I would go for an option not mentioned in your diagram, though sketched out in your description of solution 2. This is using a schema very similar to that of the transaction DB, but extended to include the information specific to the snapshots. You mention publication ID as a foreign key, and dates for the reference data. You might find you need additional information such as dates, related to the transaction data.
The same schema will not do - you have pointed out (Publication ID) that the same schema is not adequate. Nothing in what you post suggests you need to adopt a different schema optimised for reading. Even if this proves to be required, it is something that can be incorporated at a later stage, with the current, extended schema as a starting point. I do not have much experience with XML trees, but would ask "why introduce another technology when you have alternatives that can utilise your existing infrastructure?" Any advantage you perceive from this approach would have to be very significant to warrent throwing away the advantage of leverage from your existing architecture. Similar considerations apply to a denormalised DB. Why go there until there is a demonstrated need to do so?
Again I would adopt the approach of tracking versioning and snapshots. You give a primary benefit of this approach in your solution 2. I would add the snapshotting of the reference data as part of the snapshotting process, rather than the versioning process. (Ie when a snapshot is taken, ensure the appropriate reference tables form part of the snapshot). It seems from your description that you have two different requirements that happen to utilise the same data - the snapshotting, and the versioning. There seems to be little dependency between them, and so you should keep them as independent as possible - lack of coupling.
You make mention of potentially using the data warehouse as storage, though not specifically mentioned in your solutions. If your volumes are, as you suggest, low, then I would have thought that a seperate database was adequate. You do give the impression that volumes of both data and users for the snapshots are low, so there would not seem to be a prima facie case for using the data warehouse. At the same time, the warehouse does have some mechanisms for storing exactly this type of historic data, to be used for reading and analysis.
I am sorry that I have not directly answered your questions here - but I hope this provides some pointers and another view on your stated situation.