如何使用SSIS归档相关数据(在不同的表中)

发布于 2024-10-20 03:49:42 字数 508 浏览 4 评论 0原文

我们数据库中的许多表中分布着许多(较旧的)相关数据,需要存档到单独的数据库,然后从原始数据库中删除(可能每天)。归档(和删除)的内容由业务规则决定。我们不希望此存档中有较新的数据(它会更改并且被频繁访问)。如果需要并且可能需要报告,一些存档数据可能需要传回。

我需要在 SSIS 中提出一个高效且易于维护的解决方案。我可以想到至少三种方法来做到这一点:

  1. 编写一个包含大量左连接的大选择语句来获取我想要的所有数据,然后在内存中遍历这些数据或将其存储在
  2. 表上的 临时表中按表基础,选择作业,然后从其他表中选择相关数据
  3. ,与第一种方法类似,除了将数据转储到非规范化表中(似乎是一种懒惰的方法?)

我考虑过暂存表,但我不这样做在这种情况下看到一个好处 - 我可以使用时间戳(时间戳 = 进程启动)将数据直接复制到存档表,然后完成后,删除进程启动时带有时间戳的 id。一旦确定了作业 ID,我就知道需要存档的其余数据。

我正在寻找最佳方法,有人有其他方法吗? 其他人如何实现这一目标?

There is a lot of (older) related data spread over many tables in our database that needs archiving to a separate database then deleting from the original database (probably daily). What is archived (and deleted) is determined by business rules. We don't want newer data in this archive (it changes and is accessed frequently). Some of the archived data may need transferring back if required and possibly reported on.

I need to come up with an efficient and easy to maintain solution in SSIS. I can think of at least three ways of doing this:

  1. write a big select statement with lots of left joins to get all the data that I want, then go through this data either in memory or store it in a staging table
  2. work on a table by table basis, selecting the jobs, then the related data from other tables
  3. similar to the first method, except just dump the data into a de-normalised table (seems like a lazy method?)

I considered staging tables, though I don't see a benefit in this case - I can copy the data directly to the archive table with a timestamp (timestamp = process started) then when finished, go and delete the id's that were timestamped at the time the process started. Once I've identified the job id, I know the rest of the data that needs to be archived too.

I'm looking for the optimal approach, does anyone have another approach?
How would other people achieve this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

夜灵血窟げ 2024-10-27 03:49:42

听起来您需要两个进程,一个用于归档旧数据,一个用于重新加载归档数据。我会用以下方法来解决这两个问题。

对于归档数据:

  • 首先,仔细检查并使用您的业务规则来识别需要归档的数据。总的来说,这意味着构建一个仅包含要存档数据的键的临时表(或数据集)。以发票为例,我只收集要存档的发票的 ID,这应该足以识别所有相关的子数据(发票行项目、运输和付款信息等),您可能需要多个这样的集合,例如用于归档发票、客户和销售人员,因为没有一个集合“完全包含”在另一个集合中。
  • 确保这些集合完整且完整(即,它们不包含破裂的父/子关系)。将所有后续工作基于这些数据集将确保不会有任何额外的无意“溜进去”。
  • 接下来,检查这些集中标识的数据并将其从源数据库复制到存档数据库。
  • 当所有数据都被正确复制后(并且只有到那时),返回并从源表中删除这些数据。

对于重新加载数据,过程几乎相同,但从存档集工作到“实时”集。特别考虑的因素包括:

  • 如果重新加载,数据可以修改吗?如果是这样,那么您可能需要从存档中删除它,因为当它最终重新存档时,它可能已被修改。否则,您将不得不将同一组数据归档两次!
  • 您可能希望以某种方式标记重新加载的数据,以便下次存档运行时不会立即重新存档。

无论使用哪种工具(SSIS 或其他工具),此方法都适用。

Sounds like you need two processes, one to archive old data, and one to reload archived data. I’d tackle both with the following methodology.

For Archiving data:

  • First, go through and, using your business rules, identify the data that needs to be archived. By and large, this means building a temporary table (or data set) containing only the keys of the data to be archived. Using invoices as an example, I’d collect just the IDs of the Invoices to be archived, and that should be sufficient to identify all the related child data (invoice line items, shipping and payment information, etc. etc.) You might need multiple such sets, such as for archiving both invoices, customers, and sales people, since none are “fully contained” within another.
  • Be certain these sets are whole and complete (that is, they contain no broken parent/child relationships). Basing all subsequent work on these data sets will ensure that nothing extra inadvertently “slips in”.
  • Next, go through and copy the data identified within these sets from the source database to the archive database.
  • When all the data has been properly copied over (and only then), go back and delete this data from the source tables.

For Reloading data, it’d be pretty much the same process, but working from the archive set to the “live” set. Special considerations would include:

  • If reloaded, can the data be modified? If so, then you’d presumably need to delete it from the archive since, when it eventually is re-archived, it may have been modified. That, or you’ll have to allow for archiving the same set of data twice!
  • You’d presumably want to flag reloaded data somehow, so that it does not get immediately re-archived by the next archive run.

This methodology would work regardless of tool—SSIS or otherwise.

甜味超标? 2024-10-27 03:49:42

您能说得更具体一些吗?

您是否需要当前数据库中仍然存在的“旧”数据?如果是 - 那么您可以简单地:

  1. 备份和恢复 一次性
  2. 写出“高效”删除
  3. 创建新的 SSIS 包以维护未来的数据填充/维护

如果您不需要现有数据库中的“旧”数据,则插入所需的存档数据到新数据库将是必经之路。根据您在存档数据库中的需要 - 如果您不需要非键,您可以进行规范化,否则逐表方法会很好。


如果这有帮助,请标记为答案

Could you please be more specific?

Do you need the 'old' data still present in your current database? If yes - then you can simply:

  1. Backup and restore
  2. Write one off 'efficient' delete
  3. Create new SSIS package that will maintain future data population/maintain

If you don't need the 'old' data in the existing database, inserting the required archive data to new database will be the way to go. Depending what you need in the archive database - if you don't need the non-key you can go for normalization otherwise table by table approach will be good.


if this helps please mark as answer

梦幻之岛 2024-10-27 03:49:42

使用 CQRS。
问题出在“相关数据”这个术语上。
如果您隔离读取(在自己的数据库/表中所有可能的读取),您将不需要相关数据,并且可以在必要时单独将规则应用于每个“聚合”。
假设对于列表视图,当您渲染分页器时,您可以从两个不同的源(实际数据和存档数据)获取计数。对于详细视图,您仍然可以为每个资源提供 UUID,但在这种情况下,应用程序将从不同的数据存储中读取。
这里的关键点是摆脱所有类型的连接。

Use CQRS.
The problem is in the term "related data".
If you segregate your reads (all possible reads in own database/tables) you will not have a need for related data and can apply rules to each "Aggregate" separately if necessary.
Let's say for list views you an get count from two different sources (actual and archived data) when you are rendering pager. For detail view you still can have UUID's for each resource but in this case application will read from different data stores.
The key point here is you get rid of all kind of joins.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文