我有一个相对较大的表(约 100m 条记录),它基本上是一个 XML 存储。可以有多个具有不同时间戳的 XML 文档(逻辑上最新时间戳 = 最新版本)。我们预计每月会批量更新数据,其中可能包含约 70% 的数据的新版本。
我们计划只在商店中保留最新的 2-3 个版本,所以我猜测我们当前的 (记录 ID、时间戳) 上的 B 树索引不一定是最快的?昨晚,一个简单的“select * from table where timestamp >= yyyy-mm-dd order by record id, timestamp”查询花了 15 个小时才完成 - 相当高规格的套件,我认为没有其他人在使用当时的DB。
(回复:查询本身,理想情况下我只想选择时间戳 >= yyyy-mm-dd 的最新文档,但目前这不是问题)。
有什么方法可以创建一个自动递减列,如下所示:
Record ID Timestamp Version XML
1 2011-10-18 1 <...>
1 2011-10-11 2 <...>
1 2011-10-04 3 <...>
2 2011-10-18 1 <...>
2 2011-10-11 2 <...>
等等 - 即随着新版本的出现,最新的时间戳 = 版本 1,并且所有较旧的记录都获得版本 =版本 + 1。这样我的内务管理脚本可以是一个简单的“删除版本> 3”(或我们决定保留的任何内容),并且我可以在记录 ID 上有一个 B 树索引,在版本?
希望我没有完全找错树——整个早上都在“创造性地谷歌搜索”,这就是我想出的理论......
I have a relatively large table (~100m records) which is basically an XML store. There can be multiple XML documents with different timestamps (with the logic that the latest timestamp = the most recent version). We're expecting monthly batches of updated data, probably with new versions of ~70% of the data.
We're planning on only keeping the most recent 2-3 versions in the store, so I'm guessing our current b-tree index on (record ID, timestamp) is not necessarily the fastest? A straight-forward "select * from table where timestamp >= yyyy-mm-dd order by record id, timestamp" query took 15 hours to complete last night - pretty high-spec kit and I don't think anyone else was using the DB at the time.
(re: the query itself, ideally I only want to select the most recent document with timestamp >= yyyy-mm-dd, but that's less of an issue for now).
Is there any way I can create an auto-decrement column, as follows:
Record ID Timestamp Version XML
1 2011-10-18 1 <...>
1 2011-10-11 2 <...>
1 2011-10-04 3 <...>
2 2011-10-18 1 <...>
2 2011-10-11 2 <...>
etc etc - i.e. as a new version comes along, the most recent timestamp = version 1, and all the older records get version = version + 1. This way my house-keeping scripts can be a simple "delete where version > 3" (or whatever we decide to keep), and I can have a b-tree index on record ID, and a binary index on version?
Hope I'm not barking completely up the wrong tree - have been "creatively Googling" all morning and this is the theory I've come up with...
发布评论
评论(2)
我不确定减少版本是否是一个好主意。唯一的方法是使用触发器查找匹配的记录 ID 并相应地更新它们。这对性能来说并不是很好。
这就是我在数据库环境(大小相似)中执行类似操作的方式。希望它有用:
创建一个单独的存档表来保存记录的所有版本。这将由插入主表时的触发器填充。触发器会将记录的当前版本
插入
到您的存档中,并更新
主表上的记录,增加版本号并更新时间戳和数据。然后,当您只需要选择所有记录的最新版本时,您只需执行以下操作:
如果您需要能够查看给定时间点数据的“快照”,您还需要
valid_from<表中的 /code> 和
valid_to
列记录了每个版本记录为最新版本的时间。您可以在写入存档表时使用触发器填充这些内容。最新版本记录的
Valid_to
可以设置为可用的最大日期。插入新版本的记录时,您需要将先前版本的valid_to
更新为新记录的valid_from
之前(这与避免欺骗)..然后,当您想查看数据在给定时间的情况时,您可以使用 SQL 查询存档表,如下所示:
I'm not sure decrementing the version would be a good idea.. The only way to do it would be with triggers looking up matching record ids and updating them accordingly. This wouldn't be great for performance..
This is how I do something similat in our database environment (which is of a similar size). Hopefully its useful:
Create a seperate archive table that will hold all versions of your records. This will be populated by a trigger on insert to your main table. The trigger will
insert
the current version of the record into your archive, andupdate
the record on the master table, incrementing the version number and updating the timestamp and data.Then, when you only need to select the latest version of all records, you simply do:
If you need the ability to view 'snapshots' of how the data looked at a given point in time, you will also need
valid_from
andvalid_to
columns on the table to record the times at which each version of the records were latest versions. You can populate these using the triggers when you write to the archive table..Valid_to
on the latest version of a record can be set to the maximum date available. When a newer version of a record is inserted, you'd update thevalid_to
of the previous version to be just before thevalid_from
of the new record (its not the same to avoid dupes)..Then, when you want to see how your data looked at a given time, you query th archive table using SQL like:
批处理工作绝对不同于典型的插入/更新方法(特别是涉及触发器或许多索引时)。即使有不错的磁盘/硬件,您也会发现传统的 DML 方法对于这个卷来说非常慢。对于每月批量更新 70mm 的 100mm + 表,我建议研究类似于以下的方法:
将新的批处理文件 (70mm) 加载到单独的表 (NEW_XML) 中,格式与现有表 (EXISTING_XML) 相同。使用 nologging 来避免撤消。
从 EXISTING_XML 中追加(无日志记录)NEW_XML 中不存在的记录(30 毫米记录,基于您已使用的任何密钥)。
将 EXISTING_XML 重命名为 HISTORY_XML,将 NEW_XML 重命名为 EXISTING_XML。在这里你需要一些休息时间,也许是周末的休息时间。这实际上不会花费任何时间,但您需要时间进行下一步(并且由于对象失效)。如果您已有上个月的 HISTORY_XML,请先截断并删除它(保留 1 个月的旧数据)。
在 EXISTING_XML(现在也包含新数据)上构建索引、统计信息、约束等。重新编译任何无效的对象,使用日志记录等。
简而言之,您将拥有一个表 (EXISTING_XML),它不仅包含新数据,而且构建得相对较快(比 DML/触发方法快很多倍)。另外,如果需要,您可以尝试在步骤 2 中使用并行。
希望有帮助。
Batch work is definitely different than the typical insert/update approach (esp if triggers or many indexes are involved). Even with decent disks/hardware, you'll find traditional DML approach is very slow with this volume. For 100mm + tables where you're updating 70mm in batch each month, I would suggest looking into an approach similar to:
Load new batch file (70mm) into separate table (NEW_XML), same format as existing table (EXISTING_XML). Use nologging to avoid undo.
Append (nologging) records from EXISTING_XML that don't exist in NEW_XML (30mm recs, based on whatever key(s) you already use).
Rename EXISTING_XML to HISTORY_XML and NEW_XML to EXISTING_XML. Here you'll need some downtime, off hours over a weekend perhaps. This won't take any time really, but you'll need time for next step (and due to object invalidations). If you already have a HISTORY_XML from previous month, truncate and drop it first (keep 1 month of old data).
Build indexes, stats, constraints, etc on EXISTING_XML (which now contains the new data as well). Recompile any invalidated objects, use logging, etc.
So in a nutshell, you'll have a table (EXISTING_XML) that not only has the new data, but was built relatively quickly (many times faster than DML/trigger approach). Also, you may try using parallel for step 2 if needed.
Hope that helps.