在本机 SQL 中批量提交大型 INSERT 操作?

发布于 2024-08-08 16:41:47 字数 368 浏览 8 评论 0原文

我有几个大表(188m 和 144m 行),需要从视图填充,但每个视图包含几亿行(将伪维度建模数据合并到平面表单中)。每个表上的键是超过 50 个复合字节的列。如果数据在表中,我总是可以考虑使用 sp_rename 来创建另一个新表,但这并不是真正的选择。

如果我执行单个 INSERT 操作,该过程会使用大量事务日志空间,通常会将其归档并给 DBA 带来很多麻烦。 (是的,这可能是 DBA 应该处理/设计/架构师的工作)

我可以使用 SSIS 并通过批量提交将数据流式传输到目标表(但这确实需要通过网络传输数据,因为我们是不允许在服务器上运行 SSIS 包)。

除了使用某种键将进程划分为多个 INSERT 操作以将行分配到不同的批次并执行循环之外,还有什么其他方法吗?

I have a couple large tables (188m and 144m rows) I need to populate from views, but each view contains a few hundred million rows (pulling together pseudo-dimensionally modelled data into a flat form). The keys on each table are over 50 composite bytes of columns. If the data was in tables, I could always think about using sp_rename to make the other new table, but that isn't really an option.

If I do a single INSERT operation, the process uses a huge amount of transaction log space, typicalyl filing it up and prompting a bunch of hassle with the DBAs. (And yes, this is probably a job the DBAs should handle/design/architect)

I can use SSIS and stream the data into the destination table with batch commits (but this does require the data to be transmitted over the network, since we are not allowed to run SSIS packages on the server).

Any things other than to divide the process up into multiple INSERT operations using some kind of key to distribute the rows into different batches and doing a loop?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

还在原地等你 2024-08-15 16:41:47

该视图是否有任何类型的唯一标识符/候选键?如果是这样,您可以使用以下方法将这些行选择到工作表中:(

SELECT key_columns INTO dbo.temp FROM dbo.HugeView;

如果有意义,也许可以将此表放入不同的数据库中,也许使用简单恢复模型,以防止日志活动干扰您的主数据库。这应该生成无论如何,日志要少得多,并且您可以在恢复之前释放另一个数据库中的空间,以防问题是磁盘空间不足。)

然后您可以执行类似的操作,一次插入 10,000 行,并备份之间的日志:

SET NOCOUNT ON;

DECLARE
    @batchsize INT,
    @ctr INT,
    @rc INT;

SELECT
    @batchsize = 10000,
    @ctr = 0;

WHILE 1 = 1
BEGIN
    WITH x AS
    (
        SELECT key_column, rn = ROW_NUMBER() OVER (ORDER BY key_column)
        FROM dbo.temp
    )
    INSERT dbo.PrimaryTable(a, b, c, etc.)
        SELECT v.a, v.b, v.c, etc.
        FROM x
        INNER JOIN dbo.HugeView AS v
        ON v.key_column = x.key_column
        WHERE x.rn > @batchsize * @ctr
        AND x.rn <= @batchsize * (@ctr + 1);

    IF @@ROWCOUNT = 0
        BREAK;

    BACKUP LOG PrimaryDB TO DISK = 'C:\db.bak' WITH INIT;

    SET @ctr = @ctr + 1;
END

这都是我的想法,所以不要剪切/粘贴/运行,但我认为总体思路就在那里。有关更多详细信息(以及为什么我在循环内备份日志/检查点),请参阅 sqlperformance.com 上的这篇文章:

请注意,如果您正在获取常规数据库和日志您可能需要进行完整的备份才能重新开始您的日志链。

Does the view have ANY kind of unique identifier / candidate key? If so, you could select those rows into a working table using:

SELECT key_columns INTO dbo.temp FROM dbo.HugeView;

(If it makes sense, maybe put this table into a different database, perhaps with SIMPLE recovery model, to prevent the log activity from interfering with your primary database. This should generate much less log anyway, and you can free up the space in the other database before you resume, in case the problem is that you have inadequate disk space all around.)

Then you can do something like this, inserting 10,000 rows at a time, and backing up the log in between:

SET NOCOUNT ON;

DECLARE
    @batchsize INT,
    @ctr INT,
    @rc INT;

SELECT
    @batchsize = 10000,
    @ctr = 0;

WHILE 1 = 1
BEGIN
    WITH x AS
    (
        SELECT key_column, rn = ROW_NUMBER() OVER (ORDER BY key_column)
        FROM dbo.temp
    )
    INSERT dbo.PrimaryTable(a, b, c, etc.)
        SELECT v.a, v.b, v.c, etc.
        FROM x
        INNER JOIN dbo.HugeView AS v
        ON v.key_column = x.key_column
        WHERE x.rn > @batchsize * @ctr
        AND x.rn <= @batchsize * (@ctr + 1);

    IF @@ROWCOUNT = 0
        BREAK;

    BACKUP LOG PrimaryDB TO DISK = 'C:\db.bak' WITH INIT;

    SET @ctr = @ctr + 1;
END

That's all off the top of my head, so don't cut/paste/run, but I think the general idea is there. For more details (and why I backup log / checkpoint inside the loop), see this post on sqlperformance.com:

Note that if you are taking regular database and log backups you will probably want to take a full to start your log chain over again.

ゃ人海孤独症 2024-08-15 16:41:47

您可以对数据进行分区并将数据插入游标循环中。这与 SSIS 批量插入几乎相同。但在您的服务器上运行。

create cursor ....
select YEAR(DateCol), MONTH(DateCol) from whatever

while ....
    insert into yourtable(...)
    select * from whatever 
    where YEAR(DateCol) = year and MONTH(DateCol) = month
end

You could partition your data and insert your data in a cursor loop. That would be nearly the same as SSIS batchinserting. But runs on your server.

create cursor ....
select YEAR(DateCol), MONTH(DateCol) from whatever

while ....
    insert into yourtable(...)
    select * from whatever 
    where YEAR(DateCol) = year and MONTH(DateCol) = month
end
岁吢 2024-08-15 16:41:47

我知道这是一个旧线程,但我制作了 Arthur 游标解决方案的通用版本:

--Split a batch up into chunks using a cursor.
--This method can be used for most any large table with some modifications
--It could also be refined further with an @Day variable (for example)

DECLARE @Year INT
DECLARE @Month INT

DECLARE BatchingCursor CURSOR FOR
SELECT DISTINCT YEAR(<SomeDateField>),MONTH(<SomeDateField>)
FROM <Sometable>;


OPEN BatchingCursor;
FETCH NEXT FROM BatchingCursor INTO @Year, @Month;
WHILE @@FETCH_STATUS = 0
BEGIN

--All logic goes in here
--Any select statements from <Sometable> need to be suffixed with:
--WHERE Year(<SomeDateField>)=@Year AND Month(<SomeDateField>)=@Month   


  FETCH NEXT FROM BatchingCursor INTO @Year, @Month;
END;
CLOSE BatchingCursor;
DEALLOCATE BatchingCursor;
GO

这解决了我们大型表负载的问题。

I know this is an old thread, but I made a generic version of Arthur's cursor solution:

--Split a batch up into chunks using a cursor.
--This method can be used for most any large table with some modifications
--It could also be refined further with an @Day variable (for example)

DECLARE @Year INT
DECLARE @Month INT

DECLARE BatchingCursor CURSOR FOR
SELECT DISTINCT YEAR(<SomeDateField>),MONTH(<SomeDateField>)
FROM <Sometable>;


OPEN BatchingCursor;
FETCH NEXT FROM BatchingCursor INTO @Year, @Month;
WHILE @@FETCH_STATUS = 0
BEGIN

--All logic goes in here
--Any select statements from <Sometable> need to be suffixed with:
--WHERE Year(<SomeDateField>)=@Year AND Month(<SomeDateField>)=@Month   


  FETCH NEXT FROM BatchingCursor INTO @Year, @Month;
END;
CLOSE BatchingCursor;
DEALLOCATE BatchingCursor;
GO

This solved the problem on loads of our large tables.

如梦初醒的夏天 2024-08-15 16:41:47

没有仙尘,你知道的。

在不知道正在传输的实际模式的细节的情况下,通用解决方案将与您所描述的完全一样:将处理划分为多个插入并跟踪键。这是一种伪代码 T-SQL:

create table currentKeys (table sysname not null primary key, key sql_variant not null);
go

declare @keysInserted table (key sql_variant);
declare @key sql_variant;
begin transaction
do while (1=1)
begin
    select @key = key from currentKeys where table = '<target>';
    insert into <target> (...)
    output inserted.key into @keysInserted (key)
    select top (<batchsize>) ... from <source>
    where key > @key
    order by key;

    if (0 = @@rowcount)
       break; 

    update currentKeys 
    set key = (select max(key) from @keysInserted)
    where table = '<target>';
    commit;
    delete from @keysInserted;
    set @key = null;
    begin transaction;
end
commit

如果您想要允许并行批处理并对键进行分区,则会变得更加复杂。

There is no pixie dust, you know that.

Without knowing specifics about the actual schema being transfered, a generic solution would be exactly as you describe it: divide processing into multiple inserts and keep track of the key(s). This is sort of pseudo-code T-SQL:

create table currentKeys (table sysname not null primary key, key sql_variant not null);
go

declare @keysInserted table (key sql_variant);
declare @key sql_variant;
begin transaction
do while (1=1)
begin
    select @key = key from currentKeys where table = '<target>';
    insert into <target> (...)
    output inserted.key into @keysInserted (key)
    select top (<batchsize>) ... from <source>
    where key > @key
    order by key;

    if (0 = @@rowcount)
       break; 

    update currentKeys 
    set key = (select max(key) from @keysInserted)
    where table = '<target>';
    commit;
    delete from @keysInserted;
    set @key = null;
    begin transaction;
end
commit

It would get more complicated if you want to allow for parallel batches and partition the keys.

安人多梦 2024-08-15 16:41:47

您可以使用 BCP 命令加载数据并使用批量大小参数

http: //msdn.microsoft.com/en-us/library/ms162802.aspx

两步处理

  • BCP OUT 数据从视图到文本文件
  • BCP IN 数据从文本文件到带有批量大小参数的表

You could use the BCP command to load the data and use the Batch Size parameter

http://msdn.microsoft.com/en-us/library/ms162802.aspx

Two step process

  • BCP OUT data from Views into Text files
  • BCP IN data from Text files into Tables with batch size parameter
吹泡泡o 2024-08-15 16:41:47

这看起来像是 BCP 的工作。

This looks like a job for good ol' BCP.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文