使用多个连接的单个事务。 (MYSQL/JDBC)
我正在开发的应用程序是一个基于 Java 的 ETL 流程,它将数据加载到多个表中。 DBMS 是 Infobright(一个基于 MYSQL 的用于数据仓库的 DBMS)。
数据加载应该以原子方式完成;但是,出于性能原因,我想同时将数据加载到多个表中(使用 LOAD DATA INFILE 命令)。这意味着我需要打开多个连接。
有没有任何解决方案可以让我以原子方式并行加载? (我猜答案可能取决于我加载到的表的引擎;其中大多数是 Brighthouse,它允许事务,但没有 XA 也没有保存点)。
为了进一步澄清,我想避免出现这样的情况:
- 我将数据加载到 5 个表中
- 我提交了前 4 个表的加载
- 第 5 个表的提交失败
在这种情况下,我无法回滚前 4 个加载,因为他们已经承诺了。
The application I'm working on is a Java-based ETL process that loads data into multiple tables. The DBMS is Infobright (a MYSQL-based DBMS geared for data warehousing).
The data loading should be done atomically; however, for performance reasons, I want to load data into multiple tables at the same time (using a LOAD DATA INFILE
command). This means I need to open multiple connections.
Is there any solution that allows me to do the loads atomically and in parallel?
(I'm guessing the answer might depend on the engine for the tables I load into; most of them are Brighthouse, which allows Transactions, but no XA and no Savepoints).
To further clarify, I want to avoid a situation where let's say:
- I load data into 5 tables
- I commit the loads for the first 4 tables
- The commit for the 5th table fails
In this situation, I can't rollback the first 4 loads, because they are already commited.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
正如我所承诺的那样
,我已经编写了一个完整的示例。我使用 MySQL 并创建了三个表,如下所示:
test2
最初包含一行。(我已将完整代码发布到 http://pastebin.com。)
以下示例做了几件事。
threads
设置为3
,这决定了要并行运行的作业数量。threads
个连接。a
)。线程
个要运行的作业并向它们加载数据。threads
个线程 中运行作业并等待其完成(成功与否)。(注意,我在
SQLTask.call()
中使用了 Java 7 的自动资源管理功能。)逻辑
数据
任务
运行
结果
给定
返回的默认数据getTableData(...)
并且
test2
已经包含a
(并且data
列是唯一的< /em>) 第二个作业将失败并抛出异常,因此每个连接都将被回滚。如果您返回
b
而不是a
,那么连接将被安全提交。这可以通过
LOAD DATA
类似地完成。在OP对我的答案做出回应后,我意识到她/他想做的事情不可能以简单明了的方式完成。
基本上问题是,成功提交后,已提交的数据无法回滚,因为该操作是原子的。鉴于在给定的情况下需要多次提交,除非跟踪所有数据(在所有事务中)并且如果发生某些情况则删除所有已成功提交的内容,否则不可能回滚所有内容。
有一个很好的答案相关提交和回滚的问题。
Intro
As I've promised I've hacked up a complete example. I've used MySQL and created three tables like the following:
test2
contains a single row initially.(I've posted the full code to http://pastebin.com.)
The following example does several things.
threads
to3
which determines how many jobs are going to be run in parallel.threads
number of connections.a
for every table).threads
number of jobs to be run and loads them with data.threads
number of threads and waits for their completion (successful or not).(Note, that I've used Java 7's automatic resource management feature in
SQLTask.call()
.)Logic
Data
Tasks
Run
Result
Given the default data returned by
getTableData(...)
and the fact that
test2
already containsa
(and thedata
column is unique) the second job will fail and throw an exception, thus every connection will be rolled back.If instead of
a
s you returnb
s, then the connections will be committed safely.This can be done similarly with
LOAD DATA
.After OP's response on my answer I realized that what she/he wants to do isn't possible to do in a simple and clear manner.
Basically the problem is that after a successful commit the data that was committed can't be rolled-back, because the operation is atomic. Given multiple commits are needed in the case given, rolling-back everything isn't possible unless one tracks all data (in all of the transactions) and if somethings happens deletes everything that was successfully committed.
There is a nice answer relating to the issue of commits and rollbacks.
实际上,在较新版本的 IEE(而不是 ICE)中,有一个附加功能称为 DLP(分布式负载处理)。网站上有一个 PDF 文件,链接如下:
http://www.infobright.com/Products /功能/
Actually in the newer version of IEE, not ICE, there is an additional feature called DLP (Distributed Load Processing). There is a PDF file on the site, linked from here:
http://www.infobright.com/Products/Features/