无法将大数据集加载到 h2 数据库中

发布于 2024-10-16 04:13:15 字数 3429 浏览 3 评论 0原文

问题是:在我的公司,我们有一个大型数据库,我们希望在其中执行一些自动化操作。为了测试这一点,我们获取了大约 6 个 10MB 大小的 csv 文件的数据小样本。我们想用H2来测试我们程序在里面的结果。 H2 似乎与我们之前的简历配合得很好,尽管它们最多有 1000 个条目长。当涉及任何 10MB 文件时,该命令

insert into myschema.mytable (select * from csvread('mycsvfile.csv'));

会报告失败,因为其中一个注册表可能是重复的并且违反了我们的主键约束。

Unique index or primary key violation: "PRIMARY_KEY_6 ON MYSCHEMA.MYTABLE(DATETIME, LARGENUMBER, KIND)"; SQL statement:
insert into myschema.mytable (select * from csvread('src/test/resources/h2/data/mycsvfile.csv')) [23001-148] 23001/23001

将 mycsvfile.csv 分成更小的部分,我发现在插入大约 10000 行后问题开始出现(尽管数字因我使用的数据而异)。然而,如果我将文件分成几部分然后单独运行命令,我可以插入超过 10000 行。但即使我设法手动插入所有数据,我也需要一种自动化方法来填充数据库。

由于运行该命令不会给我导致问题的行,因此我猜测问题可能是 csvread 例程中的某些缓存。

然后我创建了一个小型java程序,可以手动将数据插入H2数据库中。无论我是否批处理命令、关闭并打开 1000 行的连接,h2 都报告我正在尝试复制数据库中的条目。

org.h2.jdbc.JdbcSQLException: Unique index or primary key violation: "PRIMARY_KEY_6 ON MYSCHEMA.MYTABLE(DATETIME, LARGENUMBER, KIND)"; SQL statement:
INSERT INTO myschema.mytable VALUES ( '1997-10-06 01:00:00.0',25485116,1.600,0,18 )  [23001-148]

使用 emacs 对该注册表进行正常搜索,我可以发现注册表没有重复,因为日期时间列在整个数据集中是唯一的。

我无法提供该数据供您测试,因为该公司出售该信息。但这是我的表定义的样子。

create table myschema.mytable (
   datetime timestamp,
   largenumber numeric(8,0) references myschema.largenumber(largecode),
   value numeric(8,3) not null,
   flag numeric(1,0) references myschema.flag(flagcode),
   kind smallint references myschema.kind(kindcode),
   primary key (datetime, largenumber, kind)
);

这就是我们的 csv 的样子:

datetime,largenumber,value,flag,kind
1997-06-11 16:45:00.0,25485116,0.710,0,18
1997-06-11 17:00:00.0,25485116,0.000,0,18
1997-06-11 17:15:00.0,25485116,0.000,0,18
1997-06-11 17:30:00.0,25485116,0.000,0,18

以及将填充我们的测试数据库的 java 代码(请原谅我丑陋的代码,我很绝望:)

private static void insertFile(MyFile file) throws SQLException {
    int updateCount = 0;
    ResultSet rs = Csv.getInstance().read(file.toString(), null, null);
    ResultSetMetaData meta = rs.getMetaData();
    Connection conn = DriverManager.getConnection(
            "jdbc:h2:tcp://localhost/mytestdatabase", "sa", "pass");
    rs.next();
    while (rs.next()) {
        Statement stmt = conn.createStatement();
        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < meta.getColumnCount(); i++) {
            if (i == 0)
                sb.append("'" + rs.getString(i + 1) + "'");
            else
                sb.append(rs.getString(i + 1));
            sb.append(',');
        }
        updateCount++;
        if (sb.length() > 0)
            sb.deleteCharAt(sb.length() - 1);

        stmt.execute(String.format(
                "INSERT INTO myschema.mydatabase VALUES ( %s ) ",
                sb.toString()));
        if (updateCount == 1000) {
            conn.close();
            conn = DriverManager.getConnection(
                    "jdbc:h2:tcp://localhost/mytestdatabase", "sa", "pass");
            updateCount = 0;
        }
    }
    if (!conn.isClosed()) {
        conn.close();
    }
    rs.close();
}

如果需要,我很乐意提供更多信息。

编辑

@Randy 我总是在运行命令之前检查数据库是否干净,在我的java程序中,我有一个例程从无法插入的文件中删除所有数据。

 select * from myschema.mytable where largenumber  = 25485116;
 DATETIME   LARGENUMBER     VALUE   FLAG    KIND  
 (no rows, 8 ms)

Here is the problem: At my company we have a large database that we want to perform some automated operations in it. To test that we got a small sample of that data about 6 10MB sized csv files. We want to use H2 to test the results of our program in it. H2 Seemed to work fine with our previous cvs though they were, at most, 1000 entries long. When it comes to any of our 10MB files the command

insert into myschema.mytable (select * from csvread('mycsvfile.csv'));

reports a failure because one of the registries is supposedly duplicated and offends our primary key constraints.

Unique index or primary key violation: "PRIMARY_KEY_6 ON MYSCHEMA.MYTABLE(DATETIME, LARGENUMBER, KIND)"; SQL statement:
insert into myschema.mytable (select * from csvread('src/test/resources/h2/data/mycsvfile.csv')) [23001-148] 23001/23001

Breaking the mycsvfile.csv into smaller pieces I was able to see that the problem starts to appear after about 10000 rows inserted(though the number varies depending on what data I used). I could however insert more than 10000 rows if I broke the file into pieces and then ran the command individually. But even if I manage to insert all that data manually I need an automated method to fill the database.

Since running the command would not give me the row that was causing the problem I guessed that the problem could be some cache in the csvread routine.

Then I created a small java program that could insert the data in the H2 database manually. No matter whether I batched the commands, closed and opened the connection for 1000 rows h2 reported that I was trying to duplicate an entry in the database.

org.h2.jdbc.JdbcSQLException: Unique index or primary key violation: "PRIMARY_KEY_6 ON MYSCHEMA.MYTABLE(DATETIME, LARGENUMBER, KIND)"; SQL statement:
INSERT INTO myschema.mytable VALUES ( '1997-10-06 01:00:00.0',25485116,1.600,0,18 )  [23001-148]

Doing a normal search for that registry using emacs I can find that the registry is not duplicated as the datetime column is unique in the whole dataset.

I cannot give that data for you to test since the company sells that information. But here is how my table definition is like.

create table myschema.mytable (
   datetime timestamp,
   largenumber numeric(8,0) references myschema.largenumber(largecode),
   value numeric(8,3) not null,
   flag numeric(1,0) references myschema.flag(flagcode),
   kind smallint references myschema.kind(kindcode),
   primary key (datetime, largenumber, kind)
);

This is how our csv looks like:

datetime,largenumber,value,flag,kind
1997-06-11 16:45:00.0,25485116,0.710,0,18
1997-06-11 17:00:00.0,25485116,0.000,0,18
1997-06-11 17:15:00.0,25485116,0.000,0,18
1997-06-11 17:30:00.0,25485116,0.000,0,18

And the java code that would fill our test database(forgive my ugly code, I got desperate :)

private static void insertFile(MyFile file) throws SQLException {
    int updateCount = 0;
    ResultSet rs = Csv.getInstance().read(file.toString(), null, null);
    ResultSetMetaData meta = rs.getMetaData();
    Connection conn = DriverManager.getConnection(
            "jdbc:h2:tcp://localhost/mytestdatabase", "sa", "pass");
    rs.next();
    while (rs.next()) {
        Statement stmt = conn.createStatement();
        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < meta.getColumnCount(); i++) {
            if (i == 0)
                sb.append("'" + rs.getString(i + 1) + "'");
            else
                sb.append(rs.getString(i + 1));
            sb.append(',');
        }
        updateCount++;
        if (sb.length() > 0)
            sb.deleteCharAt(sb.length() - 1);

        stmt.execute(String.format(
                "INSERT INTO myschema.mydatabase VALUES ( %s ) ",
                sb.toString()));
        if (updateCount == 1000) {
            conn.close();
            conn = DriverManager.getConnection(
                    "jdbc:h2:tcp://localhost/mytestdatabase", "sa", "pass");
            updateCount = 0;
        }
    }
    if (!conn.isClosed()) {
        conn.close();
    }
    rs.close();
}

I'll be glad to provide more information if requested.

EDIT

@Randy I always check if the database is clean before running the command and in my java program I have a routine to delete all data from a file that fails to be inserted.

 select * from myschema.mytable where largenumber  = 25485116;
 DATETIME   LARGENUMBER     VALUE   FLAG    KIND  
 (no rows, 8 ms)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

守不住的情 2024-10-23 04:13:15

我唯一能想到的是表上有一个触发器将时间戳设置为“现在”。尽管这不能解释为什么您在几行上成功,但它可以解释为什么主键被违反。

The only thing that I can think of is that there is a trigger on the table that sets the timestamp to "now". Although that would not explain why you are successful with a few rows, it would explain why the primary key is being violated.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文