在 MySQL 表中存储数组的 BLOB 与 VARCHAR

发布于 2024-09-06 23:28:48 字数 409 浏览 5 评论 0原文

我需要做出设计决策,并正在寻找一些最佳实践建议。我有一个java程序,需要在MySQL数据库中存储大量(每天几百个)浮点数组。数据是长度为 300 的固定长度 Double 数组。我可以看到三个合理的选项:

  1. 将数据存储为 BLOB。
  2. 序列化数据并将其存储为 VARCHAR。
  3. 将数据作为二进制文件写入磁盘并存储对其的引用。

我还应该提到的是,这些数据将被频繁读取和更新。

我想使用 BLOB,因为这是我过去所做的,而且它似乎是最有效的方法(例如,保持固定宽度且无需转换为逗号分隔的字符串)。然而,我的同事坚持认为我们应该序列化并使用 varchar,其原因似乎很教条。

如果这些方法中的一种比另一种更好,那么原因是 Java 或 MySQL 特定的吗?

I've got a design decision to make and am looking for some best practice advice. I have a java program which needs to store a large number (few hundred a day) of floating point arrays in a MySQL database. The data is a fixed length Double array of length 300. I can see three reasonable options:

  1. Store the data as a BLOB.
  2. Serialize the data and store it as a VARCHAR.
  3. Write the data to disk as a binary file and store a reference to it instead.

I should also mention that this data will be read from and updated frequently.

I want to use a BLOB since that is what I have done in the past and it seems like the most efficient method (e.g., maintains fixed width & no need to convert to a comma separated string). However my coworker is insisting that we should serialize and use varchar for reasons which seem mostly dogmatic.

If one of these methods is better than the other, are the reasons Java or MySQL specific?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

谁把谁当真 2024-09-13 23:28:48

您是否有理由不创建子表以便每行存储一个浮点值而不是数组?

假设您每天存储 1000 个数组,每个数组包含 300 个元素。即每天 300,000 行,或每年 1.095 亿行。没有什么可以轻视的,但在 MySQL 或任何其他 RDBMS 的能力范围内。


回复您的评论:

当然,如果订单很重要,您可以为订单添加另一列。以下是我设计表格的方式:

CREATE TABLE VectorData (
  trial_id INT NOT NULL,
  vector_no SMALLINT UNSIGNED NOT NULL,
  order_no SMALLINT UNSIGNED NOT NULL,
  element FLOAT NOT NULL,
  PRIMARY KEY (trial_id, vector_no),
  FOREIGN KEY (trial_id) REFERENCES Trials (trial_id)
);
  • 一行向量数据的总空间:300x(4+2+2+4) = 3600 字节。再加上 16 字节的 InnoDB 记录目录(内部内容)。

  • 序列化 300 个浮点的 Java 数组时的总空间 = 1227 字节?

因此,通过存储数组可以节省大约 2400 字节,即 67% 的空间。但假设您有 100GB 的空间来存储数据库。存储序列化数组允许您存储 8750 万个向量,而标准化设计仅允许您存储 2980 万个向量。

你说你每天存储几百个向量,所以你只需 81 年而不是 239 年就能填满那个 100GB 分区。


回复您的评论: INSERT 的性能 是一个重要的问题,但您每天只存储几百个向量。

大多数 MySQL 应用程序无需过多的魔法即可实现每秒数百或数千次插入。

如果您需要最佳性能,请考虑以下事项:

  • 显式事务
  • 多行 INSERT 语法
  • INSERT DELAYED(如果您仍然使用 MyISAM)
  • LOAD DATA INFILE
  • ALTER TABLE DISABLE KEYS,执行插入,ALTER TABLE ENABLE KEYS

搜索短语在您最喜欢的搜索引擎上“mysql inserts per Second”可以阅读许多讨论此问题的文章和博客。

Is there a reason you don't create a child table so you can store one floating point value per row, instead of an array?

Say you store a thousand arrays of 300 elements each per day. That's 300,000 rows per day, or 109.5 million per year. Nothing to sneeze at, but within the capabilities of MySQL or any other RDBMS.


Re your comments:

Sure, if the order is significant you add another column for the order. Here's how I'd design the table:

CREATE TABLE VectorData (
  trial_id INT NOT NULL,
  vector_no SMALLINT UNSIGNED NOT NULL,
  order_no SMALLINT UNSIGNED NOT NULL,
  element FLOAT NOT NULL,
  PRIMARY KEY (trial_id, vector_no),
  FOREIGN KEY (trial_id) REFERENCES Trials (trial_id)
);
  • Total space for a row of vector data: 300x(4+2+2+4) = 3600 bytes. Plus InnoDB record directory (internals stuff) of 16 bytes.

  • Total space if you serialize a Java array of 300 floats = 1227 bytes?

So you save about 2400 bytes, or 67% of the space by storing the array. But suppose you have 100GB of space to store the database. Storing a serialized array allows you to store 87.5 million vectors, whereas the normalized design only allows you to store 29.8 million vectors.

You said you store a few hundred vectors per day, so you'll fill up that 100GB partition in only 81 years instead of 239 years.


Re your comment: Performance of INSERT is an important issue, but you're only storing a few hundred vectors per day.

Most MySQL applications can achieve hundreds or thousands of inserts per second without excessive wizardry.

If you need optimal performance, here are some things to look into:

  • Explicit transactions
  • Multi-row INSERT syntax
  • INSERT DELAYED (if you still use MyISAM)
  • LOAD DATA INFILE
  • ALTER TABLE DISABLE KEYS, do the inserts, ALTER TABLE ENABLE KEYS

Search for the phrase "mysql inserts per second" on your favorite search engine to read many articles and blogs talking about this.

天暗了我发光 2024-09-13 23:28:48

像这样存储为 BLOB(请参阅下面的代码示例)。我认为这可能比使用java序列化更好,因为java的内置序列化将需要2427字节,并且非java应用程序将更难处理数据。也就是说,将来是否应该有任何非 Java 应用程序查询数据库......如果没有,那么内置序列化会少几行。

public static void storeInDB() throws IOException, SQLException {

    double[] dubs = new double[300];

    ByteArrayOutputStream bout = new ByteArrayOutputStream();
    DataOutputStream dout = new DataOutputStream(bout);
    for (double d : dubs) {
        dout.writeDouble(d);
    }
    dout.close();
    byte[] asBytes = bout.toByteArray();

    PreparedStatement stmt = null;  // however we normally get this...
    stmt.setBytes(1, asBytes);

}

public static double[] readFromDB() throws IOException, SQLException {

    ResultSet rs = null;  // however we normally get this...
    while (rs.next()) {
        double[] dubs = new double[300];
        byte[] asBytes = rs.getBytes("myDoubles");
        ByteArrayInputStream bin = new ByteArrayInputStream(asBytes);
        DataInputStream din = new DataInputStream(bin);
        for (int i = 0; i < dubs.length; i++) {
            dubs[i] = din.readDouble();
        }
        return dubs;
    }

}

编辑:我希望使用 BINARY(2400),但 MySQL 说:

mysql> create table t (a binary(2400)) ;
ERROR 1074 (42000): Column length too big for column 'a' (max = 255);
use BLOB or TEXT instead

Store as a BLOB like so (see code example below). I think this is probably better than using java serialization since java's builtin serialization will need 2427 bytes, and non-java applications will have a harder time dealing with the data. That is, should there ever be any non-java applications querying the database in the future.... if not then the builtin serialization is a few less lines.

public static void storeInDB() throws IOException, SQLException {

    double[] dubs = new double[300];

    ByteArrayOutputStream bout = new ByteArrayOutputStream();
    DataOutputStream dout = new DataOutputStream(bout);
    for (double d : dubs) {
        dout.writeDouble(d);
    }
    dout.close();
    byte[] asBytes = bout.toByteArray();

    PreparedStatement stmt = null;  // however we normally get this...
    stmt.setBytes(1, asBytes);

}

public static double[] readFromDB() throws IOException, SQLException {

    ResultSet rs = null;  // however we normally get this...
    while (rs.next()) {
        double[] dubs = new double[300];
        byte[] asBytes = rs.getBytes("myDoubles");
        ByteArrayInputStream bin = new ByteArrayInputStream(asBytes);
        DataInputStream din = new DataInputStream(bin);
        for (int i = 0; i < dubs.length; i++) {
            dubs[i] = din.readDouble();
        }
        return dubs;
    }

}

Edit: I'd hoped to use BINARY(2400), but MySQL says:

mysql> create table t (a binary(2400)) ;
ERROR 1074 (42000): Column length too big for column 'a' (max = 255);
use BLOB or TEXT instead
澜川若宁 2024-09-13 23:28:48

如果您只想将数据存储为 Java 数组的二进制转储,那么请务必使用 BLOB。您的朋友可能会建议不要这样做,因为您可能希望某些非 Java 程序在以后使用该信息,因此二进制转储可能很难解释。

通过序列化为 VARCHAR,您了解数据格式,并且可以使用任何应用程序轻松读取它。

当然,如果您有哪怕一丁点想要操作或报告单个浮点数的可能性,它们都应该以数据库友好的格式存储。换句话说,不是二进制转储,不是序列化,不是 CSV 列。

按照 Codd 的意图,以第三范式存储它们。

顺便说一下,每天几百个300个元素的浮点数组并不是一个大数据库。从使用 DB2 在大型机上工作的人那里得知,大多数 DBMS 都可以轻松处理这种容量。我们每天都会将数千万行数据收集到我们的应用程序中,而且它甚至不费吹灰之力。

If you just want to store the data as a binary dump of the Java array then, by all means, use a BLOB. Your friend may well be advising against this since you may want some non-Java program to use the information at some later date so binary dumps are probably a pain to interpret.

With serialization to a VARCHAR, you know the data format and can easily read it with any application.

Of course, if there's even the slightest chance that you'll want to manipulate or report on the individual floats, they should be stored in a database-friendly format. In other words, not a binary dump, not serialized, not a CSV column.

Store them as Codd intended, in third normal form.

By the way, a few hundred 300-element floating point arrays each day is not a big database. Take it from someone who works on the mainframe with DB2, most DBMS' will easily handle that sort of volume. We collect tens of millions of rows every day into our application and it doesn't even break into a sweat.

夏末染殇 2024-09-13 23:28:48

使用数据库来存储一维数组真是太痛苦了!
甚至更多地使用 rdm,其中存储的数据之间没有关系。
抱歉,但恕我直言,最好的解决方案是使用文件并按照您喜欢的方式写入数据。
二进制或 txt 形式。
因此,300xsize 的长文本或 300x1 行的 txt 是一个数组。

Using a database to store an one dimensional array is pain in the ass!
Even more using a rdm where is no relation between the data stored.
sorry but the best solution imho is use a file and just write the data the way u like.
binary or as txt.
Thus 300xsize of long or 300x1 line of txt is one array.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文