如何将数据持久保存到磁盘,并随机更新它,并将其有效地流回 RAM?

发布于 2024-08-02 00:54:43 字数 584 浏览 8 评论 0原文

我需要在磁盘上存储多达数千万甚至数亿条数据。 每条数据都包含以下信息:

id=23425
browser=firefox
ip-address=10.1.1.1
outcome=1.0

新数据的添加速度最多可达每毫秒 1 条。

因此它是一组相对简单的键值对,其中值可以是字符串、整数或浮点数。 有时候我可能需要更新某个特定id的数据,将flag字段从0更改为1。换句话说,我需要能够通过id进行随机键查找,并修改数据(实际上只是浮动的)点“结果”字段 - 所以我永远不需要修改值的大小)。

另一个要求是我需要能够有效地从磁盘流式传输这些数据(顺序不是特别重要)。 这意味着硬盘磁头不应该需要在磁盘上跳跃来读取数据,而是应该在连续的磁盘块中读取。

我正在用 Java 写这个。

我考虑过使用嵌入式数据库,但 DB4O 不是一个选项,因为它是 GPL,而我的其余代码不是。 考虑到 SQL 查询之间的转换开销,我还担心使用嵌入式 SQL 数据库的效率。

有人有什么想法吗? 我是否必须为此构建一个自定义解决方案(我直接处理 ByteBuffers 并处理 id 查找)?

I need to store up to tens or even hundreds of millions of pieces of data on-disk. Each piece of data contains information like:

id=23425
browser=firefox
ip-address=10.1.1.1
outcome=1.0

New pieces of data may be added at the rate of up-to 1 per millisecond.

So its a relatively simple set of key-value pairs, where the values can be strings, integers, or floats. Occasionally I may need to update the piece of data with a particular id, changing the flag field from 0 to 1. In other words, I need to be able to do random key lookups by id, and modify the data (actually only the floating point "outcome" field - so I'll never need to modify the size of the value).

The other requirement is that I need to be able to stream this data off disk (the order isn't particularly important) efficiently. This means that the hard disk head should not need to jump around the disk to read the data, rather it should be read in consecutive disk blocks.

I'm writing this in Java.

I've thought about using an embedded database, but DB4O is not an option as it is GPL and the rest of my code is not. I also worry about the efficiency of using an embedded SQL database, given the overhead of translating to and from SQL queries.

Does anyone have any ideas? Might I have to build a custom solution to this (where I'm dealing directly with ByteBuffers, and handling the id lookup)?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

偏爱自由 2024-08-09 00:54:43

H2 怎么样? 许可证应该适合您。

  • 您可以免费使用H2。 你可以
    将其集成到您的应用程序中
    (包括商业应用),
    你可以分发它。
  • 文件
    仅包含您的代码不是
    该许可证涵盖(它是
    “商业友好”)。
  • 修改
    H2源代码必须是
    发表。
  • 您不需要提供
    H2 的源代码(如果没有)
    修改任何东西。

我得到

1000000 insert in 22492ms (44460.252534234394 row/sec)

100000 update in 9565ms (10454.783063251438 row/sec)

from

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.SQLException;
import java.util.Random;


/**
 * @author clint
 *
 */
public class H2Test {

  static int testrounds = 1000000;

  public static void main(String[] args) {
    try {
      Class.forName("org.h2.Driver");

    Connection conn = DriverManager.
        getConnection("jdbc:h2:/tmp/test.h2", "sa", "");
    // add application code here
    conn.createStatement().execute("DROP TABLE IF EXISTS TEST");
    conn.createStatement().execute("CREATE TABLE IF NOT EXISTS TEST(id INT PRIMARY KEY, browser VARCHAR(64),ip varchar(16), outcome real)"); 
    //conn.createStatement().execute("CREATE INDEX IDXall ON TEST(id,browser,ip,outcome");


    PreparedStatement ps = conn.prepareStatement("insert into TEST (id, browser, ip, outcome) values (?,?,?,?)");
    long time = System.currentTimeMillis();
    for ( int i = 0; i < testrounds; i++ ) {
      ps.setInt(1,i);
      ps.setString(2,"firefox");
      ps.setString(3,"000.000.000.000");
      ps.setFloat(4,0);
      ps.execute();
    }
    long last = System.currentTimeMillis() ;
    System.out.println( testrounds + " insert in " + (last - time) + "ms (" + ((testrounds)/((last - time)/1000d)) + " row/sec)" );

    ps.close();
    ps = conn.prepareStatement("update TEST set outcome = 1 where id=?");
    Random random = new Random();
    time = System.currentTimeMillis();

    /// randomly updadte 10% of the entries
    for ( int i = 0; i < testrounds/10; i++ ) {
      ps.setInt(1,random.nextInt(testrounds));
      ps.execute();
    }

    last = System.currentTimeMillis();
    System.out.println( (testrounds/10) + " updates in " + (last - time) + "ms (" + ((testrounds/10)/((last - time)/1000d)) + " row/sec)" );

    conn.close();

    } catch (ClassNotFoundException e) {
      // TODO Auto-generated catch block
      e.printStackTrace();
    } catch (SQLException e) {
      // TODO Auto-generated catch block
      e.printStackTrace();
    }
  }

}

How about H2? The License should work for you.

  • You can use H2 for free. You can
    integrate it into your application
    (including commercial applications),
    and you can distribute it.
  • Files
    containing only your code are not
    covered by this license (it is
    'commercial friendly').
  • Modifications
    to the H2 source code must be
    published.
  • You don't need to provide
    the source code of H2 if you did not
    modify anything.

I get

1000000 insert in 22492ms (44460.252534234394 row/sec)

100000 updates in 9565ms (10454.783063251438 row/sec)

from

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.SQLException;
import java.util.Random;


/**
 * @author clint
 *
 */
public class H2Test {

  static int testrounds = 1000000;

  public static void main(String[] args) {
    try {
      Class.forName("org.h2.Driver");

    Connection conn = DriverManager.
        getConnection("jdbc:h2:/tmp/test.h2", "sa", "");
    // add application code here
    conn.createStatement().execute("DROP TABLE IF EXISTS TEST");
    conn.createStatement().execute("CREATE TABLE IF NOT EXISTS TEST(id INT PRIMARY KEY, browser VARCHAR(64),ip varchar(16), outcome real)"); 
    //conn.createStatement().execute("CREATE INDEX IDXall ON TEST(id,browser,ip,outcome");


    PreparedStatement ps = conn.prepareStatement("insert into TEST (id, browser, ip, outcome) values (?,?,?,?)");
    long time = System.currentTimeMillis();
    for ( int i = 0; i < testrounds; i++ ) {
      ps.setInt(1,i);
      ps.setString(2,"firefox");
      ps.setString(3,"000.000.000.000");
      ps.setFloat(4,0);
      ps.execute();
    }
    long last = System.currentTimeMillis() ;
    System.out.println( testrounds + " insert in " + (last - time) + "ms (" + ((testrounds)/((last - time)/1000d)) + " row/sec)" );

    ps.close();
    ps = conn.prepareStatement("update TEST set outcome = 1 where id=?");
    Random random = new Random();
    time = System.currentTimeMillis();

    /// randomly updadte 10% of the entries
    for ( int i = 0; i < testrounds/10; i++ ) {
      ps.setInt(1,random.nextInt(testrounds));
      ps.execute();
    }

    last = System.currentTimeMillis();
    System.out.println( (testrounds/10) + " updates in " + (last - time) + "ms (" + ((testrounds/10)/((last - time)/1000d)) + " row/sec)" );

    conn.close();

    } catch (ClassNotFoundException e) {
      // TODO Auto-generated catch block
      e.printStackTrace();
    } catch (SQLException e) {
      // TODO Auto-generated catch block
      e.printStackTrace();
    }
  }

}
維他命╮ 2024-08-09 00:54:43

JDBM 是一个很棒的 Java 嵌入式数据库(并且不像 Berkley 的 Java 版本那样受到许可的困扰)。 这值得尝试。 如果您不需要 ACID 保证(即您可以接受数据库在崩溃时损坏),请关闭事务管理器(显着提高速度)。

JDBM is a great embedded database for Java (and not as encumbered with licensing as the Java version of Berkley). It would be worth trying. If you don't need ACID guarantees (i.e. you are OK with the database getting corrupted in the event of a crash), turn off the transaction manager (significantly increases speed).

好菇凉咱不稀罕他 2024-08-09 00:54:43

我认为你会更成功地编写一些东西来缓存内存中最活跃的记录并将数据更改作为低优先级插入到数据库中。

我知道使用此方法会稍微增加 IO,但如果您谈论的是数百万条记录,我认为它仍然会更快,因为您创建的任何搜索算法都将大大优于成熟的数据库引擎。

I think you'd have a lot more success writing something that caches the most active records in memory and queues data changes as a low priority insert into the DB.

I understand there's a slight increase in IO using this method but if you're talking about millions of records I think it would still be faster because any search algorithm you create is going to be greatly outperformed by a a full fledged database engine.

寄人书 2024-08-09 00:54:43

您可以尝试 Berkley DB,它现在属于 Oracle。 他们拥有开源和商业许可证。 它使用键/值模型(如果需要其他形式的查询,可以选择创建索引)。 有纯 Java 版本和带有 Java 绑定的本机版本。

You could try Berkley DB which is now owned by Oracle. They have Open Source and Commercial licenses. It uses a Key/Value model (with an option to create indexes if other forms of queries are required). There is a pure Java version and a native version with Java bindings.

尽揽少女心 2024-08-09 00:54:43

http://www.zentus.com/sqlitejdbc/

SQLite 数据库(公共域)、JDBC 连接器带有 BSD 许可证,原生适用于一大堆平台(OSX、Linux、Windows),其余平台均进行仿真。

http://www.zentus.com/sqlitejdbc/

SQLite database (public domain), JDBC connector with BSD license, native for a whole bunch of platforms (OSX, Linux, Windows), emulation for the rest.

顾北清歌寒 2024-08-09 00:54:43

您可以使用与 JDK 捆绑在一起的 Apache Derby(或 JavaDB)。 但是,如果 DBMS 不能提供所需的速度,您可以自己实现特定的文件结构。 如果只需要精确的键查找,您可以使用哈希文件来实现它。 哈希文件是满足此类要求的最快文件结构(比数据库中使用的 B 树和网格等通用文件结构快得多)。 它还提供了可接受的流传输效率。

You can use Apache Derby (or JavaDB) which is bundled with JDK. However, if a DBMS doesn't provide the required speed you may implement a specific file structure yourself. If just exact key lookup is required, you may use a hash-file to implement it. The hash-file is the fastest file structure for such requirements (much faster than general purpose file structures such as B-Trees and grids which are used in DBs). It also provides acceptable streaming efficiency.

断桥再见 2024-08-09 00:54:43

最后,我决定将数据记录到磁盘,并将其保存在内存中,以便我可以更新它。 一段时间后,我将数据写入磁盘并删除日志。

In the end I decided to log the data to disk as it comes in, and also keep it in memory where I can update it. After a period of time I write the data out to disk and delete the log.

总以为 2024-08-09 00:54:43

您看过 Oracle 的“TimesTen”数据库吗? 它是一个内存数据库,应该具有非常高性能。 不知道成本/许可证等,但请查看 Oracle 网站并搜索它。 应该可以下载评估版。

Have you taken a look at Oracle's 'TimesTen' database? Its an in-memory db that is supposed to be very high-performance. Don't know about costs/licenses, etc, but take a look at Oracles site and search for it. Eval download should be available.

骑趴 2024-08-09 00:54:43

我还会查看是否有基于 EHCache 或 JCS 的现有内容可能有所帮助。

I'd also take a look to see if there's anything existing based on either EHCache or JCS that might help.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文