使用 Java 访问数据集的最快方法是什么?

发布于 2024-11-01 21:07:09 字数 2511 浏览 1 评论 0原文

我有一个包含 180 万行数据的大文件,我需要能够读取该文件以用于我正在编写的机器学习程序。数据当前位于 CSV 文件中,但显然我可以根据需要将其放入数据库或其他结构中 - 它不需要定期更新。

我现在使用的代码如下。我首先将数据导入到数组列表中,然后将其传递到表模型。这非常慢,目前只需要六分钟才能执行前 10,000 行,这是不可接受的,因为我需要能够经常针对数据测试不同的算法。

我的程序只需要访问每行数据一次,因此不需要将整个数据集保存在 RAM 中。我是否最好从数据库中读取数据,或者是否有更好的方法逐行读取 CSV 文件,但速度更快?

import java.io.File;
import java.io.FileNotFoundException;
import java.util.ArrayList;
import java.util.Scanner;
import javax.swing.table.DefaultTableModel;
import javax.swing.table.TableModel;

public class CSVpaser {

public static TableModel parse(File f) throws FileNotFoundException {
    ArrayList<String> headers = new ArrayList<String>();
    ArrayList<String> oneDdata = new ArrayList<String>();
    //Get the headers of the table.
    Scanner lineScan = new Scanner(f);
    Scanner s = new Scanner(lineScan.nextLine());
    s.useDelimiter(",");
    while (s.hasNext()) {
        headers.add(s.next());
    }

    //Now go through each line of the table and add each cell to the array list
    while (lineScan.hasNextLine()) {
       s =  new Scanner(lineScan.nextLine());
       s.useDelimiter(", *");
       while (s.hasNext()) {
           oneDdata.add(s.next());
       }
    }
    String[][] data = new String[oneDdata.size()/headers.size()][headers.size()];
    int numberRows = oneDdata.size()/headers.size();

    // Move the data into a vanilla array so it can be put in a table.
    for (int x = 0; x < numberRows; x++) {
        for (int y = 0; y < headers.size(); y++) {
            data[x][y] = oneDdata.remove(0);
        }
    }

    // Create a table and return it
    return new DefaultTableModel(data, headers.toArray());


}

更新: 根据我在重写代码的答案中收到的反馈,它现在运行时间为 3 秒,而不是 6 分钟(对于 10,000 行),这意味着整个文件只需 10 分钟......但是关于如何加快速度的任何进一步建议它将不胜感激:

 //加载数据文件
    文件 f = new File("data/primary_training_short.csv");
    Scanner lineScan = new Scanner(f);
    Scanner s = new Scanner(lineScan.nextLine());
    s.useDelimiter(",");

    //now go through each line of the results
    while (lineScan.hasNextLine()) {
       s =  new Scanner(lineScan.nextLine());
       s.useDelimiter(", *");
       String[] data = new String[NUM_COLUMNS];

       //get the data out of the CSV file so I can access it
       int x = 0;
       while (s.hasNext()) {
           data[x] = (s.next());
           x++;
       }
       //insert code here which is excecuted each line
   }

I have a large file, with 1.8 million rows of data, that I need to be able to read for a machine learning program I'm writing. The data is currently in a CSV file but clearly I can put it in a database or other structure as required - it won't need to be updated regularly.

The code I'm using at the moment is below. I'm first importing the data to an array list and then I'm passing it to a table model. This is very slow, currently taking six minutes to execute just the first 10,000 rows which is not acceptable as I need to be able to test different algorithms against the data fairly often.

My program will only need to access each row of the data once, so there's no need to hold the whole dataset in RAM. Am I better off reading from a database, or is there a better way to read the CSV file line by line but do it much faster?

import java.io.File;
import java.io.FileNotFoundException;
import java.util.ArrayList;
import java.util.Scanner;
import javax.swing.table.DefaultTableModel;
import javax.swing.table.TableModel;

public class CSVpaser {

public static TableModel parse(File f) throws FileNotFoundException {
    ArrayList<String> headers = new ArrayList<String>();
    ArrayList<String> oneDdata = new ArrayList<String>();
    //Get the headers of the table.
    Scanner lineScan = new Scanner(f);
    Scanner s = new Scanner(lineScan.nextLine());
    s.useDelimiter(",");
    while (s.hasNext()) {
        headers.add(s.next());
    }

    //Now go through each line of the table and add each cell to the array list
    while (lineScan.hasNextLine()) {
       s =  new Scanner(lineScan.nextLine());
       s.useDelimiter(", *");
       while (s.hasNext()) {
           oneDdata.add(s.next());
       }
    }
    String[][] data = new String[oneDdata.size()/headers.size()][headers.size()];
    int numberRows = oneDdata.size()/headers.size();

    // Move the data into a vanilla array so it can be put in a table.
    for (int x = 0; x < numberRows; x++) {
        for (int y = 0; y < headers.size(); y++) {
            data[x][y] = oneDdata.remove(0);
        }
    }

    // Create a table and return it
    return new DefaultTableModel(data, headers.toArray());


}

Update:
Based on feedback I received in the answers I've rewritten the code, its now running in 3 seconds rather than 6 minutes (for 10,000 rows) which means only ten minutes for the whole file... but any further suggestions for how to speed it up would be appreciated:

       //load data file
    File f = new File("data/primary_training_short.csv");
    Scanner lineScan = new Scanner(f);
    Scanner s = new Scanner(lineScan.nextLine());
    s.useDelimiter(",");

    //now go through each line of the results
    while (lineScan.hasNextLine()) {
       s =  new Scanner(lineScan.nextLine());
       s.useDelimiter(", *");
       String[] data = new String[NUM_COLUMNS];

       //get the data out of the CSV file so I can access it
       int x = 0;
       while (s.hasNext()) {
           data[x] = (s.next());
           x++;
       }
       //insert code here which is excecuted each line
   }

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

无言温柔 2024-11-08 21:07:09
data[x][y] = oneDdata.remove(0);

那样效率会非常低。每次从 ArrayList 中删除第一个条目时,所有其他条目都需要向下移动。

至少您需要创建一个自定义 TableModel,这样您就不必复制数据两次。

如果您想将数据保存在数据库中,请在网上搜索 ResultSet TableModel。

如果您想将其保留为 CSV 格式,则可以使用 ArrayList 作为 TableModel 的数据存储。因此您的 Scanner 代码会将数据直接读取到 ArrayList 中。有关此类解决方案,请参阅列表表模型。或者您可能想使用 Bean 表模型

当然,真正的问题是谁有时间浏览所有 180 万条记录?因此,您确实应该使用数据库并具有查询逻辑来过滤从数据库返回的行。

我的程序只需要访问每行数据一次,因此不需要将整个数据集保存在 RAM 中

那么为什么要在 JTable 中显示它呢?这意味着整个数据将存储在内存中。

data[x][y] = oneDdata.remove(0);

That would be very inefficient. Every time you remove the first entry from the ArrayList all the other entries would need to be shifted down.

At a minimum you would want to create a custom TableModel so you don't have to copy the data twice.

If you want to keep the data in a database then search the net for a ResultSet TableModel.

If you want to keep it in CSV format then you can use the ArrayList as the data store for the TableModel. So your Scanner code would read the data directly into the ArrayList. See List Table Model for one such solution. Or you might want to use the Bean Table Model.

Of course the real question is who is going to have time to browse through all 1.8M records? So you really should use a database and have query logic to filter the rows that are returned from the database.

My program will only need to access each row of the data once, so there's no need to hold the whole dataset in RAM

So why are you displaying it in a JTable? This implies the entire data will be in memory.

浸婚纱 2024-11-08 21:07:09

Sqllite 是一个非常轻量级的基于文件的数据库,根据我的说法,这是解决您问题的最佳解决方案。

查看这个非常好的 java 驱动程序。我将它用于我的一个 NLP 项目,效果非常好。

Sqllite is a very light weight file based db and according to me, the best solution for your problem.

Check out this very good driver for java. I use it for one of my NLP projects and it works really well.

深白境迁sunset 2024-11-08 21:07:09

这就是我的理解:您的要求是对加载的数据执行一些算法,并且在运行时也执行一些算法,即

  • 加载一组数据
  • 执行一些计算
  • 加载另一组数据
  • 执行更多计算,依此类推,直到我们到达 CSV 的末尾

由于两组数据之间没有相关性,并且您对数据执行的算法/计算是自定义逻辑(SQL 中没有内置函数),这意味着您可以在 Java 中执行此操作,即使不使用任何数据库,这应该是最快的。

但是,如果您对两组数据执行的逻辑/计算在 SQL 中有一些等效的功能,并且有一个单独的数据库运行在良好的硬件(即更多的内存/CPU)上,则通过过程/执行整个逻辑SQL 中的函数可以执行得更好。

This is what I understood: Your requirement is to perform some algorithm on loaded data and that too at runtime i.e.

  • load a set of data
  • Perform some calculation
  • Load another set of data
  • Perform more calculation, and so on till we reach at the end of CSV

Since there is no correlation between the two sets of data and algorithm/calculation you're doing on data is a custom logic (for which there is no built-in function in SQL), that means you can do this in Java even without using any database, and this should be fastest.

However If the logic/calculation you're performing on two sets of data has got some equivalent function in SQL, and there is a separate Database running with good Hardware (that is more memory/CPU), executing this whole logic through a Procedure/Function in SQL could perform better.

一身仙ぐ女味 2024-11-08 21:07:09

你可以使用opencsv包,他们的CSVReader可以迭代大型CSV文件,对于这样的大数据,你还应该使用NaiveBayes、LinearRegression等在线学习方法。

You can use opencsv package, their CSVReader can itereate over large CSV files, you should also use online learning methods such as NaiveBayes, LinearRegression for such large data.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文