使用 Java 访问数据集的最快方法是什么？

发布于 2024-11-01 21:07:09 字数 2511 浏览 1 评论 0原文

我有一个包含 180 万行数据的大文件，我需要能够读取该文件以用于我正在编写的机器学习程序。数据当前位于 CSV 文件中，但显然我可以根据需要将其放入数据库或其他结构中 - 它不需要定期更新。

我现在使用的代码如下。我首先将数据导入到数组列表中，然后将其传递到表模型。这非常慢，目前只需要六分钟才能执行前 10,000 行，这是不可接受的，因为我需要能够经常针对数据测试不同的算法。

我的程序只需要访问每行数据一次，因此不需要将整个数据集保存在 RAM 中。我是否最好从数据库中读取数据，或者是否有更好的方法逐行读取 CSV 文件，但速度更快？

import java.io.File;
import java.io.FileNotFoundException;
import java.util.ArrayList;
import java.util.Scanner;
import javax.swing.table.DefaultTableModel;
import javax.swing.table.TableModel;

public class CSVpaser {

public static TableModel parse(File f) throws FileNotFoundException {
    ArrayList<String> headers = new ArrayList<String>();
    ArrayList<String> oneDdata = new ArrayList<String>();
    //Get the headers of the table.
    Scanner lineScan = new Scanner(f);
    Scanner s = new Scanner(lineScan.nextLine());
    s.useDelimiter(",");
    while (s.hasNext()) {
        headers.add(s.next());
    }

    //Now go through each line of the table and add each cell to the array list
    while (lineScan.hasNextLine()) {
       s =  new Scanner(lineScan.nextLine());
       s.useDelimiter(", *");
       while (s.hasNext()) {
           oneDdata.add(s.next());
       }
    }
    String[][] data = new String[oneDdata.size()/headers.size()][headers.size()];
    int numberRows = oneDdata.size()/headers.size();

    // Move the data into a vanilla array so it can be put in a table.
    for (int x = 0; x < numberRows; x++) {
        for (int y = 0; y < headers.size(); y++) {
            data[x][y] = oneDdata.remove(0);
        }
    }

    // Create a table and return it
    return new DefaultTableModel(data, headers.toArray());


}

更新：根据我在重写代码的答案中收到的反馈，它现在运行时间为 3 秒，而不是 6 分钟（对于 10,000 行），这意味着整个文件只需 10 分钟......但是关于如何加快速度的任何进一步建议它将不胜感激：

 //加载数据文件
    文件 f = new File("data/primary_training_short.csv");

    Scanner lineScan = new Scanner(f);
    Scanner s = new Scanner(lineScan.nextLine());
    s.useDelimiter(",");

    //now go through each line of the results
    while (lineScan.hasNextLine()) {
       s =  new Scanner(lineScan.nextLine());
       s.useDelimiter(", *");
       String[] data = new String[NUM_COLUMNS];

       //get the data out of the CSV file so I can access it
       int x = 0;
       while (s.hasNext()) {
           data[x] = (s.next());
           x++;
       }
       //insert code here which is excecuted each line
   }

原文

I have a large file, with 1.8 million rows of data, that I need to be able to read for a machine learning program I'm writing. The data is currently in a CSV file but clearly I can put it in a database or other structure as required - it won't need to be updated regularly.

The code I'm using at the moment is below. I'm first importing the data to an array list and then I'm passing it to a table model. This is very slow, currently taking six minutes to execute just the first 10,000 rows which is not acceptable as I need to be able to test different algorithms against the data fairly often.

My program will only need to access each row of the data once, so there's no need to hold the whole dataset in RAM. Am I better off reading from a database, or is there a better way to read the CSV file line by line but do it much faster?

import java.io.File;
import java.io.FileNotFoundException;
import java.util.ArrayList;
import java.util.Scanner;
import javax.swing.table.DefaultTableModel;
import javax.swing.table.TableModel;

public class CSVpaser {

public static TableModel parse(File f) throws FileNotFoundException {
    ArrayList<String> headers = new ArrayList<String>();
    ArrayList<String> oneDdata = new ArrayList<String>();
    //Get the headers of the table.
    Scanner lineScan = new Scanner(f);
    Scanner s = new Scanner(lineScan.nextLine());
    s.useDelimiter(",");
    while (s.hasNext()) {
        headers.add(s.next());
    }

    //Now go through each line of the table and add each cell to the array list
    while (lineScan.hasNextLine()) {
       s =  new Scanner(lineScan.nextLine());
       s.useDelimiter(", *");
       while (s.hasNext()) {
           oneDdata.add(s.next());
       }
    }
    String[][] data = new String[oneDdata.size()/headers.size()][headers.size()];
    int numberRows = oneDdata.size()/headers.size();

    // Move the data into a vanilla array so it can be put in a table.
    for (int x = 0; x < numberRows; x++) {
        for (int y = 0; y < headers.size(); y++) {
            data[x][y] = oneDdata.remove(0);
        }
    }

    // Create a table and return it
    return new DefaultTableModel(data, headers.toArray());


}

Update:
Based on feedback I received in the answers I've rewritten the code, its now running in 3 seconds rather than 6 minutes (for 10,000 rows) which means only ten minutes for the whole file... but any further suggestions for how to speed it up would be appreciated:

       //load data file
    File f = new File("data/primary_training_short.csv");

    Scanner lineScan = new Scanner(f);
    Scanner s = new Scanner(lineScan.nextLine());
    s.useDelimiter(",");

    //now go through each line of the results
    while (lineScan.hasNextLine()) {
       s =  new Scanner(lineScan.nextLine());
       s.useDelimiter(", *");
       String[] data = new String[NUM_COLUMNS];

       //get the data out of the CSV file so I can access it
       int x = 0;
       while (s.hasNext()) {
           data[x] = (s.next());
           x++;
       }
       //insert code here which is excecuted each line
   }

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

无言温柔 2024-11-08 21:07:09

data[x][y] = oneDdata.remove(0);

那样效率会非常低。每次从 ArrayList 中删除第一个条目时，所有其他条目都需要向下移动。

至少您需要创建一个自定义 TableModel，这样您就不必复制数据两次。

如果您想将数据保存在数据库中，请在网上搜索 ResultSet TableModel。

如果您想将其保留为 CSV 格式，则可以使用 ArrayList 作为 TableModel 的数据存储。因此您的 Scanner 代码会将数据直接读取到 ArrayList 中。有关此类解决方案，请参阅列表表模型。或者您可能想使用 Bean 表模型。

当然，真正的问题是谁有时间浏览所有 180 万条记录？因此，您确实应该使用数据库并具有查询逻辑来过滤从数据库返回的行。

我的程序只需要访问每行数据一次，因此不需要将整个数据集保存在 RAM 中

那么为什么要在 JTable 中显示它呢？这意味着整个数据将存储在内存中。

data[x][y] = oneDdata.remove(0);

That would be very inefficient. Every time you remove the first entry from the ArrayList all the other entries would need to be shifted down.

At a minimum you would want to create a custom TableModel so you don't have to copy the data twice.

If you want to keep the data in a database then search the net for a ResultSet TableModel.

If you want to keep it in CSV format then you can use the ArrayList as the data store for the TableModel. So your Scanner code would read the data directly into the ArrayList. See List Table Model for one such solution. Or you might want to use the Bean Table Model.

Of course the real question is who is going to have time to browse through all 1.8M records? So you really should use a database and have query logic to filter the rows that are returned from the database.