将大矩阵从文本文件加载到 Java 数组中
我的数据存储在大型矩阵中,这些矩阵存储在文本文件中,其中包含数百万行和 4 列逗号分隔值。 (每列存储一个不同的变量,每行存储所有四个变量的不同毫秒数据。)在前十几行中还有一些不相关的标题数据。我需要编写 Java 代码来将此数据加载到四个数组中,文本矩阵中的每一列都有一个数组。
Java 代码还需要能够判断标头何时完成,以便第一个数据行可以分为 4 个数组的条目。最后,Java 代码需要迭代数百万个数据行,重复将每行分解为四个数字的过程,每个数字都输入到该数字所在列的相应数组中。
我如何更改下面的代码才能实现此目的?我想找到最快的方法来完成数百万行的处理。
这是我的代码:
MainClass2.java
package packages;
public class MainClass2{
public static void main(String[] args){
readfile2 r = new readfile2();
r.openFile();
int x1Count = r.readFile();
r.populateArray(x1Count);
r.closeFile();
}
}
readfile2.java
package packages;
import java.io.*;
import java.util.*;
public class readfile2 {
private Scanner scan1;
private Scanner scan2;
public void openFile(){
try{
scan1 = new Scanner(new File("C:\\test\\samedatafile.txt"));
scan1 = new Scanner(new File("C:\\test\\samedatafile.txt"));
}
catch(Exception e){
System.out.println("could not find file");
}
}
public int readFile(){
int scan1Count = 0;
while(scan1.hasNext()){
scan1.next();
scan1Count += 1;
}
return scan1Count;
}
public double[] populateArray(int scan1Count){
double[] outputArray1 = new double[scan1Count];
double[] outputArray2 = new double[scan1Count];
double[] outputArray3 = new double[scan1Count];
double[] outputArray4 = new double[scan1Count];
int i = 0;
while(scan2.hasNext()){
//what code do I write here to:
// 1.) identify the start of my time series rows after the end of the header rows (e.g. row starts with a number AT LEAST 4 digits in length.)
// 2.) split each time series row's data into a separate new entry for each of the 4 output arrays
i++;
}
return outputArray1, outputArray2, outputArray3, outputArray4;
}
public void closeFile(){
scan1.close();
scan2.close();
}
}
以下是典型数据文件的前 19 行:
text and numbers on first line
1 msec/sample
3 channels
ECG
Volts
Z_Hamming_0_05_LPF
Ohms
dz/dt
Volts
min,CH2,CH4,CH41,
,3087747,3087747,3087747,
0,-0.0518799,17.0624,0,
1.66667E-05,-0.0509644,17.0624,-0.00288295,
3.33333E-05,-0.0497437,17.0624,-0.00983428,
5E-05,-0.0482178,17.0624,-0.0161573,
6.66667E-05,-0.0466919,17.0624,-0.0204402,
8.33333E-05,-0.0448608,17.0624,-0.0213986,
0.0001,-0.0427246,17.0624,-0.0207532,
0.000116667,-0.0405884,17.0624,-0.0229672,
编辑
我测试了 Shilaghae 的代码建议。似乎有效。然而,所有结果数组的长度与 x1Count 相同,因此在 Shilaghae 的模式匹配代码无法放置数字的地方会保留零。 (这是我最初编写代码的结果。)
我很难找到保留零的索引,但除了标头所在的预期零之外,似乎还有更多的零。当我绘制 temp[1] 输出的导数图时,我看到了许多尖锐的尖峰,其中 temp[1] 中可能存在假零。如果我知道 temp[1]、temp[2] 和 temp[3] 中的零在哪里,我也许能够修改模式匹配以更好地保留所有数据。
另外,最好简单地缩短输出数组以不再包含输入文件中标题所在的行。然而,我发现的关于可变长度数组的教程仅显示了过于简单的示例,例如:
int[] anArray = {100, 200, 300, 400};
如果代码不再使用 scan1 来生成 scan1Count,则代码可能运行得更快。我不想通过使用低效的方法来生成可变长度数组来减慢代码速度。而且,在模式匹配无法将输入行拆分为 4 个数字的情况下,我也不想跳过时间序列中的数据。我宁愿保留时间序列中的零,以便我可以找到它们并使用它们来调试模式匹配。
这些事情可以在快速运行的代码中完成吗?
第二次编辑
So
"-{0,1}\\d+.\\d+,"
在表达式中重复多次:
"-{0,1}\\d+.\\d+,-{0,1}\\d+.\\d+,-{0,1}\\d+.\\d+,-{0,1}\\d+.\\d+,"
是否
"-{0,1}\\d+.\\d+,"
分解为以下三个语句:
"-{0,1}" means that a minus sign occurs zero or one times, while
"\\d+." means that the minus sign(or lack of minus sign) is followed by several digits of any value followed by a decimal point, so that finally
"\\d+," means that the decimal point is followed by several digits of any value?
如果是,那么我的数据中的数字(例如“1.66667E-05”或“-8.06131E-05”)怎么样?我刚刚扫描了一个输入文件,(在 3+100 万个 4 列行中)它包含 638 个包含 E 的数字,其中 5 个位于第一列,633 个位于最后一列。
My data is stored in large matrices stored in text files with millions of rows and 4 columns of comma-separated values. (Each column stores a different variable, and each row stores a different millisecond's data for all four variables.) There is also some irrelevant header data in the first dozen or so lines. I need to write Java code to load this data into four arrays, with one array for each column in the text matrix.
The Java code also needs to be able to tell when the header is done, so that the first data row can be split into entries for the 4 arrays. Finally, the Java code needs to iterate through the millions of data rows, repeating the process of decomposing each row into four numbers which are each entered into the appropriate array for the column in which the number was located.
How can I alter the code below in order to accomplish this? I want to find the fastest way to accomplish this processing of millions of rows.
Here is my code:
MainClass2.java
package packages;
public class MainClass2{
public static void main(String[] args){
readfile2 r = new readfile2();
r.openFile();
int x1Count = r.readFile();
r.populateArray(x1Count);
r.closeFile();
}
}
readfile2.java
package packages;
import java.io.*;
import java.util.*;
public class readfile2 {
private Scanner scan1;
private Scanner scan2;
public void openFile(){
try{
scan1 = new Scanner(new File("C:\\test\\samedatafile.txt"));
scan1 = new Scanner(new File("C:\\test\\samedatafile.txt"));
}
catch(Exception e){
System.out.println("could not find file");
}
}
public int readFile(){
int scan1Count = 0;
while(scan1.hasNext()){
scan1.next();
scan1Count += 1;
}
return scan1Count;
}
public double[] populateArray(int scan1Count){
double[] outputArray1 = new double[scan1Count];
double[] outputArray2 = new double[scan1Count];
double[] outputArray3 = new double[scan1Count];
double[] outputArray4 = new double[scan1Count];
int i = 0;
while(scan2.hasNext()){
//what code do I write here to:
// 1.) identify the start of my time series rows after the end of the header rows (e.g. row starts with a number AT LEAST 4 digits in length.)
// 2.) split each time series row's data into a separate new entry for each of the 4 output arrays
i++;
}
return outputArray1, outputArray2, outputArray3, outputArray4;
}
public void closeFile(){
scan1.close();
scan2.close();
}
}
Here are the first 19 lines of a typical data file:
text and numbers on first line
1 msec/sample
3 channels
ECG
Volts
Z_Hamming_0_05_LPF
Ohms
dz/dt
Volts
min,CH2,CH4,CH41,
,3087747,3087747,3087747,
0,-0.0518799,17.0624,0,
1.66667E-05,-0.0509644,17.0624,-0.00288295,
3.33333E-05,-0.0497437,17.0624,-0.00983428,
5E-05,-0.0482178,17.0624,-0.0161573,
6.66667E-05,-0.0466919,17.0624,-0.0204402,
8.33333E-05,-0.0448608,17.0624,-0.0213986,
0.0001,-0.0427246,17.0624,-0.0207532,
0.000116667,-0.0405884,17.0624,-0.0229672,
Edit
I tested Shilaghae's code suggestion. It seems to work. However, the length of all the resulting arrays is the same as x1Count, so that zeros remain in the places where Shilaghae's pattern matching code is not able to place a number. (This is a result of how I wrote the code originally.)
I was having trouble finding the indices where zeros remain, but there seemed to be a lot more zeros besides the ones expected where the header was. When I graphed the derivative of the temp[1] output, I saw a number of sharp spikes where false zeros in temp[1] might be. If I can tell where the zeros in temp[1], temp[2], and temp[3] are, I might be able to modify the pattern matching to better retain all the data.
Also, it would be nice to simply shorten the output array to no longer include the rows where the header was in the input file. However, the tutorials I have found regarding variable length arrays only show oversimplified examples like:
int[] anArray = {100, 200, 300, 400};
The code might run faster if it no longer uses scan1 to produce scan1Count. I do not want to slow the code down by using an inefficient method to produce a variable-length array. And I also do not want to skip data in my time series in the cases where the pattern matching is not able to split the input row into 4 numbers. I would rather keep the in-time-series zeros so that I can find them and use them to debug the pattern matching.
Can these things be done in fast-running code?
Second edit
So
"-{0,1}\\d+.\\d+,"
repeats for times in the expression:
"-{0,1}\\d+.\\d+,-{0,1}\\d+.\\d+,-{0,1}\\d+.\\d+,-{0,1}\\d+.\\d+,"
Does
"-{0,1}\\d+.\\d+,"
decompose into the following three statements:
"-{0,1}" means that a minus sign occurs zero or one times, while
"\\d+." means that the minus sign(or lack of minus sign) is followed by several digits of any value followed by a decimal point, so that finally
"\\d+," means that the decimal point is followed by several digits of any value?
If so, what about numbers in my data like "1.66667E-05," or "-8.06131E-05," ? I just scanned one of the input files, and (out of 3+ million 4-column rows) it contains 638 numbers that contain E, of which 5 were in the first column, and 633 were in the last column.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您可以逐行读取文件,并且对于您可以使用正则表达式(http://www.vogella.de/articles/JavaRegularExpressions/article.html)控制的每一行,如果该行恰好包含 4 个逗号。
如果该行恰好包含 4 个逗号,您可以使用 String.split 拆分该行并填充 4 个数组,否则您将在下一行传递。
You could read line to line the file and for every line you could control with a regular expression (http://www.vogella.de/articles/JavaRegularExpressions/article.html) if the line presents exactly 4 comma.
If the line presents exactly 4 comma you can split the line with String.split and fill the 4 array otherwise you pass at next line.
您可以使用 String.split()。
要跳过标题,您可以读取前 N 行并丢弃它们(如果您知道有多少行),或者您需要查找特定标记 - 在不查看数据的情况下很难提供建议。
您可能还需要稍微改变一下您的方法,因为您当前似乎是根据总行数调整数组大小(假设您的扫描仪返回行?),而不是忽略标题行的计数。
You can split up each line using String.split().
To skip the headers, you can either read the first N lines and discard them (if you know how many there are) or you will need to look for a specific marker - difficult to advise without seeing your data.
You may also need to change your approach a little because you currently seem to be sizing the arrays according to the total number of lines (assuming your Scanner returns lines?) rather than omitting the count of header lines.
我会通过简单地尝试将每一行解析为四个数字来处理标题问题,并丢弃解析不起作用的任何行。如果标题行之后可能存在无法解析的行,那么您可以在第一次获得“好”行时设置一个标志,然后报告任何后续的“坏”行。
使用
String.split(...)
分割行。这不是绝对最快的方法,但程序的 CPU 时间将花在其他地方......所以这可能并不重要。I'd deal with the problem of the headers by simply attempting to parse every line as four numbers, and throwing away any lines where the parsing doesn't work. If there is a possibility of unparseable lines after the header lines, then you can set a flag the first time you get a "good" line, and then report any subsequent "bad" lines.
Split the lines with
String.split(...)
. It is not the absolute fastest way to do it, but the CPU time of your program will be spent elsewhere ... so it probably doesn't matter.(代表问题作者移动解决方案,将其移动到答案空间)。
最终的代码非常简单,只涉及使用 string.split() 和“,”作为正则表达式。为此,我必须手动从输入文件中删除标题,以便数据仅包含具有 4 个逗号分隔数字的行。
如果有人好奇,最终的工作代码是:
(Moved solution on behalf of the question author to move it to the answer space).
The final code was very simple, and simply involved using string.split() with "," as the regular expression. To do that, I had to manually delete the headers from the input file so that the data only contained rows with 4 comma separated numbers.
In case anyone is curious, the final working code for this is: