Java 文本文件上的数学计算性能
我正在接收一个包含大约 60,000 行点坐标的文本文件(我预计很快会扩大规模),并执行从每个点到每个其他点的马哈拉诺比斯距离,并将结果输出为文本文件。这意味着我的结果将接近 3,600,000,000 行长。我的程序每 1 或 2 秒创建大约 60,000 行。
我认为我的代码不能多线程是否正确?有没有更好的方法来编写这个算法?人们如何处理这样的流程?
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.ArrayList;
import java.util.List;
import java.util.Scanner;
public class Coord {
public int a,b,c,d,e,f;
public static void main(String[] args) throws IOException {
PrintWriter out = new PrintWriter(new BufferedWriter(new FileWriter("/Users/evanlivingston/2a.txt", true)));
Scanner sc = new Scanner(new File("/Users/evanlivingston/1.txt"));
List<Coord> coords = new ArrayList<Coord>();{
// for each line in the file
while(sc.hasNextLine()) {
String[] numstrs = sc.nextLine().split("\\s+");
Coord c = new Coord();
c.a = Integer.parseInt(numstrs[1]);
c.b = Integer.parseInt(numstrs[2]);
c.c = Integer.parseInt(numstrs[3]);
c.d = Integer.parseInt(numstrs[4]);
c.e = Integer.parseInt(numstrs[5]);
c.f = Integer.parseInt(numstrs[6]);
coords.add(c);
}
// now you have all coords in memory
int counter = 0; {
for(int i=0; i<coords.size(); i++ )
for( int j=0; j<coords.size(); j++, counter++ )
{
Coord c1 = coords.get(i);
Coord c2 = coords.get(j);
double foo = ((c1.a - c2.a) * (c1.a - c2.a)) *1 ;
double goo = ((c1.b - c2.b) * (c1.b - c2.b)) *1 ;
double hoo = ((c1.c - c2.c) * (c1.c - c2.c)) *2 ;
double joo = ((c1.d - c2.d) * (c1.d - c2.d)) *2 ;
double koo = ((c1.e - c2.e) * (c1.e - c2.e)) *4 ;
double loo = ((c1.f - c2.f) * (c1.f - c2.f)) *4 ;
double zoo = Math.sqrt(foo + goo + hoo + joo + koo + loo);
out.println(counter + "; " + i + " " + j + " " + zoo);
System.out.println(counter + "; " + i + " " + j + " " + zoo);
}
out.flush();
out.close();
}
}
}
}
我的输入文件看起来像
0 0 0 0 0 0 0
1 0 0 0 0 0 1
....
59318 12 2 12 2 12 2
第一个数字是占位符。这是所有组合的列表,替换仅限于您在最后一行看到的数量。
现在看来,计算大约需要16个小时,但这似乎还是太长了。更不用说我估计最终的文本输出约为 120 GB。
I'm taking in a text file with around 60,000 lines of point coordinates, (I expect to scale up soon) and performing Mahalanobis distance from each point to every other point, and outputting the result as a text file. That means my results will be nearly 3,600,000,000 lines long. My program creates around 60,000 lines every 1 or two seconds.
Am I correct in thinking my code is not able to be multithreaded? Is there a better way to code this algorithm? How do people deal with processes like these?
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.ArrayList;
import java.util.List;
import java.util.Scanner;
public class Coord {
public int a,b,c,d,e,f;
public static void main(String[] args) throws IOException {
PrintWriter out = new PrintWriter(new BufferedWriter(new FileWriter("/Users/evanlivingston/2a.txt", true)));
Scanner sc = new Scanner(new File("/Users/evanlivingston/1.txt"));
List<Coord> coords = new ArrayList<Coord>();{
// for each line in the file
while(sc.hasNextLine()) {
String[] numstrs = sc.nextLine().split("\\s+");
Coord c = new Coord();
c.a = Integer.parseInt(numstrs[1]);
c.b = Integer.parseInt(numstrs[2]);
c.c = Integer.parseInt(numstrs[3]);
c.d = Integer.parseInt(numstrs[4]);
c.e = Integer.parseInt(numstrs[5]);
c.f = Integer.parseInt(numstrs[6]);
coords.add(c);
}
// now you have all coords in memory
int counter = 0; {
for(int i=0; i<coords.size(); i++ )
for( int j=0; j<coords.size(); j++, counter++ )
{
Coord c1 = coords.get(i);
Coord c2 = coords.get(j);
double foo = ((c1.a - c2.a) * (c1.a - c2.a)) *1 ;
double goo = ((c1.b - c2.b) * (c1.b - c2.b)) *1 ;
double hoo = ((c1.c - c2.c) * (c1.c - c2.c)) *2 ;
double joo = ((c1.d - c2.d) * (c1.d - c2.d)) *2 ;
double koo = ((c1.e - c2.e) * (c1.e - c2.e)) *4 ;
double loo = ((c1.f - c2.f) * (c1.f - c2.f)) *4 ;
double zoo = Math.sqrt(foo + goo + hoo + joo + koo + loo);
out.println(counter + "; " + i + " " + j + " " + zoo);
System.out.println(counter + "; " + i + " " + j + " " + zoo);
}
out.flush();
out.close();
}
}
}
}
My input file looks like
0 0 0 0 0 0 0
1 0 0 0 0 0 1
....
59318 12 2 12 2 12 2
The first number is a place holder. It's a list of all combinations with replacement limited to the amounts you see in the last line.
It seems now as though the calculations will take about 16 hours, that still seems too long. Not to mention I estimate the final text output to be around 120 GBs.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您的代码非常效率低下。您在文件中的每一行(!!!)上重新读取该文件。磁盘IO非常慢。
您应该做的是将文件加载到已解析的内存结构(双精度数组)中,然后对其进行嵌套循环。
你错了。这项任务将从线程中受益匪浅。但你的首要任务是摆脱重复的 IO。我想那时的表现就足够好了。
UPDATE 到 UPDATE
将您的类重写为多个线程(默认为 4 个)。缺点:输出文件中的行不是按顺序写入的,但如果需要,可以使用 unix 排序实用程序在计算后对其进行排序。 A->B 和 B->A 仍在计算,因为我无法想出一种简单的方法来存储 A->B 的结果,除非使用 Java 64 位并安装一些 64G 的 RAM。
Your code is very inefficient. You re-read the file second time on every line(!!!) in the file. Disk IO is very slow.
What you should do is to load the file into a parsed memory structure (an array of doubles), and then do a nested loop over it.
You're incorrect. This task would benefit a lot from threading. But your first priority is to get rid of repetitive IO. I'd guess the performance would be good enough then.
UPDATE to UPDATE
Rewrote your class to multiple threads (4 by default). Downside: lines in the output file are not written in order, though by using unix sort utility you may sort it after the computation, if needed. Both A->B and B->A are still calculated as I couldn't come up with a simple way to store the result of A->B short of using Java 64bit and installing some 64G of RAM.
您正在执行大量重复的 IO,非常昂贵,比您正在执行的任何计算都要昂贵几个数量级。
此外,您的问题域非常适合映射/归约场景,这不仅易于多线程,而且您还应该能够将计算分布在多台机器上。
You are doing lots of repetitive IO, very expensive, more expensive by orders of magnitude than any calculations you are doing.
Also your problem domain fits into a map / reduce scenario very nicely, which is not only easy to multi-thread but you should be able to distribute the calculations over multiple machines as well.
您读取文件
1.txt
的次数过多。读取一次,将其存储在int[][]
类型的数组中。另外,尝试增加 BufferedWriter 实例的大小。
另外,让
Scanner
实例在具有正确字符集的BufferedInputstream
上工作。You are reading the file
1.txt
too many times. Read it once, store it in an array of typeint[][]
.Also, try to increase the size of the
BufferedWriter
instance.Also, let the
Scanner
instance work on aBufferedInputstream
with a proper character set.