Java 文本文件上的数学计算性能

发布于 2024-11-02 05:46:46 字数 2431 浏览 0 评论 0原文

我正在接收一个包含大约 60,000 行点坐标的文本文件(我预计很快会扩大规模),并执行从每个点到每个其他点的马哈拉诺比斯距离,并将结果输出为文本文件。这意味着我的结果将接近 3,600,000,000 行长。我的程序每 1 或 2 秒创建大约 60,000 行。

我认为我的代码不能多线程是否正确?有没有更好的方法来编写这个算法?人们如何处理这样的流程?

import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.ArrayList;
import java.util.List;
import java.util.Scanner;

    public class Coord {
        public int a,b,c,d,e,f;


    public static void main(String[] args) throws IOException {
        PrintWriter out = new PrintWriter(new BufferedWriter(new FileWriter("/Users/evanlivingston/2a.txt", true)));
        Scanner sc = new Scanner(new File("/Users/evanlivingston/1.txt"));
        List<Coord> coords = new ArrayList<Coord>();{


            // for each line in the file
            while(sc.hasNextLine()) {
                String[] numstrs = sc.nextLine().split("\\s+"); 

                Coord c = new Coord();


                c.a = Integer.parseInt(numstrs[1]);
                c.b = Integer.parseInt(numstrs[2]);
                c.c = Integer.parseInt(numstrs[3]);
                c.d = Integer.parseInt(numstrs[4]);
                c.e = Integer.parseInt(numstrs[5]);
                c.f = Integer.parseInt(numstrs[6]);

                coords.add(c);

            }


// now you have all coords in memory
    int counter = 0;        {
for(int i=0; i<coords.size(); i++ ) 
    for( int j=0; j<coords.size(); j++, counter++ ) 
    {
        Coord c1 = coords.get(i);
        Coord c2 = coords.get(j);
        double foo = ((c1.a - c2.a) * (c1.a - c2.a)) *1 ;
        double goo = ((c1.b - c2.b) * (c1.b - c2.b)) *1 ;
        double hoo = ((c1.c - c2.c) * (c1.c - c2.c)) *2 ;
        double joo = ((c1.d - c2.d) * (c1.d - c2.d)) *2 ;
        double koo = ((c1.e - c2.e) * (c1.e - c2.e)) *4 ;
        double loo = ((c1.f - c2.f) * (c1.f - c2.f)) *4 ;
        double zoo = Math.sqrt(foo + goo + hoo + joo + koo + loo);

        out.println(counter + "; " + i + " " + j + " " + zoo);
       System.out.println(counter + "; " + i + " " + j + " " + zoo);

    }
    out.flush();
    out.close();
            }
        }
    }   
}

我的输入文件看起来像

0 0 0 0 0 0 0

1 0 0 0 0 0 1

....

59318 12 2 12 2 12 2

第一个数字是占位符。这是所有组合的列表,替换仅限于您在最后一行看到的数量。

现在看来,计算大约需要16个小时,但这似乎还是太长了。更不用说我估计最终的文本输出约为 120 GB。

I'm taking in a text file with around 60,000 lines of point coordinates, (I expect to scale up soon) and performing Mahalanobis distance from each point to every other point, and outputting the result as a text file. That means my results will be nearly 3,600,000,000 lines long. My program creates around 60,000 lines every 1 or two seconds.

Am I correct in thinking my code is not able to be multithreaded? Is there a better way to code this algorithm? How do people deal with processes like these?

import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.ArrayList;
import java.util.List;
import java.util.Scanner;

    public class Coord {
        public int a,b,c,d,e,f;


    public static void main(String[] args) throws IOException {
        PrintWriter out = new PrintWriter(new BufferedWriter(new FileWriter("/Users/evanlivingston/2a.txt", true)));
        Scanner sc = new Scanner(new File("/Users/evanlivingston/1.txt"));
        List<Coord> coords = new ArrayList<Coord>();{


            // for each line in the file
            while(sc.hasNextLine()) {
                String[] numstrs = sc.nextLine().split("\\s+"); 

                Coord c = new Coord();


                c.a = Integer.parseInt(numstrs[1]);
                c.b = Integer.parseInt(numstrs[2]);
                c.c = Integer.parseInt(numstrs[3]);
                c.d = Integer.parseInt(numstrs[4]);
                c.e = Integer.parseInt(numstrs[5]);
                c.f = Integer.parseInt(numstrs[6]);

                coords.add(c);

            }


// now you have all coords in memory
    int counter = 0;        {
for(int i=0; i<coords.size(); i++ ) 
    for( int j=0; j<coords.size(); j++, counter++ ) 
    {
        Coord c1 = coords.get(i);
        Coord c2 = coords.get(j);
        double foo = ((c1.a - c2.a) * (c1.a - c2.a)) *1 ;
        double goo = ((c1.b - c2.b) * (c1.b - c2.b)) *1 ;
        double hoo = ((c1.c - c2.c) * (c1.c - c2.c)) *2 ;
        double joo = ((c1.d - c2.d) * (c1.d - c2.d)) *2 ;
        double koo = ((c1.e - c2.e) * (c1.e - c2.e)) *4 ;
        double loo = ((c1.f - c2.f) * (c1.f - c2.f)) *4 ;
        double zoo = Math.sqrt(foo + goo + hoo + joo + koo + loo);

        out.println(counter + "; " + i + " " + j + " " + zoo);
       System.out.println(counter + "; " + i + " " + j + " " + zoo);

    }
    out.flush();
    out.close();
            }
        }
    }   
}

My input file looks like

0 0 0 0 0 0 0

1 0 0 0 0 0 1

....

59318 12 2 12 2 12 2

The first number is a place holder. It's a list of all combinations with replacement limited to the amounts you see in the last line.

It seems now as though the calculations will take about 16 hours, that still seems too long. Not to mention I estimate the final text output to be around 120 GBs.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

硬不硬你别怂 2024-11-09 05:46:46

您的代码非常效率低下。您在文件中的每一行(!!!)上重新读取该文件。磁盘IO非常慢。

您应该做的是将文件加载到已解析的内存结构(双精度数组)中,然后对其进行嵌套循环。

我认为我的代码是正确的吗?
不能多线程吗?

你错了。这项任务将从线程中受益匪浅。但你的首要任务是摆脱重复的 IO。我想那时的表现就足够好了。

UPDATEUPDATE

将您的类重写为多个线程(默认为 4 个)。缺点:输出文件中的行不是按顺序写入的,但如果需要,可以使用 unix 排序实用程序在计算后对其进行排序。 A->B 和 B->A 仍在计算,因为我无法想出一种简单的方法来存储 A->B 的结果,除非使用 Java 64 位并安装一些 64G 的 RAM。

import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.io.Writer;
import java.util.ArrayList;
import java.util.List;
import java.util.Scanner;

public class Coord {
    public int a, b, c, d, e, f;

    private static class CoordsThread extends Thread {
        private int start;
        private int end;
        private List<Coord> coords;
        private PrintWriter out;

        public CoordsThread(int start, int end, List<Coord> list, PrintWriter out) {
            this.start = start;
            this.end = end;
            this.coords = list;
            this.out = out;

            // last block can be shorter
            if( this.end > this.coords.size() ) this.end = this.coords.size();
        }

        public void run() {
            System.out.println("started thread "+getName()+" for ["+start+";"+end+")");
            for (int i = start; i < end; i++) {
                for (int j = 0; j < coords.size(); j++ ) {
                    Coord c1 = coords.get(i);
                    Coord c2 = coords.get(j);
                    double foo = ((c1.a - c2.a) * (c1.a - c2.a)) * 1;
                    double goo = ((c1.b - c2.b) * (c1.b - c2.b)) * 1;
                    double hoo = ((c1.c - c2.c) * (c1.c - c2.c)) * 2;
                    double joo = ((c1.d - c2.d) * (c1.d - c2.d)) * 2;
                    double koo = ((c1.e - c2.e) * (c1.e - c2.e)) * 4;
                    double loo = ((c1.f - c2.f) * (c1.f - c2.f)) * 4;
                    double zoo = Math.sqrt(foo + goo + hoo + joo + koo + loo);

                    synchronized (out) {
                        out.println(i*coords.size()+j + "; " + i + " " + j + " " + zoo);
                    }
                }
            }
            System.out.println("completed thread "+getName());
        }
    }

    public static void main(String[] args) throws Exception {
        PrintWriter out = new PrintWriter(new BufferedWriter(new FileWriter("2.txt")));
        Scanner sc = new Scanner(new File("1.txt"));
        List<Coord> coords = new ArrayList<Coord>();

        // for each line in the file
        while (sc.hasNextLine()) {
            String[] numstrs = sc.nextLine().split("\\s+");

            Coord c = new Coord();

            c.a = Integer.parseInt(numstrs[1]);
            c.b = Integer.parseInt(numstrs[2]);
            c.c = Integer.parseInt(numstrs[3]);
            c.d = Integer.parseInt(numstrs[4]);
            c.e = Integer.parseInt(numstrs[5]);
            c.f = Integer.parseInt(numstrs[6]);

            coords.add(c);
        }

        System.out.println("total lines read: "+coords.size());

        int threadsCount = 4;
        List<Thread> ths = new ArrayList<Thread>();

        int blockSize = coords.size()/threadsCount+1;
        for( int i=0; i<threadsCount; ++i  ) {
            CoordsThread ct = new CoordsThread(i*blockSize, (i+1)*blockSize, coords, out);
            ct.setName("Block"+i);
            ths.add(ct);
        }

        for (Thread th : ths) {
            th.start();
        }

        for (Thread th : ths) {
            th.join();
        }

        out.flush();
        out.close();
    }
}

Your code is very inefficient. You re-read the file second time on every line(!!!) in the file. Disk IO is very slow.

What you should do is to load the file into a parsed memory structure (an array of doubles), and then do a nested loop over it.

Am I correct in thinking my code is
not able to be multithreaded?

You're incorrect. This task would benefit a lot from threading. But your first priority is to get rid of repetitive IO. I'd guess the performance would be good enough then.

UPDATE to UPDATE

Rewrote your class to multiple threads (4 by default). Downside: lines in the output file are not written in order, though by using unix sort utility you may sort it after the computation, if needed. Both A->B and B->A are still calculated as I couldn't come up with a simple way to store the result of A->B short of using Java 64bit and installing some 64G of RAM.

import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.io.Writer;
import java.util.ArrayList;
import java.util.List;
import java.util.Scanner;

public class Coord {
    public int a, b, c, d, e, f;

    private static class CoordsThread extends Thread {
        private int start;
        private int end;
        private List<Coord> coords;
        private PrintWriter out;

        public CoordsThread(int start, int end, List<Coord> list, PrintWriter out) {
            this.start = start;
            this.end = end;
            this.coords = list;
            this.out = out;

            // last block can be shorter
            if( this.end > this.coords.size() ) this.end = this.coords.size();
        }

        public void run() {
            System.out.println("started thread "+getName()+" for ["+start+";"+end+")");
            for (int i = start; i < end; i++) {
                for (int j = 0; j < coords.size(); j++ ) {
                    Coord c1 = coords.get(i);
                    Coord c2 = coords.get(j);
                    double foo = ((c1.a - c2.a) * (c1.a - c2.a)) * 1;
                    double goo = ((c1.b - c2.b) * (c1.b - c2.b)) * 1;
                    double hoo = ((c1.c - c2.c) * (c1.c - c2.c)) * 2;
                    double joo = ((c1.d - c2.d) * (c1.d - c2.d)) * 2;
                    double koo = ((c1.e - c2.e) * (c1.e - c2.e)) * 4;
                    double loo = ((c1.f - c2.f) * (c1.f - c2.f)) * 4;
                    double zoo = Math.sqrt(foo + goo + hoo + joo + koo + loo);

                    synchronized (out) {
                        out.println(i*coords.size()+j + "; " + i + " " + j + " " + zoo);
                    }
                }
            }
            System.out.println("completed thread "+getName());
        }
    }

    public static void main(String[] args) throws Exception {
        PrintWriter out = new PrintWriter(new BufferedWriter(new FileWriter("2.txt")));
        Scanner sc = new Scanner(new File("1.txt"));
        List<Coord> coords = new ArrayList<Coord>();

        // for each line in the file
        while (sc.hasNextLine()) {
            String[] numstrs = sc.nextLine().split("\\s+");

            Coord c = new Coord();

            c.a = Integer.parseInt(numstrs[1]);
            c.b = Integer.parseInt(numstrs[2]);
            c.c = Integer.parseInt(numstrs[3]);
            c.d = Integer.parseInt(numstrs[4]);
            c.e = Integer.parseInt(numstrs[5]);
            c.f = Integer.parseInt(numstrs[6]);

            coords.add(c);
        }

        System.out.println("total lines read: "+coords.size());

        int threadsCount = 4;
        List<Thread> ths = new ArrayList<Thread>();

        int blockSize = coords.size()/threadsCount+1;
        for( int i=0; i<threadsCount; ++i  ) {
            CoordsThread ct = new CoordsThread(i*blockSize, (i+1)*blockSize, coords, out);
            ct.setName("Block"+i);
            ths.add(ct);
        }

        for (Thread th : ths) {
            th.start();
        }

        for (Thread th : ths) {
            th.join();
        }

        out.flush();
        out.close();
    }
}
夜清冷一曲。 2024-11-09 05:46:46

您正在执行大量重复的 IO,非常昂贵,比您正在执行的任何计算都要昂贵几个数量级。

此外,您的问题域非常适合映射/归约场景,这不仅易于多线程,而且您还应该能够将计算分布在多台机器上。

You are doing lots of repetitive IO, very expensive, more expensive by orders of magnitude than any calculations you are doing.

Also your problem domain fits into a map / reduce scenario very nicely, which is not only easy to multi-thread but you should be able to distribute the calculations over multiple machines as well.

猛虎独行 2024-11-09 05:46:46

您读取文件 1.txt 的次数过多。读取一次,将其存储在 int[][] 类型的数组中。

另外,尝试增加 BufferedWriter 实例的大小。

另外,让 Scanner 实例在具有正确字符集的 BufferedInputstream 上工作。

You are reading the file 1.txt too many times. Read it once, store it in an array of type int[][].

Also, try to increase the size of the BufferedWriter instance.

Also, let the Scanner instance work on a BufferedInputstream with a proper character set.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文