旋转大数据文件

发布于 2024-12-21 17:36:14 字数 251 浏览 3 评论 0原文

我有一些大型制表符分隔数据文件。这些文件的行数比列数多几个数量级。问题是我想旋转这些文件,但在这种情况下,“大”被定义为太大而无法在内存中执行此操作。

我希望找到一些关于最快方法的建议。我主要在 UNIX 上使用 Java 工作,尽管如果出现更快的特定于语言的解决方案(或使用 awk 等的解决方案),我也会对此持开放态度。

目前我们正在内存中执行此操作,但随着时间的推移,文件超出了我们的内存容量。显然“买一台更大的机器”是一个解决方案,但目前还不可能。

I have some large tab delimited data files.These files will have a few orders of magnitude more rows than columns. The problem is that I'd like to pivot these files, but in this case "large" is being defined as being too big to do this in memory.

I was hoping to find some suggestions on the fastest way of doing this. I'm primarily working in Java on UNIX, although if a faster language specific solution were to arise (or something using awk, etc) I'd be open to that as well.

Currently we're doing this in memory but as things evolve over time the files are exceeding our memory capacities. Obviously "buy a larger machine" is a solution, but not in the cards at the moment.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

唔猫 2024-12-28 17:36:14

像下面这样的东西可能适合你。此代码首先将源文件作为 BufferedReader 打开,然后读取第一行并将其拆分为 \t

结果数组的长度是目标文件的行数。创建一个新的 FileHolder 对象数组,其中 FileHolder 基本上保存一个文件描述符和一个用作缓存的 ByteBuffer(以免写出每一个字)。创建所有持有者后,将写入第一行。

然后再次读取源文件,再次逐行分割,直到为空,并附加所有文件持有者。

完成后,(最后)创建目标文件,并且所有 FileHolder 实例都按数组顺序(即行顺序)写入其中。

这是示例代码(很长,也可在此处获取)。它当然可以改进(资源没有真正关闭在正确的位置等),但它有效。它在大约 25 秒内转置了一个 275 MB 的文件(四核 Q6600、4 GB RAM、x86_64 Linux 3.1.2-rc5),并使用 Sun(64 位)JDK 的“脆弱”默认值 64 MB 运行:

package net.sf.jpam;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.io.Reader;
import java.nio.ByteBuffer;
import java.nio.channels.FileChannel;
import java.util.regex.Pattern;

public final class Test
{
    private static final Pattern TAB = Pattern.compile("\t");

    private static class FileHolder
    {
        private static final byte TABCHAR[] = "\t".getBytes();
        // Size of the buffer size
        private static final int BUFSZ = 32768;

        // Format string for a file
        private static final String FORMAT = "/home/fge/t2.txt.%d";

        // The ByteBuffer
        private final ByteBuffer buf = ByteBuffer.allocate(BUFSZ);

        // The File object
        private final File fd;

        // RandomAccessFile
        private final RandomAccessFile file;

        FileHolder(final int index)
            throws FileNotFoundException
        {
            final String name = String.format(FORMAT, index);
            fd = new File(name);
            file = new RandomAccessFile(fd, "rw");
        }

        public void write(final String s)
            throws IOException
        {
            final byte[] b = s.getBytes();
            if (buf.remaining() < b.length + TABCHAR.length)
                flush();
            buf.put(b).put(TABCHAR);
        }

        private void flush()
            throws IOException
        {
            file.write(buf.array(), 0, buf.position());
            buf.position(0);
        }

        public void copyTo(final RandomAccessFile dst)
            throws IOException
        {
            flush();
            final FileChannel source = file.getChannel();
            final FileChannel destination = dst.getChannel();
            source.force(false);
            final long len = source.size() - TABCHAR.length;

            source.transferTo(0, len, destination);
            dst.writeBytes("\n");

        }

        public void tearDown()
            throws IOException
        {
            file.close();
            if (!fd.delete())
                System.err.println("Failed to remove file " + fd);
        }

        @Override
        public String toString()
        {
            return fd.toString();
        }
    }

    public static void main(final String... args)
        throws IOException
    {
        long before, after;

        before = System.currentTimeMillis();
        final Reader r = new FileReader("/home/fge/t.txt");
        final BufferedReader reader = new BufferedReader(r);

        /*
         * Read first line, count the number of elements. All elements are
         * separated by a single tab.
         */
        String line = reader.readLine();
        String[] elements = TAB.split(line);

        final int nrLines = elements.length;
        final FileHolder[] files = new FileHolder[nrLines];

        /*
         * Initialize file descriptors
         */
        for (int i = 0; i < nrLines; i++)
            files[i] = new FileHolder(i);


        /*
         * Write first lines, then all others
         */
        writeOneLine(elements, files);

        while ((line = reader.readLine()) != null) {
            elements = TAB.split(line);
            writeOneLine(elements, files);
        }

        reader.close();
        r.close();
        after = System.currentTimeMillis();

        System.out.println("Read time: " + (after - before));

        before = System.currentTimeMillis();
        final RandomAccessFile out = new RandomAccessFile("/home/fge/t2.txt",
            "rw");

        for (final FileHolder file: files) {
            file.copyTo(out);
            file.tearDown();
        }

        out.getChannel().force(false);
        out.close();

        after = System.currentTimeMillis();

        System.out.println("Write time: " + (after - before));
        System.exit(0);
    }

    private static void writeOneLine(final String[] elements,
        final FileHolder[] fdArray)
        throws IOException
    {  
        final int len = elements.length;
        String element;
        FileHolder file;

        for (int index = 0; index < len; index++) {
            element = elements[index];
            file = fdArray[index];
            file.write(element);
        }
    }
}

Something like the below may work for you. This code first opens the source file as a BufferedReader, then reads the first line and splits it against \t.

The resulting array's length is the number of lines of the destination file. A new array of FileHolder objects is created, where a FileHolder basically holds a file descriptor and a ByteBuffer to use as a cache (so as not to write each and every word). When all holders are created, the first line is written.

Then the source file is read again, split again, line by line, until empty, and all file holders appended to.

When done, the destination file is created (at last) and all FileHolder instances are written to it in the array order, therefore in line order.

Here is a sample code (LONG, also available here). It can certainly be improved (resources are not really closed at the correct place etc) but it works. It transposes a 275 MB file here in around 25 seconds (quad core Q6600, 4 GB RAM, x86_64 Linux 3.1.2-rc5), and runs with the "flimsy" default value of 64 MB of Sun's (64bit) JDK:

package net.sf.jpam;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.io.Reader;
import java.nio.ByteBuffer;
import java.nio.channels.FileChannel;
import java.util.regex.Pattern;

public final class Test
{
    private static final Pattern TAB = Pattern.compile("\t");

    private static class FileHolder
    {
        private static final byte TABCHAR[] = "\t".getBytes();
        // Size of the buffer size
        private static final int BUFSZ = 32768;

        // Format string for a file
        private static final String FORMAT = "/home/fge/t2.txt.%d";

        // The ByteBuffer
        private final ByteBuffer buf = ByteBuffer.allocate(BUFSZ);

        // The File object
        private final File fd;

        // RandomAccessFile
        private final RandomAccessFile file;

        FileHolder(final int index)
            throws FileNotFoundException
        {
            final String name = String.format(FORMAT, index);
            fd = new File(name);
            file = new RandomAccessFile(fd, "rw");
        }

        public void write(final String s)
            throws IOException
        {
            final byte[] b = s.getBytes();
            if (buf.remaining() < b.length + TABCHAR.length)
                flush();
            buf.put(b).put(TABCHAR);
        }

        private void flush()
            throws IOException
        {
            file.write(buf.array(), 0, buf.position());
            buf.position(0);
        }

        public void copyTo(final RandomAccessFile dst)
            throws IOException
        {
            flush();
            final FileChannel source = file.getChannel();
            final FileChannel destination = dst.getChannel();
            source.force(false);
            final long len = source.size() - TABCHAR.length;

            source.transferTo(0, len, destination);
            dst.writeBytes("\n");

        }

        public void tearDown()
            throws IOException
        {
            file.close();
            if (!fd.delete())
                System.err.println("Failed to remove file " + fd);
        }

        @Override
        public String toString()
        {
            return fd.toString();
        }
    }

    public static void main(final String... args)
        throws IOException
    {
        long before, after;

        before = System.currentTimeMillis();
        final Reader r = new FileReader("/home/fge/t.txt");
        final BufferedReader reader = new BufferedReader(r);

        /*
         * Read first line, count the number of elements. All elements are
         * separated by a single tab.
         */
        String line = reader.readLine();
        String[] elements = TAB.split(line);

        final int nrLines = elements.length;
        final FileHolder[] files = new FileHolder[nrLines];

        /*
         * Initialize file descriptors
         */
        for (int i = 0; i < nrLines; i++)
            files[i] = new FileHolder(i);


        /*
         * Write first lines, then all others
         */
        writeOneLine(elements, files);

        while ((line = reader.readLine()) != null) {
            elements = TAB.split(line);
            writeOneLine(elements, files);
        }

        reader.close();
        r.close();
        after = System.currentTimeMillis();

        System.out.println("Read time: " + (after - before));

        before = System.currentTimeMillis();
        final RandomAccessFile out = new RandomAccessFile("/home/fge/t2.txt",
            "rw");

        for (final FileHolder file: files) {
            file.copyTo(out);
            file.tearDown();
        }

        out.getChannel().force(false);
        out.close();

        after = System.currentTimeMillis();

        System.out.println("Write time: " + (after - before));
        System.exit(0);
    }

    private static void writeOneLine(final String[] elements,
        final FileHolder[] fdArray)
        throws IOException
    {  
        final int len = elements.length;
        String element;
        FileHolder file;

        for (int index = 0; index < len; index++) {
            element = elements[index];
            file = fdArray[index];
            file.write(element);
        }
    }
}
猥︴琐丶欲为 2024-12-28 17:36:14

@fge:
1)最好使用 CharBuffer 而不是实例化大量字符串。

2)最好像这样使用模式匹配:

initially..

private Matcher matcher;
Pattern regexPattern = Pattern.compile( pattern );
matcher = regexPattern.matcher("");

and then for matching pattern.. you do this..

matcher.reset(charBuffer).find()

因为,当您查看内部时

Pattern.matcher(CharSequence input) {
 Matcher m = new Matcher(this, input);
}

,始终避免编写导致大量实例化或字符串使用的代码。这会导致大量内存使用,从而导致性能消耗。

@fge:
1) It is better to use CharBuffer instead of instantiating lot of Strings.

2) It is better to use Pattern Matching like this:

initially..

private Matcher matcher;
Pattern regexPattern = Pattern.compile( pattern );
matcher = regexPattern.matcher("");

and then for matching pattern.. you do this..

matcher.reset(charBuffer).find()

because, when you look inside

Pattern.matcher(CharSequence input) {
 Matcher m = new Matcher(this, input);
}

Always refrain from writing code that causes a lot of instantiation or String usage. This causes a lot of memory usage which causes a performance drain.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文