Java 中排序(内存映射?)文件中的二分搜索

发布于 2024-07-17 01:15:17 字数 545 浏览 10 评论 0原文

我正在努力将 Perl 程序移植到 Java,并一边学习 Java。 原始程序的核心组件是 Perl 模块< /a> 使用二分搜索在 +500 GB 排序文本文件中进行字符串前缀查找 (本质上,“查找”到文件中间的字节偏移量,回溯到最近的换行符,将行前缀与搜索字符串进行比较,“查找”到该字节偏移量的一半/两倍,重复直到找到......)

我有尝试了几种数据库解决方案,但发现对于这种大小的数据集,在纯粹的查找速度上没有什么比这更好的了。 您知道现有的 Java 库可以实现此类功能吗? 如果做不到这一点,你能给我指出一些随机访问读取文本文件的惯用示例代码吗?

或者,我不熟悉新的(?)Java I/O 库,但是可以选择内存映射 500 GB 文本文件(我在一台有空闲内存的 64 位机器上)并执行二进制操作搜索内存映射字节数组? 我非常有兴趣听到您分享有关此问题和类似问题的任何经验。

I am struggling to port a Perl program to Java, and learning Java as I go. A central component of the original program is a Perl module that does string prefix lookups in a +500 GB sorted text file using binary search
(essentially, "seek" to a byte offset in the middle of the file, backtrack to nearest newline, compare line prefix with the search string, "seek" to half/double that byte offset, repeat until found...)

I have experimented with several database solutions but found that nothing beats this in sheer lookup speed with data sets of this size. Do you know of any existing Java library that implements such functionality? Failing that, could you point me to some idiomatic example code that does random access reads in text files?

Alternatively, I am not familiar with the new (?) Java I/O libraries but would it be an option to memory-map the 500 GB text file (I'm on a 64-bit machine with memory to spare) and do binary search on the memory-mapped byte array? I would be very interested to hear any experiences you have to share about this and similar problems.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

阪姬 2024-07-24 01:15:17

我是 Java MappedByteBuffers 对于这种情况。 它的速度非常快。 下面是我为您整理的一个片段,它将缓冲区映射到文件,查找中间,然后向后搜索到换行符。 这应该足以让你继续下去吧?

我在自己的应用程序中有类似的代码(查找、读取、重复直到完成),并进行了基准测试
java.io 在生产环境中针对 MappedByteBuffer 进行流式传输,并将结果发布在我的博客上 (Geekomatic 帖子标记为“java.nio”),包含原始数据、图表等。

两秒总结? 我基于 MappedByteBuffer 的实现速度提高了约 275%。 YMMV。

为了处理大于 ~2GB 的文件(由于强制转换和 .position(int pos),这是一个问题),我精心设计了由 MappedByteBuffer 数组支持的分页算法>s。 您需要在 64 位系统上工作才能处理大于 2-4GB 的文件,因为 MBB 使用操作系统的虚拟内存系统来发挥其魔力。

public class StusMagicLargeFileReader  {
    private static final long PAGE_SIZE = Integer.MAX_VALUE;
    private List<MappedByteBuffer> buffers = new ArrayList<MappedByteBuffer>();
    private final byte raw[] = new byte[1];

    public static void main(String[] args) throws IOException {
        File file = new File("/Users/stu/test.txt");
        FileChannel fc = (new FileInputStream(file)).getChannel(); 
        StusMagicLargeFileReader buffer = new StusMagicLargeFileReader(fc);
        long position = file.length() / 2;
        String candidate = buffer.getString(position--);
        while (position >=0 && !candidate.equals('\n')) 
            candidate = buffer.getString(position--);
        //have newline position or start of file...do other stuff    
    }
    StusMagicLargeFileReader(FileChannel channel) throws IOException {
        long start = 0, length = 0;
        for (long index = 0; start + length < channel.size(); index++) {
            if ((channel.size() / PAGE_SIZE) == index)
                length = (channel.size() - index *  PAGE_SIZE) ;
            else
                length = PAGE_SIZE;
            start = index * PAGE_SIZE;
            buffers.add(index, channel.map(READ_ONLY, start, length));
        }    
    }
    public String getString(long bytePosition) {
        int page  = (int) (bytePosition / PAGE_SIZE);
        int index = (int) (bytePosition % PAGE_SIZE);
        raw[0] = buffers.get(page).get(index);
        return new String(raw);
    }
}

I am a big fan of Java's MappedByteBuffers for situations like this. It is blazing fast. Below is a snippet I put together for you that maps a buffer to the file, seeks to the middle, and then searches backwards to a newline character. This should be enough to get you going?

I have similar code (seek, read, repeat until done) in my own application, benchmarked
java.io streams against MappedByteBuffer in a production environment and posted the results on my blog (Geekomatic posts tagged 'java.nio' ) with raw data, graphs and all.

Two second summary? My MappedByteBuffer-based implementation was about 275% faster. YMMV.

To work for files larger than ~2GB, which is a problem because of the cast and .position(int pos), I've crafted paging algorithm backed by an array of MappedByteBuffers. You'll need to be working on a 64-bit system for this to work with files larger than 2-4GB because MBB's use the OS's virtual memory system to work their magic.

public class StusMagicLargeFileReader  {
    private static final long PAGE_SIZE = Integer.MAX_VALUE;
    private List<MappedByteBuffer> buffers = new ArrayList<MappedByteBuffer>();
    private final byte raw[] = new byte[1];

    public static void main(String[] args) throws IOException {
        File file = new File("/Users/stu/test.txt");
        FileChannel fc = (new FileInputStream(file)).getChannel(); 
        StusMagicLargeFileReader buffer = new StusMagicLargeFileReader(fc);
        long position = file.length() / 2;
        String candidate = buffer.getString(position--);
        while (position >=0 && !candidate.equals('\n')) 
            candidate = buffer.getString(position--);
        //have newline position or start of file...do other stuff    
    }
    StusMagicLargeFileReader(FileChannel channel) throws IOException {
        long start = 0, length = 0;
        for (long index = 0; start + length < channel.size(); index++) {
            if ((channel.size() / PAGE_SIZE) == index)
                length = (channel.size() - index *  PAGE_SIZE) ;
            else
                length = PAGE_SIZE;
            start = index * PAGE_SIZE;
            buffers.add(index, channel.map(READ_ONLY, start, length));
        }    
    }
    public String getString(long bytePosition) {
        int page  = (int) (bytePosition / PAGE_SIZE);
        int index = (int) (bytePosition % PAGE_SIZE);
        raw[0] = buffers.get(page).get(index);
        return new String(raw);
    }
}
淡淡绿茶香 2024-07-24 01:15:17

我也有同样的问题。 我试图在排序的文件中查找以某个前缀开头的所有行。

这是我编写的一个方法,它主要是在这里找到的Python代码的一个端口: http:// www.logarithmic.net/pfh/blog/01186620415

我已经测试过它,但还没有彻底测试。 但它不使用内存映射。

public static List<String> binarySearch(String filename, String string) {
    List<String> result = new ArrayList<String>();
    try {
        File file = new File(filename);
        RandomAccessFile raf = new RandomAccessFile(file, "r");

        long low = 0;
        long high = file.length();

        long p = -1;
        while (low < high) {
            long mid = (low + high) / 2;
            p = mid;
            while (p >= 0) {
                raf.seek(p);

                char c = (char) raf.readByte();
                //System.out.println(p + "\t" + c);
                if (c == '\n')
                    break;
                p--;
            }
            if (p < 0)
                raf.seek(0);
            String line = raf.readLine();
            //System.out.println("-- " + mid + " " + line);
            if (line.compareTo(string) < 0)
                low = mid + 1;
            else
                high = mid;
        }

        p = low;
        while (p >= 0) {
            raf.seek(p);
            if (((char) raf.readByte()) == '\n')
                break;
            p--;
        }

        if (p < 0)
            raf.seek(0);

        while (true) {
            String line = raf.readLine();
            if (line == null || !line.startsWith(string))
                break;
            result.add(line);
        }

        raf.close();
    } catch (IOException e) {
        System.out.println("IOException:");
        e.printStackTrace();
    }
    return result;
}

I have the same problem. I am trying to find all lines that start with some prefix in a sorted file.

Here is a method I cooked up which is largely a port of Python code found here: http://www.logarithmic.net/pfh/blog/01186620415

I have tested it but not thoroughly just yet. It does not use memory mapping, though.

public static List<String> binarySearch(String filename, String string) {
    List<String> result = new ArrayList<String>();
    try {
        File file = new File(filename);
        RandomAccessFile raf = new RandomAccessFile(file, "r");

        long low = 0;
        long high = file.length();

        long p = -1;
        while (low < high) {
            long mid = (low + high) / 2;
            p = mid;
            while (p >= 0) {
                raf.seek(p);

                char c = (char) raf.readByte();
                //System.out.println(p + "\t" + c);
                if (c == '\n')
                    break;
                p--;
            }
            if (p < 0)
                raf.seek(0);
            String line = raf.readLine();
            //System.out.println("-- " + mid + " " + line);
            if (line.compareTo(string) < 0)
                low = mid + 1;
            else
                high = mid;
        }

        p = low;
        while (p >= 0) {
            raf.seek(p);
            if (((char) raf.readByte()) == '\n')
                break;
            p--;
        }

        if (p < 0)
            raf.seek(0);

        while (true) {
            String line = raf.readLine();
            if (line == null || !line.startsWith(string))
                break;
            result.add(line);
        }

        raf.close();
    } catch (IOException e) {
        System.out.println("IOException:");
        e.printStackTrace();
    }
    return result;
}
万劫不复 2024-07-24 01:15:17

我不知道有哪个库具有该功能。 然而,Java 中外部二分搜索的正确代码应该与此类似:

class ExternalBinarySearch {
final RandomAccessFile file;
final Comparator<String> test; // tests the element given as search parameter with the line. Insert a PrefixComparator here
public ExternalBinarySearch(File f, Comparator<String> test) throws FileNotFoundException {
    this.file = new RandomAccessFile(f, "r");
    this.test = test;
}
public String search(String element) throws IOException {
    long l = file.length();
    return search(element, -1, l-1);
}
/**
 * Searches the given element in the range [low,high]. The low value of -1 is a special case to denote the beginning of a file.
 * In contrast to every other line, a line at the beginning of a file doesn't need a \n directly before the line
 */
private String search(String element, long low, long high) throws IOException {
    if(high - low < 1024) {
        // search directly
        long p = low;
        while(p < high) {
            String line = nextLine(p);
            int r = test.compare(line,element);
            if(r > 0) {
                return null;
            } else if (r < 0) {
                p += line.length();
            } else {
                return line;
            }
        }
        return null;
    } else {
        long m  = low + ((high - low) / 2);
        String line = nextLine(m);
        int r = test.compare(line, element);
        if(r > 0) {
            return search(element, low, m);
        } else if (r < 0) {
            return search(element, m, high);
        } else {
            return line;
        }
    }
}
private String nextLine(long low) throws IOException {
    if(low == -1) { // Beginning of file
        file.seek(0);           
    } else {
        file.seek(low);
    }
    int bufferLength = 65 * 1024;
    byte[] buffer = new byte[bufferLength];
    int r = file.read(buffer);
    int lineBeginIndex = -1;

    // search beginning of line
    if(low == -1) { //beginning of file
        lineBeginIndex = 0;
    } else {
        //normal mode
        for(int i = 0; i < 1024; i++) {
        if(buffer[i] == '\n') {
            lineBeginIndex = i + 1;
            break;
        }
        }
    }
    if(lineBeginIndex == -1) {
        // no line begins within next 1024 bytes
        return null;
    }
    int start = lineBeginIndex;
        for(int i = start; i < r; i++) {
            if(buffer[i] == '\n') {
                // Found end of line
                return new String(buffer, lineBeginIndex, i - lineBeginIndex + 1);
                return line.toString();
            }
        }
        throw new IllegalArgumentException("Line to long");
}
}

请注意:我临时编写了此代码:极端情况测试得不够好,该代码假设没有单行大于 64K等等。

我还认为建立行开始处的偏移量索引可能是一个好主意。 对于 500 GB 的文件,该索引应存储在索引文件中。 您应该通过该索引获得一个不那么小的常数因子,因为不需要在每个步骤中搜索下一行。

我知道这不是问题,但构建一个前缀树数据结构(如(Patrica)Tries(在磁盘/SSD 上))可能是进行前缀搜索的好主意。

I am not aware of any library that has that functionality. However, a correct code for a external binary search in Java should be similar to this:

class ExternalBinarySearch {
final RandomAccessFile file;
final Comparator<String> test; // tests the element given as search parameter with the line. Insert a PrefixComparator here
public ExternalBinarySearch(File f, Comparator<String> test) throws FileNotFoundException {
    this.file = new RandomAccessFile(f, "r");
    this.test = test;
}
public String search(String element) throws IOException {
    long l = file.length();
    return search(element, -1, l-1);
}
/**
 * Searches the given element in the range [low,high]. The low value of -1 is a special case to denote the beginning of a file.
 * In contrast to every other line, a line at the beginning of a file doesn't need a \n directly before the line
 */
private String search(String element, long low, long high) throws IOException {
    if(high - low < 1024) {
        // search directly
        long p = low;
        while(p < high) {
            String line = nextLine(p);
            int r = test.compare(line,element);
            if(r > 0) {
                return null;
            } else if (r < 0) {
                p += line.length();
            } else {
                return line;
            }
        }
        return null;
    } else {
        long m  = low + ((high - low) / 2);
        String line = nextLine(m);
        int r = test.compare(line, element);
        if(r > 0) {
            return search(element, low, m);
        } else if (r < 0) {
            return search(element, m, high);
        } else {
            return line;
        }
    }
}
private String nextLine(long low) throws IOException {
    if(low == -1) { // Beginning of file
        file.seek(0);           
    } else {
        file.seek(low);
    }
    int bufferLength = 65 * 1024;
    byte[] buffer = new byte[bufferLength];
    int r = file.read(buffer);
    int lineBeginIndex = -1;

    // search beginning of line
    if(low == -1) { //beginning of file
        lineBeginIndex = 0;
    } else {
        //normal mode
        for(int i = 0; i < 1024; i++) {
        if(buffer[i] == '\n') {
            lineBeginIndex = i + 1;
            break;
        }
        }
    }
    if(lineBeginIndex == -1) {
        // no line begins within next 1024 bytes
        return null;
    }
    int start = lineBeginIndex;
        for(int i = start; i < r; i++) {
            if(buffer[i] == '\n') {
                // Found end of line
                return new String(buffer, lineBeginIndex, i - lineBeginIndex + 1);
                return line.toString();
            }
        }
        throw new IllegalArgumentException("Line to long");
}
}

Please note: I made up this code ad-hoc: Corner cases are not tested nearly good enough, the code assumes that no single line is larger than 64K, etc.

I also think that building an index of the offsets where lines start might be a good idea. For a 500 GB file, that index should be stored in an index file. You should gain a not-so-small constant factor with that index because than there is no need to search for the next line in each step.

I know that was not the question, but building a prefix tree data structure like (Patrica) Tries (on disk/SSD) might be a good idea to do the prefix search.

岁月静好 2024-07-24 01:15:17

这是您想要实现的目标的一个简单示例。 我可能会首先索引文件,跟踪每个字符串的文件位置。 我假设字符串由换行符(或回车符)分隔:

    RandomAccessFile file = new RandomAccessFile("filename.txt", "r");
    List<Long> indexList = new ArrayList();
    long pos = 0;
    while (file.readLine() != null)
    {
        Long linePos = new Long(pos);
        indexList.add(linePos);
        pos = file.getFilePointer();
    }
    int indexSize = indexList.size();
    Long[] indexArray = new Long[indexSize];
    indexList.toArray(indexArray);

最后一步是转换为数组,以便在进行大量查找时稍微提高速度。 我可能也会将 Long[] 转换为 long[] ,但我在上面没有展示这一点。 最后是从给定索引位置读取字符串的代码:

    int i; // Initialize this appropriately for your algorithm.
    file.seek(indexArray[i]);
    String line = file.readLine();
            // At this point, line contains the string #i.

This is a simple example of what you want to achieve. I would probably first index the file, keeping track of the file position for each string. I'm assuming the strings are separated by newlines (or carriage returns):

    RandomAccessFile file = new RandomAccessFile("filename.txt", "r");
    List<Long> indexList = new ArrayList();
    long pos = 0;
    while (file.readLine() != null)
    {
        Long linePos = new Long(pos);
        indexList.add(linePos);
        pos = file.getFilePointer();
    }
    int indexSize = indexList.size();
    Long[] indexArray = new Long[indexSize];
    indexList.toArray(indexArray);

The last step is to convert to an array for a slight speed improvement when doing lots of lookups. I would probably convert the Long[] to a long[] also, but I did not show that above. Finally the code to read the string from a given indexed position:

    int i; // Initialize this appropriately for your algorithm.
    file.seek(indexArray[i]);
    String line = file.readLine();
            // At this point, line contains the string #i.
╄→承喏 2024-07-24 01:15:17

如果您正在处理 500GB 的文件,那么您可能需要使用比二分搜索更快的查找方法 - 即基数排序,它本质上是散列的一种变体。 执行此操作的最佳方法实际上取决于您的数据分布和查找类型,但如果您正在查找字符串前缀,那么应该有一个好方法来执行此操作。

我发布了一个整数基数排序解决方案的示例,但您可以使用相同的想法 - 基本上通过将数据划分为存储桶来减少排序时间,然后使用 O(1) 查找来检索相关数据存储桶。

Option Strict On
Option Explicit On

Module Module1

Private Const MAX_SIZE As Integer = 100000
Private m_input(MAX_SIZE) As Integer
Private m_table(MAX_SIZE) As List(Of Integer)
Private m_randomGen As New Random()
Private m_operations As Integer = 0

Private Sub generateData()
    ' fill with random numbers between 0 and MAX_SIZE - 1
    For i = 0 To MAX_SIZE - 1
        m_input(i) = m_randomGen.Next(0, MAX_SIZE - 1)
    Next

End Sub

Private Sub sortData()
    For i As Integer = 0 To MAX_SIZE - 1
        Dim x = m_input(i)
        If m_table(x) Is Nothing Then
            m_table(x) = New List(Of Integer)
        End If
        m_table(x).Add(x)
        ' clearly this is simply going to be MAX_SIZE -1
        m_operations = m_operations + 1
    Next
End Sub

 Private Sub printData(ByVal start As Integer, ByVal finish As Integer)
    If start < 0 Or start > MAX_SIZE - 1 Then
        Throw New Exception("printData - start out of range")
    End If
    If finish < 0 Or finish > MAX_SIZE - 1 Then
        Throw New Exception("printData - finish out of range")
    End If
    For i As Integer = start To finish
        If m_table(i) IsNot Nothing Then
            For Each x In m_table(i)
                Console.WriteLine(x)
            Next
        End If
    Next
End Sub

' run the entire sort, but just print out the first 100 for verification purposes
Private Sub test()
    m_operations = 0
    generateData()
    Console.WriteLine("Time started = " & Now.ToString())
    sortData()
    Console.WriteLine("Time finished = " & Now.ToString & " Number of operations = " & m_operations.ToString())
    ' print out a random 100 segment from the sorted array
    Dim start As Integer = m_randomGen.Next(0, MAX_SIZE - 101)
    printData(start, start + 100)
End Sub

Sub Main()
    test()
    Console.ReadLine()
End Sub

End Module

If you are dealing with a 500GB file, then you might want to use a faster lookup method than binary search - namely a radix sort which is essentially a variant of hashing. The best method for doing this really depends on your data distributions and types of lookup, but if you are looking for string prefixes there should be a good way to do this.

I posted an example of a radix sort solution for integers, but you can use the same idea - basically to cut down the sort time by dividing the data into buckets, then using O(1) lookup to retrieve the bucket of data that is relevant.

Option Strict On
Option Explicit On

Module Module1

Private Const MAX_SIZE As Integer = 100000
Private m_input(MAX_SIZE) As Integer
Private m_table(MAX_SIZE) As List(Of Integer)
Private m_randomGen As New Random()
Private m_operations As Integer = 0

Private Sub generateData()
    ' fill with random numbers between 0 and MAX_SIZE - 1
    For i = 0 To MAX_SIZE - 1
        m_input(i) = m_randomGen.Next(0, MAX_SIZE - 1)
    Next

End Sub

Private Sub sortData()
    For i As Integer = 0 To MAX_SIZE - 1
        Dim x = m_input(i)
        If m_table(x) Is Nothing Then
            m_table(x) = New List(Of Integer)
        End If
        m_table(x).Add(x)
        ' clearly this is simply going to be MAX_SIZE -1
        m_operations = m_operations + 1
    Next
End Sub

 Private Sub printData(ByVal start As Integer, ByVal finish As Integer)
    If start < 0 Or start > MAX_SIZE - 1 Then
        Throw New Exception("printData - start out of range")
    End If
    If finish < 0 Or finish > MAX_SIZE - 1 Then
        Throw New Exception("printData - finish out of range")
    End If
    For i As Integer = start To finish
        If m_table(i) IsNot Nothing Then
            For Each x In m_table(i)
                Console.WriteLine(x)
            Next
        End If
    Next
End Sub

' run the entire sort, but just print out the first 100 for verification purposes
Private Sub test()
    m_operations = 0
    generateData()
    Console.WriteLine("Time started = " & Now.ToString())
    sortData()
    Console.WriteLine("Time finished = " & Now.ToString & " Number of operations = " & m_operations.ToString())
    ' print out a random 100 segment from the sorted array
    Dim start As Integer = m_randomGen.Next(0, MAX_SIZE - 101)
    printData(start, start + 100)
End Sub

Sub Main()
    test()
    Console.ReadLine()
End Sub

End Module
情何以堪。 2024-07-24 01:15:17

我发布了一个要点 https://gist.github.com/mikee805/c6c2e6a35032a3ab74f643a1d0f8249c

这是相当基于我在堆栈溢出和一些博客上发现的完整示例,希望其他人可以使用它

import static java.nio.file.Files.isWritable;
import static java.nio.file.StandardOpenOption.READ;
import static org.apache.commons.io.FileUtils.forceMkdir;
import static org.apache.commons.io.IOUtils.closeQuietly;
import static org.apache.commons.lang3.StringUtils.isBlank;
import static org.apache.commons.lang3.StringUtils.trimToNull;

import java.io.File;
import java.io.IOException;
import java.nio.Buffer;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;
import java.nio.file.Path;

public class FileUtils {

    private FileUtils() {
    }

    private static boolean found(final String candidate, final String prefix) {
        return isBlank(candidate) || candidate.startsWith(prefix);
    }

    private static boolean before(final String candidate, final String prefix) {
        return prefix.compareTo(candidate.substring(0, prefix.length())) < 0;
    }

    public static MappedByteBuffer getMappedByteBuffer(final Path path) {
        FileChannel fileChannel = null;
        try {
            fileChannel = FileChannel.open(path, READ);
            return fileChannel.map(FileChannel.MapMode.READ_ONLY, 0, fileChannel.size()).load();
        } 
        catch (Exception e) {
            throw new RuntimeException(e);
        }
        finally {
            closeQuietly(fileChannel);
        }
    }

    public static String binarySearch(final String prefix, final MappedByteBuffer buffer) {
        if (buffer == null) {
            return null;
        }
        try {
            long low = 0;
            long high = buffer.limit();
            while (low < high) {
                int mid = (int) ((low + high) / 2);
                final String candidate = getLine(mid, buffer);
                if (found(candidate, prefix)) {
                    return trimToNull(candidate);
                } 
                else if (before(candidate, prefix)) {
                    high = mid;
                } 
                else {
                    low = mid + 1;
                }
            }
        } 
        catch (Exception e) {
            throw new RuntimeException(e);
        } 
        return null;
    }

    private static String getLine(int position, final MappedByteBuffer buffer) {
        // search backwards to the find the proceeding new line
        // then search forwards again until the next new line
        // return the string in between
        final StringBuilder stringBuilder = new StringBuilder();
        // walk it back
        char candidate = (char)buffer.get(position);
        while (position > 0 && candidate != '\n') {
            candidate = (char)buffer.get(--position);
        }
        // we either are at the beginning of the file or a new line
        if (position == 0) {
            // we are at the beginning at the first char
            candidate = (char)buffer.get(position);
            stringBuilder.append(candidate);
        }
        // there is/are char(s) after new line / first char
        if (isInBuffer(buffer, position)) {
            //first char after new line
            candidate = (char)buffer.get(++position);
            stringBuilder.append(candidate);
            //walk it forward
            while (isInBuffer(buffer, position) && candidate != ('\n')) {
                candidate = (char)buffer.get(++position);
                stringBuilder.append(candidate);
            }
        }
        return stringBuilder.toString();
    }

    private static boolean isInBuffer(final Buffer buffer, int position) {
        return position + 1 < buffer.limit();
    }

    public static File getOrCreateDirectory(final String dirName) { 
        final File directory = new File(dirName);
        try {
            forceMkdir(directory);
            isWritable(directory.toPath());
        } 
        catch (IOException e) {
            throw new RuntimeException(e);
        }
        return directory;
    }
}

I post a gist https://gist.github.com/mikee805/c6c2e6a35032a3ab74f643a1d0f8249c

that is rather complete example based on what I found on stack overflow and some blogs hopefully someone else can use it

import static java.nio.file.Files.isWritable;
import static java.nio.file.StandardOpenOption.READ;
import static org.apache.commons.io.FileUtils.forceMkdir;
import static org.apache.commons.io.IOUtils.closeQuietly;
import static org.apache.commons.lang3.StringUtils.isBlank;
import static org.apache.commons.lang3.StringUtils.trimToNull;

import java.io.File;
import java.io.IOException;
import java.nio.Buffer;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;
import java.nio.file.Path;

public class FileUtils {

    private FileUtils() {
    }

    private static boolean found(final String candidate, final String prefix) {
        return isBlank(candidate) || candidate.startsWith(prefix);
    }

    private static boolean before(final String candidate, final String prefix) {
        return prefix.compareTo(candidate.substring(0, prefix.length())) < 0;
    }

    public static MappedByteBuffer getMappedByteBuffer(final Path path) {
        FileChannel fileChannel = null;
        try {
            fileChannel = FileChannel.open(path, READ);
            return fileChannel.map(FileChannel.MapMode.READ_ONLY, 0, fileChannel.size()).load();
        } 
        catch (Exception e) {
            throw new RuntimeException(e);
        }
        finally {
            closeQuietly(fileChannel);
        }
    }

    public static String binarySearch(final String prefix, final MappedByteBuffer buffer) {
        if (buffer == null) {
            return null;
        }
        try {
            long low = 0;
            long high = buffer.limit();
            while (low < high) {
                int mid = (int) ((low + high) / 2);
                final String candidate = getLine(mid, buffer);
                if (found(candidate, prefix)) {
                    return trimToNull(candidate);
                } 
                else if (before(candidate, prefix)) {
                    high = mid;
                } 
                else {
                    low = mid + 1;
                }
            }
        } 
        catch (Exception e) {
            throw new RuntimeException(e);
        } 
        return null;
    }

    private static String getLine(int position, final MappedByteBuffer buffer) {
        // search backwards to the find the proceeding new line
        // then search forwards again until the next new line
        // return the string in between
        final StringBuilder stringBuilder = new StringBuilder();
        // walk it back
        char candidate = (char)buffer.get(position);
        while (position > 0 && candidate != '\n') {
            candidate = (char)buffer.get(--position);
        }
        // we either are at the beginning of the file or a new line
        if (position == 0) {
            // we are at the beginning at the first char
            candidate = (char)buffer.get(position);
            stringBuilder.append(candidate);
        }
        // there is/are char(s) after new line / first char
        if (isInBuffer(buffer, position)) {
            //first char after new line
            candidate = (char)buffer.get(++position);
            stringBuilder.append(candidate);
            //walk it forward
            while (isInBuffer(buffer, position) && candidate != ('\n')) {
                candidate = (char)buffer.get(++position);
                stringBuilder.append(candidate);
            }
        }
        return stringBuilder.toString();
    }

    private static boolean isInBuffer(final Buffer buffer, int position) {
        return position + 1 < buffer.limit();
    }

    public static File getOrCreateDirectory(final String dirName) { 
        final File directory = new File(dirName);
        try {
            forceMkdir(directory);
            isWritable(directory.toPath());
        } 
        catch (IOException e) {
            throw new RuntimeException(e);
        }
        return directory;
    }
}
倾城°AllureLove 2024-07-24 01:15:17

我遇到了类似的问题,因此我从此线程中提供的解决方案创建了(Scala)库:

https://github.com/ avast/BigMap

它包含用于对大文件进行排序以及在此排序文件中进行二进制搜索的实用程序...

I had similar problem, so I created (Scala) library from solutions provided in this thread:

https://github.com/avast/BigMap

It contains utility for sorting huge file and binary search in this sorted file...

乱世争霸 2024-07-24 01:15:17

如果您确实想尝试内存映射文件,我找到了一个 关于如何在Java nio中使用内存映射

If you truly want to try memory mapping the file, I found a tutorial on how to use memory mapping in Java nio.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文