Burrows Wheeler 变换的优化

发布于 2024-11-07 20:28:26 字数 2808 浏览 4 评论 0原文

您好，我在优化洞穴惠勒变换时遇到一些困难。我正在尝试转换文本文件，但是转换诸如圣经之类的大型文本文件需要很长时间。

关于如何继续的任何想法？

public BurrowsWheelerTransformEncoder()
{

}

private String originalSuffix(int index, String string)
{
    String temp = (string.substring(index,string.length()) + string.substring(0,index));

    //this bit just 'compresses' each transformation of text by producing
    //a prefix, so 'abracadabra' just becomes 'abrac'
    //this is so minimal amount of memory is used when it is stored in an array

    return temp.substring(0,5)+
    //the last character of the transformation is kept
           temp.charAt(temp.length()-1);
}

private String compressedSuffix(String string)
{
    //this method just 'compresses' original piece of text by producing
    //a prefix, so 'abracadabra' just becomes 'abrac'
    //this is so comprisons won't take so long
    return string.substring(0,5)+string.charAt(string.length()-1);
}

public static void main(String args[]) throws Exception
{
    BurrowsWheelerTransformEncoder encoder = new BurrowsWheelerTransformEncoder();
    BufferedReader input = new BufferedReader(new FileReader("src/compressionalgorithm/texts/manifesto.txt"));

    String text = "";
    //the row in the sorted array where the original text can be found
    int originalRow = 0;
    //system time when program began
    long startTime = System.nanoTime();

    //get text from file
    while(input.ready())
    {
        text += input.readLine();
    }
    //create a new array to hold all transformations
    String[] textArray = new String[text.length()];
    int length = text.length();

    //get individual transformations and put in array
    for(int i = 0; i < text.length(); i++)
    {
        textArray[i] = encoder.originalSuffix(i,text);
        //for debugging large text files, prints progress after every 10k'th 
        //transformation
        if(i%10000==0)
        System.out.println(i+"/"+length);
    }
    //uses java's internal methods to sort the array, presumably 
    //the most efficient way to do the sort (for now)
    Arrays.sort(textArray);

    String compressedOriginalText = encoder.compressedSuffix(text);

    //print the results
    for(int i = 0; i < textArray.length; i++)
    {
        if(textArray[i].equals(compressedOriginalText))
        {
            originalRow = i;
        }
        if(i%100==0)
        {
            System.out.println();
        }
        System.out.print(textArray[i].charAt(textArray[i].length()-1));
    }
    System.out.println("\nThe original transformation of the text was found at row " + originalRow + " of the sorted array.");
    System.out.println("Time elapsed: " + (System.nanoTime() - startTime));
 }

原文

Hello I am having some difficulty optimizing the burrows wheeler transform. I'm trying to transform text files, however transforming large text files like the bible take way too long.

Any idea on how to proceed?

public BurrowsWheelerTransformEncoder()
{

}

private String originalSuffix(int index, String string)
{
    String temp = (string.substring(index,string.length()) + string.substring(0,index));

    //this bit just 'compresses' each transformation of text by producing
    //a prefix, so 'abracadabra' just becomes 'abrac'
    //this is so minimal amount of memory is used when it is stored in an array

    return temp.substring(0,5)+
    //the last character of the transformation is kept
           temp.charAt(temp.length()-1);
}

private String compressedSuffix(String string)
{
    //this method just 'compresses' original piece of text by producing
    //a prefix, so 'abracadabra' just becomes 'abrac'
    //this is so comprisons won't take so long
    return string.substring(0,5)+string.charAt(string.length()-1);
}

public static void main(String args[]) throws Exception
{
    BurrowsWheelerTransformEncoder encoder = new BurrowsWheelerTransformEncoder();
    BufferedReader input = new BufferedReader(new FileReader("src/compressionalgorithm/texts/manifesto.txt"));

    String text = "";
    //the row in the sorted array where the original text can be found
    int originalRow = 0;
    //system time when program began
    long startTime = System.nanoTime();

    //get text from file
    while(input.ready())
    {
        text += input.readLine();
    }
    //create a new array to hold all transformations
    String[] textArray = new String[text.length()];
    int length = text.length();

    //get individual transformations and put in array
    for(int i = 0; i < text.length(); i++)
    {
        textArray[i] = encoder.originalSuffix(i,text);
        //for debugging large text files, prints progress after every 10k'th 
        //transformation
        if(i%10000==0)
        System.out.println(i+"/"+length);
    }
    //uses java's internal methods to sort the array, presumably 
    //the most efficient way to do the sort (for now)
    Arrays.sort(textArray);

    String compressedOriginalText = encoder.compressedSuffix(text);

    //print the results
    for(int i = 0; i < textArray.length; i++)
    {
        if(textArray[i].equals(compressedOriginalText))
        {
            originalRow = i;
        }
        if(i%100==0)
        {
            System.out.println();
        }
        System.out.print(textArray[i].charAt(textArray[i].length()-1));
    }
    System.out.println("\nThe original transformation of the text was found at row " + originalRow + " of the sorted array.");
    System.out.println("Time elapsed: " + (System.nanoTime() - startTime));
 }

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

離人涙 2024-11-14 20:28:26

对于编码情况，您不需要实际构建字符串数组 - 使用 int （或 long，具体取决于您的文件大小）数组来存储旋转字符串开始的索引。

创建一个初始化为 [0 1 2 3 ... n] 的数组

使用以下compareTo对数组进行排序（假设compareTo()可以访问原始字符串，original):

int CompareTo(int a, int b){
    int 比较，len =original.length();
    做{
        char _a = 原始.charAt(a), _b = 原始.charAt(b);
        比较=_a-_b；
        一个++； b++；
        if(a>=len)a-=len;
        if(b>=len)b-=len;
    } while(比较==0);
    返回比较；
}

记下索引数组中“0”的索引，并将其作为“开始”值添加到输出中

对于反转，再次进行我们希望避免为文本构建整个表格像圣经一样大。我们可以利用第一行和最后一行中相同标记始终处于相同顺序的事实来做到这一点。这是正确的，因为第一行已排序并且标记是循环排列的：对于最后一行中的三个连续的 b，它们后面的标记已排序，因此 b 已排序。因此，要反转：

对输出标记进行排序。除了存储已排序的标记之外，还存储每个标记开始的索引。因此，对于未排序的标记“nbnaaa”，您将存储 [3 4 5 2 0 1] 和“aaabnn”。重要：此步骤必须使用稳定排序。

使用前面提到的“start”值来重建字符串：

字符串解码（字符串已排序，int[]index，int start）{
    字符串答案 = ""+sorted.charAt(start);
    int next = 索引[开始];
    while(下一个!=开始){
        答案=已排序.charAt(下一个)+答案;
        下一个=索引[下一个]；
    }
    返回答案；
}

For the coding case, you don't need to actually build a string array - use an int (or long depending on your file size) array instead to store the index that your rotating string starts at.

Create an array initialized to [0 1 2 3 ... n]

sort the array with the following compareTo (assume compareTo() has access to the original string, original):

int compareTo(int a, int b){
    int compare, len = original.length();
    do{
        char _a = original.charAt(a), _b = original.charAt(b);
        compare = _a-_b;
        a++; b++;
        if(a>=len)a-=len;
        if(b>=len)b-=len;
    }while(compare==0);
    return compare;
}

note the index of "0" in the array and add that to your output as the "start" value

For the reversal, again we would want to avoid building the entire table for a text as large as the bible. We can do this by using the fact that identical tokens in the first row and last row are always in the same order. This is true because the first row is sorted and the tokens are arranged cyclically: for three consecutive b's in the last row, the tokens after them are sorted, so the b's are sorted. So to reverse:

sort the output tokens. Along with storing the sorted tokens, store the index each token started at. So for the unsorted tokens "nbnaaa", you would store [3 4 5 2 0 1] and "aaabnn". Important: you MUST use a stable sort for this step.

use the "start" value mentioned earlier to rebuild the string:

string decode(string sorted, int[]index, int start){
    string answer = ""+sorted.charAt(start);
    int next = index[start];
    while(next!=start){
        answer = sorted.charAt(next) + answer;
        next = index[next];
    }
    return answer;
}

回复收藏 0 原文

浅暮の光 2024-11-14 20:28:26

此行：

    String temp = (string.substring(index,string.length()) + string.substring(0,index));

每次调用它时都会创建整个输入文本的副本。由于您对 N 个字符的输入文本调用它 N 次，因此您的算法将是 O(N^2)。

看看是否可以优化 originalSuffix 方法来避免复制。

This line:

    String temp = (string.substring(index,string.length()) + string.substring(0,index));

is going to create a copy of the entire input text each time you call it. Since you call it N times for an input text of N characters, your algorithm will be O(N^2).

See if you can optimize the originalSuffix method to avoid that copying.

回复收藏 0 原文

~没有更多了~