Burrows Wheeler 变换的优化
您好,我在优化洞穴惠勒变换时遇到一些困难。我正在尝试转换文本文件,但是转换诸如圣经之类的大型文本文件需要很长时间。
关于如何继续的任何想法?
public BurrowsWheelerTransformEncoder()
{
}
private String originalSuffix(int index, String string)
{
String temp = (string.substring(index,string.length()) + string.substring(0,index));
//this bit just 'compresses' each transformation of text by producing
//a prefix, so 'abracadabra' just becomes 'abrac'
//this is so minimal amount of memory is used when it is stored in an array
return temp.substring(0,5)+
//the last character of the transformation is kept
temp.charAt(temp.length()-1);
}
private String compressedSuffix(String string)
{
//this method just 'compresses' original piece of text by producing
//a prefix, so 'abracadabra' just becomes 'abrac'
//this is so comprisons won't take so long
return string.substring(0,5)+string.charAt(string.length()-1);
}
public static void main(String args[]) throws Exception
{
BurrowsWheelerTransformEncoder encoder = new BurrowsWheelerTransformEncoder();
BufferedReader input = new BufferedReader(new FileReader("src/compressionalgorithm/texts/manifesto.txt"));
String text = "";
//the row in the sorted array where the original text can be found
int originalRow = 0;
//system time when program began
long startTime = System.nanoTime();
//get text from file
while(input.ready())
{
text += input.readLine();
}
//create a new array to hold all transformations
String[] textArray = new String[text.length()];
int length = text.length();
//get individual transformations and put in array
for(int i = 0; i < text.length(); i++)
{
textArray[i] = encoder.originalSuffix(i,text);
//for debugging large text files, prints progress after every 10k'th
//transformation
if(i%10000==0)
System.out.println(i+"/"+length);
}
//uses java's internal methods to sort the array, presumably
//the most efficient way to do the sort (for now)
Arrays.sort(textArray);
String compressedOriginalText = encoder.compressedSuffix(text);
//print the results
for(int i = 0; i < textArray.length; i++)
{
if(textArray[i].equals(compressedOriginalText))
{
originalRow = i;
}
if(i%100==0)
{
System.out.println();
}
System.out.print(textArray[i].charAt(textArray[i].length()-1));
}
System.out.println("\nThe original transformation of the text was found at row " + originalRow + " of the sorted array.");
System.out.println("Time elapsed: " + (System.nanoTime() - startTime));
}
Hello I am having some difficulty optimizing the burrows wheeler transform. I'm trying to transform text files, however transforming large text files like the bible take way too long.
Any idea on how to proceed?
public BurrowsWheelerTransformEncoder()
{
}
private String originalSuffix(int index, String string)
{
String temp = (string.substring(index,string.length()) + string.substring(0,index));
//this bit just 'compresses' each transformation of text by producing
//a prefix, so 'abracadabra' just becomes 'abrac'
//this is so minimal amount of memory is used when it is stored in an array
return temp.substring(0,5)+
//the last character of the transformation is kept
temp.charAt(temp.length()-1);
}
private String compressedSuffix(String string)
{
//this method just 'compresses' original piece of text by producing
//a prefix, so 'abracadabra' just becomes 'abrac'
//this is so comprisons won't take so long
return string.substring(0,5)+string.charAt(string.length()-1);
}
public static void main(String args[]) throws Exception
{
BurrowsWheelerTransformEncoder encoder = new BurrowsWheelerTransformEncoder();
BufferedReader input = new BufferedReader(new FileReader("src/compressionalgorithm/texts/manifesto.txt"));
String text = "";
//the row in the sorted array where the original text can be found
int originalRow = 0;
//system time when program began
long startTime = System.nanoTime();
//get text from file
while(input.ready())
{
text += input.readLine();
}
//create a new array to hold all transformations
String[] textArray = new String[text.length()];
int length = text.length();
//get individual transformations and put in array
for(int i = 0; i < text.length(); i++)
{
textArray[i] = encoder.originalSuffix(i,text);
//for debugging large text files, prints progress after every 10k'th
//transformation
if(i%10000==0)
System.out.println(i+"/"+length);
}
//uses java's internal methods to sort the array, presumably
//the most efficient way to do the sort (for now)
Arrays.sort(textArray);
String compressedOriginalText = encoder.compressedSuffix(text);
//print the results
for(int i = 0; i < textArray.length; i++)
{
if(textArray[i].equals(compressedOriginalText))
{
originalRow = i;
}
if(i%100==0)
{
System.out.println();
}
System.out.print(textArray[i].charAt(textArray[i].length()-1));
}
System.out.println("\nThe original transformation of the text was found at row " + originalRow + " of the sorted array.");
System.out.println("Time elapsed: " + (System.nanoTime() - startTime));
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
对于编码情况,您不需要实际构建字符串数组 - 使用 int (或 long,具体取决于您的文件大小)数组来存储旋转字符串开始的索引。
使用以下compareTo对数组进行排序(假设
compareTo()
可以访问原始字符串,original):
记下索引数组中“0”的索引,并将其作为“开始”值添加到输出中
对于反转,再次进行我们希望避免为文本构建整个表格像圣经一样大。我们可以利用第一行和最后一行中相同标记始终处于相同顺序的事实来做到这一点。这是正确的,因为第一行已排序并且标记是循环排列的:对于最后一行中的三个连续的 b,它们后面的标记已排序,因此 b 已排序。因此,要反转:
使用前面提到的“start”值来重建字符串:
For the coding case, you don't need to actually build a string array - use an int (or long depending on your file size) array instead to store the index that your rotating string starts at.
sort the array with the following compareTo (assume
compareTo()
has access to the original string,original
):note the index of "0" in the array and add that to your output as the "start" value
For the reversal, again we would want to avoid building the entire table for a text as large as the bible. We can do this by using the fact that identical tokens in the first row and last row are always in the same order. This is true because the first row is sorted and the tokens are arranged cyclically: for three consecutive b's in the last row, the tokens after them are sorted, so the b's are sorted. So to reverse:
use the "start" value mentioned earlier to rebuild the string:
此行:
每次调用它时都会创建整个输入文本的副本。由于您对 N 个字符的输入文本调用它 N 次,因此您的算法将是
O(N^2)
。看看是否可以优化
originalSuffix
方法来避免复制。This line:
is going to create a copy of the entire input text each time you call it. Since you call it N times for an input text of N characters, your algorithm will be
O(N^2)
.See if you can optimize the
originalSuffix
method to avoid that copying.