优化字符串和嵌套搜索
我一直在研究一个搜索按层次结构组织的文本的函数。由于我正在处理的文件大小和数量都很大,因此优化问题(速度和内存)变得越来越重要,我正在研究如何改进算法
public NestLink<String> searchsubs(String input, NestLink<String> searchTags) {
// parameter tests
int startIndex = 0; // start of the resultant subsection
int endIndex = 0; // end of the resultant subsection
String openTag = searchTags.elem1; // start of the string I need
String closeTag = searchTags.elem2; // end of the string I need
String subString = null; // stores the result in between
NestLink<String> node = null; // temp variable
NestLink<String> out = null; // output variable
while (true) {
// find the opening string
startIndex = input.indexOf(openTag, endIndex);
if (startIndex == -1) {
break; // if no more are found, break from the loop
} else {
}
endIndex = input.indexOf(closeTag, startIndex);
if (endIndex == -1) {
break; // if tag isn't closed, break from the loop
} else {
}
// we now have a pair of tags with a content between
subString = input.substring(startIndex + openTag.length(), endIndex);
// store what we found, method unimportant
// search this content for each subsearch in the heirarchy
for (NestLink<String> subSearch : searchTags.subBranches) {
// recurse
node = subBlockParser(subString, subSearch);
// do stuff with results
}
}
//final do stuff
return out;
}
注释:NestLink 是一种定制的树结构,然而,格式并不重要。
结果是,对于每个搜索级别,都会创建子字符串的副本(有时大小高达 1 MB),这显然效率不高。
为了尝试解决这个问题,我考虑了以下内容:
public NestLink<String> searchsubs(String input, int substringStart, int substringEnd,
NestLink<String> searchTags) {
// parameter tests
int startIndex = substringStart; // start of the resultant subsection
int endIndex = substringStart; // end of the resultant subsection
String openTag = searchTags.elem1; // start of the string I need
String closeTag = searchTags.elem2; // end of the string I need
String subString = null; // stores the result in between
NestLink<String> node = null; // temp variable
NestLink<String> out = null; // output variable
while (true) {
// find the opening string
startIndex = input.indexOf(openTag, endIndex);
if (startIndex == -1 || startIndex >= substringEnd) {
break; // if no more are found, break from the loop
} else {
}
endIndex = input.indexOf(closeTag, startIndex);
if (endIndex == -1 || endIndex >= substringEnd) {
break; // if tag isn't closed, break from the loop
} else {
}
// we now have a pair of tags with a content between
// store what we found, method unimportant
// search this content for each subsearch in the heirarchy
for (NestLink<String> subSearch : searchTags.subBranches) {
// recurse, this time sharing input, but with a new substring start and end to serve as bounds
node =
subBlockParser(input, startIndex + openTag.length(), endIndex, subSearch);
// do stuff with results
}
}
// final do stuff
return out;
}
这一次我没有创建子字符串,而是发送输入和一组边界。这就提出了一个问题,JRE 将如何处理这个问题?它会复制输入字符串(导致性能下降,因为现在正在复制更大的字符串),还是会像其他对象一样简单地传递指针对象,并且所有递归共享相同的字符串对象(性能显着提高)因为没有复制)
或者,是否有任何其他概念可能适用于层次搜索?和继承结果?
巴拉德
I've been working on a function that is searching heirarchally organised text. Since the files I'm working with are large in both size and number, the matter of optimisation (both speed and memory) is becoming increasingly important and I'm looking on how to improve an algorithm
public NestLink<String> searchsubs(String input, NestLink<String> searchTags) {
// parameter tests
int startIndex = 0; // start of the resultant subsection
int endIndex = 0; // end of the resultant subsection
String openTag = searchTags.elem1; // start of the string I need
String closeTag = searchTags.elem2; // end of the string I need
String subString = null; // stores the result in between
NestLink<String> node = null; // temp variable
NestLink<String> out = null; // output variable
while (true) {
// find the opening string
startIndex = input.indexOf(openTag, endIndex);
if (startIndex == -1) {
break; // if no more are found, break from the loop
} else {
}
endIndex = input.indexOf(closeTag, startIndex);
if (endIndex == -1) {
break; // if tag isn't closed, break from the loop
} else {
}
// we now have a pair of tags with a content between
subString = input.substring(startIndex + openTag.length(), endIndex);
// store what we found, method unimportant
// search this content for each subsearch in the heirarchy
for (NestLink<String> subSearch : searchTags.subBranches) {
// recurse
node = subBlockParser(subString, subSearch);
// do stuff with results
}
}
//final do stuff
return out;
}
note: NestLink is a customised tree structure, the format is unimportant, however.
The result is that for each level of the search a copy of the substring, sometimes up to 1mbyte in size, is being created which is obviously far from efficient.
To try and resolve this I considered the following:
public NestLink<String> searchsubs(String input, int substringStart, int substringEnd,
NestLink<String> searchTags) {
// parameter tests
int startIndex = substringStart; // start of the resultant subsection
int endIndex = substringStart; // end of the resultant subsection
String openTag = searchTags.elem1; // start of the string I need
String closeTag = searchTags.elem2; // end of the string I need
String subString = null; // stores the result in between
NestLink<String> node = null; // temp variable
NestLink<String> out = null; // output variable
while (true) {
// find the opening string
startIndex = input.indexOf(openTag, endIndex);
if (startIndex == -1 || startIndex >= substringEnd) {
break; // if no more are found, break from the loop
} else {
}
endIndex = input.indexOf(closeTag, startIndex);
if (endIndex == -1 || endIndex >= substringEnd) {
break; // if tag isn't closed, break from the loop
} else {
}
// we now have a pair of tags with a content between
// store what we found, method unimportant
// search this content for each subsearch in the heirarchy
for (NestLink<String> subSearch : searchTags.subBranches) {
// recurse, this time sharing input, but with a new substring start and end to serve as bounds
node =
subBlockParser(input, startIndex + openTag.length(), endIndex, subSearch);
// do stuff with results
}
}
// final do stuff
return out;
}
This time rather than creating a substring I send the input and a set of bounds. This raises the question, how would the JRE handle this? Would it copy the input string (resulting in a reduced performance as an even bigger string is being copied now), or would it simply pass a pointer object as it would with othre objects and all recursions share the same string object (a significant performance increase as there is no copying)
Alternatively, are there any other concepts that may be applicable to a heirarchal search? and heirarchal result?
K.Barad
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
substring
不会创建包含原始字符串中字符的副本的新字符串。它只是返回一个与原始字符串共享相同字符数组的 String 对象,但具有不同的偏移量和长度。因此,您的第二个实现与第一个实现类似,但更复杂(因为它执行 String 内部的操作)。
substring
doesn't create a new String containing a copy of the chars in the original string. It just returns a String object sharing the same char array as the original string, but with a different offset and length.So your second implementation is similar to the first one, but more complex (because it does what String does internally).
恐怕Java标准API已经实现了这种优化,所以从中没有什么好处。
java.lang.String
由底层char[]
、偏移量和长度组成。substring()
重用char[]
并仅调整偏移量和长度。我建议在您考虑任何优化之前,在您的代码上使用分析器来找出它实际花费大部分时间的地方。这将防止你像这样浪费你的努力,我几乎可以保证你会对结果感到惊讶。
I'm afraid the Java standard API already implements this optimization, so there is nothing to be gained from it.
A
java.lang.String
consists of an underlyingchar[]
, an offset and a length.substring()
reuses thechar[]
and merely adjusts the offset and length.I suggest using a profiler on your code to find out where it actually spends most of the time, before you think of any optimizations. That will prevent you from wasting your efforts like this, and I can almost guarantee that you will be surprised by the result.