优化字符串和嵌套搜索

发布于 2024-10-18 14:05:24 字数 3272 浏览 0 评论 0原文

我一直在研究一个搜索按层次结构组织的文本的函数。由于我正在处理的文件大小和数量都很大，因此优化问题（速度和内存）变得越来越重要，我正在研究如何改进算法

public NestLink<String> searchsubs(String input, NestLink<String> searchTags) {

// parameter tests

int startIndex = 0; // start of the resultant subsection
int endIndex = 0; // end of the resultant subsection
String openTag = searchTags.elem1; // start of the string I need
String closeTag = searchTags.elem2; // end of the string I need
String subString = null; // stores the result in between
NestLink<String> node = null; // temp variable
NestLink<String> out = null; // output variable

while (true) {
    // find the opening string
    startIndex = input.indexOf(openTag, endIndex);
    if (startIndex == -1) {
    break; // if no more are found, break from the loop
    } else {
    }
    endIndex = input.indexOf(closeTag, startIndex);
    if (endIndex == -1) {
    break; // if tag isn't closed, break from the loop
    } else {
    }

    // we now have a pair of tags with a content between
    subString = input.substring(startIndex + openTag.length(), endIndex);

    // store what we found, method unimportant

    // search this content for each subsearch in the heirarchy
    for (NestLink<String> subSearch : searchTags.subBranches) {
    // recurse
    node = subBlockParser(subString, subSearch);
    // do stuff with results
    }       
}

//final do stuff
return out;
}

注释：NestLink 是一种定制的树结构，然而，格式并不重要。

结果是，对于每个搜索级别，都会创建子字符串的副本（有时大小高达 1 MB），这显然效率不高。

为了尝试解决这个问题，我考虑了以下内容：

public NestLink<String> searchsubs(String input, int substringStart, int substringEnd,
NestLink<String> searchTags) {

// parameter tests

int startIndex = substringStart; // start of the resultant subsection
int endIndex = substringStart; // end of the resultant subsection
String openTag = searchTags.elem1; // start of the string I need
String closeTag = searchTags.elem2; // end of the string I need
String subString = null; // stores the result in between
NestLink<String> node = null; // temp variable
NestLink<String> out = null; // output variable

while (true) {
    // find the opening string
    startIndex = input.indexOf(openTag, endIndex);
    if (startIndex == -1 || startIndex >= substringEnd) {
    break; // if no more are found, break from the loop
    } else {
    }
    endIndex = input.indexOf(closeTag, startIndex);
    if (endIndex == -1 || endIndex >= substringEnd) {
    break; // if tag isn't closed, break from the loop
    } else {
    }

    // we now have a pair of tags with a content between
    // store what we found, method unimportant

    // search this content for each subsearch in the heirarchy
    for (NestLink<String> subSearch : searchTags.subBranches) {
    // recurse, this time sharing input, but with a new substring start and end to serve as bounds
    node =
        subBlockParser(input, startIndex + openTag.length(), endIndex, subSearch);
    // do stuff with results
    }
}

// final do stuff
return out;
}

这一次我没有创建子字符串，而是发送输入和一组边界。这就提出了一个问题，JRE 将如何处理这个问题？它会复制输入字符串（导致性能下降，因为现在正在复制更大的字符串），还是会像其他对象一样简单地传递指针对象，并且所有递归共享相同的字符串对象（性能显着提高）因为没有复制）

或者，是否有任何其他概念可能适用于层次搜索？和继承结果？

巴拉德

原文

I've been working on a function that is searching heirarchally organised text. Since the files I'm working with are large in both size and number, the matter of optimisation (both speed and memory) is becoming increasingly important and I'm looking on how to improve an algorithm

public NestLink<String> searchsubs(String input, NestLink<String> searchTags) {

// parameter tests

int startIndex = 0; // start of the resultant subsection
int endIndex = 0; // end of the resultant subsection
String openTag = searchTags.elem1; // start of the string I need
String closeTag = searchTags.elem2; // end of the string I need
String subString = null; // stores the result in between
NestLink<String> node = null; // temp variable
NestLink<String> out = null; // output variable

while (true) {
    // find the opening string
    startIndex = input.indexOf(openTag, endIndex);
    if (startIndex == -1) {
    break; // if no more are found, break from the loop
    } else {
    }
    endIndex = input.indexOf(closeTag, startIndex);
    if (endIndex == -1) {
    break; // if tag isn't closed, break from the loop
    } else {
    }

    // we now have a pair of tags with a content between
    subString = input.substring(startIndex + openTag.length(), endIndex);

    // store what we found, method unimportant

    // search this content for each subsearch in the heirarchy
    for (NestLink<String> subSearch : searchTags.subBranches) {
    // recurse
    node = subBlockParser(subString, subSearch);
    // do stuff with results
    }       
}

//final do stuff
return out;
}

note: NestLink is a customised tree structure, the format is unimportant, however.

The result is that for each level of the search a copy of the substring, sometimes up to 1mbyte in size, is being created which is obviously far from efficient.

To try and resolve this I considered the following:

public NestLink<String> searchsubs(String input, int substringStart, int substringEnd,
NestLink<String> searchTags) {

// parameter tests

int startIndex = substringStart; // start of the resultant subsection
int endIndex = substringStart; // end of the resultant subsection
String openTag = searchTags.elem1; // start of the string I need
String closeTag = searchTags.elem2; // end of the string I need
String subString = null; // stores the result in between
NestLink<String> node = null; // temp variable
NestLink<String> out = null; // output variable

while (true) {
    // find the opening string
    startIndex = input.indexOf(openTag, endIndex);
    if (startIndex == -1 || startIndex >= substringEnd) {
    break; // if no more are found, break from the loop
    } else {
    }
    endIndex = input.indexOf(closeTag, startIndex);
    if (endIndex == -1 || endIndex >= substringEnd) {
    break; // if tag isn't closed, break from the loop
    } else {
    }

    // we now have a pair of tags with a content between
    // store what we found, method unimportant

    // search this content for each subsearch in the heirarchy
    for (NestLink<String> subSearch : searchTags.subBranches) {
    // recurse, this time sharing input, but with a new substring start and end to serve as bounds
    node =
        subBlockParser(input, startIndex + openTag.length(), endIndex, subSearch);
    // do stuff with results
    }
}

// final do stuff
return out;
}

This time rather than creating a substring I send the input and a set of bounds. This raises the question, how would the JRE handle this? Would it copy the input string (resulting in a reduced performance as an even bigger string is being copied now), or would it simply pass a pointer object as it would with othre objects and all recursions share the same string object (a significant performance increase as there is no copying)

Alternatively, are there any other concepts that may be applicable to a heirarchal search? and heirarchal result?

K.Barad

分享到QQ

分享到微博