为什么我的有限状态机需要这么长时间才能执行？

发布于 2024-08-25 12:51:14 字数 2895 浏览 11 评论 0原文

我正在开发一个状态机，它应该提取形式的函数调用

/* I am a comment */
//I am a comment
pref("this.is.a.string.which\"can have QUOTES\"", 123456);

，其中提取的数据将是 pref("this.is.a.string.which\"can have QUOTES\"", 123456) ; 从一个文件。目前，处理一个 41kb 的文件需要接近一分半钟的时间。我对这个有限状态机有严重误解吗？

#include <boost/algorithm/string.hpp>
std::vector<std::string> Foo()
{
    std::string fileData;
    //Fill filedata with the contents of a file
    std::vector<std::string> results;
    std::string::iterator begin = fileData.begin();
    std::string::iterator end = fileData.end();
    std::string::iterator stateZeroFoundLocation = fileData.begin();
    std::size_t state = 0;
    for(; begin < end; begin++)
    {
        switch (state)
        {
        case 0:
            if (boost::starts_with(boost::make_iterator_range(begin, end), "pref(")) {
                stateZeroFoundLocation = begin;
                begin += 4;
                state = 2;
            } else if (*begin == '/')
                state = 1;
            break;
        case 1:
            state = 0;
            switch (*begin)
            {
            case '*':
                begin = boost::find_first(boost::make_iterator_range(begin, end), "*/").end();
                break;
            case '/':
                begin = std::find(begin, end, L'\n');
            }
            break;
        case 2:
            if (*begin == '"')
                state = 3;
            break;
        case 3:
            switch(*begin)
            {
            case '\\':
                state = 4;
                break;
            case '"':
                state = 5;
            }
            break;
        case 4:
            state = 3;
            break;
        case 5:
            if (*begin == ',')
                state = 6;
            break;
        case 6:
            if (*begin != ' ')
                state = 7;
            break;
        case 7:
            switch(*begin)
            {
            case '"':
                state = 8;
                break;
            default:
                state = 10;
                break;
            }
            break;
        case 8:
            switch(*begin)
            {
            case '\\':
                state = 9;
                break;
            case '"':
                state = 10;
            }
            break;
        case 9:
            state = 8;
            break;
        case 10:
            if (*begin == ')')
                state = 11;
            break;
        case 11:
            if (*begin == ';')
                state = 12;
            break;
        case 12:
            state = 0;
            results.push_back(std::string(stateZeroFoundLocation, begin));
        };
    }
    return results;
}

Billy3

编辑：嗯，这是我见过的最奇怪的事情之一。我刚刚重建了该项目，它又正常运行了。奇怪。

原文

I'm working on a state machine which is supposed to extract function calls of the form

/* I am a comment */
//I am a comment
pref("this.is.a.string.which\"can have QUOTES\"", 123456);

where the extracted data would be pref("this.is.a.string.which\"can have QUOTES\"", 123456);
from a file. Currently, to process a 41kb file, this process is taking close to a minute and a half. Is there something I'm seriously misunderstanding here about this finite state machine?

#include <boost/algorithm/string.hpp>
std::vector<std::string> Foo()
{
    std::string fileData;
    //Fill filedata with the contents of a file
    std::vector<std::string> results;
    std::string::iterator begin = fileData.begin();
    std::string::iterator end = fileData.end();
    std::string::iterator stateZeroFoundLocation = fileData.begin();
    std::size_t state = 0;
    for(; begin < end; begin++)
    {
        switch (state)
        {
        case 0:
            if (boost::starts_with(boost::make_iterator_range(begin, end), "pref(")) {
                stateZeroFoundLocation = begin;
                begin += 4;
                state = 2;
            } else if (*begin == '/')
                state = 1;
            break;
        case 1:
            state = 0;
            switch (*begin)
            {
            case '*':
                begin = boost::find_first(boost::make_iterator_range(begin, end), "*/").end();
                break;
            case '/':
                begin = std::find(begin, end, L'\n');
            }
            break;
        case 2:
            if (*begin == '"')
                state = 3;
            break;
        case 3:
            switch(*begin)
            {
            case '\\':
                state = 4;
                break;
            case '"':
                state = 5;
            }
            break;
        case 4:
            state = 3;
            break;
        case 5:
            if (*begin == ',')
                state = 6;
            break;
        case 6:
            if (*begin != ' ')
                state = 7;
            break;
        case 7:
            switch(*begin)
            {
            case '"':
                state = 8;
                break;
            default:
                state = 10;
                break;
            }
            break;
        case 8:
            switch(*begin)
            {
            case '\\':
                state = 9;
                break;
            case '"':
                state = 10;
            }
            break;
        case 9:
            state = 8;
            break;
        case 10:
            if (*begin == ')')
                state = 11;
            break;
        case 11:
            if (*begin == ';')
                state = 12;
            break;
        case 12:
            state = 0;
            results.push_back(std::string(stateZeroFoundLocation, begin));
        };
    }
    return results;
}

Billy3

EDIT: Well this is one of the strangest things I've ever seen. I just rebuilt the project and it's running reasonably again. Odd.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

秋日私语 2024-09-01 12:51:14

除非您的 41 kb 文件主要是注释或首选项，否则它将大部分时间都处于状态 0。对于状态 0 中的每个字符，您至少进行两次函数调用。

if (boost::starts_with(boost::make_iterator_range(begin, end), "pref(")) {

您可以通过预先测试来加快速度，看看当前字符是否为“p”。

if (*begin == 'p' && boost::starts_with(boost::make_iterator_range(begin, end), "pref(")) {

如果字符不是“p”，则无需进行任何函数调用。特别是不创建迭代器，这可能是花费时间的地方。

Unless your 41 kb file is mostly comments or prefs, it's going to spend most of its time in state 0. And for each character in state 0, you make a minimum of two function calls.

if (boost::starts_with(boost::make_iterator_range(begin, end), "pref(")) {

You can speed this up by pre-testing to see if the current character is 'p'

if (*begin == 'p' && boost::starts_with(boost::make_iterator_range(begin, end), "pref(")) {

If the character isn't 'p' then there is no need to make any function calls. In particular not creating a iterator, which is probably where the time is being spent.

回复收藏 0 原文

Hello爱情风 2024-09-01 12:51:14

我不知道这是否是问题的一部分，但在 case 0 中你有一个拼写错误，“perf”被错误地拼写为“pref”。

回复收藏 0 原文

吹泡泡o 2024-09-01 12:51:14

好吧，仅通过观察很难说……但我猜查找算法正在这样做。为什么要在 FSM 内搜索？根据定义，您应该一次给他们一个字符......添加更多状态。还可以尝试将结果设为列表，而不是向量。大量的复制正在进行

vector<string>

，但主要是：
简介它！

Well it's hard to say just by looking at it...but I'm guessing the find algorithms are doing it. Why are you searching within a FSM? By definition you're supposed to giving them one character at a time....Add more states. Also try making results a list, not a vector. A lot of copying going on with

vector<string>

But mostly:
Profile it!

回复收藏 0 原文

请持续率性 2024-09-01 12:51:14

有限状态机是一种可行的解决方案，但对于文本处理，最好使用高度优化的有限状态机生成器。在本例中，是一个正则表达式。这里是 Perl 正则表达式：

# first clean the comments
$source =~ s|//.*$||;      # replace "// till end of line" with nothing
$source =~ s|/\*.*?\*/||s; # replace "/* any text until */" with nothing
                           # depending on your data, you may need a few other
                           # rules here to avoid blanking data, you could replace
                           # the comments with a unique identifier, and then
                           # expand any identifiers that the regex below returns

# then find your data
while ($source =~ /perf\(\s*"(.+?)",\s*(\d+)\s*\);/g) { 
   # matches your function signature and moves along source
   # do something with the captured groups, in this case $1 and $2
}

由于大多数正则表达式库都与 Perl 兼容，因此翻译语法应该不难。如果您的搜索变得更加复杂，那么解析器就合适了。

Finite state machines are a workable solution, but for text processing, its best to use a highly optimized finite state machine generator. In this case, a regular expression. Here it is as Perl regex:

# first clean the comments
$source =~ s|//.*$||;      # replace "// till end of line" with nothing
$source =~ s|/\*.*?\*/||s; # replace "/* any text until */" with nothing
                           # depending on your data, you may need a few other
                           # rules here to avoid blanking data, you could replace
                           # the comments with a unique identifier, and then
                           # expand any identifiers that the regex below returns

# then find your data
while ($source =~ /perf\(\s*"(.+?)",\s*(\d+)\s*\);/g) { 
   # matches your function signature and moves along source
   # do something with the captured groups, in this case $1 and $2
}

Since most regex libraries are Perl compatible, it shouldn't be hard to translate the syntax. If your search becomes more complicated, then a parser is in order.

回复收藏 0 原文