为什么我的有限状态机需要这么长时间才能执行?

发布于 2024-08-25 12:51:14 字数 2895 浏览 11 评论 0原文

我正在开发一个状态机,它应该提取形式的函数调用

/* I am a comment */
//I am a comment
pref("this.is.a.string.which\"can have QUOTES\"", 123456);

,其中提取的数据将是 pref("this.is.a.string.which\"can have QUOTES\"", 123456) ; 从一个文件。目前,处理一个 41kb 的文件需要接近一分半钟的时间。我对这个有限状态机有严重误解吗?

#include <boost/algorithm/string.hpp>
std::vector<std::string> Foo()
{
    std::string fileData;
    //Fill filedata with the contents of a file
    std::vector<std::string> results;
    std::string::iterator begin = fileData.begin();
    std::string::iterator end = fileData.end();
    std::string::iterator stateZeroFoundLocation = fileData.begin();
    std::size_t state = 0;
    for(; begin < end; begin++)
    {
        switch (state)
        {
        case 0:
            if (boost::starts_with(boost::make_iterator_range(begin, end), "pref(")) {
                stateZeroFoundLocation = begin;
                begin += 4;
                state = 2;
            } else if (*begin == '/')
                state = 1;
            break;
        case 1:
            state = 0;
            switch (*begin)
            {
            case '*':
                begin = boost::find_first(boost::make_iterator_range(begin, end), "*/").end();
                break;
            case '/':
                begin = std::find(begin, end, L'\n');
            }
            break;
        case 2:
            if (*begin == '"')
                state = 3;
            break;
        case 3:
            switch(*begin)
            {
            case '\\':
                state = 4;
                break;
            case '"':
                state = 5;
            }
            break;
        case 4:
            state = 3;
            break;
        case 5:
            if (*begin == ',')
                state = 6;
            break;
        case 6:
            if (*begin != ' ')
                state = 7;
            break;
        case 7:
            switch(*begin)
            {
            case '"':
                state = 8;
                break;
            default:
                state = 10;
                break;
            }
            break;
        case 8:
            switch(*begin)
            {
            case '\\':
                state = 9;
                break;
            case '"':
                state = 10;
            }
            break;
        case 9:
            state = 8;
            break;
        case 10:
            if (*begin == ')')
                state = 11;
            break;
        case 11:
            if (*begin == ';')
                state = 12;
            break;
        case 12:
            state = 0;
            results.push_back(std::string(stateZeroFoundLocation, begin));
        };
    }
    return results;
}

Billy3

编辑:嗯,这是我见过的最奇怪的事情之一。我刚刚重建了该项目,它又正常运行了。奇怪。

I'm working on a state machine which is supposed to extract function calls of the form

/* I am a comment */
//I am a comment
pref("this.is.a.string.which\"can have QUOTES\"", 123456);

where the extracted data would be pref("this.is.a.string.which\"can have QUOTES\"", 123456);
from a file. Currently, to process a 41kb file, this process is taking close to a minute and a half. Is there something I'm seriously misunderstanding here about this finite state machine?

#include <boost/algorithm/string.hpp>
std::vector<std::string> Foo()
{
    std::string fileData;
    //Fill filedata with the contents of a file
    std::vector<std::string> results;
    std::string::iterator begin = fileData.begin();
    std::string::iterator end = fileData.end();
    std::string::iterator stateZeroFoundLocation = fileData.begin();
    std::size_t state = 0;
    for(; begin < end; begin++)
    {
        switch (state)
        {
        case 0:
            if (boost::starts_with(boost::make_iterator_range(begin, end), "pref(")) {
                stateZeroFoundLocation = begin;
                begin += 4;
                state = 2;
            } else if (*begin == '/')
                state = 1;
            break;
        case 1:
            state = 0;
            switch (*begin)
            {
            case '*':
                begin = boost::find_first(boost::make_iterator_range(begin, end), "*/").end();
                break;
            case '/':
                begin = std::find(begin, end, L'\n');
            }
            break;
        case 2:
            if (*begin == '"')
                state = 3;
            break;
        case 3:
            switch(*begin)
            {
            case '\\':
                state = 4;
                break;
            case '"':
                state = 5;
            }
            break;
        case 4:
            state = 3;
            break;
        case 5:
            if (*begin == ',')
                state = 6;
            break;
        case 6:
            if (*begin != ' ')
                state = 7;
            break;
        case 7:
            switch(*begin)
            {
            case '"':
                state = 8;
                break;
            default:
                state = 10;
                break;
            }
            break;
        case 8:
            switch(*begin)
            {
            case '\\':
                state = 9;
                break;
            case '"':
                state = 10;
            }
            break;
        case 9:
            state = 8;
            break;
        case 10:
            if (*begin == ')')
                state = 11;
            break;
        case 11:
            if (*begin == ';')
                state = 12;
            break;
        case 12:
            state = 0;
            results.push_back(std::string(stateZeroFoundLocation, begin));
        };
    }
    return results;
}

Billy3

EDIT: Well this is one of the strangest things I've ever seen. I just rebuilt the project and it's running reasonably again. Odd.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

秋日私语 2024-09-01 12:51:14

除非您的 41 kb 文件主要是注释或首选项,否则它将大部分时间都处于状态 0。对于状态 0 中的每个字符,您至少进行两次函数调用。

if (boost::starts_with(boost::make_iterator_range(begin, end), "pref(")) {

您可以通过预先测试来加快速度,看看当前字符是否为“p”。

if (*begin == 'p' && boost::starts_with(boost::make_iterator_range(begin, end), "pref(")) {

如果字符不是“p”,则无需进行任何函数调用。特别是不创建迭代器,这可能是花费时间的地方。

Unless your 41 kb file is mostly comments or prefs, it's going to spend most of its time in state 0. And for each character in state 0, you make a minimum of two function calls.

if (boost::starts_with(boost::make_iterator_range(begin, end), "pref(")) {

You can speed this up by pre-testing to see if the current character is 'p'

if (*begin == 'p' && boost::starts_with(boost::make_iterator_range(begin, end), "pref(")) {

If the character isn't 'p' then there is no need to make any function calls. In particular not creating a iterator, which is probably where the time is being spent.

Hello爱情风 2024-09-01 12:51:14

我不知道这是否是问题的一部分,但在 case 0 中你有一个拼写错误,“perf”被错误地拼写为“pref”。

I don't know if this is part of the problem, but you have a typo in case 0, "perf" is misspelled as "pref".

吹泡泡o 2024-09-01 12:51:14

好吧,仅通过观察很难说……但我猜查找算法正在这样做。为什么要在 FSM 内搜索?根据定义,您应该一次给他们一个字符......添加更多状态。还可以尝试将结果设为列表,而不是向量。大量的复制正在进行

vector<string>

,但主要是:
简介它!

Well it's hard to say just by looking at it...but I'm guessing the find algorithms are doing it. Why are you searching within a FSM? By definition you're supposed to giving them one character at a time....Add more states. Also try making results a list, not a vector. A lot of copying going on with

vector<string>

But mostly:
Profile it!

请持续率性 2024-09-01 12:51:14

有限状态机是一种可行的解决方案,但对于文本处理,最好使用高度优化的有限状态机生成器。在本例中,是一个正则表达式。这里是 Perl 正则表达式:

# first clean the comments
$source =~ s|//.*$||;      # replace "// till end of line" with nothing
$source =~ s|/\*.*?\*/||s; # replace "/* any text until */" with nothing
                           # depending on your data, you may need a few other
                           # rules here to avoid blanking data, you could replace
                           # the comments with a unique identifier, and then
                           # expand any identifiers that the regex below returns

# then find your data
while ($source =~ /perf\(\s*"(.+?)",\s*(\d+)\s*\);/g) { 
   # matches your function signature and moves along source
   # do something with the captured groups, in this case $1 and $2
}

由于大多数正则表达式库都与 Perl 兼容,因此翻译语法应该不难。如果您的搜索变得更加复杂,那么解析器就合适了。

Finite state machines are a workable solution, but for text processing, its best to use a highly optimized finite state machine generator. In this case, a regular expression. Here it is as Perl regex:

# first clean the comments
$source =~ s|//.*$||;      # replace "// till end of line" with nothing
$source =~ s|/\*.*?\*/||s; # replace "/* any text until */" with nothing
                           # depending on your data, you may need a few other
                           # rules here to avoid blanking data, you could replace
                           # the comments with a unique identifier, and then
                           # expand any identifiers that the regex below returns

# then find your data
while ($source =~ /perf\(\s*"(.+?)",\s*(\d+)\s*\);/g) { 
   # matches your function signature and moves along source
   # do something with the captured groups, in this case $1 and $2
}

Since most regex libraries are Perl compatible, it shouldn't be hard to translate the syntax. If your search becomes more complicated, then a parser is in order.

梨涡 2024-09-01 12:51:14

如果您正在进行解析,为什么不使用解析器库。

我通常有 Boost.Spirit.Qi 记住。

  • 您可以使用类似 EBNF 的表达式来表达您的语法,这无疑使维护变得更加容易。
  • 它是一个仅包含头文件的库,因此您在混合中引入整个二进制文件不会有任何问题。

虽然我可以欣赏极简主义方法,但恐怕您自己编写有限状态机的想法并不明智。它与玩具示例配合得很好,但随着需求的增加,您将拥有一个巨大的开关,并且理解正在发生的事情将变得越来越复杂。

请不要告诉我你知道它不会进化:我不相信神谕;)

If you are doing parsing, why not use a parser library.

I typically have Boost.Spirit.Qi in mind.

  • You express your grammar with EBNF-like expressions which certainly makes it easier for maintenance.
  • It's a header only library, so you don't have any problem of bringing a whole binary in the mix.

While I can appreciate the minimalist approach, I am afraid that your idea of coding a finite state machine on your own is not that wise. It works well with a toy example, but as the requirements add up you'll have one monstrous switch and it's going to become more and more complicated to understand what's going on.

And please, don't tell me you know it's not going to evolve: I don't believe in oracles ;)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文