从文件中读取令牌(复杂)

发布于 2024-10-01 18:07:16 字数 1805 浏览 8 评论 0原文

我有一个基本的标记化结构/算法。它非常复杂,我希望我能够简单地澄清它,以让您了解我的设计中的“缺陷”。

class ParserState

// bool functions return false if getline() or stream extraction '>>' fails
static bool nextLine(); // reads and tokenizes next line from file and puts it in m_buffer
static bool nextToken(); // gets next token from m_buffer, via fetchToken(), and puts it in m_token
static bool fetchToken( std::string &token ); // procures next token from file/buffer

static size_t m_lineNumber;
static std::ifstream m_fstream;
static std::string m_buffer;
static std::string m_token;

进行此设置的原因是能够在发生语法错误时报告行号。根据解析器的阶段/状态,我的程序中会发生不同的事情,并且此 ParserState 的子类使用 m_tokennextToken 继续。如果m_buffer为空,fetchToken调用nextLine,并将下一个标记放入其参数中:

istringstream stream;

do // read new line until valid token can be extracted
{
    Debug(5) << "m_buffer contains: " << m_buffer << "\n";
    stream.str( m_buffer );

    if( stream >> token )
    {
        Debug(5) << "Token extracted: " << token << "\n";
        m_token = token;
        return true; // return when token found
    }
    stream.clear();
} while( nextLine() );
// if no tokens can be extracted from the whole file, return false
return false;

问题是从m_buffer中删除的标记不是删除,每次调用 nextToken() 时都会读取相同的令牌。问题是 m_buffer 可以修改,从而在循环中调用 istringstream::str 。但这是我的问题的原因,据我所知,它无法解决,因此我的问题是:我怎样才能让 stream >> token 从字符串流内部指向的字符串中删除某些内容?也许我需要使用stringstream,但是在这种情况下需要使用更基本的东西(比如找到第一个空格并从字符串中剪切第一个标记)?

感谢十亿!

PS:任何改变我的函数/类结构的建议都是可以的,只要它们允许跟踪行号(因此没有完整的文件读入m_buffer和类成员istringstream,这是我在想要行号错误报告之前所拥有的)。

I have a basic tokenization structure/algorithm in place. It's pretty complicated, and I hope I can clarify it simply enough to enlighten you about the "flaw" in my design.

class ParserState

// bool functions return false if getline() or stream extraction '>>' fails
static bool nextLine(); // reads and tokenizes next line from file and puts it in m_buffer
static bool nextToken(); // gets next token from m_buffer, via fetchToken(), and puts it in m_token
static bool fetchToken( std::string &token ); // procures next token from file/buffer

static size_t m_lineNumber;
static std::ifstream m_fstream;
static std::string m_buffer;
static std::string m_token;

The reason for this setup is being able to report the line number if a syntax error occurs. Depending on the phase/state of the parser, differend things happen in my program, and subclasses of this ParserState use m_token and nextToken to continue. fetchToken calls nextLine if m_buffer is empty, and puts the next token in its argument:

istringstream stream;

do // read new line until valid token can be extracted
{
    Debug(5) << "m_buffer contains: " << m_buffer << "\n";
    stream.str( m_buffer );

    if( stream >> token )
    {
        Debug(5) << "Token extracted: " << token << "\n";
        m_token = token;
        return true; // return when token found
    }
    stream.clear();
} while( nextLine() );
// if no tokens can be extracted from the whole file, return false
return false;

The problem is that the token removed from m_buffer isn't removed, and the same token gets read with every call to nextToken(). The problem is that m_buffer can be modified, thus the call to istringstream::str in the loop. But this is the cause of my issue, and as far as I can see, it can't be worked around, hence my question: How can I let stream >> token remove something from the string pointed to internally by the stringstream? Perhaps I need to not use a stringstream, but something more elementary in this situation (like find first whitespace and cut the first token from the string)?

Thanks a billion!

PS: any suggestions altering my function/class structure are ok as long as they allow line numbers to be kept track of (so no full file read into m_buffer and a class member istringstream, which is what I had before I wanted line number error reporting).

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

晨与橙与城 2024-10-08 18:07:16

为什么不简单地将 m_buffer 设为 std::istringstream 而不是 std::string 呢?您可以删除临时变量并获得所需的效果。每当您在诸如此类的语句中更改 m_buffer 时,

m_buffer = ...

请改写:

m_buffer.str(...);

Why not simply make m_buffer an std::istringstream instead of a std::string? You would remove a temporary variable as well as get the desired effect. Whenever you change m_buffer in statements such as

m_buffer = ...

write this instead:

m_buffer.str(...);
清引 2024-10-08 18:07:16

为了避免多次读取相同的令牌,我认为您必须使用 tellg 获取 stream 中的位置,然后使用 seekg 恢复它(这些方法此处进行了描述)。然而,使用 std::istringstream 对我来说似乎有点过分了。我宁愿直接使用 m_buffer

To avoid reading the same token multiple times I think you have to get the position in stream using tellg and then restore it using seekg (these methods are described here). However using std::istringstream looks as an overkill for me here. I would rather work with m_buffer directly.

几度春秋 2024-10-08 18:07:16

处理行号报告的常用方案是一次读取一行,就像您所做的那样,增加行计数,然后当您的标记生成器开始构建标记时,它会拍摄行号的快照并将其存储到令牌数据结构(通常包含行号、令牌类型和令牌值(如果有))。

这将行读取与令牌构建分离,而不会丢失行号。这也意味着您可以有很多令牌,它们都可以有行号(包括不同的行号),令牌可以在一行上开始并在另一行上结束,等等。

The usual scheme for handling line-number reporting is to read lines one at time, as you have, incrementing a the line count, and then as your tokenizer starts to build a token, it takes a snapshot of the line number and stores it into the token data structure (typically containing the line number, token type, and token value if any).

This decouples line-reading from token building without losing the line number. It also means you can have lots of tokens, they can all have line numbers (including different ones), a token can start on one line and and finish on another, etc.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文