从文件中读取令牌(复杂)
我有一个基本的标记化结构/算法。它非常复杂,我希望我能够简单地澄清它,以让您了解我的设计中的“缺陷”。
class ParserState
// bool functions return false if getline() or stream extraction '>>' fails
static bool nextLine(); // reads and tokenizes next line from file and puts it in m_buffer
static bool nextToken(); // gets next token from m_buffer, via fetchToken(), and puts it in m_token
static bool fetchToken( std::string &token ); // procures next token from file/buffer
static size_t m_lineNumber;
static std::ifstream m_fstream;
static std::string m_buffer;
static std::string m_token;
进行此设置的原因是能够在发生语法错误时报告行号。根据解析器的阶段/状态,我的程序中会发生不同的事情,并且此 ParserState 的子类使用 m_token
和 nextToken
继续。如果m_buffer
为空,fetchToken
调用nextLine
,并将下一个标记放入其参数中:
istringstream stream;
do // read new line until valid token can be extracted
{
Debug(5) << "m_buffer contains: " << m_buffer << "\n";
stream.str( m_buffer );
if( stream >> token )
{
Debug(5) << "Token extracted: " << token << "\n";
m_token = token;
return true; // return when token found
}
stream.clear();
} while( nextLine() );
// if no tokens can be extracted from the whole file, return false
return false;
问题是从m_buffer中删除的标记不是删除,每次调用 nextToken()
时都会读取相同的令牌。问题是 m_buffer
可以修改,从而在循环中调用 istringstream::str
。但这是我的问题的原因,据我所知,它无法解决,因此我的问题是:我怎样才能让 stream >> token
从字符串流内部指向的字符串中删除某些内容?也许我需要不使用stringstream
,但是在这种情况下需要使用更基本的东西(比如找到第一个空格并从字符串中剪切第一个标记)?
感谢十亿!
PS:任何改变我的函数/类结构的建议都是可以的,只要它们允许跟踪行号(因此没有完整的文件读入m_buffer
和类成员istringstream
,这是我在想要行号错误报告之前所拥有的)。
I have a basic tokenization structure/algorithm in place. It's pretty complicated, and I hope I can clarify it simply enough to enlighten you about the "flaw" in my design.
class ParserState
// bool functions return false if getline() or stream extraction '>>' fails
static bool nextLine(); // reads and tokenizes next line from file and puts it in m_buffer
static bool nextToken(); // gets next token from m_buffer, via fetchToken(), and puts it in m_token
static bool fetchToken( std::string &token ); // procures next token from file/buffer
static size_t m_lineNumber;
static std::ifstream m_fstream;
static std::string m_buffer;
static std::string m_token;
The reason for this setup is being able to report the line number if a syntax error occurs. Depending on the phase/state of the parser, differend things happen in my program, and subclasses of this ParserState use m_token
and nextToken
to continue. fetchToken
calls nextLine
if m_buffer
is empty, and puts the next token in its argument:
istringstream stream;
do // read new line until valid token can be extracted
{
Debug(5) << "m_buffer contains: " << m_buffer << "\n";
stream.str( m_buffer );
if( stream >> token )
{
Debug(5) << "Token extracted: " << token << "\n";
m_token = token;
return true; // return when token found
}
stream.clear();
} while( nextLine() );
// if no tokens can be extracted from the whole file, return false
return false;
The problem is that the token removed from m_buffer isn't removed, and the same token gets read with every call to nextToken()
. The problem is that m_buffer
can be modified, thus the call to istringstream::str
in the loop. But this is the cause of my issue, and as far as I can see, it can't be worked around, hence my question: How can I let stream >> token
remove something from the string pointed to internally by the stringstream? Perhaps I need to not use a stringstream
, but something more elementary in this situation (like find first whitespace and cut the first token from the string)?
Thanks a billion!
PS: any suggestions altering my function/class structure are ok as long as they allow line numbers to be kept track of (so no full file read into m_buffer
and a class member istringstream
, which is what I had before I wanted line number error reporting).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
为什么不简单地将
m_buffer
设为std::istringstream
而不是std::string
呢?您可以删除临时变量并获得所需的效果。每当您在诸如此类的语句中更改m_buffer
时,请改写:
Why not simply make
m_buffer
anstd::istringstream
instead of astd::string
? You would remove a temporary variable as well as get the desired effect. Whenever you changem_buffer
in statements such aswrite this instead:
为了避免多次读取相同的令牌,我认为您必须使用
tellg
获取stream
中的位置,然后使用seekg
恢复它(这些方法此处进行了描述)。然而,使用std::istringstream
对我来说似乎有点过分了。我宁愿直接使用m_buffer
。To avoid reading the same token multiple times I think you have to get the position in
stream
usingtellg
and then restore it usingseekg
(these methods are described here). However usingstd::istringstream
looks as an overkill for me here. I would rather work withm_buffer
directly.处理行号报告的常用方案是一次读取一行,就像您所做的那样,增加行计数,然后当您的标记生成器开始构建标记时,它会拍摄行号的快照并将其存储到令牌数据结构(通常包含行号、令牌类型和令牌值(如果有))。
这将行读取与令牌构建分离,而不会丢失行号。这也意味着您可以有很多令牌,它们都可以有行号(包括不同的行号),令牌可以在一行上开始并在另一行上结束,等等。
The usual scheme for handling line-number reporting is to read lines one at time, as you have, incrementing a the line count, and then as your tokenizer starts to build a token, it takes a snapshot of the line number and stores it into the token data structure (typically containing the line number, token type, and token value if any).
This decouples line-reading from token building without losing the line number. It also means you can have lots of tokens, they can all have line numbers (including different ones), a token can start on one line and and finish on another, etc.