使用 _tcstok 时发生访问冲突
我正在尝试使用 _tcstok 对文件中的行进行标记。我能够对该行进行标记一次,但是当我尝试第二次对其进行标记时,我遇到了访问冲突。我觉得这与实际访问值而不是位置有关。但我不知道还能怎么做。
谢谢
戴夫
, 我使用 TCHAR 和 _tcstok 因为文件是 UTF-8。
这是我收到的错误:
Testing.exe 中 0x63e866b4 (msvcr90d.dll) 处的首次机会异常:0xC0000005:访问冲突读取位置 0x0000006c。
vector<TCHAR> TabDelimitedSource::getNext() {
// Returns the next document (a given cell) from the file(s)
TCHAR row[256]; // Return NULL if no more documents/rows
vector<TCHAR> document;
try{
//Read each line in the file, corresponding to and individual document
buff_reader->getline(row,10000);
}
catch (ifstream::failure e){
; // Ignore and fall through
}
if (_tcslen(row)>0){
this->current_row += 1;
vector<TCHAR> cells;
//Separate the line on tabs (id 'tab' document title 'tab' document body)
TCHAR * pch;
pch = _tcstok(row,"\t");
while (pch != NULL){
cells.push_back(*pch);
pch = _tcstok(NULL, "\t");
}
// Split the cell into individual words using the lucene analyzer
try{
//Separate the body by spaces
TCHAR original_document ;
original_document = (cells[column_holding_doc]);
try{
TCHAR * pc;
pc = _tcstok((char*)original_document," ");
while (pch != NULL){
document.push_back(*pc);
pc = _tcstok(NULL, "\t");
}
I am trying to tokenize lines in a file using _tcstok. I am able to tokenize the line once, but when i try to tokenize it a second time, I get an access violation. I feel like it has something to do with not actually accessing the values, but locations instead. I'm not sure how else to do this though.
Thanks,
Dave
p.s. I'm using TCHAR and _tcstok because the file is UTF-8.
This is the error I'm getting:
First-chance exception at 0x63e866b4 (msvcr90d.dll) in Testing.exe: 0xC0000005: Access violation reading location 0x0000006c.
vector<TCHAR> TabDelimitedSource::getNext() {
// Returns the next document (a given cell) from the file(s)
TCHAR row[256]; // Return NULL if no more documents/rows
vector<TCHAR> document;
try{
//Read each line in the file, corresponding to and individual document
buff_reader->getline(row,10000);
}
catch (ifstream::failure e){
; // Ignore and fall through
}
if (_tcslen(row)>0){
this->current_row += 1;
vector<TCHAR> cells;
//Separate the line on tabs (id 'tab' document title 'tab' document body)
TCHAR * pch;
pch = _tcstok(row,"\t");
while (pch != NULL){
cells.push_back(*pch);
pch = _tcstok(NULL, "\t");
}
// Split the cell into individual words using the lucene analyzer
try{
//Separate the body by spaces
TCHAR original_document ;
original_document = (cells[column_holding_doc]);
try{
TCHAR * pc;
pc = _tcstok((char*)original_document," ");
while (pch != NULL){
document.push_back(*pc);
pc = _tcstok(NULL, "\t");
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
首先,您的代码是 C 字符串操作和 C++ 容器的混合体。这只会让你陷入困境。理想情况下,您应该将该行标记为
std::vector
另外,您对
TCHAR
和 UTF-8 感到非常困惑。TCHAR
是一种字符类型,根据编译时标志在 8 到 16 位之间“浮动”。 UTF-8 文件使用一到四个字节来表示每个字符。因此,您可能希望将文本保存为std::wstring
对象,但您需要将 UTF-8 显式转换为 wstring。但是,如果您只是想让任何事情发挥作用,请专注于您的标记化。您需要存储每个标记的起始地址(作为
TCHAR*
),但您的向量是TCHAR
的向量。当您尝试使用令牌数据时,您会将TCHAR
转换为TCHAR*
指针,这会导致访问冲突,这并不令人意外。您给出的 AV 地址是0x0000006c
,它是字符l
的 ASCII 代码。... 进而...
First up, your code is a mongrel mixture of C string manipulation and C++ containers. This will just dig you into a hole. Ideally you should tokenize the line into
std::vector<std::wstring>
Also, you're very confused about
TCHAR
and UTF-8.TCHAR
is a character type that 'floats' between 8 and 16 bits depending on compile time flags. UTF-8 files use between one and four bytes to represent each character. So, you probably want to hold the text asstd::wstring
objects, but you're going to need to explicitly convert the UTF-8 into wstrings.But, if you just want to get anything working, focus on your tokenization. You need to store the address of the start of each token (as a
TCHAR*
) but your vector is a vector ofTCHAR
s instead. When you try to use the token data, you're castingTCHAR
s toTCHAR*
pointers, with the unsurprising result of access violations. The AV address you give is0x0000006c
, which is ASCII code for the characterl
.... and then...