当前位置：文江博客话题详情

使用 _tcstok 时发生访问冲突

发布于 2024-11-23 16:23:53 字数 1450 浏览 4 评论 0原文

我正在尝试使用 _tcstok 对文件中的行进行标记。我能够对该行进行标记一次，但是当我尝试第二次对其进行标记时，我遇到了访问冲突。我觉得这与实际访问值而不是位置有关。但我不知道还能怎么做。

谢谢

戴夫

，我使用 TCHAR 和 _tcstok 因为文件是 UTF-8。

这是我收到的错误：

Testing.exe 中 0x63e866b4 (msvcr90d.dll) 处的首次机会异常：0xC0000005：访问冲突读取位置 0x0000006c。

vector<TCHAR> TabDelimitedSource::getNext() {
// Returns the next document (a given cell) from the file(s)
TCHAR row[256]; // Return NULL if no more documents/rows
vector<TCHAR> document;

try{
    //Read each line in the file, corresponding to and individual document
    buff_reader->getline(row,10000);
    }
catch (ifstream::failure e){
        ; // Ignore and fall through
    }

if (_tcslen(row)>0){
    this->current_row += 1;
    vector<TCHAR> cells;
      //Separate the line on tabs (id 'tab' document title 'tab' document body)
     TCHAR *  pch;
     pch = _tcstok(row,"\t");
     while (pch != NULL){
         cells.push_back(*pch);
         pch = _tcstok(NULL, "\t");
     }

    // Split the cell into individual words using the lucene analyzer
    try{
      //Separate the body by spaces
        TCHAR original_document ;
        original_document = (cells[column_holding_doc]);
        try{
            TCHAR * pc;
            pc = _tcstok((char*)original_document," ");
             while (pch != NULL){
                 document.push_back(*pc);
                pc = _tcstok(NULL, "\t");
             }

原文

I am trying to tokenize lines in a file using _tcstok. I am able to tokenize the line once, but when i try to tokenize it a second time, I get an access violation. I feel like it has something to do with not actually accessing the values, but locations instead. I'm not sure how else to do this though.

Thanks,

Dave

p.s. I'm using TCHAR and _tcstok because the file is UTF-8.

This is the error I'm getting:

First-chance exception at 0x63e866b4 (msvcr90d.dll) in Testing.exe: 0xC0000005: Access violation reading location 0x0000006c.

vector<TCHAR> TabDelimitedSource::getNext() {
// Returns the next document (a given cell) from the file(s)
TCHAR row[256]; // Return NULL if no more documents/rows
vector<TCHAR> document;

try{
    //Read each line in the file, corresponding to and individual document
    buff_reader->getline(row,10000);
    }
catch (ifstream::failure e){
        ; // Ignore and fall through
    }

if (_tcslen(row)>0){
    this->current_row += 1;
    vector<TCHAR> cells;
      //Separate the line on tabs (id 'tab' document title 'tab' document body)
     TCHAR *  pch;
     pch = _tcstok(row,"\t");
     while (pch != NULL){
         cells.push_back(*pch);
         pch = _tcstok(NULL, "\t");
     }

    // Split the cell into individual words using the lucene analyzer
    try{
      //Separate the body by spaces
        TCHAR original_document ;
        original_document = (cells[column_holding_doc]);
        try{
            TCHAR * pc;
            pc = _tcstok((char*)original_document," ");
             while (pch != NULL){
                 document.push_back(*pc);
                pc = _tcstok(NULL, "\t");
             }

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

七堇年 2024-11-30 16:23:53

首先，您的代码是 C 字符串操作和 C++ 容器的混合体。这只会让你陷入困境。理想情况下，您应该将该行标记为 std::vector

另外，您对 TCHAR 和 UTF-8 感到非常困惑。 TCHAR 是一种字符类型，根据编译时标志在 8 到 16 位之间“浮动”。 UTF-8 文件使用一到四个字节来表示每个字符。因此，您可能希望将文本保存为 std::wstring 对象，但您需要将 UTF-8 显式转换为 wstring。

但是，如果您只是想让任何事情发挥作用，请专注于您的标记化。您需要存储每个标记的起始地址（作为 TCHAR*），但您的向量是 TCHAR 的向量。当您尝试使用令牌数据时，您会将 TCHAR 转换为 TCHAR* 指针，这会导致访问冲突，这并不令人意外。您给出的 AV 地址是 0x0000006c，它是字符 l 的 ASCII 代码。

  vector<TCHAR*> cells;
  ...
  cells.push_back(pch);

... 进而...

    TCHAR *original_document = cells[column_holding_doc];
    TCHAR *pc = _tcstok(original_document," ");

First up, your code is a mongrel mixture of C string manipulation and C++ containers. This will just dig you into a hole. Ideally you should tokenize the line into std::vector<std::wstring>

Also, you're very confused about TCHAR and UTF-8. TCHAR is a character type that 'floats' between 8 and 16 bits depending on compile time flags. UTF-8 files use between one and four bytes to represent each character. So, you probably want to hold the text as std::wstring objects, but you're going to need to explicitly convert the UTF-8 into wstrings.

But, if you just want to get anything working, focus on your tokenization. You need to store the address of the start of each token (as a TCHAR*) but your vector is a vector of TCHARs instead. When you try to use the token data, you're casting TCHARs to TCHAR* pointers, with the unsurprising result of access violations. The AV address you give is 0x0000006c, which is ASCII code for the character l.

  vector<TCHAR*> cells;
  ...
  cells.push_back(pch);

... and then...

    TCHAR *original_document = cells[column_holding_doc];
    TCHAR *pc = _tcstok(original_document," ");

回复收藏 0 原文

~没有更多了~

关于作者

爱格式化

暂无简介

文章

25 人气

关注发私信

友情链接

文江博客

使用 _tcstok 时发生访问冲突

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

尘曦

在梵高的星空下

善良天后

韬韬不绝

qq_CgiN62

不美如何

友情链接

使用 _tcstok 时发生访问冲突

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

尘曦

在梵高的星空下

善良天后

韬韬不绝

qq_CgiN62

不美如何

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。