strtok() 仅打印第一个单词,其余部分为(空)

发布于 2025-01-11 10:29:49 字数 1171 浏览 0 评论 0原文

我正在尝试解析一个大文本文件并使用 strtok 将其拆分为单个单词。分隔符删除所有特殊字符、空格和换行符。由于某种原因,当我 printf() 它时,它只打印第一个单词和其余的一堆(null)。

    ifstream textstream(textFile);
    string textLine;
    while (getline(textstream, textLine))
    {
        struct_ptr->numOfCharsProcessedFromFile[TESTFILEINDEX] += textLine.length() + 1;
        char *line_c = new char[textLine.length() + 1]; // creates a character array the length of the line
        strcpy(line_c, textLine.c_str());               // copies the line string into the character array
        char *word = strtok(line_c, delimiters);        // removes all unwanted characters
        while (word != nullptr && wordCount(struct_ptr->dictRootNode, word) > struct_ptr->minNumOfWordsWithAPrefixForPrinting)
        {
            MyFile << word << ' ' << wordCount(struct_ptr->dictRootNode, word) << '\n'; // writes each word and number of times it appears as a prefix in the tree
            word = strtok(NULL, delimiters);                                            // move to next word
            printf("%s", word);
        }
    }

I am trying to parse a large text file and split it up into single words using strtok. The delimiters remove all special characters, whitespace, and new lines. For some reason when I printf() it, it only prints the first word and a bunch of (null) for the rest.

    ifstream textstream(textFile);
    string textLine;
    while (getline(textstream, textLine))
    {
        struct_ptr->numOfCharsProcessedFromFile[TESTFILEINDEX] += textLine.length() + 1;
        char *line_c = new char[textLine.length() + 1]; // creates a character array the length of the line
        strcpy(line_c, textLine.c_str());               // copies the line string into the character array
        char *word = strtok(line_c, delimiters);        // removes all unwanted characters
        while (word != nullptr && wordCount(struct_ptr->dictRootNode, word) > struct_ptr->minNumOfWordsWithAPrefixForPrinting)
        {
            MyFile << word << ' ' << wordCount(struct_ptr->dictRootNode, word) << '\n'; // writes each word and number of times it appears as a prefix in the tree
            word = strtok(NULL, delimiters);                                            // move to next word
            printf("%s", word);
        }
    }

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

沫离伤花 2025-01-18 10:29:49

我不会跳过使用 strtok 所需的麻烦,而是编写一个直接使用字符串的小替换,而不修改其输入,按照这个一般顺序:

std::vector<std::string> tokenize(std::string const &input, std::string const &delims = " ") {
    std::vector<std::string> ret;
    int start = 0;

    while ((start = input.find_first_not_of(delims, start)) != std::string::npos) {
        auto stop = input.find_first_of(delims, start+1);
        ret.push_back(input.substr(start, stop-start));
        start = stop;
    }
    return ret;
}

至少对我来说,这似乎是大大简化了代码的其余部分:

std::string textLine;
while (std::getline(textStream, textLine)) {
    struct_ptr->numOfCharsProcessedFromFile[TESTFILEINDEX] += textLine.length() + 1;
    auto words = tokenize(textLine, delims);
    for (auto const &word : words) {
        MyFile << word << ' ' << wordCount(struct_ptr->dictRootNode, word) << '\n';
        std::cout << word << '\n';
    }
}

这也避免了(除其他外)您所拥有的大量内存泄漏,在循环的每次迭代中分配内存,但从不释放任何内存。

Rather than jumping through the hoops necessary to use strtok, I'd write a little replacement that works directly with strings, without modifying its input, something on this general order:

std::vector<std::string> tokenize(std::string const &input, std::string const &delims = " ") {
    std::vector<std::string> ret;
    int start = 0;

    while ((start = input.find_first_not_of(delims, start)) != std::string::npos) {
        auto stop = input.find_first_of(delims, start+1);
        ret.push_back(input.substr(start, stop-start));
        start = stop;
    }
    return ret;
}

At least to me, this seems to simplify the rest of the code quite a bit:

std::string textLine;
while (std::getline(textStream, textLine)) {
    struct_ptr->numOfCharsProcessedFromFile[TESTFILEINDEX] += textLine.length() + 1;
    auto words = tokenize(textLine, delims);
    for (auto const &word : words) {
        MyFile << word << ' ' << wordCount(struct_ptr->dictRootNode, word) << '\n';
        std::cout << word << '\n';
    }
}

This also avoids (among other things) the massive memory leak you had, allocating memory every iteration of your loop, but never freeing any of it.

罪歌 2025-01-18 10:29:49

printf 向上移动两行。

while (word != nullptr && wordCount(struct_ptr->dictRootNode, word) > struct_ptr->minNumOfWordsWithAPrefixForPrinting)
{
    printf("%s", word);
    MyFile << word << ' ' << wordCount(struct_ptr->dictRootNode, word) << '\n'; // writes each word and number of times it appears as a prefix in the tree
    word = strtok(NULL, delimiters);                                            // move to next word

}

Move printf two lines UP.

while (word != nullptr && wordCount(struct_ptr->dictRootNode, word) > struct_ptr->minNumOfWordsWithAPrefixForPrinting)
{
    printf("%s", word);
    MyFile << word << ' ' << wordCount(struct_ptr->dictRootNode, word) << '\n'; // writes each word and number of times it appears as a prefix in the tree
    word = strtok(NULL, delimiters);                                            // move to next word

}
拥有 2025-01-18 10:29:49

正如 @j23 指出的,您的 printf 位于错误的位置。

正如 @Jerry-Coffin 指出的,有更多的 C++ 风格和现代方法可以完成您尝试做的事情。除了避免突变之外,您还可以避免从文本字符串中复制单词。 (在下面的代码中,我们逐行读取,但如果您知道整个文本适合内存,您也可以将整个内容读取到 std::string 中。)

因此,使用 < code>std::string_view 避免执行额外的副本,它就像指向字符串的指针和长度。

在这里,对于一个用例,您不需要将单词存储在另一个数据结构中 - 某种单词的一次性处理:

#include <iostream>
#include <fstream>
#include <string>
#include <string_view>
#include <cctype>

template <class F>
void with_lines(std::istream& stream, F body) {
  for (std::string line; std::getline(stream,line);) {
    body(line);
  }
}

template <class F>
void with_words(std::istream& stream, F body) {
  with_lines(stream,[&body](std::string& line) {
    std::string_view line_view{line.cbegin(),line.cend()};
    while (!line_view.empty()) {
      // skip whitespaces
      for (; !line_view.empty() && isspace(line_view[0]);
       line_view.remove_prefix(1));
      size_t position = 0;
      for (; position < line_view.size() &&
         !isspace(line_view[position]);
       position++);
      if (position > 0) {
        body(line_view.substr(0,position));
        line_view.remove_prefix(position);
      }
    }
  });
}

int main (int argc, const char* argv[]) {
  size_t word_count = 0;
  std::ifstream stream{"input.txt"};
  if(!stream) {
    std::cerr
      << "could not open file input.txt" << std::endl;
    return -1;
  }
  with_words(stream, [&word_count] (std::string_view word) {
    std::cout << word_count << " " << word << std::endl;
    word_count++;
  });
  std::cout
    << "input.txt contains "
    << word_count << " words."
    << std::endl;
  return 0;
}

As @j23 pointed out, your printf is in the wrong location.

As @Jerry-Coffin points out, there are more c++-ish and modern ways to accomplish, what you try to do. Next to avoiding mutation, you can also avoid copying the words out of the text string. (In my code below, we read line by line, but if you know your whole text fits into memory, you could as well read the whole content into a std::string.)

So, using std::string_view avoids to perform extra copies, it being just something like a pointer into your string and a length.

Here, how it looks like, for a use case, where you need not store the words in another data structure - some kind of one-pass processing of words:

#include <iostream>
#include <fstream>
#include <string>
#include <string_view>
#include <cctype>

template <class F>
void with_lines(std::istream& stream, F body) {
  for (std::string line; std::getline(stream,line);) {
    body(line);
  }
}

template <class F>
void with_words(std::istream& stream, F body) {
  with_lines(stream,[&body](std::string& line) {
    std::string_view line_view{line.cbegin(),line.cend()};
    while (!line_view.empty()) {
      // skip whitespaces
      for (; !line_view.empty() && isspace(line_view[0]);
       line_view.remove_prefix(1));
      size_t position = 0;
      for (; position < line_view.size() &&
         !isspace(line_view[position]);
       position++);
      if (position > 0) {
        body(line_view.substr(0,position));
        line_view.remove_prefix(position);
      }
    }
  });
}

int main (int argc, const char* argv[]) {
  size_t word_count = 0;
  std::ifstream stream{"input.txt"};
  if(!stream) {
    std::cerr
      << "could not open file input.txt" << std::endl;
    return -1;
  }
  with_words(stream, [&word_count] (std::string_view word) {
    std::cout << word_count << " " << word << std::endl;
    word_count++;
  });
  std::cout
    << "input.txt contains "
    << word_count << " words."
    << std::endl;
  return 0;
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文