用C读取文件

发布于 2024-12-21 11:24:32 字数 722 浏览 1 评论 0原文

我有一个输入文件，我需要从中提取单词。这些单词只能包含字母和数字，因此其他任何内容都将被视为分隔符。我尝试了 fscanf、fgets+sscanf 和 strtok 但似乎没有任何效果。

while(!feof(file))
{
    fscanf(file,"%s",string);
    printf("%s\n",string);
}

上面的显然不起作用，因为它不使用任何分隔符，所以我用以下内容替换了该行：

 fscanf(file,"%[A-z]",string);

它很好地读取第一个单词，但文件指针不断倒带，因此它一遍又一遍地读取第一个单词。

所以我使用 fgets 读取第一行并使用 sscanf:

sscanf(line,"%[A-z]%n,word,len);
line+=len;

这也不起作用，因为无论我尝试什么，我都无法将指针移动到正确的位置。我尝试了 strtok 但我找不到如何设置分隔符

while(p != NULL) {
printf("%s\n", p);
p = strtok(NULL, " ");

这显然采用空白字符作为分隔符，但我实际上有 100 个分隔符。

我是否在这里遗漏了一些东西，因为从文件中提取单词一开始似乎是一个简单的概念，但我尝试的任何东西都没有真正起作用？

原文

I have an input file I need to extract words from. The words can only contain letters and numbers so anything else will be treated as a delimiter. I tried fscanf,fgets+sscanf and strtok but nothing seems to work.

while(!feof(file))
{
    fscanf(file,"%s",string);
    printf("%s\n",string);
}

Above one clearly doesn't work because it doesn't use any delimiters so I replaced the line with this:

 fscanf(file,"%[A-z]",string);

It reads the first word fine but the file pointer keeps rewinding so it reads the first word over and over.

So I used fgets to read the first line and use sscanf:

sscanf(line,"%[A-z]%n,word,len);
line+=len;

This one doesn't work either because whatever I try I can't move the pointer to the right place. I tried strtok but I can't find how to set delimitters

while(p != NULL) {
printf("%s\n", p);
p = strtok(NULL, " ");

This one obviously take blank character as a delimitter but I have literally 100s of delimitters.

Am I missing something here becasue extracting words from a file seemed a simple concept at first but nothing I try really works?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

瞎闹 2024-12-28 11:24:33

你的分隔符是什么？ strtok 的第二个参数应该是包含分隔符的字符串，第一个参数应该是第一次指向字符串的指针，然后是 NULL：

char * p = strtok(line, ","); // assuming a , delimiter
printf("%s\n", p);

while(p)
{
    p = strtok(NULL, ",");
    printf("%S\n", p);
}

What are your delimiters? The second argument to strtok should be a string containing your delimiters, and the first should be a pointer to your string the first time round then NULL afterwards:

char * p = strtok(line, ","); // assuming a , delimiter
printf("%s\n", p);

while(p)
{
    p = strtok(NULL, ",");
    printf("%S\n", p);
}

回复收藏 0 原文

指尖上的星空 2024-12-28 11:24:32

考虑构建一个最小的词法分析器。当处于 word 状态时，只要它看到字母和数字，它就会保持在该状态。当遇到其他情况时，它会切换到状态分隔符。然后它可以在状态分隔符中执行完全相反的操作。

这是一个简单状态机的示例，可能会有所帮助。为了简洁起见，它仅适用于数字。 echo "2341,452(42 555" | ./main 会将每个数字打印在单独的行中。它不是词法分析器，但状态之间切换的想法非常相似。

#include <stdio.h>
#include <string.h>

int main() {
  static const int WORD = 1, DELIM = 2, BUFLEN = 1024;
  int state = WORD, ptr = 0;
  char buffer[BUFLEN], *digits = "1234567890";
  while ((c = getchar()) != EOF) {
    if (strchr(digits, c)) {
      if (WORD == state) {
        buffer[ptr++] = c;
      } else {
        buffer[0] = c;
        ptr = 1;
      }
      state = WORD;
    } else {
      if (WORD == state) {
        buffer[ptr] = '\0';
        printf("%s\n", buffer);
      }
      state = DELIM;
    }
  }
  return 0;
}

如果状态数量增加您可以考虑用 switch 块替换检查当前状态的 if 语句。通过将 getchar 替换为读取整个块可以提高性能。输入到临时缓冲区并迭代它，

如果必须处理更复杂的输入文件格式，您可以使用词法分析器生成器，例如 flex。它们可以为您定义状态转换和词法分析器生成的其他部分。

Consider building a minimal lexer. When in state word it would remain in it as long as it sees letters and numbers. It would switch to state delimiter when encountering something else. Then it could do an exact opposite in the state delimiter.

Here's an example of a simple state machine which might be helpful. For the sake of brevity it works only with digits. echo "2341,452(42 555" | ./main will print each number in a separate line. It's not a lexer but the idea of switching between states is quite similar.

#include <stdio.h>
#include <string.h>

int main() {
  static const int WORD = 1, DELIM = 2, BUFLEN = 1024;
  int state = WORD, ptr = 0;
  char buffer[BUFLEN], *digits = "1234567890";
  while ((c = getchar()) != EOF) {
    if (strchr(digits, c)) {
      if (WORD == state) {
        buffer[ptr++] = c;
      } else {
        buffer[0] = c;
        ptr = 1;
      }
      state = WORD;
    } else {
      if (WORD == state) {
        buffer[ptr] = '\0';
        printf("%s\n", buffer);
      }
      state = DELIM;
    }
  }
  return 0;
}

If the number of states increases you can consider replacing if statements checking the current state with switch blocks. The performance can be increased by replacing getchar with reading a whole block of the input to a temporary buffer and iterating through it.

In case of having to deal with a more complex input file format you can use lexical analysers generators such as flex. They can do the job of defining state transitions and other parts of lexer generation for you.

回复收藏 0 原文

吃不饱 2024-12-28 11:24:32

几点：

首先，不要使用feof(file)作为循环条件； feof 不会返回 true，直到您尝试读取文件末尾之后，因此您的循环将过于频繁地执行一次。

其次，你提到了这一点：

fscanf(文件,"%[Az]",字符串);
它可以很好地读取第一个单词，但文件指针不断倒带，因此它会一遍又一遍地读取第一个单词。

实际情况并非如此。如果流中的下一个字符与格式说明符不匹配，则 scanf 会返回而不读取任何内容，并且 string 不会被修改。

这是一种简单但不优雅的方法：它一次从输入文件中读取一个字符，检查它是否是字母或数字，如果是，则将其添加到字符串中。

#include <stdio.h>
#include <ctype.h>

int get_next_word(FILE *file, char *word, size_t wordSize)
{
  size_t i = 0;
  int c;

  /**
   * Skip over any non-alphanumeric characters
   */
  while ((c = fgetc(file)) != EOF && !isalnum(c))
    ; // empty loop

  if (c != EOF)
    word[i++] = c;

  /**
   * Read up to the next non-alphanumeric character and
   * store it to word
   */
  while ((c = fgetc(file)) != EOF && i < (wordSize - 1) && isalnum(c))
  {
      word[i++] = c;
  }
  word[i] = 0;
  return c != EOF;
}

int main(void)
{
   char word[SIZE]; // where SIZE is large enough to handle expected inputs
   FILE *file;
   ...
   while (get_next_word(file, word, sizeof word))
     // do something with word
   ...
}

Several points:

First of all, do not use feof(file) as your loop condition; feof won't return true until after you attempt to read past the end of the file, so your loop will execute once too often.

Second, you mentioned this:

fscanf(file,"%[A-z]",string);
It reads the first word fine but the file pointer keeps rewinding so it reads the first word over and over.

That's not quite what's happening; if the next character in the stream doesn't match the format specifier, scanf returns without having read anything, and string is unmodified.

Here's a simple, if inelegant, method: it reads one character at a time from the input file, checks to see if it's either an alpha or a digit, and if it is, adds it to a string.

#include <stdio.h>
#include <ctype.h>

int get_next_word(FILE *file, char *word, size_t wordSize)
{
  size_t i = 0;
  int c;

  /**
   * Skip over any non-alphanumeric characters
   */
  while ((c = fgetc(file)) != EOF && !isalnum(c))
    ; // empty loop

  if (c != EOF)
    word[i++] = c;

  /**
   * Read up to the next non-alphanumeric character and
   * store it to word
   */
  while ((c = fgetc(file)) != EOF && i < (wordSize - 1) && isalnum(c))
  {
      word[i++] = c;
  }
  word[i] = 0;
  return c != EOF;
}

int main(void)
{
   char word[SIZE]; // where SIZE is large enough to handle expected inputs
   FILE *file;
   ...
   while (get_next_word(file, word, sizeof word))
     // do something with word
   ...
}

回复收藏 0 原文

梦初启 2024-12-28 11:24:32

我会使用：

FILE *file;
char string[200];

while(fscanf(file, "%*[^A-Za-z]"), fscanf(file, "%199[a-zA-Z]", string) > 0) {
    /* do something with string... */
}

这会跳过非字母，然后读取最多 199 个字母的字符串。唯一奇怪的是，如果你有任何超过 199 个字母的“单词”，它们将被分成多个单词，但你需要限制以避免缓冲区溢出......

I would use:

FILE *file;
char string[200];

while(fscanf(file, "%*[^A-Za-z]"), fscanf(file, "%199[a-zA-Z]", string) > 0) {
    /* do something with string... */
}

This skips over non-letters and then reads a string of up to 199 letters. The only oddness is that if you have any 'words' that are longer than 199 letters they'll be split up into multiple words, but you need the limit to avoid a buffer overflow...

回复收藏 0 原文

~没有更多了~