用C读取文件
我有一个输入文件,我需要从中提取单词。这些单词只能包含字母和数字,因此其他任何内容都将被视为分隔符。我尝试了 fscanf、fgets+sscanf 和 strtok 但似乎没有任何效果。
while(!feof(file))
{
fscanf(file,"%s",string);
printf("%s\n",string);
}
上面的显然不起作用,因为它不使用任何分隔符,所以我用以下内容替换了该行:
fscanf(file,"%[A-z]",string);
它很好地读取第一个单词,但文件指针不断倒带,因此它一遍又一遍地读取第一个单词。
所以我使用 fgets 读取第一行并使用 sscanf:
sscanf(line,"%[A-z]%n,word,len);
line+=len;
这也不起作用,因为无论我尝试什么,我都无法将指针移动到正确的位置。我尝试了 strtok 但我找不到如何设置分隔符
while(p != NULL) {
printf("%s\n", p);
p = strtok(NULL, " ");
这显然采用空白字符作为分隔符,但我实际上有 100 个分隔符。
我是否在这里遗漏了一些东西,因为从文件中提取单词一开始似乎是一个简单的概念,但我尝试的任何东西都没有真正起作用?
I have an input file I need to extract words from. The words can only contain letters and numbers so anything else will be treated as a delimiter. I tried fscanf,fgets+sscanf and strtok but nothing seems to work.
while(!feof(file))
{
fscanf(file,"%s",string);
printf("%s\n",string);
}
Above one clearly doesn't work because it doesn't use any delimiters so I replaced the line with this:
fscanf(file,"%[A-z]",string);
It reads the first word fine but the file pointer keeps rewinding so it reads the first word over and over.
So I used fgets to read the first line and use sscanf:
sscanf(line,"%[A-z]%n,word,len);
line+=len;
This one doesn't work either because whatever I try I can't move the pointer to the right place. I tried strtok but I can't find how to set delimitters
while(p != NULL) {
printf("%s\n", p);
p = strtok(NULL, " ");
This one obviously take blank character as a delimitter but I have literally 100s of delimitters.
Am I missing something here becasue extracting words from a file seemed a simple concept at first but nothing I try really works?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
你的分隔符是什么?
strtok
的第二个参数应该是包含分隔符的字符串,第一个参数应该是第一次指向字符串的指针,然后是NULL
:What are your delimiters? The second argument to
strtok
should be a string containing your delimiters, and the first should be a pointer to your string the first time round thenNULL
afterwards:考虑构建一个最小的词法分析器。当处于 word 状态时,只要它看到字母和数字,它就会保持在该状态。当遇到其他情况时,它会切换到状态分隔符。然后它可以在状态分隔符中执行完全相反的操作。
这是一个简单状态机的示例,可能会有所帮助。为了简洁起见,它仅适用于数字。
echo "2341,452(42 555" | ./main
会将每个数字打印在单独的行中。它不是词法分析器,但状态之间切换的想法非常相似。如果状态数量增加您可以考虑用
switch
块替换检查当前状态的if
语句。通过将getchar
替换为读取整个块可以提高性能。输入到临时缓冲区并迭代它,如果必须处理更复杂的输入文件格式,您可以使用词法分析器生成器,例如 flex。它们可以为您定义状态转换和词法分析器生成的其他部分。
Consider building a minimal lexer. When in state word it would remain in it as long as it sees letters and numbers. It would switch to state delimiter when encountering something else. Then it could do an exact opposite in the state delimiter.
Here's an example of a simple state machine which might be helpful. For the sake of brevity it works only with digits.
echo "2341,452(42 555" | ./main
will print each number in a separate line. It's not a lexer but the idea of switching between states is quite similar.If the number of states increases you can consider replacing
if
statements checking the current state withswitch
blocks. The performance can be increased by replacinggetchar
with reading a whole block of the input to a temporary buffer and iterating through it.In case of having to deal with a more complex input file format you can use lexical analysers generators such as flex. They can do the job of defining state transitions and other parts of lexer generation for you.
几点:
首先,不要使用
feof(file)
作为循环条件;feof
不会返回true
,直到您尝试读取文件末尾之后,因此您的循环将过于频繁地执行一次。其次,你提到了这一点:
实际情况并非如此。如果流中的下一个字符与格式说明符不匹配,则
scanf
会返回而不读取任何内容,并且string
不会被修改。这是一种简单但不优雅的方法:它一次从输入文件中读取一个字符,检查它是否是字母或数字,如果是,则将其添加到字符串中。
Several points:
First of all, do not use
feof(file)
as your loop condition;feof
won't returntrue
until after you attempt to read past the end of the file, so your loop will execute once too often.Second, you mentioned this:
That's not quite what's happening; if the next character in the stream doesn't match the format specifier,
scanf
returns without having read anything, andstring
is unmodified.Here's a simple, if inelegant, method: it reads one character at a time from the input file, checks to see if it's either an alpha or a digit, and if it is, adds it to a string.
我会使用:
这会跳过非字母,然后读取最多 199 个字母的字符串。唯一奇怪的是,如果你有任何超过 199 个字母的“单词”,它们将被分成多个单词,但你需要限制以避免缓冲区溢出......
I would use:
This skips over non-letters and then reads a string of up to 199 letters. The only oddness is that if you have any 'words' that are longer than 199 letters they'll be split up into multiple words, but you need the limit to avoid a buffer overflow...