逐步读取文本文件
我有一个包含如下文本的文件:
#1#14#ADEADE#CAH0F#0#0.....
我需要创建一个代码来查找 # 符号后面的文本,将其存储到变量中,然后将其写入不带 # 符号但前面有空格的文件。所以从之前的代码中我会得到:
1 14 ADEADE CAH0F 0 0......
我首先尝试用Python来完成它,但是文件非常大并且处理文件需要非常长的时间,所以我决定用C++编写这部分。但是,我对 C++ 正则表达式一无所知,我正在寻求帮助。您能给我推荐一个简单的正则表达式库(我不太了解 C++)或有详细文档的库吗?如果您提供一个小示例(我知道如何使用 fstream 执行到文件的传输,但我需要帮助如何读取文件,正如我之前所说),那就更好了。
I have a file which has text like this:
#1#14#ADEADE#CAH0F#0#0.....
I need to create a code that will find text that follows # symbol, store it to variable and then writes it to file WITHOUT # symbol, but with a space before. So from previous code I will get:
1 14 ADEADE CAH0F 0 0......
I first tried to did it in Python, but files are really big and it takes a really huge time to process file, so I decided to write this part in C++. However, I know nothing about C++ regex, and I'm looking for help. Could you, please, recommend me an easy regex library (I don't know C++ very well) or the well-documented one? It would be even better, if you provide a small example (I know how to perform transmission to file, using fstream, but I need help with how to read file as I said before).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
这看起来像是
std::locale
和他值得信赖的伙伴imbue
:This looks like a job for
std::locale
and his trusty sidekickimbue
:IMO,C++ 不是您任务的最佳选择。但如果你必须用 C++ 来做,我建议你看看 Boost.Regex,Boost 库的一部分。
IMO, C++ is not the best choice for your task. But if you have to do it in C++ I would suggest you have a look at Boost.Regex, part of the Boost library.
如果您使用的是 Unix,一个简单的
sed 's/#/ /'outfile
就足够了。Sed 代表“流编辑器”(并且支持正则表达式!哇!),因此它非常适合您正在寻找的性能。
If you are on Unix, a simple
sed 's/#/ /' <infile >outfile
would suffice.Sed stands for 'stream editor' (and supports regexes! whoo!), so it would be well-suited for the performance that you are looking for.
好吧,我只是将其作为答案而不是评论。不要使用正则表达式。对于这项任务来说,几乎可以肯定这是矫枉过正的。我对 C++ 有点生疏,所以我不会发布任何丑陋的代码,但本质上你可以做的就是一次解析文件一个字符,放置任何不是
#
的内容写入缓冲区,然后在按下#
时将其与空格一起写入输出文件。 至少有两种非常简单的方法可以解决这个在 C# 中,
:
问题 我只是在这里指出,解析字符串的方法有很多。正则表达式非常棒且功能强大,甚至可以在极端情况下拯救世界,但这并不是唯一的解析方法文本,甚至可能如果使用不当,就会毁灭世界事物。真的。
如果您坚持使用正则表达式(或者被迫使用正则表达式,例如家庭作业),那么我建议您听听 Chris 的意见并使用 Boost.Regex。或者,如果您想尝试其他东西,我知道 Boost 也有一个很好的字符串库。只需留意 Cthulhu 如果你确实使用正则表达式。
Alright, I'm just going to make this an answer instead of a comment. Don't use regex. It's almost certainly overkill for this task. I'm a little rusty with C++, so I'll not post any ugly code, but essentially what you could do is parse the file one character at a time, putting anything that wasn't a
#
into a buffer, then writing it out to the output file along with a space when you do hit a#
. In C# at least two really easy methods for solving this come to mind:Alternatively, you could replace
With
I'm not saying you should do it either of these ways or my suggested method for C++, nor that any of these methods are ideal - I'm just pointing out here that there are many many ways to parse strings. Regex is awesome and powerful and may even save the day in extreme circumstances, but it's not the only way to parse text, and may even destroy the world if used for the wrong thing. Really.
If you insist on using regex (or are forced to, as in for a homework assignment), then I suggest you listen to Chris and use Boost.Regex. Alternatively, I understand Boost has a good string library as well if you'd like to try something else. Just look out for Cthulhu if you do use regex.
您遗漏了一个关键点:如果输入中有两个(或更多)连续的
#
,它们应该变成一个空格,还是有相同数量的空格#
s?如果你想将整个字符串变成一个空格,那么@Rob 的解决方案应该可以很好地工作。
如果您希望每个
#
变成一个空格,那么编写 C 风格代码可能是最简单的:You've left out one crucial point: if you have two (or more) consecutive
#
s in the input, should they turn into one space, or the same number of spaces are there are#
s?If you want to turn the entire string into a single space, then @Rob's solution should work quite nicely.
If you want each
#
turned into a space, then it's probably easiest to just write C-style code:那么,您想将每个 1 个字符
'#'
替换为 1 个字符' '
,对吗?然后就很容易做到,因为您可以用完全相同长度的字符串替换文件的任何部分,而不会扰乱文件的组织。
重复这样的替换允许逐块地转换文件;这样就可以避免读取内存中的所有文件,这在文件很大时会出现问题。
这是 Python 2.7 中的代码。
也许,逐块替换不足以使其更快,并且您将很难用 C++ 编写相同的内容。但总的来说,当我提出这样的代码时,它令人满意地增加了执行时间。
点评:
必须以二进制方式打开文件'b'才能精确控制文件指针的位置和移动;
模式 '+' 是能够在文件
文件描述符中读取和写入,它是一个整数,
读取大小为 chunk_size 的块。给它读取缓冲区的大小会很棘手,但我不知道如何找到该缓冲区的大小。因此,一个好主意是赋予它 2 的幂值。
文件的指针移回刚刚读取块的位置。它必须是
len(x)
,而不是chunk_size,因为最后读取的块通常比chink_size在相同长度上写入的 长度短modded chunk
这两个指令强制写入,否则修改的 chunk 可能保留在写入缓冲区中,并
在读取 file 的最后部分后不受控制的时刻写入,无论其长度是多少(小于或等于 chunk_size),文件的指针是在最大位置文件,也就是说 file_size 并且程序必须停止
。
如果您想仅用一个替换多个连续的“###...”,则可以轻松修改代码以遵守这一要求,因为写入缩短的块不会删除文件中更远的地方仍未读取的字符。它只需要2个文件的指针。
So, you want to replace each ONE character
'#'
with ONE character' '
, right ?Then it's easy to do since you can replace any portion of the file with string of exactly the same length without perturbating the organisation of the file.
Repeating such a replacement allows to make transformation of the file chunk by chunk; so you avoid to read all the file in memory, which is problematic when the file is very big.
Here's the code in Python 2.7 .
Maybe, the replacement chunk by chunk will be unsifficient to make it faster and you'll have a hard time to write the same in C++. But in general, when I proposed such codes, it has increased the execution's time satisfactorily.
Comments:
it's absolutely obligatory to open the file in binary mode 'b' to control precisely the positions and movements of the file's pointer;
mode '+' is to be able to read AND write in the file
file descriptor, it's an integer
reads a chunk of size chunk_size . It would be tricky to give it the size of the reading buffer, but I don't know how to find this buffer's size. Hence a good idea is to give it a power of 2 value.
the file's pointer is moved back to the position from which the reading of the chunk has just been made. It must be
len(x)
, not chunk_size because the last chunk read is in general less long than chink_sizewrites on the same length with the modified chunk
these two instructions force the writing, otherwise the modified chunk could remain in the writing buffer and written at uncontrolled moment
after the reading of the last portion of file , whatever is its length (less or equal to chunk_size), the file's pointer is at the maximum position of the file, that is to say file_size and the program must stop
.
In case you would like to replace several consecutive '###...' with only one, the code is easily modifiable to respect this requirement, since writing a shortened chunk doesn't erase characters still unread more far in the file. It only needs 2 files's pointers.