Java:读取文件输入内容并在找到某些线条图案序列时对其进行过滤
我需要处理输入文件,并将其内容(按 ilne 行)复制到输出文件。但是,此输入文件中有一些不重要的数据(杂散),我需要跳过。 我试图解决的主要问题实际上比这更复杂,但我只是要简化问题:
所以,我有一个包含数十万行的输入文件。 如果输入文件中出现以下 3 行序列:
一个
B
C
然后我需要跳过这 3 行并继续输入文件中的下一行。如果这 3 行作为连续行的序列出现,我只能跳过这 3 行。
例如:
输入文件:
A
一个
B
C
B
P
一个
B
C
一个
B
一个
一个
B
C
输出文件
:
一个
B
P
一个
B
一个
澄清
:
一个
A(跳过)
B(跳过)
C(跳过)
B
P
A(跳过)
B(跳过)
C(跳过)
一个
B
一个
A(跳过)
B(跳过)
C(跳过)
请
注意,只有当行序列(A、B、C)按顺序出现时,我才能跳过它们。所有其他未跳过的行都必须复制到输出文件中。 如果我使用 BufferedReader.nextLine(),如果下一行与输入模式不匹配,我将无法回溯到前一行。例如,如果我已经遇到一个A,并且下一行是另一个A(不是B),那么我必须将第一个A复制到输出文件,并从我没有处理的第二个A开始再次过滤,并检查下一行,依此类推。
我能想到的一种方法是首先保存输入文本文件的内容,这样如果它与我正在寻找的模式不匹配,我可以在遍历输入文件内容时轻松回溯。然而,这不是一个内存明智的解决方案。有没有什么巧妙的算法来解决这个问题,最好是一次性遍历,即O(N)复杂度?或者,如果这是不可能的,那么仍然是内存方面的最佳解决方案是什么?一些示例 C/Java 代码将会非常有帮助。
I need to process an input file, and copy its content (line by ilne) to an output file. However, there are some unimportant data (stray) inside this input file that I need to skip.
The main problem that I am trying to solve is actually more complicated than this, but I am just gonna simplify the problem:
So, I have an input file containing hundreds of thousands of lines.
If the following sequence of 3-lines occur inside the input file:
A
B
C
then I need to skip these 3 lines and proceed with the next line in the input file. I can only skip these 3 lines if these 3 lines occur as a sequence of consecutive lines.
For example:
Input File:
A
A
B
C
B
P
A
B
C
A
B
A
A
B
C
A
Output file:
A
B
P
A
B
A
A
Clarification:
A
A (skipped)
B (skipped)
C (skipped)
B
P
A (skipped)
B (skipped)
C (skipped)
A
B
A
A (skipped)
B (skipped)
C (skipped)
A
Notice that I can skip the sequence of lines (A, B, C) only if they occur sequentially. All the other lines which are not skipped have to be copied to the output file.
If I use BufferedReader.nextLine(), I cannot backtrack to the previous lines if the next line doesn't match the input pattern. For example, if I already encounter an A, and the next line is another A (not B), I then have to copy the first A to the output file, and start the filtering again from the second A which I have not processed, and check the next following line, and so on.
One way that I can think of is to firstly save the contents of the input text file, so I can easily backtrack when traversing the input file contents if it doesn't match the pattern I am looking for. However this is not a memory-wise solution. Is there any clever algorithm to solve this, preferably in one time traversal, i.e. O (N) complexity? Or if this is not possible, what would be the most optimal solution which is still memory-wise? Some example C / Java codes will be really helpful.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您可以使用三元素数组来完成此操作。
每当遇到 A 时,检查数组的第一个元素是否为空——如果不是,则将数组刷新到输出文件——然后将新的 A 存储到数组的第一个元素。
每当遇到 B 时,检查数组的第二个元素是否为空但第一个元素已满 - 如果不是,则将数组与新的 B 一起刷新到输出文件。否则(即,如果第一个元素是已满,但第二个为空)您将把新的 B 存储为数组的第二个元素。
对于 C,重复 B 的逻辑,递增 1:每当遇到 C 时,检查数组的第三个元素是否为空,但第二个元素是否已满 - 如果不是,则将数组与new C。否则(即,如果第二个元素已满,但第三个元素为空),您将把新 C 存储为数组的第三个元素。
当您既没有遇到 A、B 也没有遇到 C 时,请将所有现有数组元素刷新到输出文件,然后将新行直接写入输出文件。
这里的主要技巧是,您定义显式规则来填充缓冲区数组的每个槽,并使用它来避免重新检查任何行匹配,同时将缓冲区刷新到输出并在破坏模式时重置序列。
当然,您承认您的实际规则集有些复杂,但相同类型的方法应该有效。
You could do this with a 3-element array.
Whenever you encounter an A, check if the first element of the array is empty -- if not, flush the array to the output file -- then store the new A to the first element of the array.
Whenever you encounter a B, check if the second element of the array is empty but the first element is full -- if not, flush the array to the output file along with the new B. Otherwise (that is, if the first element is full but the 2nd is empty) you'll store the new B as the 2nd element of the array.
For C, repeat the logic for B, incremented by one: Whenever you encounter a C, check if the third element of the array is empty but the 2nd element is full -- if not, flush the array to the output file along with the new C. Otherwise (that is, if the 2nd element is full but the 3rd is empty) you'll store the new C as the 3rd element of the array.
When you encounter neither A nor B nor C, flush any existing array elements to the output file, then write the new line directly to the output file.
The main trick here is that you're defining explicit rules for filling each slot of the buffer array, and using this to avoid re-checking any line matches, while flushing the buffer to the output and resetting the sequence whenever you break pattern.
Granted, you admit that your actual ruleset is somewhat more complicated, but the same type of approach should work.
我假设你的线条比“A”、“B”和“C”更复杂,但是有一些方法可以从“B”和“C”中挑选“A”。
(如果它们确实是 A、B 和 C,那么您不需要存储任何内容)
我会制作一个小型状态机类型程序。
I'm assuming your lines are more complex than just "A", "B" and "C", but there is some way to pick an "A" from a "B" from a "C".
(If they really are A, B anc C then you don't need to store anything)
I'd make a little state machine type program.
您可以使用 mark< /a>
并重置您的流以“倒带”
You could use mark
and reset on your stream to "rewind"