使用 ruby 脚本从大文件中删除几行
File 1: 1356775 lines
File 2: 9516 lines
文件 2 包含数字行,当这些数字在文件 1 中匹配时,应从该文件中删除。 示例:
文件 1
34234323432 some useless stuff
23423432342 more useless stuff
98989898329 foo bar blah
65367389473 one two three
文件 2
234234323
653673894
新文件
34234323432 some useless stuff
98989898329 foo bar blah
我现在的方法是
- 将整个文件 2 内容读入数组中
- 获取文件 1 的第一行并提取前 8 个数字
- 循环遍历步骤 1 中的整个数组,查看步骤 1 中的 8 个数字是否匹配
- 如果数字不匹配t 匹配,然后将步骤 1 中的行写入新文件
- 如果它们匹配,则跳出循环,并且不将该行写入新文件
- 继续,直到没有更多行可从步骤 2 读取
但是,由于文件太大,这需要大量的时间这样做是因为对于 file1 中的每一行,我们都会循环遍历整个数组(9516 个元素)。是否有一种更简单的方法来执行此类文件操作,而无需将文件中的记录放入数据库表中。
File 1: 1356775 lines
File 2: 9516 lines
File 2 contains lines of numbers which when matched in File 1 should be deleted from that file.
Example:
File 1
34234323432 some useless stuff
23423432342 more useless stuff
98989898329 foo bar blah
65367389473 one two three
File 2
234234323
653673894
New File
34234323432 some useless stuff
98989898329 foo bar blah
My approach right now is to
- Read entire file2 content into an array
- Get first line of File1 and extract first 8 numbers
- Loop through entire array from step 1 to see if 8 numbers from step1 match
- If numbers don't match then write line from step1 into a new file
- If they match then break out of the loop and don't write the line to new file
- continue until there are no more lines to read from step2
However, since the file is so big, it take an enormous amount of time to do this since for each line in file1 we are looping through entire array(9516 elements). Is there a simpler way to do this type of file manipulation without putting records from file in a DB table.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
以数字作为键、“true”作为值读取哈希中的 file2。哈希的设计目的是快速查找 - 比数组快得多< /a>.
Read file2 in a Hash with the number as key and 'true' as value. Hashes are designed to be fast at lookups - much faster then arrays.
您可以将 File1 的块读入内存,从而避免大量阻塞 IO。
You could read chunks of File1 into memory, avoiding a lot of blocking IO.