如何在文件中列出的文件中查找来自文件的单词?
在单个文件中搜索单词很容易:
grep stuff file.txt
但是我有很多文件,每个文件都是 files.txt
中的一行,并且我想查找许多单词,每个都是 words 中的一行。 txt
.输出应该是一个文件,其中每行 a =>; b
,其中 a
是 words.txt
中的行号,b
是 files.txt< 中的行号/代码>。
我需要在 OSX 上运行它,所以最好在 shell 中运行一些简单的东西,但任何其他语言都可以。我自己对 shell 脚本没有太多经验,而且我更习惯于对字符串搜索没有用的语言(即 C - 我猜 Perl 或 Python 可能会有帮助,但我没有使用过它们) )。
Searching a single file for a word is easy:
grep stuff file.txt
But I have many files, each is a line in files.txt
, and many words I want to find, each is a line in words.txt
. The output should be a file with each line a => b
with a
being the line number in words.txt
, b
being the line number in files.txt
.
I need to run it on OSX, so preferably something simple in shell, but any other language would be fine. I haven't had much experience with shell scripts myself, and I'm more used to languages that aren't useful for string searching (namely C - I'm guessing Perl or Python may be helpful, but I've not used them).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
你可能会更快、更 Pythonic、更容易理解:
You might this to be faster, more Pythonic, and easier to understand:
这是 awk 的两部分:
1.扫描files.txt中的每个文件,将单词编号映射到文件名
2.将文件名映射到files.txt中的行号
This is a two-parter with awk:
1. scan each file in files.txt, and map the word number to the name of the file
2. map the filename to the line number in files.txt
首先,学会指定感兴趣的文件。在一个目录中还是在多个目录中? Unix
find
实用程序可以做到这一点。在 Bash 提示符处:
您没有说,但假设这些文件可以用“星点某物”描述,然后 find 会找到这些文件。
接下来,将文件名通过管道传递给您想要对其执行的操作:
这将在每个文件上运行
egrep
,搜索模式为stuff
Google
find
加上xargs
实际上有数千个示例。一旦您可以轻松地找到这些文件,请重新表述您的问题,以便更明显地了解您想要对它们执行的操作。然后我可以用 Perl 帮你做。First, learn to specify the files of interest. In one directory or more than one directory? The Unix
find
utility will do that.At the Bash prompt:
You did not say, but assumably the files are describable with "star dot something" then find will find the files.
Next, pipe the files names to what you want to do to them:
That will run
egrep
on each file with the search pattern ofstuff
Google
find
plusxargs
for literally thousands of examples. Once you are comfortable finding the files -- rephrase your question so that it is a bit more obvious what you want to do to them. Then I can help you with Perl to do it.下面的 python 脚本可以做到这一点。这是我第一次尝试 python,所以我很感激任何评论
The following script in python does it. This is my first attempt at python, so I'd appreciate any comments
这里的东西可以做你想做的事,但唯一的事情是它不会打印出匹配的单词,而是只打印出匹配的行、文件名和行号。但是,如果您在 grep 上使用
--color=auto
,它将使用您在${GREP_COLOR}
中设置的内容突出显示匹配的单词,默认为红色。此命令将逐行转储
files.txt
的所有内容,并将文件名通过管道传递给 grep,后者将在文件中搜索words.txt< 中匹配的每个单词。 /代码>。与
files.txt
类似,words.txt
应该是您想要用换行符分隔的所有搜索词。如果您的 grep 是使用 perl 正则表达式引擎构建的,那么,如果您将
-P
选项传递给 grep,则可以使用 Perl 正则表达式,如下所示:希望这会有所帮助。
更新:起初,我不太确定@Zeophlite 在问什么,但在他发布示例后,我明白了他想要什么。这是他想要做的事情的 python 实现:
Here's something that will do what you want, but the only thing is that it will not print out the matched word, instead just prints out the line matched, the file name, and the line number. However, if you use
--color=auto
on grep, it will highlight the matched words using whatever you have set in${GREP_COLOR}
, the default is red.This command will dump all contents of
files.txt
, line by line, and it will pipe the file names to grep, which will search the file for every word that matches inwords.txt
. Similar tofiles.txt
,words.txt
should be all the search terms you want delimited by new-lines.If your grep was built with the perl regular expression engine, then, you can use Perl regular expressions if you pass the
-P
option to grep like so:Hope this helps.
Update: At first, I wasn't really sure what @Zeophlite was asking but after he posted his example, I see what he wanted. Here's a python implementation of what he wants to do:
在纯 shell 中执行此操作,我很接近:(
尝试找出如何删除
$(cat files.txt)
,但不能)这会打印出每个文件中的单词,并且打印出它们出现的行,但不会打印出
words.txt
中该单词所在的行。我可能可以做一些非常丑陋的事情(如果你认为这还不够丑的话),但你真正的答案是使用更高级别的语言。
awk
解决方案是shellish
,因为大多数人现在认为awk
只是 Unix 环境的一部分。但是,如果您使用的是awk
,您也可以使用perl
、python
或ruby
。awk
的唯一优点是,即使创建发行版的用户没有包含任何开发包,它也会自动包含在 Linux/Unix 发行版中。这种情况很少见,但确实发生了。Doing it in pure shell, I'm close:
(Tried to figure out how to remove the
$(cat files.txt)
, but couldn't)This prints out the words in each file, and prints out the lines where they occur, but it doesn't print out the line in
words.txt
where that word was located.There's probably some really ugly (if you didn't think this was ugly enough) stuff I could do, but your real answer is to use a higher level language. The
awk
solution isshellish
since most people now considerawk
as just part of the Unix environment. However, if you're usingawk
, you might as well useperl
,python
, orruby
.The only advantage
awk
has is that it is automatically included in a Linux/Unix distro even if the user who created the distro didn't include any of the development packages. It's rare, but it happens.来解答您的需求
。
您的代码:
您打开'files.txt'但不关闭它。
with open('files.txt') as flist:
更可取,因为它的文本更干净,并且能够单独关闭。代替
filenum = filenum + 1
,使用enumerate()
从现在开始,您绝对不能忘记
enumerate()
,因为它是一个非常有用的函数。它的工作速度也非常快。在我看来,fline 对于行迭代器来说并不是一个好名字; line 不是很好吗?
指令
wlist = open('words.txt')
的位置不太好:它不仅在每个打开的文件中执行,甚至在每次分析一行时执行。此外,每次迭代 wlist 时,即在每一行,都会执行对 wlist 中列出的名称的处理。您必须将这种处理排除在所有迭代之外。
wordnum只不过是wlist中word的索引。您可以再次使用
enumerate()
或简单地使用索引 i 循环并使用wlist[i]
而不是 word每次wlist的剑出现在队伍中时,你
最好这样做
print repr(filenum) + ' =>; ' + repr(all_wordnum)
其中all_wordnum
将是在一行中找到的所有sword 的列表您将单词列表保存在一个文件中。你最好把这个单词的列表连载一下。请参阅模块 pickle 和 pickle
在结果记录方面还有一些需要改进的地方。因为每次都执行指令
并不是一个好的做法。如果要记录在文件中也是一样的:不能重复命令
write()
更好的方法是将所有结果列在一个列表中,并在过程结束时打印或记录,使得 < code>"\n".join(list) 或类似的东西To answer your demand
.
Your code:
You open 'files.txt' but don't close it.
with open('files.txt') as flist:
is preferable because it is textually cleaner and it manages to close alone.Instead of
filenum = filenum + 1
, useenumerate()
From now, you must never forget
enumerate()
because it is an extremely useful function. It works very very fast, too.fline isn't a good name for an iterator of lines, IMO; Isn't line a good one ?
The instruction
wlist = open('words.txt')
isn't in a good place: it is executed not only each for each file opened, but even each time a line is analysed.Moreover, the treatment of the names listed in wlist is performed each time the wlist is iterated, that is to say at each line. You must put this treatment out of all the iterations.
wordnum is nothing else than the index of word in wlist. You can use again
enumerate()
or simply loop with index i and usewlist[i]
instead of wordEach time a sword of wlist is in the line, you do
It would be better to do
print repr(filenum) + ' => ' + repr(all_wordnum)
in whichall_wordnum
would be the list of all the sword found in one lineYou keep your list of words in a file. You'd better serialise the list of this words. See the modules pickle and pickle
There is also something to improve in the recording of result. Because executing the instruction
each time is not a good practice. It's the same if you want to record in a file: you can't repeatedly order
write()
Better is to list all the result in a list, and print or record when process is over, making"\n".join(list)
or something like that一个纯粹的
sh
答案,假设单词或文件名不包含任何 shell 元字符,例如空格:但我更喜欢 Perl 来处理这种事情。
Perl 脚本不会像 carrrot-top 的 Python 代码那么短或可读,除非你使用 IO::全部。
A pure
sh
answer, assuming that words or filenames do not contain any shell metacharacters such as blanks:But I prefer Perl for this kind of thing.
And the Perl script won't be quite as short or readable as carrrot-top's Python code, unless you use IO::All.