如何在文件中列出的文件中查找来自文件的单词?

发布于 2024-10-20 12:53:02 字数 448 浏览 1 评论 0原文

在单个文件中搜索单词很容易:

grep stuff file.txt

但是我有很多文件,每个文件都是 files.txt 中的一行,并且我想查找许多单词,每个都是 words 中的一行。 txt.输出应该是一个文件,其中每行 a =>; b,其中 awords.txt 中的行号,bfiles.txt< 中的行号/代码>。

我需要在 OSX 上运行它,所以最好在 shell 中运行一些简单的东西,但任何其他语言都可以。我自己对 shell 脚本没有太多经验,而且我更习惯于对字符串搜索没有用的语言(即 C - 我猜 Perl 或 Python 可能会有帮助,但我没有使用过它们) )。

Searching a single file for a word is easy:

grep stuff file.txt

But I have many files, each is a line in files.txt, and many words I want to find, each is a line in words.txt. The output should be a file with each line a => b with a being the line number in words.txt, b being the line number in files.txt.

I need to run it on OSX, so preferably something simple in shell, but any other language would be fine. I haven't had much experience with shell scripts myself, and I'm more used to languages that aren't useful for string searching (namely C - I'm guessing Perl or Python may be helpful, but I've not used them).

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

萌能量女王 2024-10-27 12:53:02

你可能会更快、更 Pythonic、更容易理解:

with open("words.txt") as words:
    wlist=[(ln,word.strip()) for ln,word in enumerate(words,1)]

with open("files.txt") as files:
    flist=[(ln,file.strip()) for ln,file in enumerate(files,1)]

for filenum, filename in flist:
    with open(filename) as fdata:
        for fln,line in enumerate(fdata,1):
            for wln, word in wlist:
                if word in line:
                    print "%d => %d" % (wln, fln)

You might this to be faster, more Pythonic, and easier to understand:

with open("words.txt") as words:
    wlist=[(ln,word.strip()) for ln,word in enumerate(words,1)]

with open("files.txt") as files:
    flist=[(ln,file.strip()) for ln,file in enumerate(files,1)]

for filenum, filename in flist:
    with open(filename) as fdata:
        for fln,line in enumerate(fdata,1):
            for wln, word in wlist:
                if word in line:
                    print "%d => %d" % (wln, fln)
时光是把杀猪刀 2024-10-27 12:53:02

这是 awk 的两部分:
1.扫描files.txt中的每个文件,将单词编号映射到文件名
2.将文件名映射到files.txt中的行号

awk '
  NR == FNR {word[$1] = NR; next}
  {for (i=1; i<=NF; i++) {if ($i in word) {print word[$i] " => " FILENAME; break}}}
' words.txt $(<files.txt) | 
sort -u |
awk '
  NR == FNR {filenum[$1] = NR; next}
  {$3 = filenum[$3]; print}
' files.txt -

This is a two-parter with awk:
1. scan each file in files.txt, and map the word number to the name of the file
2. map the filename to the line number in files.txt

awk '
  NR == FNR {word[$1] = NR; next}
  {for (i=1; i<=NF; i++) {if ($i in word) {print word[$i] " => " FILENAME; break}}}
' words.txt $(<files.txt) | 
sort -u |
awk '
  NR == FNR {filenum[$1] = NR; next}
  {$3 = filenum[$3]; print}
' files.txt -
萌无敌 2024-10-27 12:53:02

首先,学会指定感兴趣的文件。在一个目录中还是在多个目录中? Unix find 实用程序可以做到这一点。

在 Bash 提示符处:

$ cd [the root directory where your files are]
$ find . -name "*.txt"

您没有说,但假设这些文件可以用“星点某物”描述,然后 find 会找到这些文件。

接下来,将文件名通过管道传递给您想要对其执行的操作:

$ find . -name "*.txt" -print0 | xargs -0 egrep 'stuff'

这将在每个文件上运行 egrep,搜索模式为 stuff

Google find加上xargs实际上有数千个示例。一旦您可以轻松地找到这些文件,请重新表述您的问题,以便更明显地了解您想要对它们执行的操作。然后我可以用 Perl 帮你做。

First, learn to specify the files of interest. In one directory or more than one directory? The Unix find utility will do that.

At the Bash prompt:

$ cd [the root directory where your files are]
$ find . -name "*.txt"

You did not say, but assumably the files are describable with "star dot something" then find will find the files.

Next, pipe the files names to what you want to do to them:

$ find . -name "*.txt" -print0 | xargs -0 egrep 'stuff'

That will run egrep on each file with the search pattern of stuff

Google find plus xargs for literally thousands of examples. Once you are comfortable finding the files -- rephrase your question so that it is a bit more obvious what you want to do to them. Then I can help you with Perl to do it.

濫情▎り 2024-10-27 12:53:02

下面的 python 脚本可以做到这一点。这是我第一次尝试 python,所以我很感激任何评论

flist = open('files.txt')

filenum = 0
for filename in flist:
    filenum = filenum + 1
    filenamey = filename.strip()
    filedata = open(filenamey)
    for fline in filedata:
        wordnum = 0
        wlist = open('words.txt')
        for word in wlist:
            wordnum = wordnum + 1
            sword = word.strip()
            if sword in fline:
                s = repr(filenum) + ' => ' + repr(wordnum)
                print s

The following script in python does it. This is my first attempt at python, so I'd appreciate any comments

flist = open('files.txt')

filenum = 0
for filename in flist:
    filenum = filenum + 1
    filenamey = filename.strip()
    filedata = open(filenamey)
    for fline in filedata:
        wordnum = 0
        wlist = open('words.txt')
        for word in wlist:
            wordnum = wordnum + 1
            sword = word.strip()
            if sword in fline:
                s = repr(filenum) + ' => ' + repr(wordnum)
                print s
—━☆沉默づ 2024-10-27 12:53:02

这里的东西可以做你想做的事,但唯一的事情是它不会打印出匹配的单词,而是只打印出匹配的行、文件名和行号。但是,如果您在 grep 上使用 --color=auto ,它将使用您在 ${GREP_COLOR} 中设置的内容突出显示匹配的单词,默认为红色。

cat files.txt | xargs grep -nf words.txt --color=auto

此命令将逐行转储 files.txt 的所有内容,并将文件名通过管道传递给 grep,后者将在文件中搜索 words.txt< 中匹配的每个单词。 /代码>。与 files.txt 类似,words.txt 应该是您想要用换行符分隔的所有搜索词。

如果您的 grep 是使用 perl 正则表达式引擎构建的,那么,如果您将 -P 选项传递给 grep,则可以使用 Perl 正则表达式,如下所示:

grep -Pnf words.txt --color=auto

希望这会有所帮助。

更新:起初,我不太确定@Zeophlite 在问什么,但在他发布示例后,我明白了他想要什么。这是他想要做的事情的 python 实现:

from contextlib import nested


def search_file(line_num, filename):
    with nested(open(filename), open('words.txt')) as managers:
        open_filename, word_file = managers
        for line in open_filename:
            for wordfile_line_number, word in enumerate(word_file, 1):
                if word.strip() in line:
                    print "%s => %s" % (line_num, wordfile_line_number)


with open('files.txt') as filenames_file:
    for filenames_line_number, fname in enumerate(filenames_file, 1):
        search_file(filenames_line_number, fname.strip())

Here's something that will do what you want, but the only thing is that it will not print out the matched word, instead just prints out the line matched, the file name, and the line number. However, if you use --color=auto on grep, it will highlight the matched words using whatever you have set in ${GREP_COLOR}, the default is red.

cat files.txt | xargs grep -nf words.txt --color=auto

This command will dump all contents of files.txt, line by line, and it will pipe the file names to grep, which will search the file for every word that matches in words.txt. Similar to files.txt, words.txt should be all the search terms you want delimited by new-lines.

If your grep was built with the perl regular expression engine, then, you can use Perl regular expressions if you pass the -P option to grep like so:

grep -Pnf words.txt --color=auto

Hope this helps.

Update: At first, I wasn't really sure what @Zeophlite was asking but after he posted his example, I see what he wanted. Here's a python implementation of what he wants to do:

from contextlib import nested


def search_file(line_num, filename):
    with nested(open(filename), open('words.txt')) as managers:
        open_filename, word_file = managers
        for line in open_filename:
            for wordfile_line_number, word in enumerate(word_file, 1):
                if word.strip() in line:
                    print "%s => %s" % (line_num, wordfile_line_number)


with open('files.txt') as filenames_file:
    for filenames_line_number, fname in enumerate(filenames_file, 1):
        search_file(filenames_line_number, fname.strip())
明月夜 2024-10-27 12:53:02

在纯 shell 中执行此操作,我很接近:(

$ grep -n $(tr '\n' '|' < words.txt | sed 's/|$//') $(cat files.txt)

尝试找出如何删除 $(cat files.txt),但不能)

这会打印出每个文件中的单词,并且打印出它们出现的行,但不会打印出 words.txt 中该单词所在的行。

我可能可以做一些非常丑陋的事情(如果你认为这还不够丑的话),但你真正的答案是使用更高级别的语言。 awk 解决方案是shellish,因为大多数人现在认为 awk 只是 Unix 环境的一部分。但是,如果您使用的是 awk,您也可以使用 perlpythonruby

awk 的唯一优点是,即使创建发行版的用户没有包含任何开发包,它也会自动包含在 Linux/Unix 发行版中。这种情况很少见,但确实发生了。

Doing it in pure shell, I'm close:

$ grep -n $(tr '\n' '|' < words.txt | sed 's/|$//') $(cat files.txt)

(Tried to figure out how to remove the $(cat files.txt), but couldn't)

This prints out the words in each file, and prints out the lines where they occur, but it doesn't print out the line in words.txt where that word was located.

There's probably some really ugly (if you didn't think this was ugly enough) stuff I could do, but your real answer is to use a higher level language. The awk solution is shellish since most people now consider awk as just part of the Unix environment. However, if you're using awk, you might as well use perl, python, or ruby.

The only advantage awk has is that it is automatically included in a Linux/Unix distro even if the user who created the distro didn't include any of the development packages. It's rare, but it happens.

情痴 2024-10-27 12:53:02

来解答您的需求

您的代码:

flist = open('files.txt') 

filenum = 0 
for filename in flist: 
    filenum = filenum + 1 
    filenamey = filename.strip() 
    filedata = open(filenamey) 
    for fline in filedata: 
        wordnum = 0 
        wlist = open('words.txt') 
        for word in wlist: 
            wordnum = wordnum + 1 
            sword = word.strip() 
            if sword in fline: 
                s = repr(filenum) + ' => ' + repr(wordnum) 
                print s 

您打开'files.txt'但不关闭它。
with open('files.txt') as flist: 更可取,因为它的文本更干净,并且能够单独关闭。

代替 filenum = filenum + 1 ,使用 enumerate()
从现在开始,您绝对不能忘记enumerate(),因为它是一个非常有用的函数。它的工作速度也非常快。

在我看来,fline 对于行迭代器来说并不是一个好名字; line 不是很好吗?

指令 wlist = open('words.txt') 的位置不太好:它不仅在每个打开的文件中执行,甚至在每次分析一行时执行。
此外,每次迭代 wlist 时,即在每一行,都会执行对 wlist 中列出的名称的处理。您必须将这种处理排除在所有迭代之外。

wordnum只不过是wlistword的索引。您可以再次使用 enumerate() 或简单地使用索引 i 循环并使用 wlist[i] 而不是 word

每次wlist出现在队伍中时,你

print repr(filenum) + ' => ' + repr(wordnum) 

最好这样做 print repr(filenum) + ' =>; ' + repr(all_wordnum) 其中 all_wordnum 将是在一行中找到的所有sword 的列表

您将单词列表保存在一个文件中。你最好把这个单词的列表连载一下。请参阅模块 picklepickle

在结果记录方面还有一些需要改进的地方。因为每次都执行指令

print repr(filenum) + ' => ' + repr(wordnum)

并不是一个好的做法。如果要记录在文件中也是一样的:不能重复命令 write() 更好的方法是将所有结果列在一个列表中,并在过程结束时打印或记录,使得 < code>"\n".join(list) 或类似的东西

To answer your demand

.

Your code:

flist = open('files.txt') 

filenum = 0 
for filename in flist: 
    filenum = filenum + 1 
    filenamey = filename.strip() 
    filedata = open(filenamey) 
    for fline in filedata: 
        wordnum = 0 
        wlist = open('words.txt') 
        for word in wlist: 
            wordnum = wordnum + 1 
            sword = word.strip() 
            if sword in fline: 
                s = repr(filenum) + ' => ' + repr(wordnum) 
                print s 

You open 'files.txt' but don't close it.
with open('files.txt') as flist: is preferable because it is textually cleaner and it manages to close alone.

Instead of filenum = filenum + 1 , use enumerate()
From now, you must never forget enumerate() because it is an extremely useful function. It works very very fast, too.

fline isn't a good name for an iterator of lines, IMO; Isn't line a good one ?

The instruction wlist = open('words.txt') isn't in a good place: it is executed not only each for each file opened, but even each time a line is analysed.
Moreover, the treatment of the names listed in wlist is performed each time the wlist is iterated, that is to say at each line. You must put this treatment out of all the iterations.

wordnum is nothing else than the index of word in wlist. You can use again enumerate() or simply loop with index i and use wlist[i] instead of word

Each time a sword of wlist is in the line, you do

print repr(filenum) + ' => ' + repr(wordnum) 

It would be better to do print repr(filenum) + ' => ' + repr(all_wordnum) in which all_wordnum would be the list of all the sword found in one line

You keep your list of words in a file. You'd better serialise the list of this words. See the modules pickle and pickle

There is also something to improve in the recording of result. Because executing the instruction

print repr(filenum) + ' => ' + repr(wordnum)

each time is not a good practice. It's the same if you want to record in a file: you can't repeatedly order write() Better is to list all the result in a list, and print or record when process is over, making "\n".join(list) or something like that

挽袖吟 2024-10-27 12:53:02

一个纯粹的 sh 答案,假设单词或文件名不包含任何 shell 元字符,例如空格:

nw=0; while read w; do nw=`expr $nw + 1`; nf=0; { while read f; do nf=`expr $nf + 1`; fgrep -n $w $f | sed 's/:.*//' | while read n; do echo $nw =\> $nf; done; done < /tmp/files.txt;}; done < /tmp/words.txt

但我更喜欢 Perl 来处理这种事情。
Perl 脚本不会像 carrrot-top 的 Python 代码那么短或可读,除非你使用 IO::全部

A pure sh answer, assuming that words or filenames do not contain any shell metacharacters such as blanks:

nw=0; while read w; do nw=`expr $nw + 1`; nf=0; { while read f; do nf=`expr $nf + 1`; fgrep -n $w $f | sed 's/:.*//' | while read n; do echo $nw =\> $nf; done; done < /tmp/files.txt;}; done < /tmp/words.txt

But I prefer Perl for this kind of thing.
And the Perl script won't be quite as short or readable as carrrot-top's Python code, unless you use IO::All.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文