如何搜索多个pdf文件的内容?

发布于 2024-10-11 01:26:48 字数 76 浏览 5 评论 0 原文

如何在目录/子目录中搜索 PDF 文件的内容?我正在寻找一些命令行工具。看来 grep 无法搜索 PDF 文件。

How could I search the contents of PDF files in a directory/subdirectory? I am looking for some command line tools. It seems that grep can't search PDF files.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(15

独﹏钓一江月 2024-10-18 01:26:48

pdfgrep,它的作用正如其名称所示。

pdfgrep -R 'a pattern to search recursively from path' /some/path

我用它进行简单的搜索,效果很好。

(Debian、Ubuntu 和 Fedora 中都有软件包。)

从 1.3.0 版本起 pdfgrep 支持递归搜索。该版本自 Ubuntu 12.10 (Quantal) 起在 Ubuntu 中可用。

There is pdfgrep, which does exactly what its name suggests.

pdfgrep -R 'a pattern to search recursively from path' /some/path

I've used it for simple searches and it worked fine.

(There are packages in Debian, Ubuntu and Fedora.)

Since version 1.3.0 pdfgrep supports recursive search. This version is available in Ubuntu since Ubuntu 12.10 (Quantal).

熟人话多 2024-10-18 01:26:48

您的发行版应该提供一个名为 pdftotext 的实用程序:

find /path -name '*.pdf' -exec sh -c 'pdftotext "{}" - | grep --with-filename --label="{}" --color "your pattern"' \;

“-”是将 pdftotext 输出到 stdout 而不是文件所必需的。
--with-filename--label= 选项会将文件名放入 grep 的输出中。
可选的 --color 标志很好,它告诉 grep 在终端上使用颜色进行输出。

(在 Ubuntu 中,pdftotext 由包 xpdf-utilspoppler-utils 提供。)

此方法使用 pdftotextgrep,如果您想使用 grep 所不具备的 GNU grep 功能,则比 pdfgrep 有优势不支持。 注意:pdfgrep-1.3.x 支持 -C 选项来打印上下文行。

Your distribution should provide a utility called pdftotext:

find /path -name '*.pdf' -exec sh -c 'pdftotext "{}" - | grep --with-filename --label="{}" --color "your pattern"' \;

The "-" is necessary to have pdftotext output to stdout, not to files.
The --with-filename and --label= options will put the file name in the output of grep.
The optional --color flag is nice and tells grep to output using colors on the terminal.

(In Ubuntu, pdftotext is provided by the package xpdf-utils or poppler-utils.)

This method, using pdftotext and grep, has an advantage over pdfgrep if you want to use features of GNU grep that pdfgrep doesn't support. Note: pdfgrep-1.3.x supports -C option for printing line of context.

小嗷兮 2024-10-18 01:26:48

Recoll 是一款出色的 Unix/Linux 全文 GUI 搜索应用程序,支持多种不同的格式,包括 PDF。它甚至可以将查询的确切页码和搜索词传递给文档查看器,从而允许您直接从其 GUI 跳转到结果。

Recoll 还附带了一个可行的命令行界面和一个网络浏览器界面

Recoll is a fantastic full-text GUI search application for Unix/Linux that supports dozens of different formats, including PDF. It can even pass the exact page number and search term of a query to the document viewer and thus allows you to jump to the result right from its GUI.

Recoll also comes with a viable command-line interface and a web-browser interface.

ま柒月 2024-10-18 01:26:48

我的 pdfgrep (1.3.0) 实际版本允许执行以下操作:

pdfgrep -HiR 'pattern' /path

执行 pdfgrep --help 时:

  • H:打印每个匹配项的文件名。
  • i:忽略大小写区别。
  • R:递归搜索目录。

它在我的 Ubuntu 上运行良好。

My actual version of pdfgrep (1.3.0) allows the following:

pdfgrep -HiR 'pattern' /path

When doing pdfgrep --help:

  • H: Print the file name for each match.
  • i: Ignore case distinctions.
  • R: Search directories recursively.

It works well on my Ubuntu.

自我难过 2024-10-18 01:26:48

还有另一个名为 ripgrep-all 的实用程序,它基于 ripgrep

它不仅可以处理 PDF 文档,还可以处理 Office 文档和电影,作者 声称 它比 pdfgrep 更快。

用于递归搜索当前目录的命令语法,第二个仅限于 PDF 文件:

rga 'pattern' .
rga --type pdf 'pattern' .

There is another utility called ripgrep-all, which is based on ripgrep.

It can handle more than just PDF documents, like Office documents and movies, and the author claims it is faster than pdfgrep.

Command syntax for recursively searching the current directory, and the second one limits to PDF files only:

rga 'pattern' .
rga --type pdf 'pattern' .
你如我软肋 2024-10-18 01:26:48

我制作了这个破坏性小脚本。玩得开心。

function pdfsearch()
{
    find . -iname '*.pdf' | while read filename
    do
        #echo -e "\033[34;1m// === PDF Document:\033[33;1m $filename\033[0m"
        pdftotext -q -enc ASCII7 "$filename" "$filename."; grep -s -H --color=always -i $1 "$filename."
        # remove it!  rm -f "$filename."
    done
}

I made this destructive small script. Have fun with it.

function pdfsearch()
{
    find . -iname '*.pdf' | while read filename
    do
        #echo -e "\033[34;1m// === PDF Document:\033[33;1m $filename\033[0m"
        pdftotext -q -enc ASCII7 "$filename" "$filename."; grep -s -H --color=always -i $1 "$filename."
        # remove it!  rm -f "$filename."
    done
}
剩一世无双 2024-10-18 01:26:48

我喜欢 @sjr 的答案,但我更喜欢 xargs vs -exec。我发现 xargs 更通用。例如,使用 -P,我们可以在有意义的情况下利用多个 CPU。

find . -name '*.pdf' | xargs -P 5 -I % pdftotext % - | grep --with-filename --label="{}" --color "pattern"

I like @sjr's answer however I prefer xargs vs -exec. I find xargs more versatile. For example with -P we can take advantage of multiple CPUs when it makes sense to do so.

find . -name '*.pdf' | xargs -P 5 -I % pdftotext % - | grep --with-filename --label="{}" --color "pattern"
离线来电— 2024-10-18 01:26:48

我遇到了同样的问题,因此我编写了一个脚本,该脚本在指定文件夹中的所有 pdf 文件中搜索字符串,并打印与查询字符串匹配的 PDF 文件。

也许这会对您有所帮助。

您可以在此处下载它

I had the same problem and thus I wrote a script which searches all pdf files in the specified folder for a string and prints the PDF files wich matched the query string.

Maybe this will be helpful to you.

You can download it here

财迷小姐 2024-10-18 01:26:48

如果您想使用 pdftotext 查看文件名,请使用以下命令:

find . -name '*.pdf' -exec echo {} \; -exec pdftotext {} - \; | grep "pattern\|pdf" 

If You want to see file names with pdftotext use following command:

find . -name '*.pdf' -exec echo {} \; -exec pdftotext {} - \; | grep "pattern\|pdf" 
日暮斜阳 2024-10-18 01:26:48

首先将所有 pdf 文件转换为文本文件:

for file in *.pdf;do pdftotext "$file"; done

然后照常使用 grep 。这尤其好,因为当您有多个查询和大量 PDF 文件时它速度很快。

First convert all your pdf files to text files:

for file in *.pdf;do pdftotext "$file"; done

Then use grep as normal. This is especially good as it is fast when you have multiple queries and a lot of PDF files.

◇流星雨 2024-10-18 01:26:48

有一个开源公共资源 grep 工具 crgrep,它可以在 PDF 文件中搜索,还可以搜索其他资源,例如嵌套在其中的内容档案、数据库表、图像元数据、POM 文件依赖关系和 Web 资源 - 以及这些的组合,包括递归搜索。

“文件”选项卡下的完整描述几乎涵盖了该工具支持的内容。

我开发了 crgrep 作为开源工具。

There is an open source common resource grep tool crgrep which searches within PDF files but also other resources like content nested in archives, database tables, image meta-data, POM file dependencies and web resources - and combinations of these including recursive search.

The full description under the Files tab pretty much covers what the tool supports.

I developed crgrep as an opensource tool.

蒲公英的约定 2024-10-18 01:26:48

您需要一些工具,例如 pdf2text,首先将 pdf 转换为文本文件,然后在文本中进行搜索。 (您可能会错过一些信息或符号)。

如果您使用编程语言,可能有为此目的编写的 pdf 库。例如 http://search.cpan.org/dist/CAM-PDF/珀尔

You need some tools like pdf2text to first convert your pdf to a text file and then search inside the text. (You will probably miss some information or symbols).

If you are using a programming language there are probably pdf libraries written for this purpose. e.g. http://search.cpan.org/dist/CAM-PDF/ for Perl

木緿 2024-10-18 01:26:48

尝试在像上面这样的简单脚本中使用“acroread”

try using 'acroread' in a simple script like the one above

情深已缘浅 2024-10-18 01:26:48

感谢这里所有的好主意!

我尝试了 xargs 方法,但正如此处指出的,xargs 将使打印实际文件名变得不可能(或非常困难)......

所以我尝试了 GNU 并行

parallel "pdftotext -q {} - | grep --with-filename --label='['{}']' --color=always --context=5 'pattern'" ::: *.pdf
  • 这不仅打印模式,而且使用--context=5还打印上下5行以及上下文。
  • 使用 -q pdftotext 不会打印任何错误消息或警告(安静)。
  • 我使用方括号 [] 作为标签,而不是大括号 {}。如果你想要大括号 --label='{'{}'}' 就能实现。请注意,{} 被 GNU 并行替换为实际文件名,例如 'Example portable document file name with paths.pdf' ({}已经使用单引号 ')。
  • 通过使用 --label={} 仅打印文件名,这可能是显示文件名的首选方式。
  • 我还注意到,当我尝试时,输出没有颜色,除非通过使用 grep 添加 --color=always 来强制输出。
  • --ignore-case 添加到 grep 命令中以进行不区分大小写的关键字搜索可能很有用。

如果要递归处理所有 PDF 文件,包括当前目录 (.) 中的所有子目录,则可以通过 find 来完成:

find . -type f -iname '*.pdf' -print0 | parallel -0 "pdftotext -q {} - | grep --with-filename --label='['{}']' --color=always --context=5 'pattern'"
  • With find, -iname '*.pdf'< /code> 不区分大小写。使用 -name '*.pdf' 仅包含小写的 .pdf 文件(正常情况)。由于我有时也会遇到带有大写 .PDF 文件扩展名的 Windows PDF 文件,因此我倾向于使用 -iname...
  • 上述命令也适用于 -print find 选项(而不是 -print0),因此它将是基于行的(每行一个文件名),然后必须省略 -0 (NUL 分隔符)并行命令。
  • 同样,在 grep 命令中包含 --ignore-case 将使搜索不区分大小写。

作为使用整个命令行时的一般建议,parallel --dry-run 将打印将执行哪些命令。

$ find . -type f -iname '*.pdf' -print0 | parallel --dry-run -0 "pdftotext -q {} - | grep --with-filename --label='['{}']' --color=always --ignore-case --context=5 'pattern'"
pdftotext -q ./test PDF file 1.pdf - | grep --with-filename --label='['./test PDF file 1.pdf']' --color=always --ignore-case --context=5 'pattern'
pdftotext -q ./subdir1/test PDF file 2.pdf - | grep --with-filename --label='['./subdir1/test PDF file 2.pdf']' --color=always --ignore-case --context=5 'pattern'
pdftotext -q ./subdir2/test PDF file 3.pdf - | grep --with-filename --label='['./subdir2/test PDF file 3.pdf']' --color=always --ignore-case --context=5 'pattern'

Thanks for all the good ideas here!

I tried the xargs method, but as pointed out here, xargs will make it impossible (or very hard) to include printing the actual file name...

So I tried the whole thing with GNU parallel.

parallel "pdftotext -q {} - | grep --with-filename --label='['{}']' --color=always --context=5 'pattern'" ::: *.pdf
  • This prints not only the pattern, but with --context=5 also 5 lines above and below as well for context.
  • With -q pdftotext won't print any error messages or warnings (quiet).
  • I use brackets [] as labels instead of braces {}. If you wanted braces --label='{'{}'}' will make that happen. Note that {} is replaced by the actual filename by GNU parallel, e.g. 'Example portable document file name with spaces.pdf' ({} is already using single quotes ').
  • By using --label={} only the filename will be printed, which may be the favored way of displaying the filename.
  • I also noticed that the output was without color when I tried it, except when forcing it by adding --color=always with grep.
  • It may be useful to add --ignore-case to the grep command for a case-insensitive keyword search.

If all PDF files should be processed recursively, including all sub-directories in the current directory (.), this can be accomplished through find:

find . -type f -iname '*.pdf' -print0 | parallel -0 "pdftotext -q {} - | grep --with-filename --label='['{}']' --color=always --context=5 'pattern'"
  • With find, -iname '*.pdf' acts case-insensitive. With -name '*.pdf' only lower-case .pdf files will be included (the normal case). Since I sometimes also encountered Windows PDF-files with an upper-case .PDF file extension, I tend to prefer -iname...
  • The above command also works with the -print find option (instead of -print0), so it will be line-based (one file name per line), then -0 (NUL delimiter) must be omitted from the parallel command.
  • Again, including --ignore-case in the grep command will make the search case-insensitive.

As a general recommendation when playing with the whole command line, parallel --dry-run will print which commands would be executed.

$ find . -type f -iname '*.pdf' -print0 | parallel --dry-run -0 "pdftotext -q {} - | grep --with-filename --label='['{}']' --color=always --ignore-case --context=5 'pattern'"
pdftotext -q ./test PDF file 1.pdf - | grep --with-filename --label='['./test PDF file 1.pdf']' --color=always --ignore-case --context=5 'pattern'
pdftotext -q ./subdir1/test PDF file 2.pdf - | grep --with-filename --label='['./subdir1/test PDF file 2.pdf']' --color=always --ignore-case --context=5 'pattern'
pdftotext -q ./subdir2/test PDF file 3.pdf - | grep --with-filename --label='['./subdir2/test PDF file 3.pdf']' --color=always --ignore-case --context=5 'pattern'
一枫情书 2024-10-18 01:26:48

使用 pdfgrep

pdfgrep -HinR 'FWCOSP' DatenModel/

在此命令中,我在文件夹 < 中搜索单词 FWCOSP代码>日期模型/

正如您在输出中看到的,您可以使用带有行号的文件名:

在此处输入图像描述

我使用的选项是:

-i : Ignores, case for matching
-H : print the file name for each match
-n : prefix each match with the number of the page where it is found
-R : same as -r, but it also follows all symlinks.

Use pdfgrep:

pdfgrep -HinR 'FWCOSP' DatenModel/

In this command I'm searching for the word FWCOSP inside the folder DatenModel/.

As you can see in the output you can have the file name wit the line numbers:

enter image description here

The options I'm using are:

-i : Ignores, case for matching
-H : print the file name for each match
-n : prefix each match with the number of the page where it is found
-R : same as -r, but it also follows all symlinks.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文