从文本文件中提取匹配结果的行

发布于 2024-12-20 22:52:48 字数 1217 浏览 0 评论 0原文

我需要从文本文件中提取文件名,而文本文件的输出没有字体。

因此,正如您从下面的输出文件中看到的那样,我需要打印出第一个结果之后没有字体的结果?因此,只有最后一个结果在此输出中包含字体

这是否有意义 - Grep、Sed 或 Awk 会是答案吗

因此需要下面文本文件的输出,表明该 PDf 中的 **START 和 * 中不存在字体*结尾

******************START***********************
name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
/home/user1/Documents/temp1.pdf
******************END***********************
******************START***********************
name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
/home/user1/Documents/temp2.pdf
******************END***********************
******************START***********************
name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
BAAAAA+TimesNewRomanPS-BoldMT        TrueType          yes yes yes     14  0
CAAAAA+TimesNewRomanPSMT             TrueType          yes yes yes      9  0
/home/user3/Documents/temp file.pdf
******************END***********************

I need to extract the filename from a text file whereas the output on the text file doesn't have fonts.

So as you can see from the output file below I need to print out results where they are no fonts after the first results? So only the last result has fonts in this output

Does this make sense - Would Grep, Sed or Awk be the answer

So need a output from the text file below that shows that no fonts are present in that PDf within the **START and **END

******************START***********************
name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
/home/user1/Documents/temp1.pdf
******************END***********************
******************START***********************
name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
/home/user1/Documents/temp2.pdf
******************END***********************
******************START***********************
name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
BAAAAA+TimesNewRomanPS-BoldMT        TrueType          yes yes yes     14  0
CAAAAA+TimesNewRomanPSMT             TrueType          yes yes yes      9  0
/home/user3/Documents/temp file.pdf
******************END***********************

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

沧桑㈠ 2024-12-27 22:52:48

如果上一行以 - 开头,则将打印包含“.pdf”的任何行。

[me@home]$ awk '{if (st && match($0,".pdf")){print $0}; st=match($0,"^-")}' in.txt
/home/user1/Documents/temp1.pdf
/home/user1/Documents/temp2.pdf

它不是通用解决方案,但适用于您提供的输入数据。我可以想象几种边缘情况,这可能会失败,但这完全取决于输入文件的规格。


更新

(基于您在下面的评论中发布的脚本)如果您想要做的只是识别没有嵌入字体的 PDF 文件,这可能会起作用:

MAGNUM="/mnt/network/User\ 1\ PDF\ 06.12.11/"
has_no_fonts() {
    COUNT=$(pdffonts "$1" 2> /dev/null | wc -l)
    exit $(( $COUNT - 4 ))
}
export -f has_no_fonts
find "$MAGNUM" -type f -name "*.pdf" -exec bash -c 'has_no_fonts "{}"' \; -print

这是脚本的细分:

  • 检测嵌入字体数。如果没有嵌入字体,pdffonts 返回特定值,那就很简单了,但事实并非如此。因此,我们计算输出行数并减去 2(标题行)来确定嵌入字体的数量

    COUNT=$(pdffonts "$1" 2> /dev/null | wc -l) # 输出行数
                                                # 如果没有字体则正好是 2
                                                # 如果有错误则恰好为 0
    exit $(( $COUNT - 2 )) # exit 0(成功)当且仅当 PDF 没有字体
    
  • 导出 bash 函数以便可以在子 shell 中使用它。

    export -f has_no_fonts
    
  • 找到 pdf 文件,并且仅在 PDF 有效且没有字体的情况下打印名称

    find ..... -exec bash -c 'has_no_fonts "{}"' \; -打印
                      -------- --------
                          | |
              -exec 无法运行 bash 函数 只会打印 
               因此,如果上一个命令以 0 退出,则在 bash 子 shell 文件名中运行
    

如果您喜欢一行,则整个脚本可以编写为:

find "$MAGNUM" -name "*.pdf" \
    -exec bash -c 'exit $(($(pdffonts "{}" 2> /dev/null |wc -l) - 2))' \; -print

This prints any line containing ".pdf" if the previous line starts with -.

[me@home]$ awk '{if (st && match($0,".pdf")){print $0}; st=match($0,"^-")}' in.txt
/home/user1/Documents/temp1.pdf
/home/user1/Documents/temp2.pdf

It is not a generic solution, but will work with the input data you've given. I can imagine several edge cases where this might fail but it's all down to the specifications of your input file.


Update

(Based on the script you've posted in the comments below) If what you're trying to do is simply to identify PDF files that have no embedded fonts, this might work:

MAGNUM="/mnt/network/User\ 1\ PDF\ 06.12.11/"
has_no_fonts() {
    COUNT=$(pdffonts "$1" 2> /dev/null | wc -l)
    exit $(( $COUNT - 4 ))
}
export -f has_no_fonts
find "$MAGNUM" -type f -name "*.pdf" -exec bash -c 'has_no_fonts "{}"' \; -print

Here's a breakdown of the script:

  • Detecting embedded font count. Would have been simple if pdffonts returned a specific value if no fonts were embedded but that is not so. We therefore count the number of output lines and deduct 2 (header lines) to determine the number of embedded fonts

    COUNT=$(pdffonts "$1" 2> /dev/null | wc -l) # number of output lines
                                                # exactly 2 if no fonts
                                                # exactly 0 if there are errors
    exit $(( $COUNT - 2 ))  # exit 0 (success) if and only if PDF has no fonts
    
  • bash function exported so it can be used in subshell.

    export -f has_no_fonts
    
  • Locate pdf files and only print out name if PDF valid and has no fonts

    find .....  -exec bash -c 'has_no_fonts "{}"' \; -print
                      -------                        -------
                          |                             |
              -exec cannot run bash functions     Will only print 
               so run in a bash subshell       filename if prev command exit with 0
    

If you prefer a one-line, the whole script can be written as:

find "$MAGNUM" -name "*.pdf" \
    -exec bash -c 'exit $(($(pdffonts "{}" 2> /dev/null |wc -l) - 2))' \; -print
末が日狂欢 2024-12-27 22:52:48

这可能对您有用:

sed -n '/^\*/,//{H;/\*END\*/{x;s/\n/&/6;t;s|[^/]*\([^\n]*\).*|\1|p}}' in.txt
/home/user1/Documents/temp1.pdf
/home/user1/Documents/temp2.pdf

说明:

  1. 关注以 * 开头的行之间的行
  2. 将此类行存储在保留空间 (HS) 中。
  3. 当我们到达结束分隔符时交换到 HS。
  4. 检查是否有 6 个或更多换行符,即必须具有字体的条目,如果是,则进行救助。
  5. 删除所有非必要的文本并打印出来。

或者在紧要关头:

sed -n '/^\*/,//{H;/\*END\*/{x;s|[^/]*-\n\(/[^\n]*\).*|\1|p}}' in.txt

This might work for you:

sed -n '/^\*/,//{H;/\*END\*/{x;s/\n/&/6;t;s|[^/]*\([^\n]*\).*|\1|p}}' in.txt
/home/user1/Documents/temp1.pdf
/home/user1/Documents/temp2.pdf

Explanation:

  1. Focus on lines between lines beginning with *
  2. Store such lines in the hold space (HS).
  3. When we reach the closing delimiter swap to the HS.
  4. Check for 6 or more newlines i.e. entries that must have fonts and if so bailout.
  5. Delete all non-essential text and print out.

Or at a pinch:

sed -n '/^\*/,//{H;/\*END\*/{x;s|[^/]*-\n\(/[^\n]*\).*|\1|p}}' in.txt
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文