从文本文件中提取匹配结果的行
我需要从文本文件中提取文件名,而文本文件的输出没有字体。
因此,正如您从下面的输出文件中看到的那样,我需要打印出第一个结果之后没有字体的结果?因此,只有最后一个结果在此输出中包含字体
这是否有意义 - Grep、Sed 或 Awk 会是答案吗
因此需要下面文本文件的输出,表明该 PDf 中的 **START 和 * 中不存在字体*结尾
******************START***********************
name type emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
/home/user1/Documents/temp1.pdf
******************END***********************
******************START***********************
name type emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
/home/user1/Documents/temp2.pdf
******************END***********************
******************START***********************
name type emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
BAAAAA+TimesNewRomanPS-BoldMT TrueType yes yes yes 14 0
CAAAAA+TimesNewRomanPSMT TrueType yes yes yes 9 0
/home/user3/Documents/temp file.pdf
******************END***********************
I need to extract the filename from a text file whereas the output on the text file doesn't have fonts.
So as you can see from the output file below I need to print out results where they are no fonts after the first results? So only the last result has fonts in this output
Does this make sense - Would Grep, Sed or Awk be the answer
So need a output from the text file below that shows that no fonts are present in that PDf within the **START and **END
******************START***********************
name type emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
/home/user1/Documents/temp1.pdf
******************END***********************
******************START***********************
name type emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
/home/user1/Documents/temp2.pdf
******************END***********************
******************START***********************
name type emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
BAAAAA+TimesNewRomanPS-BoldMT TrueType yes yes yes 14 0
CAAAAA+TimesNewRomanPSMT TrueType yes yes yes 9 0
/home/user3/Documents/temp file.pdf
******************END***********************
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果上一行以
-
开头,则将打印包含“.pdf”的任何行。它不是通用解决方案,但适用于您提供的输入数据。我可以想象几种边缘情况,这可能会失败,但这完全取决于输入文件的规格。
更新
(基于您在下面的评论中发布的脚本)如果您想要做的只是识别没有嵌入字体的 PDF 文件,这可能会起作用:
这是脚本的细分:
检测嵌入字体数。如果没有嵌入字体,
pdffonts
返回特定值,那就很简单了,但事实并非如此。因此,我们计算输出行数并减去 2(标题行)来确定嵌入字体的数量导出 bash 函数以便可以在子 shell 中使用它。
找到 pdf 文件,并且仅在 PDF 有效且没有字体的情况下打印名称
如果您喜欢一行,则整个脚本可以编写为:
This prints any line containing ".pdf" if the previous line starts with
-
.It is not a generic solution, but will work with the input data you've given. I can imagine several edge cases where this might fail but it's all down to the specifications of your input file.
Update
(Based on the script you've posted in the comments below) If what you're trying to do is simply to identify PDF files that have no embedded fonts, this might work:
Here's a breakdown of the script:
Detecting embedded font count. Would have been simple if
pdffonts
returned a specific value if no fonts were embedded but that is not so. We therefore count the number of output lines and deduct 2 (header lines) to determine the number of embedded fontsbash function exported so it can be used in subshell.
Locate pdf files and only print out name if PDF valid and has no fonts
If you prefer a one-line, the whole script can be written as:
这可能对您有用:
说明:
*
开头的行之间的行或者在紧要关头:
This might work for you:
Explanation:
*
Or at a pinch: