按书签分割 PDF?

发布于 2024-08-28 21:31:21 字数 154 浏览 6 评论 0原文

我要处理通过“合并”多个 PDF 创建的单个 PDF。每个合并的 PDF 都有 PDF 部分开始显示的位置,并带有书签。

有什么方法可以通过脚本自动将其分割为书签吗?

我们只有书签来指示部件,而不是页码,因此我们需要从书签推断页码。最好有一个 Linux 工具。

I am to process single PDFs that have each been created by 'merging' multiple PDFs. Each of the merged PDF has the places where the PDF parts start displayed with a bookmark.

Is there any way to automatically split this up by bookmarks with a script?

We only have the bookmarks to indicate the parts, not the page numbers, so we would need to infer the page numbers from the bookmarks. A Linux tool would be best.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

<逆流佳人身旁 2024-09-04 21:31:21

pdftk可用于分割PDF文件并提取书签的页码。

要获取书签的页码,请执行以下操作

pdftk in.pdf dump_data

并使脚本从输出中读取页码。

然后使用

pdftk in.pdf cat A-B output out_A-B.pdf

将A到B的页面放入out_A-B.pdf中。

该脚本可能是这样的:

#!/bin/bash

infile=$1 # input pdf
outputprefix=$2

[ -e "$infile" -a -n "$outputprefix" ] || exit 1 # Invalid args

pagenumbers=( $(pdftk "$infile" dump_data | \
                grep '^BookmarkPageNumber: ' | cut -f2 -d' ' | uniq | sort -n)
              end )

for ((i=0; i < ${#pagenumbers[@]} - 1; ++i)); do
  a=${pagenumbers[i]} # start page number
  b=${pagenumbers[i+1]} # end page number
  [ "$b" = "end" ] || b=$[b-1]
  pdftk "$infile" cat $a-$b output "${outputprefix}"_$a-$b.pdf
done

pdftk can be used to split the PDF file and extract the page numbers of the bookmarks.

To get the page numbers of the bookmarks do

pdftk in.pdf dump_data

and make your script read the page numbers from the output.

Then use

pdftk in.pdf cat A-B output out_A-B.pdf

to get the pages from A to B into out_A-B.pdf.

The script could be something like this:

#!/bin/bash

infile=$1 # input pdf
outputprefix=$2

[ -e "$infile" -a -n "$outputprefix" ] || exit 1 # Invalid args

pagenumbers=( $(pdftk "$infile" dump_data | \
                grep '^BookmarkPageNumber: ' | cut -f2 -d' ' | uniq | sort -n)
              end )

for ((i=0; i < ${#pagenumbers[@]} - 1; ++i)); do
  a=${pagenumbers[i]} # start page number
  b=${pagenumbers[i+1]} # end page number
  [ "$b" = "end" ] || b=$[b-1]
  pdftk "$infile" cat $a-$b output "${outputprefix}"_$a-$b.pdf
done
楠木可依 2024-09-04 21:31:21

有一个用 Java 编写的命令行工具,名为 Sejda,您可以在其中找到 splitbybookmarks命令完全按照您的要求执行。它是 Java,因此它可以在 Linux 上运行,并且作为一个命令行工具,您可以编写脚本来执行此操作。

免责声明
我是作者之一

There's a command line tool written in Java called Sejda where you can find the splitbybookmarks command that does exactly what you asked. It's Java so it runs on Linux and being a command line tool you can write script to do that.

Disclaimer
I'm one of the authors

复古式 2024-09-04 21:31:21

您有类似 pdf-split 构建的程序可以为您执行此操作:

A-PDF Split 是一个非常简单、快速的桌面实用程序,可让您将任何 Acrobat pdf 文件拆分为更小的 pdf 文件。它在如何分割文件以及如何唯一命名分割输出文件方面提供了完全的灵活性和用户控制。 A-PDF Split 提供了多种分割大文件的替代方案 - 按页面、按书签以及按奇数/偶数页面。您甚至可以提取或删除 PDF 文件的一部分。 A-PDF Split 还提供高级定义的分割,可以保存并稍后导入以用于重复的文件分割任务。 A-PDF 分割代表了文件分割的终极灵活性,可满足各种需求。

A-PDF Split 适用于受密码保护的 pdf 文件,并且可以将各种 pdf 安全功能应用于拆分输出文件。如果需要,您可以使用 A-PDF Merger 等实用程序将生成的拆分文件与其他 pdf 文件重新组合,形成新的复合 pdf 文件。

A-PDF Split 不需要 Adob​​e Acrobat,并生成与 Adob​​e Acrobat Reader 版本 5 及更高版本兼容的文档。

编辑*还发现了一个免费的开源程序这里

如果您不想付费,

you have programs that are built like pdf-split that can do that for you:

A-PDF Split is a very simple, lightning-quick desktop utility program that lets you split any Acrobat pdf file into smaller pdf files. It provides complete flexibility and user control in terms of how files are split and how the split output files are uniquely named. A-PDF Split provides numerous alternatives for how your large files are split - by pages, by bookmarks and by odd/even page. Even you can extract or remove part of a PDF file. A-PDF Split also offers advanced defined splits that can be saved and later imported for use with repetitive file-splitting tasks. A-PDF Split represents the ultimate in file splitting flexibility to suit every need.

A-PDF Split works with password-protected pdf files, and can apply various pdf security features to the split output files. If needed, you can recombine the generated split files with other pdf files using a utility such as A-PDF Merger to form new composite pdf files.

A-PDF Split does NOT require Adobe Acrobat, and produces documents compatible with Adobe Acrobat Reader Version 5 and above.

edit*

also found a free open sourced program Here if you do not want to pay.

傲性难收 2024-09-04 21:31:21

这是我用于完成该任务的一个 Perl 小程序。 Perl 并不特殊;它只是 pdftk 的包装器,用于解释其 dump_data 输出,将其转换为要提取的页码:

#!perl
use v5.24;
use warnings;

use Data::Dumper;
use File::Path qw(make_path);
use File::Spec::Functions qw(catfile);

my $pdftk = '/usr/local/bin/pdftk';
my $file = $ARGV[0];
my $split_dir = $ENV{PDF_SPLIT_DIR} // 'pdf_splits';

die "Can't find $ARGV[0]\n" unless -e $file;

# Read the data that pdftk spits out.
open my $pdftk_fh, '-|', $pdftk, $file, 'dump_data';

my @chapters;
while( <$pdftk_fh> ) {
    state $chapter = 0;
    next unless /\ABookmark/;

    if( /\ABookmarkBegin/ ) {
        my( $title ) = <$pdftk_fh> =~ /\ABookmarkTitle:\s+(.+)/;
        my( $level ) = <$pdftk_fh> =~ /\ABookmarkLevel:\s+(.+)/;

        my( $page_number ) = <$pdftk_fh> =~ /\BookmarkPageNumber:\s+(.+)/;

        # I only want to split on chapters, so I skip higher
        # level numbers (higher means more nesting, 1 is lowest).
        next unless $level == 1;

        # If you have front matter (preface, etc) then this numbering
        # will be off. Chapter 1 might be called Chapter 3.
        push @chapters, {
            title         => $title,
            start_page    => $page_number,
            chapter       => $chapter++,
            };
        }
    }

# The end page for one chapter is one before the start page for
# the next chapter. There might be some blank pages at the end
# of the split for PDFs where the next chapter needs to start on
# an odd page.
foreach my $i ( 0 .. $#chapters - 1 ) {
    my $last_page = $chapters[$i+1]->{start_page} - 1;
    $chapters[$i]->{last_page} = $last_page;
    }
$chapters[$#chapters]->{last_page} = 'end';

make_path $split_dir;
foreach my $chapter ( @chapters ) {
    my( $start, $end ) = $chapter->@{qw(start_page last_page)};

    # slugify the title so use it as a filename
    my $title = lc( $chapter->{title} =~ s/[^a-z]+/-/gri );

    my $path = catfile( $split_dir, "$title.pdf" );
    say "Outputting $path";

    # Use pdftk to extract that part of the PDF
    system $pdftk, $file, 'cat', "$start-$end", 'output', $path;
    }

Here's a little Perl program I use for the task. Perl isn't special; it's just a wrapper around pdftk to interpret its dump_data output to turn it into page numbers to extract:

#!perl
use v5.24;
use warnings;

use Data::Dumper;
use File::Path qw(make_path);
use File::Spec::Functions qw(catfile);

my $pdftk = '/usr/local/bin/pdftk';
my $file = $ARGV[0];
my $split_dir = $ENV{PDF_SPLIT_DIR} // 'pdf_splits';

die "Can't find $ARGV[0]\n" unless -e $file;

# Read the data that pdftk spits out.
open my $pdftk_fh, '-|', $pdftk, $file, 'dump_data';

my @chapters;
while( <$pdftk_fh> ) {
    state $chapter = 0;
    next unless /\ABookmark/;

    if( /\ABookmarkBegin/ ) {
        my( $title ) = <$pdftk_fh> =~ /\ABookmarkTitle:\s+(.+)/;
        my( $level ) = <$pdftk_fh> =~ /\ABookmarkLevel:\s+(.+)/;

        my( $page_number ) = <$pdftk_fh> =~ /\BookmarkPageNumber:\s+(.+)/;

        # I only want to split on chapters, so I skip higher
        # level numbers (higher means more nesting, 1 is lowest).
        next unless $level == 1;

        # If you have front matter (preface, etc) then this numbering
        # will be off. Chapter 1 might be called Chapter 3.
        push @chapters, {
            title         => $title,
            start_page    => $page_number,
            chapter       => $chapter++,
            };
        }
    }

# The end page for one chapter is one before the start page for
# the next chapter. There might be some blank pages at the end
# of the split for PDFs where the next chapter needs to start on
# an odd page.
foreach my $i ( 0 .. $#chapters - 1 ) {
    my $last_page = $chapters[$i+1]->{start_page} - 1;
    $chapters[$i]->{last_page} = $last_page;
    }
$chapters[$#chapters]->{last_page} = 'end';

make_path $split_dir;
foreach my $chapter ( @chapters ) {
    my( $start, $end ) = $chapter->@{qw(start_page last_page)};

    # slugify the title so use it as a filename
    my $title = lc( $chapter->{title} =~ s/[^a-z]+/-/gri );

    my $path = catfile( $split_dir, "$title.pdf" );
    say "Outputting $path";

    # Use pdftk to extract that part of the PDF
    system $pdftk, $file, 'cat', "$start-$end", 'output', $path;
    }
看春风乍起 2024-09-04 21:31:21

我编写了一个 Python 脚本,在具有特定名称的书签处将 PDF 分成两部分,使用pdftk。该脚本保留两个输出 PDF 中的书签。

I wrote a Python script to split a PDF in two at a bookmark with a specific name, using pdftk. This script preserves the bookmarks in the two output PDFs.

音栖息无 2024-09-04 21:31:21

您可以使用 pdf_extbook 在 Linux 上提取带书签的 PDF。

它是自由软件。

它使用 pdftk 从文件中读取书签,使用 fzf 允许用户选择要提取的书签,并再次使用 pdftk 提取添加书签的页面。

You can use pdf_extbook to extract bookmarked PDFs on Linux.

It's libre software.

It uses pdftk to read the bookmarks from the file, fzf to allow the user to select which bookmark to extract, and pdftk again to extract bookmarked pages.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文