当前位置：文江博客话题详情

按书签分割 PDF？

发布于 2024-08-28 21:31:21 字数 154 浏览 11 评论 0原文

我要处理通过“合并”多个 PDF 创建的单个 PDF。每个合并的 PDF 都有 PDF 部分开始显示的位置，并带有书签。

有什么方法可以通过脚本自动将其分割为书签吗？

我们只有书签来指示部件，而不是页码，因此我们需要从书签推断页码。最好有一个 Linux 工具。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

<逆流佳人身旁 2024-09-04 21:31:21

pdftk可用于分割PDF文件并提取书签的页码。

要获取书签的页码，请执行以下操作

pdftk in.pdf dump_data

并使脚本从输出中读取页码。

然后使用

pdftk in.pdf cat A-B output out_A-B.pdf

将A到B的页面放入out_A-B.pdf中。

该脚本可能是这样的：

#!/bin/bash

infile=$1 # input pdf
outputprefix=$2

[ -e "$infile" -a -n "$outputprefix" ] || exit 1 # Invalid args

pagenumbers=( $(pdftk "$infile" dump_data | \
                grep '^BookmarkPageNumber: ' | cut -f2 -d' ' | uniq | sort -n)
              end )

for ((i=0; i < ${#pagenumbers[@]} - 1; ++i)); do
  a=${pagenumbers[i]} # start page number
  b=${pagenumbers[i+1]} # end page number
  [ "$b" = "end" ] || b=$[b-1]
  pdftk "$infile" cat $a-$b output "${outputprefix}"_$a-$b.pdf
done

pdftk can be used to split the PDF file and extract the page numbers of the bookmarks.

To get the page numbers of the bookmarks do

pdftk in.pdf dump_data

and make your script read the page numbers from the output.

Then use

pdftk in.pdf cat A-B output out_A-B.pdf

to get the pages from A to B into out_A-B.pdf.

The script could be something like this:

#!/bin/bash

infile=$1 # input pdf
outputprefix=$2

[ -e "$infile" -a -n "$outputprefix" ] || exit 1 # Invalid args

pagenumbers=( $(pdftk "$infile" dump_data | \
                grep '^BookmarkPageNumber: ' | cut -f2 -d' ' | uniq | sort -n)
              end )

for ((i=0; i < ${#pagenumbers[@]} - 1; ++i)); do
  a=${pagenumbers[i]} # start page number
  b=${pagenumbers[i+1]} # end page number
  [ "$b" = "end" ] || b=$[b-1]
  pdftk "$infile" cat $a-$b output "${outputprefix}"_$a-$b.pdf
done

回复收藏 0 原文

楠木可依 2024-09-04 21:31:21

有一个用 Java 编写的命令行工具，名为 Sejda，您可以在其中找到 splitbybookmarks命令完全按照您的要求执行。它是 Java，因此它可以在 Linux 上运行，并且作为一个命令行工具，您可以编写脚本来执行此操作。

免责声明
我是作者之一

回复收藏 0 原文

复古式 2024-09-04 21:31:21

您有类似 pdf-split 构建的程序可以为您执行此操作：

A-PDF Split 是一个非常简单、快速的桌面实用程序，可让您将任何 Acrobat pdf 文件拆分为更小的 pdf 文件。它在如何分割文件以及如何唯一命名分割输出文件方面提供了完全的灵活性和用户控制。 A-PDF Split 提供了多种分割大文件的替代方案 - 按页面、按书签以及按奇数/偶数页面。您甚至可以提取或删除 PDF 文件的一部分。 A-PDF Split 还提供高级定义的分割，可以保存并稍后导入以用于重复的文件分割任务。 A-PDF 分割代表了文件分割的终极灵活性，可满足各种需求。
A-PDF Split 适用于受密码保护的 pdf 文件，并且可以将各种 pdf 安全功能应用于拆分输出文件。如果需要，您可以使用 A-PDF Merger 等实用程序将生成的拆分文件与其他 pdf 文件重新组合，形成新的复合 pdf 文件。
A-PDF Split 不需要 Adobe Acrobat，并生成与 Adobe Acrobat Reader 版本 5 及更高版本兼容的文档。

编辑*还发现了一个免费的开源程序这里。

如果您不想付费，

回复收藏 0 原文

傲性难收 2024-09-04 21:31:21

这是我用于完成该任务的一个 Perl 小程序。 Perl 并不特殊；它只是 pdftk 的包装器，用于解释其 dump_data 输出，将其转换为要提取的页码：

#!perl
use v5.24;
use warnings;

use Data::Dumper;
use File::Path qw(make_path);
use File::Spec::Functions qw(catfile);

my $pdftk = '/usr/local/bin/pdftk';
my $file = $ARGV[0];
my $split_dir = $ENV{PDF_SPLIT_DIR} // 'pdf_splits';

die "Can't find $ARGV[0]\n" unless -e $file;

# Read the data that pdftk spits out.
open my $pdftk_fh, '-|', $pdftk, $file, 'dump_data';

my @chapters;
while( <$pdftk_fh> ) {
    state $chapter = 0;
    next unless /\ABookmark/;

    if( /\ABookmarkBegin/ ) {
        my( $title ) = <$pdftk_fh> =~ /\ABookmarkTitle:\s+(.+)/;
        my( $level ) = <$pdftk_fh> =~ /\ABookmarkLevel:\s+(.+)/;

        my( $page_number ) = <$pdftk_fh> =~ /\BookmarkPageNumber:\s+(.+)/;

        # I only want to split on chapters, so I skip higher
        # level numbers (higher means more nesting, 1 is lowest).
        next unless $level == 1;

        # If you have front matter (preface, etc) then this numbering
        # will be off. Chapter 1 might be called Chapter 3.
        push @chapters, {
            title         => $title,
            start_page    => $page_number,
            chapter       => $chapter++,
            };
        }
    }

# The end page for one chapter is one before the start page for
# the next chapter. There might be some blank pages at the end
# of the split for PDFs where the next chapter needs to start on
# an odd page.
foreach my $i ( 0 .. $#chapters - 1 ) {
    my $last_page = $chapters[$i+1]->{start_page} - 1;
    $chapters[$i]->{last_page} = $last_page;
    }
$chapters[$#chapters]->{last_page} = 'end';

make_path $split_dir;
foreach my $chapter ( @chapters ) {
    my( $start, $end ) = $chapter->@{qw(start_page last_page)};

    # slugify the title so use it as a filename
    my $title = lc( $chapter->{title} =~ s/[^a-z]+/-/gri );

    my $path = catfile( $split_dir, "$title.pdf" );
    say "Outputting $path";

    # Use pdftk to extract that part of the PDF
    system $pdftk, $file, 'cat', "$start-$end", 'output', $path;
    }

Here's a little Perl program I use for the task. Perl isn't special; it's just a wrapper around pdftk to interpret its dump_data output to turn it into page numbers to extract:

#!perl
use v5.24;
use warnings;

use Data::Dumper;
use File::Path qw(make_path);
use File::Spec::Functions qw(catfile);

my $pdftk = '/usr/local/bin/pdftk';
my $file = $ARGV[0];
my $split_dir = $ENV{PDF_SPLIT_DIR} // 'pdf_splits';

die "Can't find $ARGV[0]\n" unless -e $file;

# Read the data that pdftk spits out.
open my $pdftk_fh, '-|', $pdftk, $file, 'dump_data';

my @chapters;
while( <$pdftk_fh> ) {
    state $chapter = 0;
    next unless /\ABookmark/;

    if( /\ABookmarkBegin/ ) {
        my( $title ) = <$pdftk_fh> =~ /\ABookmarkTitle:\s+(.+)/;
        my( $level ) = <$pdftk_fh> =~ /\ABookmarkLevel:\s+(.+)/;

        my( $page_number ) = <$pdftk_fh> =~ /\BookmarkPageNumber:\s+(.+)/;

        # I only want to split on chapters, so I skip higher
        # level numbers (higher means more nesting, 1 is lowest).
        next unless $level == 1;

        # If you have front matter (preface, etc) then this numbering
        # will be off. Chapter 1 might be called Chapter 3.
        push @chapters, {
            title         => $title,
            start_page    => $page_number,
            chapter       => $chapter++,
            };
        }
    }

# The end page for one chapter is one before the start page for
# the next chapter. There might be some blank pages at the end
# of the split for PDFs where the next chapter needs to start on
# an odd page.
foreach my $i ( 0 .. $#chapters - 1 ) {
    my $last_page = $chapters[$i+1]->{start_page} - 1;
    $chapters[$i]->{last_page} = $last_page;
    }
$chapters[$#chapters]->{last_page} = 'end';

make_path $split_dir;
foreach my $chapter ( @chapters ) {
    my( $start, $end ) = $chapter->@{qw(start_page last_page)};

    # slugify the title so use it as a filename
    my $title = lc( $chapter->{title} =~ s/[^a-z]+/-/gri );

    my $path = catfile( $split_dir, "$title.pdf" );
    say "Outputting $path";

    # Use pdftk to extract that part of the PDF
    system $pdftk, $file, 'cat', "$start-$end", 'output', $path;
    }

回复收藏 0 原文