python填充PDF-从pypdf2构建时的场值以上的空间,而不是从pdftk构建的空间

发布于 2025-02-07 10:18:22 字数 7373 浏览 0 评论 0 原文

这是我们一直使用的工作流程来生成可填充的PDF,然后将其填充到Python中。填充过程需要安装PDFTK,我们希望摆脱这种依赖性。

  1. 在libreoffice Writer(7.3.4.2)中生成.odt,表单字段('controls')
  2. 导出为.pdf (libreoffice可以直接导出可填充的pdf,但显然您需要在我们没有的acrobat中进行一些后处理,我们没有,将其变成可填充的pdf?
  3. )填充值(使用FDFGEN)
  4. 在Python中,致电外部PDFTK工具与填充的PDF合并FDF,并生成一个扁平的(不可用的).pdf

,换句话说,我们不需要PDFTK或任何其他外部外部工具,删除上面的步骤4(如果不再需要.fdf,则可能是步骤3)。

我们尝试了Python模块PYPDF2,PDFRW(实际上是PDFRW2)和FillPDF。 PYPDF2和PDFRW都会导致上方和/或下方的垂直空间,这是一个问题,因为它会降低表单上的可用垂直空间,因为场背景不透明,我们不想掩盖场地标签。 (这些字段以灰色背景显示,下面有边界以说明问题;通常它们是白色的,没有边框。)FillPDF似乎不知道这是一个多行文本字段,并且会生成剪裁的单线结果。

PDFTK方法不会导致有问题的添加垂直空间。

是否有一种方法可以a)摆脱Libreoffice PDF导出中的此增加的垂直空间,或者b)使用仅python-folly解决方案摆脱填充的PDF中增加的垂直空间,或c)将现场背景设置为透明,以便我们可以安全地向上移动整个字段,“重叠”,但不能阻止字段标签?

这是在libreoffice中编辑的文档 - 注意,默认值之上的垂直空间没有:

这是从libreoffice导出的可填充.pdf-请注意,这确实具有额外的垂直空间,因此,人们可能认为该问题在于libreoffice pdf导出,除了该空间没有使用PDFTK方法出现。以下:

填充的PDF,由PYPDF2生成(从PDFRW生成时看起来相同):

填充的PDF,使用上述当前工作流程,使用FDFGEN和外部呼叫和pdftk的外部呼叫

供参考,这是我用来制作各种填充PDF版本的代码:

import json
import os

cluePdfName='filled.pdf'
clueFdfName='filled.fdf'
fillableClueReportPdfFileName='clueReportFillable.pdf'

fields={
    'titleField':'SEARCH AND RESCUE\nSEARCH AND RESCUE\nSEARCH AND RESCUE SEARCH AND RESCUE SEARCH AND RESCUE ',
    'incidentNameField':'incident',
    'instructionsCollectField':True,
    'instructionsOtherField':True,
    'descriptionField':'long description long description long description long description long description long description long description long description long description long description long description long description long description',
    'locationField':'long description long description long description long description long description long description long description long description long description long description long description long description long description',
    'locationRadioGPSField':'(Radio GPS: 12345  67890)',
    'instructionsOtherTextField':'do some stuff'}

################################################
#  PART 1: pypdf2
################################################

from PyPDF2 import PdfReader,PdfWriter
from PyPDF2.generic import NameObject,TextStringObject,NumberObject,BooleanObject

reader=PdfReader(fillableClueReportPdfFileName)
writer=PdfWriter()
page=reader.pages[0]
# print(str(page))
# print('annots:'+json.dumps(page['/Annots'],indent=3))
pdfFields=reader.get_fields()
# print('fields:'+json.dumps(pdfFields,indent=3))
writer.add_page(page)

# override PdfWriter.update_page_form_field_values
#  based on https://stackoverflow.com/a/48412434/3577105
# - fill text fields and boolean (checkbox '/Btn' fields)
# - set /AS to the same value, to address not-visible-until-clicked issues
# - set readonly flag for all fields afterwards
for j in range(0, len(page['/Annots'])):
    writer_annot = page['/Annots'][j].getObject()
    for field in fields:
        if writer_annot.get('/T') == field:
            val=fields[field]
            valObj=TextStringObject('---')
            className=val.__class__.__name__
            if className=='str':
                valObj=TextStringObject(val)
            elif className=='bool':
                # checkboxes want a NameObject, either /Yes or /Off - seems odd but it works
                if val:
                    valObj=NameObject('/Yes')
                else:
                    valObj=NameObject('/Off')
            elif className in ['int','float']:
                valObj=TextStringObject(str(val))
            # print('updating '+str(field)+' --> '+str(fields[field])+' ['+className+':'+str(valObj)+']')
            print('updating '+str(field)+' --> '+str(valObj))
            writer_annot.update({
                NameObject("/V"): valObj,
                # NameObject("/AS"): valObj
                # NameObject('/Ff'): NumberObject(1) # set readonly flag for this field
            })
    ff=writer_annot.get('/Ff')
    if ff: # ff will not exist for all fields
        newff=ff|1 # set readonly flag for this field, without changing the other bits
        print('Ff: '+str(ff)+' --> '+str(newff))
        writer_annot.update({NameObject('/Ff'): NumberObject(newff)})
    else:
        writer_annot.update({NameObject('/Ff'): NumberObject(1)})

            
# writer.add_page(page)

with open(cluePdfName,'wb') as out:
    writer.write(out)


################################################
#  PART 2: fdfgen+pdftk
################################################

from fdfgen import forge_fdf

fdf=forge_fdf("",fields.items(),[],[],[])
fdf_file=open(clueFdfName,"wb")
fdf_file.write(fdf)
fdf_file.close()

cluePdfTkName=cluePdfName.replace('.pdf','_pdftk.pdf')
pdftk_cmd='pdftk "'+fillableClueReportPdfFileName+'" fill_form "'+clueFdfName+'" output "'+cluePdfTkName+'" flatten'
print("Calling pdftk with the following command:")
print(pdftk_cmd)
os.system(pdftk_cmd)


################################################
#  PART 3: pdfrw
# from https://akdux.com/python/2020/10/31/python-fill-pdf-files.html
# apparently, after doing pip install pdfrw2, 'import pdfrw' uses pdfrw2 (v0.5)
################################################

import pdfrw

ANNOT_KEY = '/Annots'
ANNOT_FIELD_KEY = '/T'
ANNOT_VAL_KEY = '/V'
ANNOT_RECT_KEY = '/Rect'
SUBTYPE_KEY = '/Subtype'
WIDGET_SUBTYPE_KEY = '/Widget'

def fill_pdf(input_pdf_path, output_pdf_path, data_dict):
    template_pdf = pdfrw.PdfReader(input_pdf_path)
    for page in template_pdf.pages:
        annotations = page[ANNOT_KEY]
        for annotation in annotations:
            if annotation[SUBTYPE_KEY] == WIDGET_SUBTYPE_KEY:
                if annotation[ANNOT_FIELD_KEY]:
                    key = annotation[ANNOT_FIELD_KEY][1:-1]
                    if key in data_dict.keys():
                        if type(data_dict[key]) == bool:
                            if data_dict[key] == True:
                                annotation.update(pdfrw.PdfDict(
                                    AS=pdfrw.PdfName('Yes')))
                        else:
                            annotation.update(
                                pdfrw.PdfDict(V='{}'.format(data_dict[key]))
                            )
                            annotation.update(pdfrw.PdfDict(AP=''))
    template_pdf.Root.AcroForm.update(pdfrw.PdfDict(NeedAppearances=pdfrw.PdfObject('true')))
    pdfrw.PdfWriter().write(output_pdf_path, template_pdf)

fill_pdf(fillableClueReportPdfFileName,cluePdfName.replace('.pdf','_pdfrw.pdf'),fields)


################################################
#  PART 4: fillpdf
################################################
import fillpdf
from fillpdf import fillpdfs

fillpdfs.write_fillable_pdf(fillableClueReportPdfFileName,cluePdfName.replace('.pdf','_fillpdf.pdf'),fields,flatten=True)

This is the workflow we've been using to generate a fillable pdf, then fill it in python. The filling process requires installation of pdftk, and we'd like to get rid of that dependency.

  1. generate .odt in LibreOffice Writer (7.3.4.2), with form fields ('controls')
  2. export as .pdf
    (LibreOffice can export a fillable pdf directly, but apparently with Word you would need to then do some post-processing in Acrobat, which we don't have, to turn it into a fillable pdf?)
  3. in python, generate .fdf with the filled values (using fdfgen)
  4. in python, call the external pdftk tool to merge the fdf with the fillable pdf, and generate a flattened (non-editable) .pdf

In other words, we'd like to not require pdftk or any other external tool, removing step 4 above (and potentially step 3 if .fdf is no longer needed).

We've tried python modules pypdf2, pdfrw (actually pdfrw2), and fillpdf. pypdf2 and pdfrw both result in a vertical space above and/or below the filled field value, which is a problem, because it reduces available vertical space on the form since the field backgrounds are opaque and we don't want to cover up the field labels. (The fields are shown in gray background with borders below to illustrate the problem; normally they are white with no border.) fillpdf doesn not seem to know that it's a multi-line text field, and generates a clipped single-line result.

The pdftk method does not result in the problematic added vertical space.

Is there a way to either a) get rid of this added vertical space in the LibreOffice PDF export, or b) get rid of the added vertical space in the filled pdf using a python-only solution, or c) set the field backgrounds to transparent, so that we can safely move the entire fields upwards, 'overlapping' but not blocking out the field labels?

Here's the document being edited in LibreOffice - notice, no vertical space above the default value:
enter image description here

Here's the fillable .pdf exported from LibreOffice - notice this does have the added vertical space, so, one might think the issue lies with LibreOffice PDF Export, except that the space does not appear with the pdftk method below:
enter image description here

Filled pdf, generated by pypdf2 (looks the same when generated from pdfrw):

enter image description here

Filled pdf, with the current workflow as above, using fdfgen and external call to pdftk:
enter image description here

For reference, here's the code I used to make the various filled pdf versions:

import json
import os

cluePdfName='filled.pdf'
clueFdfName='filled.fdf'
fillableClueReportPdfFileName='clueReportFillable.pdf'

fields={
    'titleField':'SEARCH AND RESCUE\nSEARCH AND RESCUE\nSEARCH AND RESCUE SEARCH AND RESCUE SEARCH AND RESCUE ',
    'incidentNameField':'incident',
    'instructionsCollectField':True,
    'instructionsOtherField':True,
    'descriptionField':'long description long description long description long description long description long description long description long description long description long description long description long description long description',
    'locationField':'long description long description long description long description long description long description long description long description long description long description long description long description long description',
    'locationRadioGPSField':'(Radio GPS: 12345  67890)',
    'instructionsOtherTextField':'do some stuff'}

################################################
#  PART 1: pypdf2
################################################

from PyPDF2 import PdfReader,PdfWriter
from PyPDF2.generic import NameObject,TextStringObject,NumberObject,BooleanObject

reader=PdfReader(fillableClueReportPdfFileName)
writer=PdfWriter()
page=reader.pages[0]
# print(str(page))
# print('annots:'+json.dumps(page['/Annots'],indent=3))
pdfFields=reader.get_fields()
# print('fields:'+json.dumps(pdfFields,indent=3))
writer.add_page(page)

# override PdfWriter.update_page_form_field_values
#  based on https://stackoverflow.com/a/48412434/3577105
# - fill text fields and boolean (checkbox '/Btn' fields)
# - set /AS to the same value, to address not-visible-until-clicked issues
# - set readonly flag for all fields afterwards
for j in range(0, len(page['/Annots'])):
    writer_annot = page['/Annots'][j].getObject()
    for field in fields:
        if writer_annot.get('/T') == field:
            val=fields[field]
            valObj=TextStringObject('---')
            className=val.__class__.__name__
            if className=='str':
                valObj=TextStringObject(val)
            elif className=='bool':
                # checkboxes want a NameObject, either /Yes or /Off - seems odd but it works
                if val:
                    valObj=NameObject('/Yes')
                else:
                    valObj=NameObject('/Off')
            elif className in ['int','float']:
                valObj=TextStringObject(str(val))
            # print('updating '+str(field)+' --> '+str(fields[field])+' ['+className+':'+str(valObj)+']')
            print('updating '+str(field)+' --> '+str(valObj))
            writer_annot.update({
                NameObject("/V"): valObj,
                # NameObject("/AS"): valObj
                # NameObject('/Ff'): NumberObject(1) # set readonly flag for this field
            })
    ff=writer_annot.get('/Ff')
    if ff: # ff will not exist for all fields
        newff=ff|1 # set readonly flag for this field, without changing the other bits
        print('Ff: '+str(ff)+' --> '+str(newff))
        writer_annot.update({NameObject('/Ff'): NumberObject(newff)})
    else:
        writer_annot.update({NameObject('/Ff'): NumberObject(1)})

            
# writer.add_page(page)

with open(cluePdfName,'wb') as out:
    writer.write(out)


################################################
#  PART 2: fdfgen+pdftk
################################################

from fdfgen import forge_fdf

fdf=forge_fdf("",fields.items(),[],[],[])
fdf_file=open(clueFdfName,"wb")
fdf_file.write(fdf)
fdf_file.close()

cluePdfTkName=cluePdfName.replace('.pdf','_pdftk.pdf')
pdftk_cmd='pdftk "'+fillableClueReportPdfFileName+'" fill_form "'+clueFdfName+'" output "'+cluePdfTkName+'" flatten'
print("Calling pdftk with the following command:")
print(pdftk_cmd)
os.system(pdftk_cmd)


################################################
#  PART 3: pdfrw
# from https://akdux.com/python/2020/10/31/python-fill-pdf-files.html
# apparently, after doing pip install pdfrw2, 'import pdfrw' uses pdfrw2 (v0.5)
################################################

import pdfrw

ANNOT_KEY = '/Annots'
ANNOT_FIELD_KEY = '/T'
ANNOT_VAL_KEY = '/V'
ANNOT_RECT_KEY = '/Rect'
SUBTYPE_KEY = '/Subtype'
WIDGET_SUBTYPE_KEY = '/Widget'

def fill_pdf(input_pdf_path, output_pdf_path, data_dict):
    template_pdf = pdfrw.PdfReader(input_pdf_path)
    for page in template_pdf.pages:
        annotations = page[ANNOT_KEY]
        for annotation in annotations:
            if annotation[SUBTYPE_KEY] == WIDGET_SUBTYPE_KEY:
                if annotation[ANNOT_FIELD_KEY]:
                    key = annotation[ANNOT_FIELD_KEY][1:-1]
                    if key in data_dict.keys():
                        if type(data_dict[key]) == bool:
                            if data_dict[key] == True:
                                annotation.update(pdfrw.PdfDict(
                                    AS=pdfrw.PdfName('Yes')))
                        else:
                            annotation.update(
                                pdfrw.PdfDict(V='{}'.format(data_dict[key]))
                            )
                            annotation.update(pdfrw.PdfDict(AP=''))
    template_pdf.Root.AcroForm.update(pdfrw.PdfDict(NeedAppearances=pdfrw.PdfObject('true')))
    pdfrw.PdfWriter().write(output_pdf_path, template_pdf)

fill_pdf(fillableClueReportPdfFileName,cluePdfName.replace('.pdf','_pdfrw.pdf'),fields)


################################################
#  PART 4: fillpdf
################################################
import fillpdf
from fillpdf import fillpdfs

fillpdfs.write_fillable_pdf(fillableClueReportPdfFileName,cluePdfName.replace('.pdf','_fillpdf.pdf'),fields,flatten=True)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文