从pdf中提取段落

发布于 2024-10-22 03:49:10 字数 297 浏览 8 评论 0原文

我正在对一本 pdf 电子书进行主题建模，需要逐段提取文本。为此，我使用 apache pdfBox，它可以有效地从 pdf 中提取文本。

PDFParser parser;
PDFTextStripper pdfStrip = null;
parsedText = pdfStrip.getText(pdDoc);

但我无法单独提取段落。该工具提供了一种设置段落开始/结束标识符的方法，但我需要知道为此的段落分隔符标识符。

有没有办法做到这一点，或者是否有其他可用的工具可以有效地进行段落提取？

原文

I'm doing topic modelling on a pdf e-book and need to extract text paragraph by paragraph. For this I use apache pdfBox which is efficiently extract text from pdf.

PDFParser parser;
PDFTextStripper pdfStrip = null;
parsedText = pdfStrip.getText(pdDoc);

But I cannot extract paragraphs separately. This tool provides a way to set the paragraph start/end identifier, but I need to know the paragraph break identifier for this.

Is there a way to do this, or if there some other tool available which can do paragraph extraction effectively?

分享到QQ

分享到微博