我可以使用 Node.js 阅读 PDF 或 Word 文档吗?

发布于 2024-12-29 09:36:17 字数 116 浏览 1 评论 0原文

我找不到任何软件包来执行此操作。我知道 PHP 有大量的 PDF 库(例如 http://www.fpdf.org/),但是 Node 有什么库吗?

I can't find any packages to do this. I know PHP has a ton of libraries for PDFs (like http://www.fpdf.org/) but anything for Node?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

美胚控场 2025-01-05 09:36:17

texttract 是一个很棒的库,支持 PDF、Doc、Docx 等。

textract is a great lib that supports PDFs, Doc, Docx, etc.

多情出卖 2025-01-05 09:36:17

好像有几个pdf版的,但是我没找到Word版的。

无论如何,像这样的 CPU 密集型处理并不是 Node 的真正强项(也就是说,使用 Node 来完成此任务不会比任何其他语言获得额外的好处)。一种务实的方法是找到一个好的工具并从 Node.js 中使用它。

我在办公室听到过关于 docsplit 的好消息 http://documentcloud.github.com/docsplit/

虽然它不是 Node,但您可以使用 http://nodejs.org/docs/latest/api/all.html#child_process.exec

Looks like there's a few for pdf, but I didn't find any for Word.

CPU bound processing like that isn't really Node's strong point anyway (i.e. you get no additional benefits using node to do it over any other language). A pragmatic approach would be to find a good tool and utilise it from Node.

I have heard good things around the office about docsplit http://documentcloud.github.com/docsplit/

While it's not Node, you could easily invoke it from Node with http://nodejs.org/docs/latest/api/all.html#child_process.exec

故事灯 2025-01-05 09:36:17

您可以轻松地将一个文件转换为另一个文件,或者使用 .doc 模板生成 .pdf 文件,但您可能希望使用现有的 Web 服务来完成此任务。

例如,可以使用 Livedocx 的服务来完成此操作。

要从节点使用此服务,请参阅node-livedocx (免责声明:我是这个节点模块的作者)

You can easily convert one into another, or use for example a .doc template to generate a .pdf file, but you will probably want to use an existing web service for this task.

This can be done using the services of Livedocx for example

To use this service from node, see node-livedocx (Disclaimer: I am the author of this node module)

千年*琉璃梦 2025-01-05 09:36:17

我建议您查看 unoconv 进行初始转换,这使用 LibreOffice 或 OpenOffice 进行实际转换。这增加了一些开销。

我会设置一些具有所有必要设置的工作人员,并使用请求/响应队列来处理转换...(可能需要查看 kuezmq

一般来说,这是一个CPU 密集型和繁重的任务应该是已卸载... Pandoc 和其他人特别提到了 .docx,而不是 .doc,因此它们可能也可能不是选项。


注意:我知道这个问题很旧,只是想为遇到这个问题的其他人提供当前的答案。

I would suggest looking into unoconv for your initial conversion, this uses LibreOffice or OpenOffice for the actual conversion. Which adds some overhead.

I'd setup a few workers with all the necessities setup, and use a request/response queue for handling the conversion... (may want to look into kue or zmq)

In general this is a CPU bound and heavy task that should be offloaded... Pandoc and others specifically mention .docx, not .doc so they may or may not be options as well.


Note: I know this question is old, just wanted to provide a current answer for others coming across this.

つ可否回来 2025-01-05 09:36:17

您可以对 pdf 文件使用 pdf-text。它将把 pdf 中的文本提取到文本“块”数组中。对于对结构化 pdf 文本进行模糊解析很有用。

var pdfText = require('pdf-text')
var pathToPdf = __dirname + "/info.pdf"


pdfText(pathToPdf, function(err, chunks) {
  //chunks is an array of strings  
  //loosely corresponding to text objects within the pdf 
  //for a more concrete example, view the test file in this repo 
})

var fs = require('fs')
var buffer = fs.readFileSync(pathToPdf)
pdfText(buffer, function(err, chunks) {
 console.log(chunks)
})

对于 docx 文件,您可以使用 mammoth,它会从 .docx 文件中提取文本。

var mammoth = require("mammoth");

mammoth.extractRawText({path: "./doc.docx"})
    .then(function(result){
        var text = result.value; // The raw text 
        console.log(text);
        var messages = result.messages;
    })
    .done();

我希望这会有所帮助。

you can use pdf-text for pdf files. it will extract text from a pdf into an array of text 'chunks'. Useful for doing fuzzy parsing on structured pdf text.

var pdfText = require('pdf-text')
var pathToPdf = __dirname + "/info.pdf"


pdfText(pathToPdf, function(err, chunks) {
  //chunks is an array of strings  
  //loosely corresponding to text objects within the pdf 
  //for a more concrete example, view the test file in this repo 
})

var fs = require('fs')
var buffer = fs.readFileSync(pathToPdf)
pdfText(buffer, function(err, chunks) {
 console.log(chunks)
})

for docx files you can use mammoth, it will extract text from .docx files.

var mammoth = require("mammoth");

mammoth.extractRawText({path: "./doc.docx"})
    .then(function(result){
        var text = result.value; // The raw text 
        console.log(text);
        var messages = result.messages;
    })
    .done();

I hope this will help.

猫弦 2025-01-05 09:36:17

要解析 pdf 文件,您可以使用 pdf2json 节点模块

它允许您转换pdf 文件转换为 json 以及原始文本数据。

For parsing pdf files you can use pdf2json node module

It allows you to convert pdf file to json as well as to raw text data.

动听の歌 2025-01-05 09:36:17

如果您只需要从 Word 文档进行转换,另一个不错的选择是 Mammoth.js

Mammoth 旨在转换 .docx 文档,例如创建的文档
通过 Microsoft Word,并将它们转换为 HTML。猛犸象的目标是生产
通过使用文档中的语义信息来简单干净的 HTML,
并忽略其他细节。例如,猛犸象将任何
段落的样式为 Heading 1 到 h1 元素,而不是
尝试精确复制样式(字体、文本大小、颜色等)
标题。

.docx 使用的结构与
HTML 的结构,这意味着转换不太可能
非常适合更复杂的文档。猛犸象效果最好,只要你
使用样式对文档进行语义标记。

Another good option if you only need to convert from Word documents is Mammoth.js.

Mammoth is designed to convert .docx documents, such as those created
by Microsoft Word, and convert them to HTML. Mammoth aims to produce
simple and clean HTML by using semantic information in the document,
and ignoring other details. For instance, Mammoth converts any
paragraph with the style Heading 1 to h1 elements, rather than
attempting to exactly copy the styling (font, text size, colour, etc.)
of the heading.

There's a large mismatch between the structure used by .docx and the
structure of HTML, meaning that the conversion is unlikely to be
perfect for more complicated documents. Mammoth works best if you only
use styles to semantically mark up your document.

许你一世情深 2025-01-05 09:36:17

以下示例展示了如何使用 PDF.js 从 PDF 下载和提取文本:

import _ from 'lodash';
import superagent from 'superagent';
import pdf from 'pdfjs-dist';

const url = 'http://unec.edu.az/application/uploads/2014/12/pdf-sample.pdf';

const main = async () => {
  const response = await superagent.get(url).buffer();
  const data = response.body;
  const doc = await pdf.getDocument({ data });
  for (const i of _.range(doc.numPages)) {
    const page = await doc.getPage(i + 1);
    const content = await page.getTextContent();
    for (const { str } of content.items) {
      console.log(str);
    }
  }
};

main().catch(error => console.error(error));

Here is an example showing how to download and extract text from a PDF using PDF.js:

import _ from 'lodash';
import superagent from 'superagent';
import pdf from 'pdfjs-dist';

const url = 'http://unec.edu.az/application/uploads/2014/12/pdf-sample.pdf';

const main = async () => {
  const response = await superagent.get(url).buffer();
  const data = response.body;
  const doc = await pdf.getDocument({ data });
  for (const i of _.range(doc.numPages)) {
    const page = await doc.getPage(i + 1);
    const content = await page.getTextContent();
    for (const { str } of content.items) {
      console.log(str);
    }
  }
};

main().catch(error => console.error(error));
病毒体 2025-01-05 09:36:17

您可以使用 Aspose.Words Cloud SDK for Node.js 从 DOC/DOCX 中提取文本,打开 Office 和 PDF。它是付费 API,但免费计划提供每月 150 次免费 API 调用。

PS:我是 Aspose 的开发人员布道者。

const { WordsApi, ConvertDocumentRequest } = require("asposewordscloud");
const fs = require('fs');

// Get Customer ID and Customer Key from https://dashboard.aspose.cloud/
wordsApi = new WordsApi("xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx", "xxxxxxxxxxxxxxxxxxxx");

const request = new ConvertDocumentRequest({
                    format: "txt",
                    document: fs.createReadStream("C:/Temp/02_pages.pdf"),
                });
const outputFile = "C:/Temp/ConvertPDFtotxt.txt";
wordsApi.convertDocument(request).then((result) => {    
    console.log(result.response.statusCode);    
    console.log(result.body.byteLength);    
    fs.writeFileSync(outputFile, result.body);
}).catch(function(err) {
    // Deal with an error
    console.log(err);
});

You can use Aspose.Words Cloud SDK for Node.js to extract text from DOC/DOCX,Open Office, and PDF. It's paid API but the free plan provides 150 free monthly API calls.

P.S: I'm developer evangelist at Aspose.

const { WordsApi, ConvertDocumentRequest } = require("asposewordscloud");
const fs = require('fs');

// Get Customer ID and Customer Key from https://dashboard.aspose.cloud/
wordsApi = new WordsApi("xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx", "xxxxxxxxxxxxxxxxxxxx");

const request = new ConvertDocumentRequest({
                    format: "txt",
                    document: fs.createReadStream("C:/Temp/02_pages.pdf"),
                });
const outputFile = "C:/Temp/ConvertPDFtotxt.txt";
wordsApi.convertDocument(request).then((result) => {    
    console.log(result.response.statusCode);    
    console.log(result.body.byteLength);    
    fs.writeFileSync(outputFile, result.body);
}).catch(function(err) {
    // Deal with an error
    console.log(err);
});
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文