如何从 MS Word 中提取文本?
我正在尝试打开一个 Word 文档,然后提取文档中的所有文本并使用 Win32::OLE
#usr/bin/perl
#OLEWord.pl
#Use string and print warnings
use strict;use warnings;
#Using OLE + OLE constants for Variants and OLE enumeration for Enumerations
use Win32::OLE;
use Win32::OLE::Const 'Microsoft Word';
$Win32::OLE::Warn = 3;
#set the file to be opened
my $file = '/work/Test.docx';
#Create a new instance of Win32::OLE for the Word application, die if could not open the application
my $MSWord = Win32::OLE->new('Word.Application','Quit') and "Opened Word" or die "Unable to open document ", Win32::OLE->LastError();
#Set the screen to Visible, so that you can see what is going on
$MSWord->{'Visible'} = 1;
#open the request file or die and print warning message
my $Doc = $MSWord->Documents->Open('C:\work\Test.docx') or die "Could not open ", $file, " Error:", Win32::OLE->LastError();
#$MSWord->ActiveDocument->SaveAs({Filename => 'AlteredTest.docx',
#FileFormat => wdFormatDocument});
sub ShowObjs {
my $obj = shift;
foreach (sort keys %$obj) {
print "Keys: $_ - $obj->{$_}\n"; }
}
my $paragraphs = $Doc->Paragraphs;
ShowObjs($paragraphs);
# Get and print the Text inside the opened file
my $paragraphs = $Doc->Paragraphs;
my $txt = $Doc->Range->Text;
print $txt;
$MSWord->ActiveDocument->Close;
$MSWord->Quit;
我收到以下错误代码:
来自“Microsoft Word”的 OLE 异常:
命令失败
Win32::OLE(0.1709) 错误 ox800a1066 在 OLEWord.pl 第 20 行的 METHOD/PROPERTYGET“Open”中
更新: 我可以正常打开 Word 应用程序,只是当我尝试打开文件时出现问题
I'm trying to open a word document and just extract all the text that is in the document and display it to the user using Win32::OLE
#usr/bin/perl
#OLEWord.pl
#Use string and print warnings
use strict;use warnings;
#Using OLE + OLE constants for Variants and OLE enumeration for Enumerations
use Win32::OLE;
use Win32::OLE::Const 'Microsoft Word';
$Win32::OLE::Warn = 3;
#set the file to be opened
my $file = '/work/Test.docx';
#Create a new instance of Win32::OLE for the Word application, die if could not open the application
my $MSWord = Win32::OLE->new('Word.Application','Quit') and "Opened Word" or die "Unable to open document ", Win32::OLE->LastError();
#Set the screen to Visible, so that you can see what is going on
$MSWord->{'Visible'} = 1;
#open the request file or die and print warning message
my $Doc = $MSWord->Documents->Open('C:\work\Test.docx') or die "Could not open ", $file, " Error:", Win32::OLE->LastError();
#$MSWord->ActiveDocument->SaveAs({Filename => 'AlteredTest.docx',
#FileFormat => wdFormatDocument});
sub ShowObjs {
my $obj = shift;
foreach (sort keys %$obj) {
print "Keys: $_ - $obj->{$_}\n"; }
}
my $paragraphs = $Doc->Paragraphs;
ShowObjs($paragraphs);
# Get and print the Text inside the opened file
my $paragraphs = $Doc->Paragraphs;
my $txt = $Doc->Range->Text;
print $txt;
$MSWord->ActiveDocument->Close;
$MSWord->Quit;
I'm getting this error code:
OLE exception from "Microsoft Word":
Command Failed
Win32::OLE(0.1709) error ox800a1066
in METHOD/PROPERTYGET "Open" at OLEWord.pl line 20
Update: I can open up the Word application fine, it's just when I try to open up the file that is the problem
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我有几个脚本可以使用 Win32::OLE 将 DOC 转换为各种输出格式。它们通常是这样开始的:
请注意,
$input_file_path
必须包含文件的绝对路径。您还可以启用Visible
和DisplayAlerts
以查看 Word 可能提供的任何错误。编辑:您可以使用
in
枚举器遍历段落:或者您可以使用Word自己的导出方法并将文档另存为支持的格式之一:
后一种方法的优点是格式设置简单如果可能的话保留,这会为项目符号、编号等产生更好的结果。
I have few scripts to convert DOC to various output format using Win32::OLE. They usually start like this:
Please note that
$input_file_path
has to contain absolute path to your file. You can also enableVisible
andDisplayAlerts
to see any error Word might give you.Edit: You can traverse paragraphs using
in
enumerator:Or you can use Word's own exporting method and save document as one of supported formats:
The advantage of latter method is that formatting is retained if possible, which produces better results for bullets, numbering and such.
Win32::OLE
交互方面可能有点有趣。如果有任何事情触发提示,您可能会收到这样的消息。例如,通常它可能想要以只读方式打开文件,并打开一个对话框,但这些对话框可能会在Win32::OLE
的默认初始化下中断。如果是这种情况,
在实例化任何对象之前调用(即在
Win32::OLE->new
之前)可能会成功。Win32::OLE
can be a little funny about interaction. if anything triggers a prompt, you may get a message like this. Typically it may be that it wants to open the file read-only, for example, and put up a dialog, but these dialogs can break under the default initialization ofWin32::OLE
.If this is the case, calling
before you do anything like instantiate any objects (i.e., before
Win32::OLE->new
) might do the trick.