从 Word 表格中提取原始数据?使用 Perl

发布于 2024-11-25 08:08:59 字数 1753 浏览 2 评论 0原文

我正在尝试从 Word 文档中的多个表格中提取数据。当尝试将表中的数据转换为文本时,出现错误。 ConvertToText 方法有两个可选参数(如何分隔数据和布尔值)。这是我当前的代码:

#usr/bin/perl
#OLEWord.pl

#Use string and print warnings
use strict;use warnings;
#Using OLE + OLE constants for Variants and OLE enumeration for Enumerations
use Win32::OLE qw(in);
use Win32::OLE::Const 'Microsoft Word';
use Win32::OLE::Variant;

my $var1 = Win32::OLE::Variant->new(VT_BOOL, 'true');

$Win32::OLE::Warn = 3;

#set the file to be opened
my $file = 'C:\work\SCL_International Financial New Fund Setup Questionnaire V1.6.docx';

#Create a new instance of Win32::OLE for the Word application, die if could not open the application
my $MSWord = Win32::OLE->GetActiveObject('Excel.Application') or Win32::OLE->new('Word.Application','Quit');

#Set the screen to Visible, so that you can see what is going on
$MSWord->{'Visible'} = 1;
 $MSWord->{'DisplayAlerts'} = 0; #Supress Alerts, such as 'Save As....'

#open the request file or die and print warning message
my $Doc = $MSWord->{'Documents'}->Open($file) or die "Could not open ", $file, " Error:", Win32::OLE->LastError();

#$MSWord->ActiveDocument->SaveAs({Filename => 'AlteredTest.docx', 
                            #FileFormat => wdFormatDocument});

my $tables = $MSWord->ActiveDocument->{'Tables'};

for my $table (in $tables){
   my $tableText = $table->ConverToText(wdSeparateByParagraphs,$var1);
   print "Table: ", $tableText, "\n";
}


$MSWord->ActiveDocument->Close;
$MSWord->Quit;

我收到此错误:

在 OLEWord.pl 第 31 行使用“严格替换”时不允许使用裸字“VT_BOOL”
在 OLEWord.pl 第 31 行使用“严格替换”时不允许使用裸字“true”
由于编译错误,OLEWord.pl 的执行中止。

I'm trying to extract data from multiple Tables in a Word document. When trying to convert the data in the tables to text I get an error. The ConvertToText method has two optional parameters(how to seperate the data, and a boolean).Here is my current code:

#usr/bin/perl
#OLEWord.pl

#Use string and print warnings
use strict;use warnings;
#Using OLE + OLE constants for Variants and OLE enumeration for Enumerations
use Win32::OLE qw(in);
use Win32::OLE::Const 'Microsoft Word';
use Win32::OLE::Variant;

my $var1 = Win32::OLE::Variant->new(VT_BOOL, 'true');

$Win32::OLE::Warn = 3;

#set the file to be opened
my $file = 'C:\work\SCL_International Financial New Fund Setup Questionnaire V1.6.docx';

#Create a new instance of Win32::OLE for the Word application, die if could not open the application
my $MSWord = Win32::OLE->GetActiveObject('Excel.Application') or Win32::OLE->new('Word.Application','Quit');

#Set the screen to Visible, so that you can see what is going on
$MSWord->{'Visible'} = 1;
 $MSWord->{'DisplayAlerts'} = 0; #Supress Alerts, such as 'Save As....'

#open the request file or die and print warning message
my $Doc = $MSWord->{'Documents'}->Open($file) or die "Could not open ", $file, " Error:", Win32::OLE->LastError();

#$MSWord->ActiveDocument->SaveAs({Filename => 'AlteredTest.docx', 
                            #FileFormat => wdFormatDocument});

my $tables = $MSWord->ActiveDocument->{'Tables'};

for my $table (in $tables){
   my $tableText = $table->ConverToText(wdSeparateByParagraphs,$var1);
   print "Table: ", $tableText, "\n";
}


$MSWord->ActiveDocument->Close;
$MSWord->Quit;

and I'm getting this error:

Bareword "VT_BOOL" not allowed while "strict subs" in use at OLEWord.pl line 31
Bareword "true" not allowed while "strict subs" in use at OLEWord.pl line 31
Execution of OLEWord.pl aborted due to compilation errors.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

凉世弥音 2024-12-02 08:08:59

VT_BOOL 之类的东西没有定义为常量时,perl 会将它们视为裸字。其他人已经提供了有关他们的信息。

问题的根本原因是缺少 Win32::OLE 导出的常量::变体模块。添加:

use Win32::OLE::Variant;

到您的脚本中以删除第一个错误。第二个是类似的问题,true也没有定义。将其替换为 1 或自己定义常量:

use constant true => 1;

编辑: 以下是提取表格文本的示例:

my $tables = $MSWord->ActiveDocument->{'Tables'};
for my $table (in $tables){
   my $tableText = $table->ConvertToText({ Separator => wdSeparateByTabs });
   print "Table: ", $tableText->Text(), "\n";
}

在您的代码中,方法名称 ConverToText。该方法还返回 Range 对象,因此您必须使用 Text 方法来获取实际文本。

When things like VT_BOOL are not defined as constant, perl will consider them bareword. Others already provided info on them.

The root cause of your problem are missing constants that are exported by Win32::OLE::Variant module. Add:

use Win32::OLE::Variant;

to your script to remove first error. The second one is similar problem, true is not defined as well. Replace it with 1 or define constant yourself with:

use constant true => 1;

Edit: Here is example of extracting table text:

my $tables = $MSWord->ActiveDocument->{'Tables'};
for my $table (in $tables){
   my $tableText = $table->ConvertToText({ Separator => wdSeparateByTabs });
   print "Table: ", $tableText->Text(), "\n";
}

In your code you had typo in method name ConverToText. Also the method returns Range object, so you have to use Text method to get actual text.

一向肩并 2024-12-02 08:08:59

“裸字”错误是由代码中的语法错误引起的。一个
“失控的多行”通常会查明错误的开始位置
是,通常意味着一行尚未完成,通常
因为括号或引号不匹配。

正如一些 SO-ers 所指出的,这看起来并不像
珀尔! Perl 解释器对语法错误犹豫不决,因为它
不会说那种特定的语言! 来源

不使用 strict 不会给您警告。 (但是您应该将其用于良好的代码)

阅读有关 Bareword 的内容,以便您知道它们是什么,并且您自己知道如何纠正此错误。

以下是一些有关 Bareword 的学习链接:
1. perl.com
2. 校友

A 'Bareword' error is caused by a syntax error in your code. A
'runaway multi-line' usually pinpoints where the start of the error
is, and usually means that a line has not been completed, often
because of mismatched brackets or quote marks.

As has been pointed out by several SO-ers, that doesn't look like
Perl! The Perl interpreter is balking on a syntax error because it
doesn't speak that particular language! Source

Not using strict will not give you the warning. (But you should use it for a good code)

Read about Bareword so that you will know what are they and you will know by your own that how can you correct this error.

Here are some links for study about Bareword:
1. perl.com
2. alumnus

孤者何惧 2024-12-02 08:08:59

删除“use strict”将删除“Bareword”错误

removing "use strict" will remove the "Bareword" errors

梦魇绽荼蘼 2024-12-02 08:08:59

将所有 doc 表提取到一个 xls 文件中

     sub doParseDoc {

           my $msg     = '' ; 
           my $ret     = 1 ; # assume failure at the beginning ...

           $msg        = 'START --- doParseDoc' ; 
           $objLogger->LogDebugMsg( $msg );
           $msg        = 'using the following DocFile: "' . $DocFile . '"' ; 
           $objLogger->LogInfoMsg( $msg );
           #-----------------------------------------------------------------------
           #Using OLE + OLE constants for Variants and OLE enumeration for Enumerations


           # Create a new Excel workbook
           my $objWorkBook = Spreadsheet::WriteExcel->new("$DocFile" . '.xls');

           # Add a worksheet
           my $objWorkSheet = $objWorkBook->add_worksheet();


           my $var1 = Win32::OLE::Variant->new(VT_BOOL, 'true');

           Win32::OLE->Option(Warn => \&Carp::croak);
           use constant true => 0;

           # at this point you should have the Word application opened in UI with t
           # the DocFile
           # build the MS Word object during run-time 
           my $objMSWord = Win32::OLE->GetActiveObject('Word.Application')
                             or Win32::OLE->new('Word.Application', 'Quit');  

           # build the doc object during run-time 
           my $objDoc   = $objMSWord->Documents->Open($DocFile)
                 or die "Could not open ", $DocFile, " Error:", Win32::OLE->LastError();

           #Set the screen to Visible, so that you can see what is going on
           $objMSWord->{'Visible'} = 1;
           # try NOT printing directly to the file


            #$objMSWord->ActiveDocument->SaveAs({Filename => 'AlteredTest.docx', 
                                        #FileFormat => wdFormatDocument});

           my $tables        = $objMSWord->ActiveDocument->Tables();
           my $tableText     = '' ;   
           my $xlsRow        = 1 ; 

           for my $table (in $tables){
              # extract the table text as a single string
              #$tableText = $table->ConvertToText({ Separator => 'wdSeparateByTabs' });
              # cheated those properties from here: 
              # https://msdn.microsoft.com/en-us/library/aa537149(v=office.11).aspx#officewordautomatingtablesdata_populateatablewithdata
              my $RowsCount = $table->{'Rows'}->{'Count'} ; 
              my $ColsCount = $table->{'Columns'}->{'Count'} ; 

              # disgard the tables having different than 5 columns count
              next unless ( $ColsCount == 5 ) ;

              $msg           = "Rows Count: $RowsCount " ; 
              $msg           .= "Cols Count: $ColsCount " ; 
              $objLogger->LogDebugMsg ( $msg ) ; 

              #my $tableRange = $table->ConvertToText({ Separator => '##' });
              # OBS !!! simple print WILL print to your doc file use Select ?!
              #$objLogger->LogDebugMsg ( $tableRange . "\n" );
              # skip the header row
              foreach my $row ( 0..$RowsCount ) {
                 foreach my $col (0..$ColsCount) {

                    # nope ... $table->cell($row,$col)->->{'WrapText'} = 1 ; 
                    # nope $table->cell($row,$col)->{'WordWrap'} = 1  ;
                    # so so $table->cell($row,$col)->WordWrap() ; 

                    my $txt = ''; 
                    # well some 1% of the values are so nasty that we really give up on them ... 
                    eval {
                       $txt = $table->cell($row,$col)->range->{'Text'}; 
                       #replace all the ctrl chars by space
                       $txt =~ s/\r/ /g   ; 
                       $txt =~ s/[^\040-\176]/ /g  ; 
                       # perform some cleansing - ColName<primary key>=> ColName
                       #$txt =~ s#^(.[a-zA-Z_0-9]*)(\<.*)#$1#g ; 

                       # this will most probably brake your cmd ... 
                       # $objLogger->LogDebugMsg ( "row: $row , col: $col with txt: $txt \n" ) ; 
                    } or $txt = 'N/A' ; 

                    # Write a formatted and unformatted string, row and column notation.
                    $objWorkSheet->write($xlsRow, $col, $txt);

                 } #eof foreach col

                 # we just want to dump all the tables into the one sheet
                 $xlsRow++ ; 
               } #eof foreach row
               sleep 1 ; 
           }  #eof foreach table

           # close the opened in the UI document
           $objMSWord->ActiveDocument->Close;

           # OBS !!! now we are able to print 
           $objLogger->LogDebugMsg ( $tableText . "\n" );

           # exit the whole Word application
           $objMSWord->Quit;

           return ( $ret , $msg ) ; 
     }
     #eof sub doParseDoc

extract all the doc tables into a single xls file

     sub doParseDoc {

           my $msg     = '' ; 
           my $ret     = 1 ; # assume failure at the beginning ...

           $msg        = 'START --- doParseDoc' ; 
           $objLogger->LogDebugMsg( $msg );
           $msg        = 'using the following DocFile: "' . $DocFile . '"' ; 
           $objLogger->LogInfoMsg( $msg );
           #-----------------------------------------------------------------------
           #Using OLE + OLE constants for Variants and OLE enumeration for Enumerations


           # Create a new Excel workbook
           my $objWorkBook = Spreadsheet::WriteExcel->new("$DocFile" . '.xls');

           # Add a worksheet
           my $objWorkSheet = $objWorkBook->add_worksheet();


           my $var1 = Win32::OLE::Variant->new(VT_BOOL, 'true');

           Win32::OLE->Option(Warn => \&Carp::croak);
           use constant true => 0;

           # at this point you should have the Word application opened in UI with t
           # the DocFile
           # build the MS Word object during run-time 
           my $objMSWord = Win32::OLE->GetActiveObject('Word.Application')
                             or Win32::OLE->new('Word.Application', 'Quit');  

           # build the doc object during run-time 
           my $objDoc   = $objMSWord->Documents->Open($DocFile)
                 or die "Could not open ", $DocFile, " Error:", Win32::OLE->LastError();

           #Set the screen to Visible, so that you can see what is going on
           $objMSWord->{'Visible'} = 1;
           # try NOT printing directly to the file


            #$objMSWord->ActiveDocument->SaveAs({Filename => 'AlteredTest.docx', 
                                        #FileFormat => wdFormatDocument});

           my $tables        = $objMSWord->ActiveDocument->Tables();
           my $tableText     = '' ;   
           my $xlsRow        = 1 ; 

           for my $table (in $tables){
              # extract the table text as a single string
              #$tableText = $table->ConvertToText({ Separator => 'wdSeparateByTabs' });
              # cheated those properties from here: 
              # https://msdn.microsoft.com/en-us/library/aa537149(v=office.11).aspx#officewordautomatingtablesdata_populateatablewithdata
              my $RowsCount = $table->{'Rows'}->{'Count'} ; 
              my $ColsCount = $table->{'Columns'}->{'Count'} ; 

              # disgard the tables having different than 5 columns count
              next unless ( $ColsCount == 5 ) ;

              $msg           = "Rows Count: $RowsCount " ; 
              $msg           .= "Cols Count: $ColsCount " ; 
              $objLogger->LogDebugMsg ( $msg ) ; 

              #my $tableRange = $table->ConvertToText({ Separator => '##' });
              # OBS !!! simple print WILL print to your doc file use Select ?!
              #$objLogger->LogDebugMsg ( $tableRange . "\n" );
              # skip the header row
              foreach my $row ( 0..$RowsCount ) {
                 foreach my $col (0..$ColsCount) {

                    # nope ... $table->cell($row,$col)->->{'WrapText'} = 1 ; 
                    # nope $table->cell($row,$col)->{'WordWrap'} = 1  ;
                    # so so $table->cell($row,$col)->WordWrap() ; 

                    my $txt = ''; 
                    # well some 1% of the values are so nasty that we really give up on them ... 
                    eval {
                       $txt = $table->cell($row,$col)->range->{'Text'}; 
                       #replace all the ctrl chars by space
                       $txt =~ s/\r/ /g   ; 
                       $txt =~ s/[^\040-\176]/ /g  ; 
                       # perform some cleansing - ColName<primary key>=> ColName
                       #$txt =~ s#^(.[a-zA-Z_0-9]*)(\<.*)#$1#g ; 

                       # this will most probably brake your cmd ... 
                       # $objLogger->LogDebugMsg ( "row: $row , col: $col with txt: $txt \n" ) ; 
                    } or $txt = 'N/A' ; 

                    # Write a formatted and unformatted string, row and column notation.
                    $objWorkSheet->write($xlsRow, $col, $txt);

                 } #eof foreach col

                 # we just want to dump all the tables into the one sheet
                 $xlsRow++ ; 
               } #eof foreach row
               sleep 1 ; 
           }  #eof foreach table

           # close the opened in the UI document
           $objMSWord->ActiveDocument->Close;

           # OBS !!! now we are able to print 
           $objLogger->LogDebugMsg ( $tableText . "\n" );

           # exit the whole Word application
           $objMSWord->Quit;

           return ( $ret , $msg ) ; 
     }
     #eof sub doParseDoc
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文