将 Word doc 或 docx 文件转换为文本文件?

发布于 2024-07-26 20:50:48 字数 197 浏览 14 评论 0 原文

我需要一种无需安装任何东西即可将 .doc.docx 扩展名转换为 .txt 的方法。 显然我也不想手动打开 Word 来执行此操作。 只要它在自动运行。

我认为 Perl 或 VBA 都可以解决这个问题,但我在网上找不到任何东西。

有什么建议么?

I need a way to convert .doc or .docx extensions to .txt without installing anything. I also don't want to have to manually open Word to do this obviously. As long as it's running on auto.

I was thinking that either Perl or VBA could do the trick, but I can't find anything online for either.

Any suggestions?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(11

红焚 2024-08-02 20:50:48

一个简单的仅限 Perl 的 docx 解决方案:

  1. 使用 Archive::Zip 获取docx 文件中的 word/document.xml 文件。 (docx 只是一个压缩存档。)

  2. 使用 XML::LibXML来解析它。

  3. 然后使用XML::LibXSLT将其转换为文本或html格式。 搜索网络找到一个不错的 docx2txt.xsl 文件:)

干杯!

J。

A simple Perl only solution for docx:

  1. Use Archive::Zip to get the word/document.xml file from your docx file. (A docx is just a zipped archive.)

  2. Use XML::LibXML to parse it.

  3. Then use XML::LibXSLT to transform it into text or html format. Seach the web to find a nice docx2txt.xsl file :)

Cheers !

J.

趁年轻赶紧闹 2024-08-02 20:50:48

请注意,对象浏览器是 Microsoft Office 应用程序的一个极好的信息来源。 您可以通过工具Visual Basic编辑器访问它。 进入编辑器后,按 F2 浏览 Microsoft Office 应用程序提供的接口、方法和属性。

以下是使用 Win32::OLE 的示例:

#!/usr/bin/perl

use strict;
use warnings;

use File::Spec::Functions qw( catfile );

use Win32::OLE;
use Win32::OLE::Const 'Microsoft Word';
$Win32::OLE::Warn = 3;

my $word = get_word();
$word->{Visible} = 0;

my $doc = $word->{Documents}->Open(catfile $ENV{TEMP}, 'test.docx');

$doc->SaveAs(
    catfile($ENV{TEMP}, 'test.txt'),
    wdFormatTextLineBreaks
);

$doc->Close(0);

sub get_word {
    my $word;
    eval {
        $word = Win32::OLE->GetActiveObject('Word.Application');
    };

    die "$@\n" if $@;

    unless(defined $word) {
        $word = Win32::OLE->new('Word.Application', sub { $_[0]->Quit })
            or die "Oops, cannot start Word: ",
                   Win32::OLE->LastError, "\n";
    }
    return $word;
}
__END__

Note that an excellent source of information for Microsoft Office applications is the Object Browser. You can access it via ToolsMacroVisual Basic Editor. Once you are in the editor, hit F2 to browse the interfaces, methods, and properties provided by Microsoft Office applications.

Here is an example using Win32::OLE:

#!/usr/bin/perl

use strict;
use warnings;

use File::Spec::Functions qw( catfile );

use Win32::OLE;
use Win32::OLE::Const 'Microsoft Word';
$Win32::OLE::Warn = 3;

my $word = get_word();
$word->{Visible} = 0;

my $doc = $word->{Documents}->Open(catfile $ENV{TEMP}, 'test.docx');

$doc->SaveAs(
    catfile($ENV{TEMP}, 'test.txt'),
    wdFormatTextLineBreaks
);

$doc->Close(0);

sub get_word {
    my $word;
    eval {
        $word = Win32::OLE->GetActiveObject('Word.Application');
    };

    die "$@\n" if $@;

    unless(defined $word) {
        $word = Win32::OLE->new('Word.Application', sub { $_[0]->Quit })
            or die "Oops, cannot start Word: ",
                   Win32::OLE->LastError, "\n";
    }
    return $word;
}
__END__
捶死心动 2024-08-02 20:50:48

对于 .doc,我使用 Linux 命令行工具 antiword 取得了一些成功。 它可以非常快速地从 .doc 中提取文本,并提供良好的缩进效果。 然后您可以将其通过管道传输到 bash 中的文本文件。

对于 .docx,我使用了 OOXML SDK,正如其他一些用户提到的那样。 它只是一个 .NET 库,可以更轻松地使用压缩在 OOXML 文件中的 OOXML。 如果您只对文本感兴趣,则有很多元数据需要丢弃。 其他一些人已经编写了我看到的代码:DocXToText

我发现 Aspose.Words 有一个非常简单的 API,并且有很好的支持。

还有来自 commandlinefu.com 的 bash 命令,它通过解压缩 .docx 来工作:

unzip -p some.docx word/document.xml | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g'

For .doc, I've had some success with the linux command line tool antiword. It extracts the text from .doc very quickly, giving a good rendering of indentation. Then you can pipe that to a text file in bash.

For .docx, I've used the OOXML SDK as some other users mentioned. It is just a .NET library to make it easier to work with the OOXML that is zipped up in an OOXML file. There is a lot of metadata that you will want to discard if you are only interested in the text. Some other people have already written the code I see: DocXToText.

Aspose.Words has a very simple API with great support too I have found.

There is also this bash command from commandlinefu.com which works by unzipping the .docx:

unzip -p some.docx word/document.xml | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g'
恋你朝朝暮暮 2024-08-02 20:50:48

我强烈推荐 AsposeWords(如果您会 Java 或 .NET)。 无需安装 Word,它就可以在所有主要文本文件类型之间进行转换。

I strongly recommend AsposeWords if you can do Java or .NET. It can convert, without Word installed, between all major text file types.

酷炫老祖宗 2024-08-02 20:50:48

如果您安装了某种风格的 unix,则可以使用“字符串”实用程序从文档中查找并提取所有可读字符串。 您要查找的文本前后会有一些混乱,但结果是可读的。

If you have some flavour of unix installed, you can use the 'strings' utility to find and extract all readable strings from the document. There will be some mess before and after the text you are looking for, but the results will be readable.

顾铮苏瑾 2024-08-02 20:50:48

请注意,您还可以使用 OpenOffice 在 Windows 和 * 上执行各种文档、绘图、电子表格等转换尼克斯平台。

您可以通过 UNO 以编程方式访问 OpenOffice(类似于 Windows 上的 COM)存在 UNO 绑定的多种语言,包括通过 OpenOffice 来自 Perl: :UNO 模块。

OpenOffice::UNO 页面上,您还会发现打开文档的示例 Perl scriptlet,您所需要做的就是使用 document.storeToURL() 方法将其导出到 txt - 请参阅 一个 Python 示例,可以轻松适应您的 Perl 需求。

Note that you can also use OpenOffice to perform miscellaneous document, drawing, spreadhseet etc. conversions on both Windows and *nix platforms.

You can access OpenOffice programmatically (in a way analogous to COM on Windows) via UNO from a variety of languages for which a UNO binding exists, including from Perl via the OpenOffice::UNO module.

On the OpenOffice::UNO page you will also find a sample Perl scriptlet which opens a document, all you then need to do is export it to txt by using the document.storeToURL() method -- see a Python example which can be easily adapted to your Perl needs.

℉服软 2024-08-02 20:50:48

Sinan Ünür 的方法效果很好。
但是,我正在转换的文件发生了一些崩溃。

另一种方法是使用 Win32::OLEWin32::Clipboard,如下所示:

  • 打开 Word 文档
  • 选择所有文本
  • 复制到剪贴板
  • 打印剪贴板内容txt 文件
  • 清空剪贴板并关闭 Word 文档

基于 Sigvald Refsu 在 http://computer-programming-forum.com/53-perl/c44063de8613483b.htm,我想出了以下脚本。

注意:我选择将 txt 文件保存为与 .docx 文件相同的基名并保存在同一文件夹中,但这可以轻松更改

########################################### 
use strict; 
use File::Spec::Functions qw( catfile );
use FindBin '$Bin';
use Win32::OLE qw(in with); 
use Win32::OLE::Const 'Microsoft Word'; 
use Win32::Clipboard; 

my $monitor_word=0; # set 1 to watch MS Word being opened and closed

sub docx2txt {
    # Note: the path shall be in the form "C:\dir\ with\ space\file.docx"; 
    my $docx_file=shift; 
    
    # MS Word object
    my $Word = Win32::OLE->new('Word.Application', 'Quit') or die "Couldn't run Word"; 
    # Monitor what happens in MS Word 
    $Word->{Visible} = 1 if $monitor_word; 
    
    #Open file 
    my $Doc = $Word->Documents->Open($docx_file); 
    with ($Doc, ShowRevisions => 0); #Turn of revision marks 
    
    # Select the complete document
    $Doc->Select(); 
    my $Range = $Word->Selection();
    with ($Range, ExtendMode => 1);
    $Range->SelectAll(); 
    
    # Copy selection to clipboard 
    $Range->Copy();
    
    # Create txt file 
    my $txt_file=$docx_file; 
    $txt_file =~ s/\.docx$/.txt/;
    open(TextFile,">$txt_file") or die "Error while trying to write in $txt_file (!$)"; 
    printf TextFile ("%s\n", Win32::Clipboard::Get()); 
    close TextFile; 

    # Empty the Clipboard (to prevent warning about "huge amount of data in clipboard")
    Win32::Clipboard::Set("");
    
    # Close Word file without saving 
    $Doc->Close({SaveChanges => wdDoNotSaveChanges});

    # Disconnect OLE 
    undef $Word; 
}

希望它可以帮助您。

The method of Sinan Ünür works well.
However, I got some crash with the files I was transforming.

Another method is to use Win32::OLE and Win32::Clipboard as such:

  • Open the Word document
  • Select all the text
  • Copy in the Clipboard
  • Print the content of Clipboard in a txt file
  • Empty the Clipboard and close the Word document

Based on the script given by Sigvald Refsu in http://computer-programming-forum.com/53-perl/c44063de8613483b.htm, I came up with the following script.

Note: I chose to save the txt file with the same basename as the .docx file and in the same folder but this can easily be changed

########################################### 
use strict; 
use File::Spec::Functions qw( catfile );
use FindBin '$Bin';
use Win32::OLE qw(in with); 
use Win32::OLE::Const 'Microsoft Word'; 
use Win32::Clipboard; 

my $monitor_word=0; # set 1 to watch MS Word being opened and closed

sub docx2txt {
    # Note: the path shall be in the form "C:\dir\ with\ space\file.docx"; 
    my $docx_file=shift; 
    
    # MS Word object
    my $Word = Win32::OLE->new('Word.Application', 'Quit') or die "Couldn't run Word"; 
    # Monitor what happens in MS Word 
    $Word->{Visible} = 1 if $monitor_word; 
    
    #Open file 
    my $Doc = $Word->Documents->Open($docx_file); 
    with ($Doc, ShowRevisions => 0); #Turn of revision marks 
    
    # Select the complete document
    $Doc->Select(); 
    my $Range = $Word->Selection();
    with ($Range, ExtendMode => 1);
    $Range->SelectAll(); 
    
    # Copy selection to clipboard 
    $Range->Copy();
    
    # Create txt file 
    my $txt_file=$docx_file; 
    $txt_file =~ s/\.docx$/.txt/;
    open(TextFile,">$txt_file") or die "Error while trying to write in $txt_file (!$)"; 
    printf TextFile ("%s\n", Win32::Clipboard::Get()); 
    close TextFile; 

    # Empty the Clipboard (to prevent warning about "huge amount of data in clipboard")
    Win32::Clipboard::Set("");
    
    # Close Word file without saving 
    $Doc->Close({SaveChanges => wdDoNotSaveChanges});

    # Disconnect OLE 
    undef $Word; 
}

Hope it can helps you.

飘然心甜 2024-08-02 20:50:48

使用 WordprocessingML 和 < a href="http://msdn.microsoft.com/en-us/library/aa338205.aspx" rel="nofollow noreferrer">.docx 的 XML 格式 可以解析其 XML 以检索实际文本该文件。 您必须阅读他们的规范才能找出哪些标签包含可读文本。

.doc's that use the WordprocessingML and .docx's XML format can have their XML parsed to retrieve the actual text of the document. You'll have to read their specifications to figure out which tags contain readable text.

笑红尘 2024-08-02 20:50:48

如果您不想启动 Word(或其他 Office 应用程序),则无法在 VBA 中执行此操作。 即使您指的是 VB,您仍然需要启动 Word 的(隐藏)实例来进行处理。

You can't do it in VBA if you don't want to start Word (or another Office application). Even if you meant VB, you'd still have to start a (hidden) instance of Word to do the processing.

淑女气质 2024-08-02 20:50:48

我需要一种无需安装任何东西即可将 .doc 或 .docx 扩展名转换为 .txt 的方法

for I in *.doc?; do mv $I `echo $ | sed 's/\.docx?/\.txt'`; done

只是开玩笑。

您可以对旧版本的 Word 文档使用 antiword,并尝试解析新版本的 xml。

I need a way to convert .doc or .docx extensions to .txt without installing anything

for I in *.doc?; do mv $I `echo $ | sed 's/\.docx?/\.txt'`; done

Just joking.

You could use antiword for the older versions of Word documents, and try to parse the xml of the new ones.

£烟消云散 2024-08-02 20:50:48

使用 docxtemplater,您可以轻松获取单词的全文(仅适用于 docx)。

这是代码(Node.JS)

DocxTemplater=require('docxtemplater');
doc=new DocxTemplater().loadFromFile("input.docx");
result=doc.getFullText();

这只是三行代码,不依赖于任何单词实例(都是纯 JS)

With docxtemplater, you can easily get the full text of a word (works with docx only).

Here's the code (Node.JS)

DocxTemplater=require('docxtemplater');
doc=new DocxTemplater().loadFromFile("input.docx");
result=doc.getFullText();

This is just three lines of code and doesn't depend on any word instance (all plain JS)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文