以编程方式将具有表结构的 Word 文档转换为 XML 的最佳方法是什么

发布于 2024-07-15 11:34:20 字数 2008 浏览 2 评论 0原文

所以,我有这个 Word 文档,其中有一大堆表格,其中一些表格相当长。 在某些情况下,它跨越许多页面。 我需要以编程方式将其转换为 XML。

最初我被告知我们可以将粘贴复制到 Excel 中并将其另存为 CSV,然后我可以从那里进行转换,这将非常容易。 但是,由于某些字段的格式问题,在复制到 Excel 后,需要对电子表格进行大量额外的操作,才能使其看起来正确并正确显示 CSV。

我应该注意到,这是一个用 VB.Net 1.1 编写的旧应用程序的附加组件(提示皱眉脸):(。但是,我正在讨论是否用 C# 3.5 编写一个单独的命令行工具是否可以实现似乎 C# 有一些我怀疑在 1.1 框架中的 Word 互操作内容,但我还没有对此进行过深入研究,

因此,我只是在寻找实现此目的的最佳/最快方法。只要它是通过编程实现的,那么它是如何实现的并不重要,如果它们不是太难的话,某些步骤可以手动完成,就像首先将其转换为其他格式可以节省大量编码和操作一样。是不是太难了,

以前有人做过这样的事情吗?

更新 好的,这是我需要做的事情的一个例子。

我有一个看起来像这样的单词文档...

PROTOCOL:  BIRDS           

Field Name      Data Type      Required      Length      Total Digits      Fraction Digits      ValidValues/Comparison      Description
OBSERVATION_ID  Text           Yes           16          n/a               n/a                                              Unique observation identification.  Primary key. 

因此,有一个包含其名称和供应商的表(在本例中为协议和鸟类)。 作为一个例子,它只有一个字段。 有效值/比较可以有多个以逗号分隔的内容,其中每个内容都由 XML 内的值标签括起来。

现在我基本上需要做的就是将其转换为此 XML...

<?xml version="1.0" encoding="utf-8"?>
<Formats xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="Formats.xsd">
  <VendorFormats Vendor="PROTOCOL" LastModified="2005-9-13">
    <Format Name="BIRDS" Version="3" VersionDate="2005-9-10">
      <BaseTable>BIRDS</BaseTable>
      <StageTable>STAGE_BIRDS</StageTable>
      <Fields>
        <Text Name="OBSERVATION_ID" Required="Y">
          <NullValue />
          <Description>Unique observation identification.  Primary key.</Description>
          <Length>16</Length>
        </Text>
      </Fields>
    </Format>
   </VendorFormats>
 </Formats>

总会有一个基表和一个阶段表,其中基表与冒号开头的任何内容同名(协议:BIRDS ,所以它将是 BIRDS)并且阶段表始终是 STAGE_ 然后是冒号后面的内容。 您还会注意到 XML 中的版本以及上次修改和版本日期。 这些东西可以稍后再担心,也许可以手动添加。

So, I have this word document that has a whole bunch of tables some of which are pretty long. It spans many many pages in some cases. I need to programmatically convert this thing to XML.

I was initially told we could just copy paste into Excel and save it as a CSV, then I could convert from there which would be pretty easy. However, due to the formatting of some of the fields there would need to be a lot of extra manipulation on the spreadsheet after copying to Excel to get it to look right and to have the CSV come out correctly.

I should note that this is an add-on for an old app written in VB.Net 1.1 (cue frowny face) :(. However, I'm debating just writing a separate command line tool in C# 3.5 if that'll make it easier. Seems like C# has some Word interop stuff that I doubt was in the 1.1 framework, but I haven't investigated that too far.

So, I'm just looking for the best/quickest way this can be achieved. It doesn't matter so much how it's achieved as long as it is achieved and it's done programmatically. Some of the steps could be done manually if they aren't too tough. Like if getting it to some other format first would save a bunch of coding and isn't too difficult that would be fine.

Has anyone done anything like this before? Any ideas?

Update
Ok, so here is an example of exactly what I'd need to do.

I have a word doc that looks something like this...

PROTOCOL:  BIRDS           

Field Name      Data Type      Required      Length      Total Digits      Fraction Digits      ValidValues/Comparison      Description
OBSERVATION_ID  Text           Yes           16          n/a               n/a                                              Unique observation identification.  Primary key. 

So, there's the table with it's name and vendor (Protocol and Birds in this case). As an example it just has one field. Valid values/comparisons can have multiple things separated by commas where each thing would be enclosed by value tags inside the XML.

Now what I basically need to do is get that to convert to this XML...

<?xml version="1.0" encoding="utf-8"?>
<Formats xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="Formats.xsd">
  <VendorFormats Vendor="PROTOCOL" LastModified="2005-9-13">
    <Format Name="BIRDS" Version="3" VersionDate="2005-9-10">
      <BaseTable>BIRDS</BaseTable>
      <StageTable>STAGE_BIRDS</StageTable>
      <Fields>
        <Text Name="OBSERVATION_ID" Required="Y">
          <NullValue />
          <Description>Unique observation identification.  Primary key.</Description>
          <Length>16</Length>
        </Text>
      </Fields>
    </Format>
   </VendorFormats>
 </Formats>

There will always be a base table and a stage table where base table is the same name as whatever follows the colon at the beginning of the (PROTOCOL: BIRDS, so it would be BIRDS) and the stage table is always STAGE_ then what follows the colon. You'll also notice the version and the last modified and version date in the XML. These things can be worried about later and perhaps manually added.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

微暖i 2024-07-22 11:34:20

您应该意识到不存在 MS Word 文档这样的东西。 格式有很多种,有些早期格式并不名副其实,但最好将其描述为 hacky 压缩文本的内存转储。
您并不真正需要 XML,这是稍后要考虑的问题。 您必须控制文档中的数据。 除非这是最新的、有记录的格式之一,否则您只有一个选择:破解它。 编写一个程序来操作该文档,直到得到你想要的。
唯一了解 MS-Word 格式的人就是 MS-Word 本人。 因此,如果您可以说服她将内容转储为或多或少定义的格式(例如 RTF),那么您就有了一个更好的起点。

You should realize that there is no such thing as a MS Word document. There are numerous formats and some early format are not deserving of the name, but are better described as memory dumps of hacky compressed text.
You're not really in need of XML, that is a later concern. You have to take control of the data in the document. Unless that is one of the newest, somewhat documented formats, you have but one option: hack it out. Write a program to manipulate the document, until you get what you want.
The only one who knows MS-Word formats is MS-Word herself. So if you can convince her to dump the content to a more-or-less defined format like RTF, you have a better starting point.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文