我的任务是从数百个人类可读文档(主要是 MS Word)中提取一些结构化信息并将其放入数据库中。数据几乎嵌入在整个文档中的表格中,但表格之间有大量文本,尽管文档在结构上非常相似,但还是有一些差异。这些文档经常更改(我们每隔几个月就会得到一个更新版本)
到目前为止,我能想到的唯一可行的选择是手动遍历所有文档并插入/更新信息,但我想我会在这里问是否有人认为可以以某种方式抓取文档吗?
哦,数据必须相当正确......
I've been tasked with extracting some structured information from hundreds of human readable documents (mostly MS Word) and to put it into a database. The data is pretty much embedded in tables throughout the entire document but there's a lot of text between the tables and although the documents are very similar in structure, there are a few differences. The documents are changed fairly often (we get an updated version every few months)
So far the only viable option i can think of is to manually go trough all the documents and insert/update the information but I thought I'd ask here if anyone think it's possible to scrape the documents in some way?
Oh, and the data has to be fairly correct...
发布评论
评论(1)
我使用 从 RTF 到 FO 的转换器做了类似的工作(尽管没有表格)。
您已将文档转换为 RTF,然后转换为 FO,这为您提供了良好的文档 XML 结构。然后您可以轻松解析它并抓取数据。
I did similar work (without tables though) using a converter from RTF to FO.
You have convert docs to RTF, and then to FO, which gives you a nice XML structure of the document. You can then easily parse it and scrape the data.