从数百个 Word 文档中抓取结构化信息?
我的任务是从数百个人类可读文档(主要是 MS Word)中提取一些结构化信息并将其放入数据库中。数据几乎嵌入在整个文档中的表格中,但表格之间有大量文本,尽管文档在结构上非常相似,但还是有一些差异。这些文档经常更改(我们每隔几个月就会得到一个更新版本)
到目前为止,我能想到的唯一可行的选择是手动遍历所有文档并插入/更新信息,但我想我会在这里问是否有人认为可以以某种方式抓取文档吗?
哦,数据必须相当正确......
I've been tasked with extracting some structured information from hundreds of human readable documents (mostly MS Word) and to put it into a database. The data is pretty much embedded in tables throughout the entire document but there's a lot of text between the tables and although the documents are very similar in structure, there are a few differences. The documents are changed fairly often (we get an updated version every few months)
So far the only viable option i can think of is to manually go trough all the documents and insert/update the information but I thought I'd ask here if anyone think it's possible to scrape the documents in some way?
Oh, and the data has to be fairly correct...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我使用 从 RTF 到 FO 的转换器做了类似的工作(尽管没有表格)。
您已将文档转换为 RTF,然后转换为 FO,这为您提供了良好的文档 XML 结构。然后您可以轻松解析它并抓取数据。
I did similar work (without tables though) using a converter from RTF to FO.
You have convert docs to RTF, and then to FO, which gives you a nice XML structure of the document. You can then easily parse it and scrape the data.