根据格式(字体名称和大小)从word或pdf中提取文本
我需要解析大文本(大约1000页的word或pdf文档)并将该文档中的一些文本放入数据库字段中
我发现我唯一能区分我想要提取的文本的是格式,它总是“Helvetica-Condensed”尺寸 12
我可以这样做吗?我知道如何使用字符串函数,但我应该使用什么来测试格式?
正如我所说,文本存储在Word文档或PDF中,
如果有第三方组件可以做没有问题,请参考给我
谢谢
I need to parse large text (about 1000 pages of word or pdf document)and place some of the text inside this document into database fields
I found that the only thing I can distinguish the text I want to extract is the format , it is always "Helvetica-Condensed" size 12
can I do that ? I know how to use the string functions but what I should use to test the format ?
as I said the text is stored inside word document or PDF
if there is third party component can do no problem please refer it to me
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
有 QuickPDF。价格为 249 美元。
There is QuickPDF. The price is $249,00.
另一种选择是自己编写代码。 文件规范可在线获取,如果您只是尝试要从文档中删除文本,这应该可以为您提供大部分指导。
唯一需要注意的是完全由图像构建的文档。在这种情况下(无论您使用什么来读取文件),您还需要 OCR 类型的应用程序。要查看是否属于这种情况,请打开您要从中“提取”文本的文件类型的示例,选择要复制的文本,然后尝试粘贴到记事本中。
The other option is to code it yourself. The file specification is available online, and if your only trying to rip the text out of the document this should guide you most of the way.
The only thing to be careful of are documents which are built entirely from images. In that scenario (no matter what you use to read the file) you will also need an OCR type of application. To see if this is the case or not, open a sample of the type of file you are wanting to "extract" text from, select the text to copy then try to paste into notepad.