从 PDF 中的文本获取书目数据并导出到窗口表单
我使用 iText5 for .NET 通过以下代码从 PDF 中提取文本。
private void button1_Click(object sender, EventArgs e)
{
PdfReader reader2 = new PdfReader("Scharfetter1969.pdf");
int pagen = reader2.NumberOfPages;
reader2.Close();
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy();
for (int i = 1; i < 2; i++)
{
textBox1.Text = "";
PdfReader reader = new PdfReader("Scharfetter1969.pdf");
String s = PdfTextExtractor.GetTextFromPage(reader, i, its);
s = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(s)));
textBox1.Text = s;
reader.Close();
}
}
但我想从研究论文 pdf 中获取书目数据。
这是从该 pdf 中提取的数据示例(尾注格式),这里是 链接!
%0 Journal Article
%T Repeated temperature modulation epitaxy for p-type doping and light-emitting diode based on ZnO
%A Tsukazaki, A.
%A Ohtomo, A.
%A Onuma, T.
%A Ohtani, M.
%A Makino, T.
%A Sumiya, M.
%A Ohtani, K.
%A Chichibu, S.F.
%A Fuke, S.
%A Segawa, Y.
%J Nature Materials
%V 4
%N 1
%P 42-46
%@ 1476-1122
%D 2004
%I Nature Publishing Group
但请记住,这是书目信息,在该 pdf 的元数据中不可用。我想访问文章类型(%O),标题(%T),作者(%A),日期(%D)和(%I)并将其显示在窗口形式的不同分配的文本框中。
我正在使用 C#,如果有人有这方面的代码,或者指导我如何做到这一点。
I use iText5 for .NET to extract text from a PDF, by using below code.
private void button1_Click(object sender, EventArgs e)
{
PdfReader reader2 = new PdfReader("Scharfetter1969.pdf");
int pagen = reader2.NumberOfPages;
reader2.Close();
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy();
for (int i = 1; i < 2; i++)
{
textBox1.Text = "";
PdfReader reader = new PdfReader("Scharfetter1969.pdf");
String s = PdfTextExtractor.GetTextFromPage(reader, i, its);
s = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(s)));
textBox1.Text = s;
reader.Close();
}
}
But I want to get bibliographic data from research paper pdf.
Here is example of data which is extrected from this pdf (in endnote format), Here's a link!
%0 Journal Article
%T Repeated temperature modulation epitaxy for p-type doping and light-emitting diode based on ZnO
%A Tsukazaki, A.
%A Ohtomo, A.
%A Onuma, T.
%A Ohtani, M.
%A Makino, T.
%A Sumiya, M.
%A Ohtani, K.
%A Chichibu, S.F.
%A Fuke, S.
%A Segawa, Y.
%J Nature Materials
%V 4
%N 1
%P 42-46
%@ 1476-1122
%D 2004
%I Nature Publishing Group
But remember that this is bibliographic information, it is not available in metadata of this pdf. I want to access Article Type (%O), Title (%T), Authors (%A), Date (%D) and (%I) and show it to different assigned textbox in window form.
I am using C# if any one have any code for this, or guide me how to do this.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
PDF 是一种单向格式。您将数据放入其中,以便它在所有设备(显示器、打印机等)上一致地呈现,但该格式从来没有打算将数据拉出来。任何和所有这样做的尝试都将纯粹是猜测。 iText 的
PdfTextExtractor
可以工作,但您必须根据自己的任意规则集将内容拼凑在一起,并且这些规则可能会从 PDF 更改为 PDF。提供的 PDF 是由 InDesign 创建的,它在使文本看起来不错方面做得非常出色,实际上使得解析数据变得更加困难。也就是说,如果您的 PDF 在视觉上都是一致的,您可以尝试在 保留格式并使用格式规则来猜测什么是什么。这篇文章将为您提供一些您可以猜到的 HTML 格式。 (如果这确实有效,我建议返回比 HTML 更具体的内容,但我会将其留给您。)
针对您提供的 PDF 运行它会显示标题正在使用字体
HelveticaNeue-LightExt
大约 17pts,因此您可以编写一条规则来查找以该大小使用该字体的所有行并将它们组合在一起。作者以HelveticaNeue-Condensed
格式完成,大约 10 分,所以这是另一条规则。下面的代码是上面链接的代码的修改版本。它是一个针对 iTextSharp 5.1.1.0 的完整工作 C# 2010 WinForms 应用程序。它会提取所提供 PDF 的标题和作者,但您需要针对其他 PDF 和元数据对其进行调整。具体实现细节请参见代码中的注释。
PDF is a one-way format. You put data in so that it renders consistently on all devices (monitors, printers, etc) but the format was never intended to pull data back out. Any and all attempts to do that will be pure guess work. iText's
PdfTextExtractor
works but you are going to have to piece things together based on your own arbitrary set of rules, and these rules will probably change from PDF to PDF. The supplied PDF was created by InDesign which does such a great job of making text look good that it actually makes it even harder to parse the data back out.That said, if your PDFs are all visually consistent, you could try to pull the data out while retaining formatting and use the formatting rules to guess what is what. That post will get you some HTML formatting that you could guess at. (If this actually works I'd recommend returning something more specific than HTML but I'll leave that up to you.)
Running it against your supplied PDF shows that the title is using the font
HelveticaNeue-LightExt
at about 17pts so you could write a rule to look for all lines that use that font at that size and combine them together. Authors are done inHelveticaNeue-Condensed
at about 10pts so that's another rule.The below code is a modified version of the one linked to above. Its a full working C# 2010 WinForms app targeting iTextSharp 5.1.1.0. It pulls out the title and authors for the supplied PDF but you'll need to tweak it for other PDFs and meta data. See the comments in the code for specific implementation details.