从研究论文的 PDF 中提取信息

发布于 2024-08-13 05:22:38 字数 1539 浏览 6 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(12

双马尾 2024-08-20 05:22:38

我每次发帖只允许有一个链接,所以就是这样:
pdfinfo Linux 手册页

这可能会获取标题和作者。查看手册页的底部,有一个指向 www.foolabs.com/xpdf 的链接,可以在其中找到该程序的开源代码以及适用于各种平台的二进制文件。

要提取参考书目,请查看 cb2bib

cb2Bib 是一款免费、开源的多平台应用程序,用于从电子邮件提醒、期刊网页和 PDF 文件中快速提取未格式化或非标准化的书目参考文献。

您可能还想查看 www 上的讨论论坛.zotero.org 已讨论此主题。

I'm only allowed one link per posting so this is it:
pdfinfo Linux manual page

This might get the title and authors. Look at the bottom of the manual page, and there's a link to www.foolabs.com/xpdf where the open source for the program can be found, as well as binaries for various platforms.

To pull out bibliographic references, look at cb2bib:

cb2Bib is a free, open source, and multiplatform application for rapidly extracting unformatted, or unstandardized bibliographic references from email alerts, journal Web pages, and PDF files.

You might also want to check the discussion forums at www.zotero.org where this topic has been discussed.

夜光 2024-08-20 05:22:38

2010 年 2 月,我们在伦敦的 Dev8D 举办了一场竞赛来解决这个问题,结果我们创建了一个不错的 GPL 小工具。我们尚未将其集成到我们的系统中,但它就在世界上。

https://code.google.com/p/pdfssa4met/

We ran a contest to solve this problem at Dev8D in London, Feb 2010 and we got a nice little GPL tool created as a result. We've not yet integrated it into our systems but it's there in the world.

https://code.google.com/p/pdfssa4met/

北方。的韩爷 2024-08-20 05:22:38

可能有点简单化,但谷歌搜索“bibtex + 论文标题”通常会为您提供来自 ACM、Citeseer 或其他此类参考跟踪站点的格式化 bibtex 条目。当然,这是假设该论文不是来自非计算期刊:D

- 编辑 -

我有一种感觉,你不会为此找到自定义解决方案,你可能想写信给引用跟踪器,例如 citeseer、ACM和谷歌学者来了解他们所做的事情。还有很多其他的,您可能会发现它们的实现不是闭源的,但不是以发布的形式。关于这个主题有大量的研究材料。

我所在的研究团队已经研究了此类问题,我们得出的结论是手写提取算法或机器学习是解决这个问题的方法。手写算法可能是您最好的选择。

由于可能的变化量,这是一个相当困难的问题。我建议将 PDF 规范化为文本(您可以从数十个编程 PDF 库中获取文本)。然后,您需要实现自定义文本抓取算法。

我会从 PDF 的末尾开始向后查看存在什么样的引文键 - 例如,[1]、[作者年份]、(作者年份),然后尝试解析下面的句子。您可能需要编写代码来规范化从库中获得的文本(删除多余的空格等)。我只会查找引文键作为一行的第一个单词,并且每个文档仅查找 10 页 - 第一个单词必须有键分隔符 - 例如,“[”或“(”。如果在中找不到键10 页然后忽略 PDF 并将其标记为人工干预,

您可能需要一个库,您可以进一步以编程方式查阅该库以格式化引文中的元数据 - 例如,斜体有特殊含义,

我认为您最终可能会花费相当多的钱 。是时候获得一个可行的解决方案,然后是不断调整和添加到报废算法/引擎的过程。

Might be a tad simplistic but Googling "bibtex + paper title" ussualy gets you a formated bibtex entry from the ACM,Citeseer, or other such reference tracking sites. Ofcourse this is assuming the paper isn't from a non-computing journal :D

-- EDIT --

I have a feeling you won't find a custom solution for this, you might want to write to citation trackers such as citeseer, ACM and google scholar to get ideas for what they have done. There are tons of others and you might find their implementations are not closed source but not in a published form. There is tons of research material on the subject.

The research team I am part of has looked at such problems and we have come to the conclusion that hand written extraction algorithms or machine learning are the way to do it. Hand written algorithms are probably your best bet.

This is quite a hard problem due to the amount of variation possible. I suggest normalizing the PDF's to text (which you get from any of the dozens of programmatic PDF libraries). You then need to implement custom text scrapping algorithms.

I would start backward from the end of the PDF and look what sort of citation keys exist -- e.g., [1], [author-year], (author-year) and then try to parse the sentence following. You will probably have to write code to normalize the text you get from a library (removing extra whitespace and such). I would only look for citation keys as the first word of a line, and only for 10 pages per document -- the first word must have key delimiters -- e.g., '[' or '('. If no keys can be found in 10 pages then ignore the PDF and flag it for human intervention.

You might want a library that you can further programmatically consult for formatting meta-data within citations --e.g., itallics have a special meaning.

I think you might end up spending quite some time to get a working solution, and then a continual process of tuning and adding to the scrapping algorithms/engine.

谁许谁一生繁华 2024-08-20 05:22:38

在这种情况下,我会推荐来自 TET。 pdflib.com/" rel="nofollow noreferrer">PDFLIB

如果您需要快速了解它的功能,请查看 TET Cookbook

这不是一个开源解决方案,但在我看来,它是目前最好的选择。它不依赖于平台,并且具有丰富的语言绑定和商业支持。

如果有人向我指出等效或更好的开源替代方案,我会很高兴。

要提取文本,您可以使用 TET_xxx() 函数;要查询元数据,您可以使用 pcos_xxx() 函数。

您还可以使用命令行工具生成包含您需要的所有信息的 XML 文件。

tet --tetml word file.pdf

< 中有关于如何使用 XSLT 处理 TETML 的示例。 a href="http://www.pdflib.com/en/tet-cookbook/" rel="nofollow noreferrer">TET Cookbook

TETML 中包含哪些内容?

TETML 输出以 UTF-8 编码(在具有 USS 或的 zSeries 上)
MVS:EBCDIC-UTF-8,请参阅 www.unicode.org/reports/tr16),并包含以下信息:
一般文档信息和元数据
每页的文字内容(单词或段落)
字形信息(字体名称、大小、坐标)
结构信息,例如表格
有关页面上放置的图像的信息
资源信息,即字体、色彩空间和图像
PDF 处理过程中发生异常时的错误消息

In this case i would recommend TET from PDFLIB

If you need to get a quick feel for what it can do, take a look at the TET Cookbook

This is not an open source solution, but it's currently the best option in my opinion. It's not platform-dependant and has a rich set of language bindings and a commercial backing.

I would be happy if someone pointed me to an equivalent or better open source alternative.

To extract text you would use the TET_xxx() functions and to query metadata you can use the pcos_xxx() functions.

You can also use the commanline tool to generate an XML-file containing all the information you need.

tet --tetml word file.pdf

There are examples on how to process TETML with XSLT in the TET Cookbook

What’s included in TETML?

TETML output is encoded in UTF-8 (on zSeries with USS or
MVS: EBCDIC-UTF-8, see www.unicode.org/reports/tr16), and includes the following information:
general document information and metadata
text contents of each page (words or paragraph)
glyph information (font name, size, coordinates)
structure information, e.g. tables
information about placed images on the page
resource information, i.e. fonts, colorspaces, and images
error messages if an exception occurred during PDF processing

无畏 2024-08-20 05:22:38

另一个可以尝试的 Java 库是 PDFBox。 PDF 确实是为查看和打印而设计的,因此您肯定需要一个库来为您完成一些繁重的工作。即便如此,您可能仍需要将文本片段重新粘合在一起才能获得您想要提取的数据。祝你好运!

Another Java library to try would be PDFBox. PDFs are really designed to viewed and printed, so you definitely want a library to do some of the heavy lifting for you. Even so, you might have to do a little gluing of text pieces back together to get the data you want extracted. Good Luck!

自此以后,行同陌路 2024-08-20 05:22:38

刚刚发现 pdftk...太棒了,以 Win/Lin/Mac 的二进制发行版形式提供以及来源。

事实上,我解决了我的其他问题(查看我的个人资料,我问然后回答了另一个 pdf 问题..由于 1 个链接限制而无法链接)。

它可以进行 pdf 元数据提取,例如,这将返回包含标题的行:

 pdftk test.pdf dump_data output test.txt | grep -A 1 "InfoKey: Title" | grep "InfoValue"

它可以转储标题、作者、修改日期,甚至书签和页码(测试 pdf 有书签)...显然需要一些工作需要正确地 grep 输出,但我认为这应该满足您的需求。

如果您的 pdf 没有元数据(即没有“摘要”元数据),您可以使用不同的工具(例如 pdf2text)来对文本进行分类,并使用一些如上所述的 grep 技巧。如果您的 pdf 未经过 OCR 处理,则会遇到更大的问题,并且 pdf 的即席查询将非常缓慢(最好进行 OCR)。

无论如何,我建议您构建文档索引,而不是让每个查询扫描文件元数据/文本。

Just found pdftk... it's amazing, comes in a binary distribution for Win/Lin/Mac as well as source.

In fact, I solved my other problem (look at my profile, I asked then answered another pdf question .. can't link due to 1 link limitation).

It can do pdf metadata extraction, for example, this will return the line containing the title:

 pdftk test.pdf dump_data output test.txt | grep -A 1 "InfoKey: Title" | grep "InfoValue"

It can dump title, author, mod-date, and even bookmarks and page numbers (test pdf had bookmarks)... obviously a bit of work will be needed to properly grep the output, but I think this should fit your needs.

If your pdfs don't have metadata (ie, no "Abstract" metadata), you can cat the text using a different tool like pdf2text, and use some grep tricks like above. If your pdfs are not OCR'd, you have a much bigger problem, and ad-hoc querying of the pdf(s) will be painfully slow (best to OCR).

Regardless, I would recommend you build an index of your documents instead of having each query scan the file metadata/text.

自在安然 2024-08-20 05:22:38

看看 iText。它是一个可以让您阅读 PDF 的 Java 库。您仍然会面临寻找正确数据的问题,但该库将提供可用于推断用途的格式和布局信息。

Take a look at iText. It is a Java library that will let you read PDFs. You will still face the problem of finding the right data, but the library will provide formatting and layout information that might be usable to infer purpose.

路还长,别太狂 2024-08-20 05:22:38

PyPDF 可能会有所帮助。它提供了广泛的 API,用于读取和写入 PDF 文件(未加密)的内容,并使用简单的 Python 语言编写。

PyPDF might be of help. It provides extensive API for reading and writing the content of a PDF file (un-encrypted), and its written in an easy language Python.

記憶穿過時間隧道 2024-08-20 05:22:38

看看这篇研究论文 - 使用条件随机字段从研究论文中准确提取信息

您可能想要使用开源包,例如 Stanford NER 开始使用 CRF。

或者,您可以尝试将它们(研究论文)导入 Mendeley。显然,它应该为您提取必要的信息。

希望这有帮助。

Have a look at this research paper - Accurate Information Extraction from Research Papers using Conditional Random Fields

You might want to use an open-source package like Stanford NER to get started on CRFs.

Or perhaps, you could try importing them (the research papers) to Mendeley. Apparently, it should extract the necessary information for you.

Hope this helps.

入画浅相思 2024-08-20 05:22:38

这是我使用 linux 和 cb2bib 所做的事情。

  1. 打开 cb2bib 并确保剪贴板连接已打开,并且您的参考数据库已加载
  2. 在 google seller 上查找您的论文
  3. 单击论文下方的“导入到 bibtex”
  4. 选择(突出显示)下一页上的所有内容(即 bibtex 代码) )
  5. 现在应该以 cb2bib 格式显示。
  6. 现在可以选择按网络搜索(地球图标)添加其他信息。
  7. 按 cb2bib 中的“保存”将论文添加到您的参考数据库中。

对所有纸张重复此操作。我认为,在缺乏可靠地从 PDF 中提取元数据的方法的情况下,这是我发现的最简单的解决方案。

Here is what I do using linux and cb2bib.

  1. Open up cb2bib and make sure that clipboard connection is ON, and that your reference database is loaded
  2. Find your paper on google scholar
  3. Click 'import to bibtex' underneath the paper
  4. Select (highlight) everything on the next page (ie., the bibtex code)
  5. It should now appear formatted in cb2bib
  6. Optionally now press network search (the globe icon) to add additional info.
  7. Press save in cb2bib to add the paper to your ref database.

Repeat this for all the papers. I think in the absence of a method that reliably extracts metadata from PDFs, this is the easiest solution I found.

云巢 2024-08-20 05:22:38

我推荐 gscholarpdftotext

尽管 PDF 提供元数据,但很少填充正确的内容。例如,通常“None”或“Adobe-Photoshop”或其他愚蠢的字符串代替标题字段。这就是为什么上述工具都无法从 PDF 中获取正确的信息,因为标题可能位于文档中的任何位置。另一个例子:许多会议论文集的论文也可能有会议的标题,或者编辑的名字,这会混淆自动提取工具。当你对论文的真正作者感兴趣时,结果就大错特错了。

所以我建议采用谷歌学术的半自动方法。

  1. 将 PDF 渲染为文本,以便您可以提取:作者和标题。
  2. 第二次复制粘贴一些信息并查询谷歌学者。为了自动执行此操作,我使用了很酷的 python 脚本 gscholar.py。

所以在现实生活中,这就是我所做的:

me@box> pdftotext 10.1.1.90.711.pdf - | head
Computational Geometry 23 (2002) 183–194
www.elsevier.com/locate/comgeo

Voronoi diagrams on the sphere ✩
Hyeon-Suk Na a , Chung-Nim Lee a , Otfried Cheong b,∗
a Department of Mathematics, Pohang University of Science and Technology, South Korea
b Institute of Information and Computing Sciences, Utrecht University, P.O. Box 80.089, 3508 TB Utrecht, The Netherlands

Received 28 June 2001; received in revised form 6 September 2001; accepted 12 February 2002
Communicated by J.-R. Sack
me@box> gscholar.py "Voronoi diagrams on the sphere Hyeon-Suk" 
@article{na2002voronoi,
  title={Voronoi diagrams on the sphere},
  author={Na, Hyeon-Suk and Lee, Chung-Nim and Cheong, Otfried},
  journal={Computational Geometry},
  volume={23},
  number={2},
  pages={183--194},
  year={2002},
  publisher={Elsevier}
}

编辑:小心,你可能会遇到验证码。另一个很棒的脚本是 bibfetch

I recommend gscholar in combination with pdftotext.

Although PDF provides meta data, it is seldomly populated with correct content. Often "None" or "Adobe-Photoshop" or other dumb strings are inplace of the title field, for example. That is why none of the above tools might derive correct information from PDFs as the title might be anywhere in the document. Another example: many papers of conference proceedings might also have the title of the conference, or the name of the editors which confuses automatic extraction tools. The results are then dead wrong when you are interested of the real authors of the paper.

So I suggest a semi-automatic approach involving google scholar.

  1. Render the PDF to text, so you might extract: author, and title.
  2. Second copy paste some of this info and query google scholar. To automate this, I employ the cool python script gscholar.py.

So in real life this is what I do:

me@box> pdftotext 10.1.1.90.711.pdf - | head
Computational Geometry 23 (2002) 183–194
www.elsevier.com/locate/comgeo

Voronoi diagrams on the sphere ✩
Hyeon-Suk Na a , Chung-Nim Lee a , Otfried Cheong b,∗
a Department of Mathematics, Pohang University of Science and Technology, South Korea
b Institute of Information and Computing Sciences, Utrecht University, P.O. Box 80.089, 3508 TB Utrecht, The Netherlands

Received 28 June 2001; received in revised form 6 September 2001; accepted 12 February 2002
Communicated by J.-R. Sack
me@box> gscholar.py "Voronoi diagrams on the sphere Hyeon-Suk" 
@article{na2002voronoi,
  title={Voronoi diagrams on the sphere},
  author={Na, Hyeon-Suk and Lee, Chung-Nim and Cheong, Otfried},
  journal={Computational Geometry},
  volume={23},
  number={2},
  pages={183--194},
  year={2002},
  publisher={Elsevier}
}

EDIT: Be careful, you might encounter captchas. Another great script is bibfetch.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文