用于提取“epub”的 Python 库信息

发布于 2024-09-07 03:22:28 字数 1539 浏览 9 评论 0 原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

故事↓在人 2024-09-14 03:22:28

.epub 文件是一个包含 META-INF 目录的 zip 编码文件,该目录包含一个名为 container.xml 的文件,该文件指向另一个通常名为 Content.opf 的文件,该文件对构成电子书的所有其他文件进行索引(摘要基于 http://www.jedisaber.com/eBooks/tutorial.asp ; 完整规范位于 http://www.idpf.org/2007/ opf/opf2.0/download/

以下 Python 代码将从 .epub 文件中提取基本元信息并将其作为字典返回。

import zipfile
from lxml import etree

def epub_info(fname):
    def xpath(element, path):
        return element.xpath(
            path,
            namespaces={
                "n": "urn:oasis:names:tc:opendocument:xmlns:container",
                "pkg": "http://www.idpf.org/2007/opf",
                "dc": "http://purl.org/dc/elements/1.1/",
            },
        )[0]

    # prepare to read from the .epub file
    zip_content = zipfile.ZipFile(fname)
      
    # find the contents metafile
    cfname = xpath(
        etree.fromstring(zip_content.read("META-INF/container.xml")),
        "n:rootfiles/n:rootfile/@full-path",
    ) 
    
    # grab the metadata block from the contents metafile
    metadata = xpath(
        etree.fromstring(zip_content.read(cfname)), "/pkg:package/pkg:metadata"
    )
    
    # repackage the data
    return {
        s: xpath(metadata, f"dc:{s}/text()")
        for s in ("title", "language", "creator", "date", "identifier")
    }    

示例输出:

{
    'date': '2009-12-26T17:03:31',
    'identifier': '25f96ff0-7004-4bb0-b1f2-d511ca4b2756',
    'creator': 'John Grisham',
    'language': 'UND',
    'title': 'Ford County'
}

An .epub file is a zip-encoded file containing a META-INF directory, which contains a file named container.xml, which points to another file usually named Content.opf, which indexes all the other files which make up the e-book (summary based on http://www.jedisaber.com/eBooks/tutorial.asp ; full spec at http://www.idpf.org/2007/opf/opf2.0/download/ )

The following Python code will extract the basic meta-information from an .epub file and return it as a dict.

import zipfile
from lxml import etree

def epub_info(fname):
    def xpath(element, path):
        return element.xpath(
            path,
            namespaces={
                "n": "urn:oasis:names:tc:opendocument:xmlns:container",
                "pkg": "http://www.idpf.org/2007/opf",
                "dc": "http://purl.org/dc/elements/1.1/",
            },
        )[0]

    # prepare to read from the .epub file
    zip_content = zipfile.ZipFile(fname)
      
    # find the contents metafile
    cfname = xpath(
        etree.fromstring(zip_content.read("META-INF/container.xml")),
        "n:rootfiles/n:rootfile/@full-path",
    ) 
    
    # grab the metadata block from the contents metafile
    metadata = xpath(
        etree.fromstring(zip_content.read(cfname)), "/pkg:package/pkg:metadata"
    )
    
    # repackage the data
    return {
        s: xpath(metadata, f"dc:{s}/text()")
        for s in ("title", "language", "creator", "date", "identifier")
    }    

Sample output:

{
    'date': '2009-12-26T17:03:31',
    'identifier': '25f96ff0-7004-4bb0-b1f2-d511ca4b2756',
    'creator': 'John Grisham',
    'language': 'UND',
    'title': 'Ford County'
}
梦初启 2024-09-14 03:22:28

例如,类似于 epub-tools 的东西?但这主要是关于编写 epub 格式(来自各种可能的来源),epubtools (类似拼写,不同项目)。为了阅读它,我会尝试配套项目 Threepress ,一个用于在浏览器上显示 epub 书籍的 Django 应用程序 - 还没有看过该代码,但我想为了显示这本书,它必须首先能够阅读它;-)。

Something like epub-tools, for example? But that's mostly about writing epub format (from various possible sources), as is epubtools (similar spelling, different project). For reading it, I'd try the companion project threepress, a Django app for showing epub books on a browser -- haven't looked at that code, but I imagine that in order to show the book it must surely first be able to read it;-).

南汐寒笙箫 2024-09-14 03:22:28

查看 epub 模块。这看起来是一个简单的选择。

Check out the epub module. It looks like an easy option.

攀登最高峰 2024-09-14 03:22:28

在寻找类似的东西后,我来到这里,并受到 Bothwell 先生的代码片段的启发,开始了我自己的项目。如果有人感兴趣... http://epubzilla.odeegan.com/

I wound up here after looking for something similar and was inspired by Mr. Bothwell's code snippet to start my own project. If anyone is interested ... http://epubzilla.odeegan.com/

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文