mhtml 文件的标签及其含义是否有主列表?

发布于 2024-12-05 23:58:44 字数 356 浏览 0 评论 0原文

我正在尝试从实际上是单文件网页的 xls 文件中读取和提取数据,请参见下面

This document is a Single File Web Page, also known as a Web Archive file.  

我正在尝试找出所有标签的含义,以便我可以确保使用 lxml 正确解析它们。

例如,这里是一个标签的示例:

 <th class=3Dtl colspan=3D1 rowspan=3D2

虽然我成功地处理了我正在处理的几个文件,但我想尝试弄清楚我所做的假设是否会在以后困扰我。因此,这些标签及其含义的列表会很棒。

I am trying to read and extract data from xls files that are really Single File Web Pages see below

This document is a Single File Web Page, also known as a Web Archive file.  

I am trying to figure out the meaning of all of the tags so I can make sure I parse them correctly using lxml.

For example here is an example of a tag:

 <th class=3Dtl colspan=3D1 rowspan=3D2

While I am having success working with the few files I am toying with I want to try to figure out if I am making assumptions that will later come back to haunt me. Thus, a list of these tags and their meaning would be great.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

少年亿悲伤 2024-12-12 23:58:44

如果 MHTML 是从 Microsoft Word 生成的,则它可能是 WordprocessingMLHTML4 标记。

WordprocessingML 文档中的顶级元素是:

SmartTagType element describes a Smart Tag type used in the document.
DocumentProperties element contains Office Document Properties.
CustomDocumentProperties element contains Custom Office Document Properties.
schemaLibrary element defines a collection of schemas that comprise a document's schema library.
fonts element (wordDocumentElt complexType) contains font information
frameset element (wordDocumentElt complexType) contains HTML Frameset definitions.
styles element (wordDocumentElt complexType) contains style definitions.
divs element contains HTML DIV information.
shapeDefaults element contains drawing defaults.
docOleData element contains supplemental data containing storages for OLE objects.
docSuppData element contains supplemental data containing toolbar customizations, envelope data, and the Microsoft Visual Basic project.
docPr element contains document options.
shapeDefaults element contains the wrapper representing the shape defaults.
bgPict element contains background picture information.
body element contains the document body.

但是,最简单的 WordprocessingML 文档仅包含五个元素(和一个命名空间)。这五个要素是:

wordDocument element: The root element for a WordprocessingML document.
body element: The container for the displayable text.
p element: A paragraph.
r element: A contiguous set of WordprocessingML components with a consistent set of properties.
t element: A piece of text.

If the MHTML is generated from Microsoft Word, it's probably a combination of WordprocessingML and HTML4 tags.

The top-level elements in a WordprocessingML document are:

SmartTagType element describes a Smart Tag type used in the document.
DocumentProperties element contains Office Document Properties.
CustomDocumentProperties element contains Custom Office Document Properties.
schemaLibrary element defines a collection of schemas that comprise a document's schema library.
fonts element (wordDocumentElt complexType) contains font information
frameset element (wordDocumentElt complexType) contains HTML Frameset definitions.
styles element (wordDocumentElt complexType) contains style definitions.
divs element contains HTML DIV information.
shapeDefaults element contains drawing defaults.
docOleData element contains supplemental data containing storages for OLE objects.
docSuppData element contains supplemental data containing toolbar customizations, envelope data, and the Microsoft Visual Basic project.
docPr element contains document options.
shapeDefaults element contains the wrapper representing the shape defaults.
bgPict element contains background picture information.
body element contains the document body.

However, the simplest WordprocessingML document consists of just five elements (and a single namespace). The five elements are:

wordDocument element: The root element for a WordprocessingML document.
body element: The container for the displayable text.
p element: A paragraph.
r element: A contiguous set of WordprocessingML components with a consistent set of properties.
t element: A piece of text.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文