我有一个 HTML 文件(以 utf-8 编码)。 我用codecs.open()
打开它。 文件架构是:
<html>
// header
<body>
// some text
<table>
// some rows with cells here
// some cells contains tables
</table>
// maybe some text here
<table>
// a form and other stuff
</table>
// probably some more text
</body></html>
我只需要检索第一个表(丢弃带有表单的表)。 省略第一个 之前和相应
之后的所有输入。 有些单元格还包含段落、粗体和脚本。 主表的每一行嵌套表不超过一个。
如何提取它以获取行列表,其中每个元素保存普通(unicode 字符串)单元格的数据以及每个嵌套表的行列表? 嵌套层数不超过 1 层。
我尝试了 HTMLParse、PyParse 和 re 模块,但无法正常工作。
我对 Python 还很陌生。
I have an HTML file (encoded in utf-8). I open it with codecs.open()
. The file architecture is:
<html>
// header
<body>
// some text
<table>
// some rows with cells here
// some cells contains tables
</table>
// maybe some text here
<table>
// a form and other stuff
</table>
// probably some more text
</body></html>
I need to retrieve only first table (discard the one with form). Omit all input before first <table>
and after corresponding </table>
. Some cells contains also paragraphs, bolds and scripts. There is no more than one nested table per row of main table.
How can I extract it to get a list of rows, where each elements holds plain (unicode string) cell's data and a list of rows for each nested table? There's no more than 1 level of nesting.
I tried HTMLParse, PyParse and re module, but can't get this working.
I'm quite new to Python.
发布评论
评论(3)
您可能喜欢lxml。 我不确定我是否真的理解你想用这个结构做什么,但也许这个例子会有所帮助......
You may like lxml. I'm not sure I really understood what you want to do with that structure, but maybe this example will help...
如果 HTML 格式良好,您可以将其解析为 DOM 树,并使用 XPath 提取所需的表。 我通常使用 lxml 来解析 XML,并使用 它也可以解析 HTML。
用于拉出第一个表的 XPath 为“//table[1]”。
If the HTML is well-formed you can parse it into a DOM tree and use XPath to extract the table you want. I usually use lxml for parsing XML, and it can parse HTML as well.
The XPath for pulling out the first table would be "//table[1]".
尝试 beautiful soup
原则上,您需要使用真正的解析器(Beaut.Soup 就是),出于计算机科学的原因,正则表达式无法处理嵌套元素(有限状态机无法解析上下文无关语法,IIRC)
Try beautiful soup
In principle you need to use a real parser (which Beaut. Soup is), regex cannot deal with nested elements, for computer sciencey reasons (finite state machines can't parse context-free grammars, IIRC)