如何使用 Python 解析带有表格的 HTML 文件
我有一个带有表格的 html 文件(它很大,所以只给出了示例代码)。我想检索表中的值。我尝试了 python 中的 HTMLParser 库。
我开始像下面这样编码。然后我发现属性“class”与系统定义的关键字相同。所以它给了我错误。
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
if tag == 'tr':
for class in attrs:
if class == 'Table_row'
p = MyHTMLParser()
p.feed(ht)
的 HTML 代码
<table class="Table_rows" cellspacing="0" rules="all" border="1" id="MyDataGrid" style="width:700px;border-collapse:collapse;">
<tr class="Table_Heading">
<td>STATION CODE</td><td>STATION NAME</td><td>SCHEDULED ARRIVAL</td><td>SCHEDULED DEPARTURE</td><td>ACTUAL/ EXPECTED ARRIVAL</td><td>ACTUAL/ EXPECTED DEPARTURE</td>
</tr><tr class="Table_row">
<td>TVC </td><td style="width:160px;">ORIGON</td><td>Starting Station </td><td>05:00, 07 May 2011</td><td>Starting Station</td><td>05:00, 07 May 2011</td>
</tr><tr class="alternat_table_row">
<td>TVP </td><td>NEY YORK</td><td>05:04, 07 May 2011</td><td>05:05, 07 May 2011</td><td>05:04, 07 May 2011</td><td>05:05, 07 May 2011</td>
</tr>
</table>
表UPDATE
如何获取标签之间的数据?
I have got a html file with table ( its a large one, so only sample code is given ). I want to retrieve the values in tables. I tried the HTMLParser library from python.
I started coding like below. Then I found that the attribute "class" is same as system defined keyword. So its giving me error.
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
if tag == 'tr':
for class in attrs:
if class == 'Table_row'
p = MyHTMLParser()
p.feed(ht)
HTML code for table
<table class="Table_rows" cellspacing="0" rules="all" border="1" id="MyDataGrid" style="width:700px;border-collapse:collapse;">
<tr class="Table_Heading">
<td>STATION CODE</td><td>STATION NAME</td><td>SCHEDULED ARRIVAL</td><td>SCHEDULED DEPARTURE</td><td>ACTUAL/ EXPECTED ARRIVAL</td><td>ACTUAL/ EXPECTED DEPARTURE</td>
</tr><tr class="Table_row">
<td>TVC </td><td style="width:160px;">ORIGON</td><td>Starting Station </td><td>05:00, 07 May 2011</td><td>Starting Station</td><td>05:00, 07 May 2011</td>
</tr><tr class="alternat_table_row">
<td>TVP </td><td>NEY YORK</td><td>05:04, 07 May 2011</td><td>05:05, 07 May 2011</td><td>05:04, 07 May 2011</td><td>05:05, 07 May 2011</td>
</tr>
</table>
UPDATE
How could I get data between the tags?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
请注意,
handle_starttag
方法的文档指出:所以,您可能正在寻找类似的东西:
Prints:
PS 我还推荐 BeautifulSoup 用 Python 解析 HTML。
Note that the documentation of the
handle_starttag
method states:So, you're probably looking for something like:
Prints:
P.S. I also recommend BeautifulSoup for parsing HTML with Python.
你可以用 BeautifulSoup 来做到这一点。
You can do it like this with BeautifulSoup.
我强烈推荐使用 BeautifulSoup 库。它甚至可以轻松处理损坏的 HTML。
http://www.crummy.com/software/BeautifulSoup/
I would highly recommend using the BeautifulSoup library. It handles even broken HTML with ease.
http://www.crummy.com/software/BeautifulSoup/