BeautifulSoup 或正则表达式 HTML 表到数据结构?
我有一个 HTML 表,我正在尝试从中解析信息。然而,有些表跨越多行/列,所以我想做的是使用 BeautifulSoup 之类的东西将表解析为某种类型的 Python 结构。我考虑只使用列表列表,这样我就会把类似的东西变成
<tr>
<td>1,1</td>
<td>1,2</td>
</tr>
<tr>
<td>2,1</td>
<td>2,2</td>
</tr>
我
[['1,1', '1,2'],
['2,1', '2,2']]
(认为)应该相当简单的东西。但是,由于某些单元格跨越多行/列,因此存在一些轻微的复杂性。另外,还有很多完全不必要的信息:
<td ondblclick="DoAdd('/student_center/sc_all_rooms/d05/09/2010/editformnew?display=W&style=L&positioning=A&adddirect=yes&accessid=CreateNewEdit&filterblock=N&popeditform=yes&returncalendar=student_center/sc_all_rooms')"
class="listdefaultmonthbg"
style="cursor:crosshair;"
width="5%"
nowrap="1"
rowspan="1">
<a class="listdatelink"
href="/student_center/sc_all_rooms/d05/09/2010/edit?style=L&display=W&positioning=A&filterblock=N&adddirect=yes&accessid=CreateNewEdit">Sep 5</a>
</td>
代码的实际情况甚至更糟。我真正需要的是:
<td rowspan="1">Sep 5</td>
两行之后,有一个行跨度为 17。对于多行跨度,我在想这样的事情:
<tr>
<td rowspan="2">Sep 5</td>
<td>Some event</td>
</tr>
<tr>
<td>Some other event</td>
</tr>
结果会是这样:
[["Sep 5", "Some event"],
[None, "Some other event"]]
页面上有多个表,并且我已经可以找到我想要的信息了,我只是不知道如何解析出我需要的信息。我知道我可以使用 BeautfulSoup 来“RenderContents”,但在某些情况下,我需要删除一些链接标签(同时保留文本)。
我正在考虑这样的过程:
- 查找表
- 计算表中的行数(
len(table.findAll('tr'))
?) - 创建列表
- 将表解析为列表(BeautifulSoup语法???)
- ???
- 利润! (嗯,这是一个纯粹的内部程序,所以并不是真的......)
I've got an HTML table that I'm trying to parse the information from. However, some of the tables span multiple rows/columns, so what I would like to do is use something like BeautifulSoup to parse the table into some type of Python structure. I'm thinking of just using a list of lists so I would turn something like
<tr>
<td>1,1</td>
<td>1,2</td>
</tr>
<tr>
<td>2,1</td>
<td>2,2</td>
</tr>
into
[['1,1', '1,2'],
['2,1', '2,2']]
Which I (think) should be fairly straightforward. However, there are some slight complications because some of the cells span multiple rows/cols. Plus there's a lot of completely unnecessary information:
<td ondblclick="DoAdd('/student_center/sc_all_rooms/d05/09/2010/editformnew?display=W&style=L&positioning=A&adddirect=yes&accessid=CreateNewEdit&filterblock=N&popeditform=yes&returncalendar=student_center/sc_all_rooms')"
class="listdefaultmonthbg"
style="cursor:crosshair;"
width="5%"
nowrap="1"
rowspan="1">
<a class="listdatelink"
href="/student_center/sc_all_rooms/d05/09/2010/edit?style=L&display=W&positioning=A&filterblock=N&adddirect=yes&accessid=CreateNewEdit">Sep 5</a>
</td>
And what the code really looks like is even worse. All I really need out of there is:
<td rowspan="1">Sep 5</td>
Two rows later, there is a with a rowspan of 17. For multi-row spans I was thinking something like this:
<tr>
<td rowspan="2">Sep 5</td>
<td>Some event</td>
</tr>
<tr>
<td>Some other event</td>
</tr>
would end out like this:
[["Sep 5", "Some event"],
[None, "Some other event"]]
There are multiple tables on the page, and I can find the one I want already, I'm just not sure how to parse out the information I need. I know I can use BeautfulSoup to "RenderContents", but in some cases there are link tags that I need to get rid of (while keeping the text).
I was thinking of a process something like this:
- Find table
- Count rows in tables (
len(table.findAll('tr'))
?) - Create list
- Parse table into list (BeautifulSoup syntax???)
- ???
- Profit! (Well, it's a purely internal program, so not really... )
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
linkedin 上的 python 小组最近就类似的问题进行了讨论,显然 lxml 是最值得推荐的 html 页面的 pythonic 解析器。
http://www.linkedin.com/groupItem?view=&gid=25827&type=member&item=27735259&qid=d2948a0e-6c0c-4256-851b-5e7007859553&goback=。 gmp_25827
There was a recent discussion on the python group on linkedin about a similar issue, and apparently lxml is the most recommended pythonic parser for html pages.
http://www.linkedin.com/groupItem?view=&gid=25827&type=member&item=27735259&qid=d2948a0e-6c0c-4256-851b-5e7007859553&goback=.gmp_25827
您可能需要使用一些属性、ID 或名称来标识表。
编辑:如果有多个链接,程序应该做什么?
前任:
You'll probably need to identify the table with some attrs, id or name.
Edit: What should do the program if there're multiple links?
Ex: