BeautifulSoup 或正则表达式 HTML 表到数据结构?

发布于 2024-09-24 01:46:04 字数 1881 浏览 1 评论 0原文

我有一个 HTML 表,我正在尝试从中解析信息。然而,有些表跨越多行/列,所以我想做的是使用 BeautifulSoup 之类的东西将表解析为某种类型的 Python 结构。我考虑只使用列表列表,这样我就会把类似的东西变成

<tr>
  <td>1,1</td>
  <td>1,2</td>
</tr>
<tr>
  <td>2,1</td>
  <td>2,2</td>
</tr>

[['1,1', '1,2'],
 ['2,1', '2,2']]

(认为)应该相当简单的东西。但是,由于某些单元格跨越多行/列,因此存在一些轻微的复杂性。另外,还有很多完全不必要的信息:

    <td ondblclick="DoAdd('/student_center/sc_all_rooms/d05/09/2010/editformnew?display=W&amp;style=L&amp;positioning=A&amp;adddirect=yes&amp;accessid=CreateNewEdit&amp;filterblock=N&amp;popeditform=yes&amp;returncalendar=student_center/sc_all_rooms')"
     class="listdefaultmonthbg" 
     style="cursor:crosshair;" 
     width="5%" 
     nowrap="1" 
     rowspan="1">
       <a class="listdatelink" 
          href="/student_center/sc_all_rooms/d05/09/2010/edit?style=L&amp;display=W&amp;positioning=A&amp;filterblock=N&amp;adddirect=yes&amp;accessid=CreateNewEdit">Sep 5</a>
    </td>

代码的实际情况甚至更糟。我真正需要的是:

<td rowspan="1">Sep 5</td>

两行之后,有一个行跨度为 17。对于多行跨度,我在想这样的事情:

<tr>
  <td rowspan="2">Sep 5</td>
  <td>Some event</td>
</tr>
<tr>
  <td>Some other event</td>
</tr>

结果会是这样:

[["Sep 5", "Some event"],
 [None, "Some other event"]]

页面上有多个表,并且我已经可以找到我想要的信息了,我只是不知道如何解析出我需要的信息。我知道我可以使用 BeautfulSoup 来“RenderContents”,但在某些情况下,我需要删除一些链接标签(同时保留文本)。

我正在考虑这样的过程:

  1. 查找表
  2. 计算表中的行数(len(table.findAll('tr'))?)
  3. 创建列表
  4. 将表解析为列表(BeautifulSoup语法???)
  5. ???
  6. 利润! (嗯,这是一个纯粹的内部程序,所以并不是真的......)

I've got an HTML table that I'm trying to parse the information from. However, some of the tables span multiple rows/columns, so what I would like to do is use something like BeautifulSoup to parse the table into some type of Python structure. I'm thinking of just using a list of lists so I would turn something like

<tr>
  <td>1,1</td>
  <td>1,2</td>
</tr>
<tr>
  <td>2,1</td>
  <td>2,2</td>
</tr>

into

[['1,1', '1,2'],
 ['2,1', '2,2']]

Which I (think) should be fairly straightforward. However, there are some slight complications because some of the cells span multiple rows/cols. Plus there's a lot of completely unnecessary information:

    <td ondblclick="DoAdd('/student_center/sc_all_rooms/d05/09/2010/editformnew?display=W&style=L&positioning=A&adddirect=yes&accessid=CreateNewEdit&filterblock=N&popeditform=yes&returncalendar=student_center/sc_all_rooms')"
     class="listdefaultmonthbg" 
     style="cursor:crosshair;" 
     width="5%" 
     nowrap="1" 
     rowspan="1">
       <a class="listdatelink" 
          href="/student_center/sc_all_rooms/d05/09/2010/edit?style=L&display=W&positioning=A&filterblock=N&adddirect=yes&accessid=CreateNewEdit">Sep 5</a>
    </td>

And what the code really looks like is even worse. All I really need out of there is:

<td rowspan="1">Sep 5</td>

Two rows later, there is a with a rowspan of 17. For multi-row spans I was thinking something like this:

<tr>
  <td rowspan="2">Sep 5</td>
  <td>Some event</td>
</tr>
<tr>
  <td>Some other event</td>
</tr>

would end out like this:

[["Sep 5", "Some event"],
 [None, "Some other event"]]

There are multiple tables on the page, and I can find the one I want already, I'm just not sure how to parse out the information I need. I know I can use BeautfulSoup to "RenderContents", but in some cases there are link tags that I need to get rid of (while keeping the text).

I was thinking of a process something like this:

  1. Find table
  2. Count rows in tables (len(table.findAll('tr'))?)
  3. Create list
  4. Parse table into list (BeautifulSoup syntax???)
  5. ???
  6. Profit! (Well, it's a purely internal program, so not really... )

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

漫雪独思 2024-10-01 01:46:04

linkedin 上的 python 小组最近就类似的问题进行了讨论,显然 lxml 是最值得推荐的 html 页面的 pythonic 解析器。

http://www.linkedin.com/groupItem?view=&gid=25827&type=member&item=27735259&qid=d2948a0e-6c0c-4256-851b-5e7007859553&goback=。 gmp_25827

There was a recent discussion on the python group on linkedin about a similar issue, and apparently lxml is the most recommended pythonic parser for html pages.

http://www.linkedin.com/groupItem?view=&gid=25827&type=member&item=27735259&qid=d2948a0e-6c0c-4256-851b-5e7007859553&goback=.gmp_25827

深府石板幽径 2024-10-01 01:46:04

您可能需要使用一些属性、ID 或名称来标识表。

from BeautifulSoup import BeautifulSoup

data = """
<table>
<tr>
  <td>1,1</td>
  <td>1,2</td>
</tr>
<tr>
  <td>2,1</td>
  <td>2,2</td>
</tr>
</table>
"""

soup = BeautifulSoup(data)

for t in soup.findAll('table'):
    for tr in t.findAll('tr'):
        print [td.contents for td in tr.findAll('td')]

编辑:如果有多个链接,程序应该做什么?

前任:

<td><a href="#">A</a> B <a href="#">C</a></td>

You'll probably need to identify the table with some attrs, id or name.

from BeautifulSoup import BeautifulSoup

data = """
<table>
<tr>
  <td>1,1</td>
  <td>1,2</td>
</tr>
<tr>
  <td>2,1</td>
  <td>2,2</td>
</tr>
</table>
"""

soup = BeautifulSoup(data)

for t in soup.findAll('table'):
    for tr in t.findAll('tr'):
        print [td.contents for td in tr.findAll('td')]

Edit: What should do the program if there're multiple links?

Ex:

<td><a href="#">A</a> B <a href="#">C</a></td>
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文