Python BeautifulSoup 自动跟踪内容表行和列

发布于 2024-10-30 09:33:09 字数 1886 浏览 0 评论 0原文

首先我想说我是 Stack 和 Python 的新手。我上周才开始使用它。然而，我是一位经验丰富的 PHP/C++/Pascal/ADA/B/Forth（显示了我的年龄）程序员。

我编写了一个脚本，可以从网站提取产品页面并将它们存储在我的本地 MySQL 数据库中。我这样做是为了在深夜负载较轻时抓取该网站。我现在需要对每个页面的 html 进行排序并获取产品描述。这些都放在桌子上。但是，每个页面可能在不同的行/列中具有所需的值。

我可以确定的事情是：

每个表都有一个标题，用于定义其下面的行/列中的数据。
每个值的标题文本都是一致的，即“零件”始终描述零件类型和“零件编号”。始终描述零件号。
并非所有页面都包含所需的所有数据。因此，如果没有找到，它必须保存它找到的内容。

下面的部分是第二部分，获取我遇到问题的数据值。如何选择一行中的第 n 列？

我当前的方法是：

获取所需的列

从数据库获取 html 文档
抓取表格（我的表格始终包含在页面上的唯一 div 中。
抓取所有行（实际上只需要对第一行执行此操作）
对于每一行当我找到所需的字段名称时，获取行和列索引'

获取

每行的数据值：
如果它是标题，则跳过该行（保存具有标题字段的行计数）
，为每列获取文本值。
将值保存到数据库

我的页面的重要部分如下所示：

<div>
   ... 
   <table>
      <tr><td>&nbsp;</td><td><b>Item</b></td><td>&nbsp;</td><td><b>Description</b></td><td>&nbsp;</td><td><b>Part No.</b></td><td>&nbsp;</td><td><b>Color</b></td><td>&nbsp;</td></tr>
      <tr><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
      <tr><td>&nbsp;</td><td>Toaster</td><td>&nbsp;</td><td>2-Slice</td><td>&nbsp;</td><td>#25713</td><td>&nbsp;</td><td>Chorme</td><td>&nbsp;</td></tr>
   </table>
   ...
</div>

非常感谢任何回复的人。

原文

First let me say that I am new to Stack and to Python. I just started working with it last week. I am however a seasoned PHP/C++/Pascal/ADA/B/Forth (showing my age) programmer.

I have written a script that pulls product pages from a website and stores them in my local MySQL database. I did this so that I can crawl the site late at night when the load is light. I now need to sort through the html of each page and get the product descriptions. These are placed in tables. However, each page may have the needed values in different rows/columns.

The things I can be sure of are:

Each table has a heading that defines the data in the rows/columns below it.
The heading text is consistent for each value i.e. 'Part' always describes the part type and 'Part No.' always describes a part number.
Not all pages will contain all the data desired. So If not located it must save what it finds.

In the section below it is the second part, getting the data values that I am having trouble with. How do I select the nth column from a row?

My current approach is:

To Get Desired Columns

Get html doc from db
Grab the table (my table is always contained in the only div on the page.
Grab all the rows (really only need to do this for the first row)
For each row grab the row and column index' when I find a desired field names.

To Get Data Values

For each row:
Skip the row if it was a header (save the row counts for those with header fields)
for each column grab the text value.
Save the values to db

The important part of my page looks like this:

<div>
   ... 
   <table>
      <tr><td> </td><td><b>Item</b></td><td> </td><td><b>Description</b></td><td> </td><td><b>Part No.</b></td><td> </td><td><b>Color</b></td><td> </td></tr>
      <tr><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td></tr>
      <tr><td> </td><td>Toaster</td><td> </td><td>2-Slice</td><td> </td><td>#25713</td><td> </td><td>Chorme</td><td> </td></tr>
   </table>
   ...
</div>

A Big thank you to anyone who responds.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦途 2024-11-06 09:33:09

这是我的解决方法：

from BeautifulSoup import BeautifulSoup

doc = '''<div>
   <table>
      <tr><td> </td><td><b>Item</b></td><td> </td><td><b>Description</b></td><td> </td><td><b>Part No.</b></td><td> </td><td><b>Color</b></td><td> </td></tr>
      <tr><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td></tr>
      <tr><td> </td><td>Toaster</td><td> </td><td>2-Slice</td><td> </td><td>#25713</td><td> </td><td>Chorme</td><td> </td></tr>
   </table>
</div>'''

soup = BeautifulSoup(doc)
# find the table element in the HTML document
table = soup.find("table")
# grabs the top row
firstRow = table.contents[0]
# find how many columns there are
numberOfColumns = len(firstRow.contents)
restOfRows = table.contents[1:]
for row in restOfRows:
  for x in range(0,numberOfColumns):
    print "column data: %s" % row.contents[x].string

这将从任何文档中提取表元素。然后根据第一行求出列数。最后，它将循环遍历其余行，打印出该行中的数据。

BS 文档的有用链接： http://www.crummy.com/software/BeautifulSoup/documentation .html

Here's how I'd tackle it:

from BeautifulSoup import BeautifulSoup

doc = '''<div>
   <table>
      <tr><td> </td><td><b>Item</b></td><td> </td><td><b>Description</b></td><td> </td><td><b>Part No.</b></td><td> </td><td><b>Color</b></td><td> </td></tr>
      <tr><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td></tr>
      <tr><td> </td><td>Toaster</td><td> </td><td>2-Slice</td><td> </td><td>#25713</td><td> </td><td>Chorme</td><td> </td></tr>
   </table>
</div>'''

soup = BeautifulSoup(doc)
# find the table element in the HTML document
table = soup.find("table")
# grabs the top row
firstRow = table.contents[0]
# find how many columns there are
numberOfColumns = len(firstRow.contents)
restOfRows = table.contents[1:]
for row in restOfRows:
  for x in range(0,numberOfColumns):
    print "column data: %s" % row.contents[x].string

That will extract the table element from any document. Then find the number of columns based on the first row. Finally, it will loop through the rest of the rows printing out the data in the row.

Useful link to BS docs: http://www.crummy.com/software/BeautifulSoup/documentation.html

回复收藏 0 原文

草莓味的萝莉 2024-11-06 09:33:09

以下是使用 HTQL 的方法：

import htql;
doc = '''<div>     <table>
    <tr><td> </td><td><b>Item</b></td><td> </td><td><b>Description</b></td><td>         </td><td><b>Part No.</b></td><td> </td><td><b>Color</b></td><td> </td></tr>
    <tr><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td></tr>
    <tr><td> </td><td>Toaster</td><td> </td><td>2-Slice</td><td> </td><td>#25713</td><td> </td><td>Chorme</td><td> </td></tr>
  </table>  </div>''';

query = "<div>.<table>.<tr>{item=<td (th='Item')>&tx; desc=<td (th='Description')>&tx | item<>'Item'}";

for item, desc in htql.HTQL(doc, query): 
    print(item, desc);

Here is how you do it with HTQL:

import htql;
doc = '''<div>     <table>
    <tr><td> </td><td><b>Item</b></td><td> </td><td><b>Description</b></td><td>         </td><td><b>Part No.</b></td><td> </td><td><b>Color</b></td><td> </td></tr>
    <tr><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td></tr>
    <tr><td> </td><td>Toaster</td><td> </td><td>2-Slice</td><td> </td><td>#25713</td><td> </td><td>Chorme</td><td> </td></tr>
  </table>  </div>''';

query = "<div>.<table>.<tr>{item=<td (th='Item')>&tx; desc=<td (th='Description')>&tx | item<>'Item'}";

for item, desc in htql.HTQL(doc, query): 
    print(item, desc);

回复收藏 0 原文

~没有更多了~