解析 html 表

发布于 2024-11-28 15:31:00 字数 4644 浏览 0 评论 0原文

首先，这是我当前的完整代码：

import urllib
from BeautifulSoup import BeautifulSoup
import sgmllib
import re

page = 'http://www.sec.gov/Archives/edgar/data/\
8177/000114036111018563/form10k.htm'

sock = urllib.urlopen(page)
raw = sock.read()
soup = BeautifulSoup(raw)

tablelist = soup.findAll('table')

class MyParser(sgmllib.SGMLParser):

def parse(self, segment):
    self.feed(segment)
    self.close()

def __init__(self, verbose=0):
    sgmllib.SGMLParser.__init__(self, verbose)
    self.descriptions = []
    self.inside_td_element = 0
    self.starting_description = 0

def start_td(self, attributes):
    for name, value in attributes:
        if name == "valign":
            self.inside_td_element = 1
            self.starting_description = 1
        else:
            self.inside_td_element = 1
            self.starting_description = 1

def end_td(self):
    self.inside_td_element = 0

def handle_data(self, data):
    if self.inside_td_element:
        if self.starting_description:
            self.descriptions.append(data)
            self.starting_description = 0
        else:
            self.descriptions[-1] += data

def get_descriptions(self):
    return self.descriptions

counter = 0
trlist = []
dtablelist = []

while counter < len(tablelist):
    trsegment = tablelist[counter].findAll('tr')
    trlist.append(trsegment)
    strsegment = str(trsegment)
    myparser = MyParser()
    myparser.parse(strsegment)
    sub = myparser.get_descriptions()
    dtablelist.append(sub)
    counter = counter + 1

ex = []

dtablelist = [s for s in dtablelist if s != ex]

所以我想要完成的是从 html 文档中获取所有表格，然后将它们重新打印到 Excel 电子表格上。因此，当我创建 trlist 时，输出如下所示：

print trlist[1]
[<tr>
<td valign="top" width="25%"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-    SIZE: 10pt; FONT-FAMILY: times new roman; TEXT-DECORATION: underline">&#160;</font></td>
<td valign="top" width="25%">
<div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0pt" align="center"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: times new roman; TEXT-DECORATION: underline">Title of each class</font></div>
</td>
<td valign="top" width="25%" style="TEXT-ALIGN: center"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: times new roman; TEXT-DECORATION: underline">Name of exchange</font></td>
<td valign="top" width="25%" style="TEXT-ALIGN: center"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: times new roman; TEXT-DECORATION: underline">&#160;</font></td>
</tr>, <tr>
<td valign="top" width="25%"><font style="DISPLAY: inline; FONT-SIZE: 10pt; FONT-FAMILY: times new roman">&#160;</font></td>
<td valign="top" width="25%">
<div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0pt" align="center"><font style="DISPLAY: inline; FONT-SIZE: 10pt; FONT-FAMILY: times new roman"><font style="DISPLAY: inline; FONT-WEIGHT: bold">Common Stock, par value</font>    </font></div>
</td>
<td valign="top" width="25%">
<div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0pt" align="center">
<div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0pt" align="center"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: times new roman"><font style="FONT-WEIGHT: bold"><font style="FONT-WEIGHT: bold"><     <font style="FONT-WEIGHT: bold">NASDAQ Global Market</font></font></font></font></div>
</div>
</td>
<td valign="top" width="25%"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: times new roman">&#160;</font></td>
</tr>,...

正如您所看到的，trlist 中的每个项目都是表中的每个单独行（...），这就是我想要的。但是，当我通过 sgmllib 解析器运行每个 trlist 项以检索标签之间的内容时，我得到以下输出：

print dtablelist[1]
['\nTitle of each class\n', 'Name of exchange', '\nCommon Stock, par value\n', '\n\nNASDAQ Global Market\n\n', '\n$1.00 per share\n']

如您所见，输出是每个内容作为它们自己的单独字符串，而不是每个内容的内容列表表行()。所以本质上我想要输出：

[['\nTitle of each class\n', 'Name of exchange'], ['\nCommon Stock, par value\n', '\n\nNASDAQ Global Market\n\n'], ['\n$1.00 per share\n']]

是因为我必须在用 MyParser 解析它之前将 trlist 转换为字符串吗？有谁知道解决这个问题的方法，允许我解析列表中的列表（又名 Inception 狗屎）？

原文

To start off here's my current code in its entirety:

import urllib
from BeautifulSoup import BeautifulSoup
import sgmllib
import re

page = 'http://www.sec.gov/Archives/edgar/data/\
8177/000114036111018563/form10k.htm'

sock = urllib.urlopen(page)
raw = sock.read()
soup = BeautifulSoup(raw)

tablelist = soup.findAll('table')

class MyParser(sgmllib.SGMLParser):

def parse(self, segment):
    self.feed(segment)
    self.close()

def __init__(self, verbose=0):
    sgmllib.SGMLParser.__init__(self, verbose)
    self.descriptions = []
    self.inside_td_element = 0
    self.starting_description = 0

def start_td(self, attributes):
    for name, value in attributes:
        if name == "valign":
            self.inside_td_element = 1
            self.starting_description = 1
        else:
            self.inside_td_element = 1
            self.starting_description = 1

def end_td(self):
    self.inside_td_element = 0

def handle_data(self, data):
    if self.inside_td_element:
        if self.starting_description:
            self.descriptions.append(data)
            self.starting_description = 0
        else:
            self.descriptions[-1] += data

def get_descriptions(self):
    return self.descriptions

counter = 0
trlist = []
dtablelist = []

while counter < len(tablelist):
    trsegment = tablelist[counter].findAll('tr')
    trlist.append(trsegment)
    strsegment = str(trsegment)
    myparser = MyParser()
    myparser.parse(strsegment)
    sub = myparser.get_descriptions()
    dtablelist.append(sub)
    counter = counter + 1

ex = []

dtablelist = [s for s in dtablelist if s != ex]

So what I want to accomplish is take all the tables from an html document, then reprint them onto an Excel spreadsheet. So when I create trlist the output looks like this:

print trlist[1]
[<tr>
<td valign="top" width="25%"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-    SIZE: 10pt; FONT-FAMILY: times new roman; TEXT-DECORATION: underline"> </font></td>
<td valign="top" width="25%">
<div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0pt" align="center"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: times new roman; TEXT-DECORATION: underline">Title of each class</font></div>
</td>
<td valign="top" width="25%" style="TEXT-ALIGN: center"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: times new roman; TEXT-DECORATION: underline">Name of exchange</font></td>
<td valign="top" width="25%" style="TEXT-ALIGN: center"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: times new roman; TEXT-DECORATION: underline"> </font></td>
</tr>, <tr>
<td valign="top" width="25%"><font style="DISPLAY: inline; FONT-SIZE: 10pt; FONT-FAMILY: times new roman"> </font></td>
<td valign="top" width="25%">
<div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0pt" align="center"><font style="DISPLAY: inline; FONT-SIZE: 10pt; FONT-FAMILY: times new roman"><font style="DISPLAY: inline; FONT-WEIGHT: bold">Common Stock, par value</font>    </font></div>
</td>
<td valign="top" width="25%">
<div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0pt" align="center">
<div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0pt" align="center"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: times new roman"><font style="FONT-WEIGHT: bold"><font style="FONT-WEIGHT: bold"><     <font style="FONT-WEIGHT: bold">NASDAQ Global Market</font></font></font></font></div>
</div>
</td>
<td valign="top" width="25%"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: times new roman"> </font></td>
</tr>,...

As you can see each item in trlist is each individual row ( . . . ) of the table which is what I want. But when I run each trlist item through my sgmllib parser to retrieve the contents between the tags I get this output:

print dtablelist[1]
['\nTitle of each class\n', 'Name of exchange', '\nCommon Stock, par value\n', '\n\nNASDAQ Global Market\n\n', '\n$1.00 per share\n']

As you can see, the output is each of the contents as their own individual string, instead of a list of the contents of each table row (). So essentially I want the output:

[['\nTitle of each class\n', 'Name of exchange'], ['\nCommon Stock, par value\n', '\n\nNASDAQ Global Market\n\n'], ['\n$1.00 per share\n']]

Is it because I have to turn trlist into a string before I parse it with MyParser? Does anyone know any way around this, allowing me to parse lists within lists (aka Inception shit)?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

醉梦枕江山 2024-12-05 15:31:00

使用 lxml.html：

>>> import lxml.html
>>> data = ["<tr><td>test</td><td>help</td></tr>", "<tr><td>data1</td><td>data2</td></tr>"]
>>> [lxml.html.fromstring(tr).xpath(".//text()") for tr in data]
[['test', 'help'], ['data1', 'data2']]

这是一些更完整的代码。它将文本存储在一个包含表列表的列表中，每个表都有一个 tr 列表，每个 tr 都有一个所有文本的列表。

import urllib
import lxml.html

data = urllib.urlopen('http://www.sec.gov/Archives/edgar/data/8177/000114036111018563/form10k.htm').read()
tree = lxml.html.fromstring(data)

tables = []
for tbl in tree.iterfind('.//table'):
    tele = []
    tables.append(tele)
    for tr in tbl.iterfind('.//tr'):
        text = [e.strip() for e in tr.xpath('.//text()') if len(e.strip()) > 0]
        tele.append(text)

print tables

希望这有帮助，干杯！

Using lxml.html:

>>> import lxml.html
>>> data = ["<tr><td>test</td><td>help</td></tr>", "<tr><td>data1</td><td>data2</td></tr>"]
>>> [lxml.html.fromstring(tr).xpath(".//text()") for tr in data]
[['test', 'help'], ['data1', 'data2']]

And here is some more complete code. It stores the text in a list containing a list of tables, and each table has a list of tr's, and each tr has a list of all the text.

import urllib
import lxml.html

data = urllib.urlopen('http://www.sec.gov/Archives/edgar/data/8177/000114036111018563/form10k.htm').read()
tree = lxml.html.fromstring(data)

tables = []
for tbl in tree.iterfind('.//table'):
    tele = []
    tables.append(tele)
    for tr in tbl.iterfind('.//tr'):
        text = [e.strip() for e in tr.xpath('.//text()') if len(e.strip()) > 0]
        tele.append(text)

print tables

Hope this helps, cheers!

回复收藏 0 原文

迷路的信 2024-12-05 15:31:00

如果有人正在寻找相同问题的解决方案但使用的是 python 3：

即使您使用的是 python 3，您也不必使用外部库来解析 HTML 表。那里有 SGMLParser类被 html.parser 中的 HTMLParser 替换。我已经为一个简单的派生 HTMLParser 类编写了代码。它位于 github 存储库中。它只是记住、或 标记的当前范围。相对于使用 etree 的优点是它可以在不符合 xml 的 html 上正确运行并且不使用外部库。

您可以按以下方式使用该类（此处名为 HTMLTableParser）：

import urllib.request
from html_table_parser import HTMLTableParser

target = 'http://www.twitter.com'

# get website content
req = urllib.request.Request(url=target)
f = urllib.request.urlopen(req)
xhtml = f.read().decode('utf-8')

# instantiate the parser and feed it
p = HTMLTableParser()
p.feed(xhtml)
print(p.tables)

该类的输出是表示表格的 2D 列表的列表。它看起来可能是这样的：

[[['   ', ' Anmelden ']],
 [['Land', 'Code', 'Für Kunden von'],
  ['Vereinigte Staaten', '40404', '(beliebig)'],
  ['Kanada', '21212', '(beliebig)'],
  ...
  ['3424486444', 'Vodafone'],
  ['  Zeige SMS-Kurzwahlen für andere Länder ']]]

If somebody is searching for a solution of the same problem but is using python 3:

You don't have to use an external library for parsing an HTML table even if you are using python 3. There the SGMLParser class was replaced by HTMLParser from html.parser. I've written code for a simple derived HTMLParser class. It is here in a github repo. It simply does remember the current scope of a <td>, <tr> or <table> tag. The advantages over using etree are that it runs correctly on non-xml-compliant html and that it doesn't use external libraries.

You can use that class (here named HTMLTableParser) the following way:

import urllib.request
from html_table_parser import HTMLTableParser

target = 'http://www.twitter.com'

# get website content
req = urllib.request.Request(url=target)
f = urllib.request.urlopen(req)
xhtml = f.read().decode('utf-8')

# instantiate the parser and feed it
p = HTMLTableParser()
p.feed(xhtml)
print(p.tables)

The output of this is a list of 2D-lists representing tables. It looks maybe like this:

[[['   ', ' Anmelden ']],
 [['Land', 'Code', 'Für Kunden von'],
  ['Vereinigte Staaten', '40404', '(beliebig)'],
  ['Kanada', '21212', '(beliebig)'],
  ...
  ['3424486444', 'Vodafone'],
  ['  Zeige SMS-Kurzwahlen für andere Länder ']]]

回复收藏 0 原文

~没有更多了~