python - 如何解析 HTML 表格

发布于 2024-12-04 07:28:47 字数 1926 浏览 0 评论 0原文

我有一个 HTML 页面，上面有大约 50 个表格。每个表都有相同的布局，但具有不同的值，例如：

<table align="right" class="customTableClass">
<tr align="center">
<td width="25" height="25" class="usernum">value1</td>
<td width="25" height="25" class="usernum">value2</td>
<td width="25" height="25" class="usernum">value3</td>
<td width="25" height="25" class="usernum">value4</td>
<td width="25" height="25" class="usernum">value5</td>
<td width="25" height="25" class="usernum">value6</td>
<td width="25" height="25" class="totalnum">otherVal</td>
</tr>
</table>

我的REST服务器正在运行django/python，所以在我的urls.py中我我正在调用我的 def parse_url(): 函数，显然我想在其中完成所有工作。我的问题是，当涉及到 python 时，我几乎是一个新手，所以实际上只是不这样做知道把我的代码放在哪里。我从 HTMLParser python 文档中获取了一些代码，并将其更改如下：

import urllib, urllib2
from django.http import HttpResponse
from HTMLParser import HTMLParser

    class MyHTMLParser(HTMLParser):
        def handle_starttag(self, tag, attrs):
            print "Encountered the beginning of a %s tag" % tag

        def handle_endtag(self, tag):
            print "Encountered the end of a %s tag" % tag

        def handle_data(self, data):
            HttpResponse("Encountered data %s" % data)


    def parse_url(request):
        p = MyHTMLParser()
        url = 'http://www.mysite.com/lists.asp'
        content = urllib.urlopen(url).read()
        p.feed(content)
        return HttpResponse('DONE')

此代码目前不输出任何有用的内容。它只是打印出DONE，这不是很有用。

如何使用 handle_starttag() 等类方法？当我使用 p.feed(content) 时，这些不应该被自动调用吗？

基本上，我最终想要完成的是，当我访问 mysite.com/showlist 时，能够输出一个列表：

value1
value2
value3
value4
value5
value6

othervalue

这需要在循环中完成，因为大约有 50 个表，每个表中都有不同的值。

感谢您对初学者的帮助！

原文

I have a HTML page with about 50 tables on it. Each table has the same layout, but with different values, eg:

<table align="right" class="customTableClass">
<tr align="center">
<td width="25" height="25" class="usernum">value1</td>
<td width="25" height="25" class="usernum">value2</td>
<td width="25" height="25" class="usernum">value3</td>
<td width="25" height="25" class="usernum">value4</td>
<td width="25" height="25" class="usernum">value5</td>
<td width="25" height="25" class="usernum">value6</td>
<td width="25" height="25" class="totalnum">otherVal</td>
</tr>
</table>

My REST server is running django/python so in my urls.py I am calling my def parse_url(): function which obviously I want to do all the work in. My problem is, I'm pretty much a newbie when it comes to python, so literally just don't know where to put my code. I have gotten some code from the HTMLParser python docs, and changed it as follows:

import urllib, urllib2
from django.http import HttpResponse
from HTMLParser import HTMLParser

    class MyHTMLParser(HTMLParser):
        def handle_starttag(self, tag, attrs):
            print "Encountered the beginning of a %s tag" % tag

        def handle_endtag(self, tag):
            print "Encountered the end of a %s tag" % tag

        def handle_data(self, data):
            HttpResponse("Encountered data %s" % data)


    def parse_url(request):
        p = MyHTMLParser()
        url = 'http://www.mysite.com/lists.asp'
        content = urllib.urlopen(url).read()
        p.feed(content)
        return HttpResponse('DONE')

This code, at the moment, doesnt output anything useful. It just prints out DONE, which isnt very useful.

How do I use the class methods such as handle_starttag()? Shouldnt these be called automatically when I use p.feed(content)??

Basically, what I'm trying to accomplish in the end is, when I go to mysite.com/showlist, to be able to output a list saying:

value1
value2
value3
value4
value5
value6

othervalue

This needs to be done in a loop, because there is roughly 50 tables with different values in each table.

Thanks for helping a beginner!!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

长安忆 2024-12-11 07:28:47

您正在将答案的开头打印到 stdout，而不是 django。以下是如何让 HTMLParser 执行您的命令：

import urllib, urllib2
from django.http import HttpResponse
from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
    def __init__(self, *args, **kwargs):
        self.capture_data = False
        self.data_list = []
        HTMLParser.__init__(self, *args, **kwargs)

    def handle_starttag(self, tag, attrs):
        if tag == 'td':
            self.capture_data = True

    def handle_endtag(self, tag):
        if tag == 'td':
            self.capture_data = False

    def handle_data(self, data):
        if self.capture_data and data and not data.isspace():
            self.data_list.append(data)

def parse_url(request):
    p = MyHTMLParser()
    url = 'http://www.mysite.com/lists.asp'
    content = urllib.urlopen(url).read()
    p.feed(content)
    return HttpResponse(str(p.data_list))

我建议将类放入 utils.py 文件中，并保存在与 view.py 相同的文件夹中。然后将其导入。这将有助于通过仅包含视图来保持views.py的可管理性。

You are printing the beginning of the answer to stdout, not django. Here is how to get HTMLParser to do your bidding:

import urllib, urllib2
from django.http import HttpResponse
from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
    def __init__(self, *args, **kwargs):
        self.capture_data = False
        self.data_list = []
        HTMLParser.__init__(self, *args, **kwargs)

    def handle_starttag(self, tag, attrs):
        if tag == 'td':
            self.capture_data = True

    def handle_endtag(self, tag):
        if tag == 'td':
            self.capture_data = False

    def handle_data(self, data):
        if self.capture_data and data and not data.isspace():
            self.data_list.append(data)

def parse_url(request):
    p = MyHTMLParser()
    url = 'http://www.mysite.com/lists.asp'
    content = urllib.urlopen(url).read()
    p.feed(content)
    return HttpResponse(str(p.data_list))

I would recommend putting the class into a utils.py file and keeping in the same folder as your views.py. Then import it in. This will help keep your views.py manageable by only containing views.

回复收藏 0 原文