python - 如何解析 HTML 表格
我有一个 HTML 页面,上面有大约 50 个表格。每个表都有相同的布局,但具有不同的值,例如:
<table align="right" class="customTableClass">
<tr align="center">
<td width="25" height="25" class="usernum">value1</td>
<td width="25" height="25" class="usernum">value2</td>
<td width="25" height="25" class="usernum">value3</td>
<td width="25" height="25" class="usernum">value4</td>
<td width="25" height="25" class="usernum">value5</td>
<td width="25" height="25" class="usernum">value6</td>
<td width="25" height="25" class="totalnum">otherVal</td>
</tr>
</table>
我的REST服务器正在运行django/python,所以在我的urls.py
中我我正在调用我的 def parse_url():
函数,显然我想在其中完成所有工作。我的问题是,当涉及到 python 时,我几乎是一个新手,所以实际上只是不这样做知道把我的代码放在哪里。我从 HTMLParser
python 文档中获取了一些代码,并将其更改如下:
import urllib, urllib2
from django.http import HttpResponse
from HTMLParser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print "Encountered the beginning of a %s tag" % tag
def handle_endtag(self, tag):
print "Encountered the end of a %s tag" % tag
def handle_data(self, data):
HttpResponse("Encountered data %s" % data)
def parse_url(request):
p = MyHTMLParser()
url = 'http://www.mysite.com/lists.asp'
content = urllib.urlopen(url).read()
p.feed(content)
return HttpResponse('DONE')
此代码目前不输出任何有用的内容。它只是打印出DONE
,这不是很有用。
如何使用 handle_starttag()
等类方法?当我使用 p.feed(content) 时,这些不应该被自动调用吗?
基本上,我最终想要完成的是,当我访问 mysite.com/showlist 时,能够输出一个列表:
value1
value2
value3
value4
value5
value6
othervalue
这需要在循环中完成,因为大约有 50 个表,每个表中都有不同的值。
感谢您对初学者的帮助!
I have a HTML page with about 50 tables on it. Each table has the same layout, but with different values, eg:
<table align="right" class="customTableClass">
<tr align="center">
<td width="25" height="25" class="usernum">value1</td>
<td width="25" height="25" class="usernum">value2</td>
<td width="25" height="25" class="usernum">value3</td>
<td width="25" height="25" class="usernum">value4</td>
<td width="25" height="25" class="usernum">value5</td>
<td width="25" height="25" class="usernum">value6</td>
<td width="25" height="25" class="totalnum">otherVal</td>
</tr>
</table>
My REST server is running django/python so in my urls.py
I am calling my def parse_url():
function which obviously I want to do all the work in. My problem is, I'm pretty much a newbie when it comes to python, so literally just don't know where to put my code. I have gotten some code from the HTMLParser
python docs, and changed it as follows:
import urllib, urllib2
from django.http import HttpResponse
from HTMLParser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print "Encountered the beginning of a %s tag" % tag
def handle_endtag(self, tag):
print "Encountered the end of a %s tag" % tag
def handle_data(self, data):
HttpResponse("Encountered data %s" % data)
def parse_url(request):
p = MyHTMLParser()
url = 'http://www.mysite.com/lists.asp'
content = urllib.urlopen(url).read()
p.feed(content)
return HttpResponse('DONE')
This code, at the moment, doesnt output anything useful. It just prints out DONE
, which isnt very useful.
How do I use the class methods such as handle_starttag()
? Shouldnt these be called automatically when I use p.feed(content)
??
Basically, what I'm trying to accomplish in the end is, when I go to mysite.com/showlist
, to be able to output a list saying:
value1
value2
value3
value4
value5
value6
othervalue
This needs to be done in a loop, because there is roughly 50 tables with different values in each table.
Thanks for helping a beginner!!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您正在将答案的开头打印到 stdout,而不是 django。以下是如何让 HTMLParser 执行您的命令:
我建议将类放入 utils.py 文件中,并保存在与 view.py 相同的文件夹中。然后将其导入。这将有助于通过仅包含视图来保持views.py的可管理性。
You are printing the beginning of the answer to stdout, not django. Here is how to get HTMLParser to do your bidding:
I would recommend putting the class into a utils.py file and keeping in the same folder as your views.py. Then import it in. This will help keep your views.py manageable by only containing views.
查看 BeautifulSoup
这是文档 http://www.crummy.com/software/BeautifulSoup/documentation.html 。
PS:它会更加灵活,包括未来的要求!
Check out BeautifulSoup
here is the documentation http://www.crummy.com/software/BeautifulSoup/documentation.html.
PS: It will be much more flexible including future requirements!