如何从 HTML 文件中提取所需的数据?
这是我的 HTML:
p_tags = '''<p class="foo-body">
<font class="test-proof">Full name</font> Foobar<br />
<font class="test-proof">Born</font> July 7, 1923, foo, bar<br />
<font class="test-proof">Current age</font> 27 years 226 days<br />
<font class="test-proof">Major teams</font> <span style="white-space: nowrap">Japan,</span> <span style="white-space: nowrap">Jakarta,</span> <span style="white-space: nowrap">bazz,</span> <span style="white-space: nowrap">foo,</span> <span style="white-space: nowrap">foobazz</span><br />
<font class="test-proof">Also</font> bar<br />
<font class="test-proof">foo style</font> hand <br />
<font class="test-proof">bar style</font> ball<br />
<font class="test-proof">foo position</font> bak<br />
<br class="bar" />
</p>'''
这是我的 Python 代码,使用 Beautiful Soup:
def get_info(p_tags):
"""Returns brief information."""
head_list = []
detail_list = []
# This works fine
for head in p_tags.findAll('font', 'test-proof'):
head_list.append(head.contents[0])
# Some problem with this?
for index in xrange(2, 30, 4):
detail_list.append(p_tags.contents[index])
return dict([(l, detail_list[head_list.index(l)]) for l in head_list])
我从 HTML 中获得了正确的 head_list
,但 detail_list
不起作用。
head_list = [u'Full name', u'Born', u'Current age', u'Major teams', u'Also', u'foo style', u'bar style', u'foo position']
我想要这样的东西
{ 'Full name': 'Foobar', 'Born': 'July 7, 1923, foo, bar', 'Current age': '78 years 226 days', 'Major teams': 'Japan, Jakarta, bazz, foo, foobazz', 'Also': 'bar', 'foo style': 'hand', 'bar style': 'ball', 'foo position': 'bak' }
任何帮助都会很感激。 提前致谢。
This is the HTML I have:
p_tags = '''<p class="foo-body">
<font class="test-proof">Full name</font> Foobar<br />
<font class="test-proof">Born</font> July 7, 1923, foo, bar<br />
<font class="test-proof">Current age</font> 27 years 226 days<br />
<font class="test-proof">Major teams</font> <span style="white-space: nowrap">Japan,</span> <span style="white-space: nowrap">Jakarta,</span> <span style="white-space: nowrap">bazz,</span> <span style="white-space: nowrap">foo,</span> <span style="white-space: nowrap">foobazz</span><br />
<font class="test-proof">Also</font> bar<br />
<font class="test-proof">foo style</font> hand <br />
<font class="test-proof">bar style</font> ball<br />
<font class="test-proof">foo position</font> bak<br />
<br class="bar" />
</p>'''
This is my Python code, using Beautiful Soup:
def get_info(p_tags):
"""Returns brief information."""
head_list = []
detail_list = []
# This works fine
for head in p_tags.findAll('font', 'test-proof'):
head_list.append(head.contents[0])
# Some problem with this?
for index in xrange(2, 30, 4):
detail_list.append(p_tags.contents[index])
return dict([(l, detail_list[head_list.index(l)]) for l in head_list])
I get the proper head_list
from the HTML but the detail_list
is not working.
head_list = [u'Full name', u'Born', u'Current age', u'Major teams', u'Also', u'foo style', u'bar style', u'foo position']
I wanted something like this
{ 'Full name': 'Foobar', 'Born': 'July 7, 1923, foo, bar', 'Current age': '78 years 226 days', 'Major teams': 'Japan, Jakarta, bazz, foo, foobazz', 'Also': 'bar', 'foo style': 'hand', 'bar style': 'ball', 'foo position': 'bak' }
Any help would be appreciable. Thanks in advance.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
在我意识到你正在使用“美丽的汤”之前我就开始回答这个问题,但我认为这是一个解析器,它可以与使用 HTMLParser 库编写的示例字符串一起使用
给出结果:
I started answering this before I realised you were using 'beautiful soup' but here's a parser that I think works with your example string written using the HTMLParser library
Gives the result:
问题是你的 HTML 没有经过深思熟虑——你有一个“混合内容模型”,其中标签和数据是交错的。 您的标签包含在
标签中,但您的数据位于 NavigableString 节点中。
您需要迭代
p_tag
的内容。 将有两种节点:Tag
节点(其中包含标签)和
NavigableString
节点,其中包含其他位文本。大约是这样的。
The issue is that your HTML is not very well thought out -- you have a "mixed content model" where your labels and your data are interleaved. Your labels are wrapped in
<font>
Tags, but your data is in NavigableString nodes.You need to iterate over the contents of
p_tag
. There will be two kinds of nodes:Tag
nodes (which have your<font>
tags) andNavigableString
nodes which have the other bits of text.Something approximately like that.
抱歉,代码不必要地复杂,我急需大量咖啡因;)
Sorry for the unnecessarily complex code, I badly need a big dose of caffeine ;)
您想要查找以 > 开头的字符串 后跟 <,忽略尾随或前导空格。 您可以通过循环查看字符串中的每个字符来轻松完成此操作,或者正则表达式可能会有所帮助。 类似 >[ \t]*[^<]+[ \t]*< 之类的东西。
您还可以使用 re.split 和代表标签内容,例如 <[^>]*> 作为拆分器,您将在数组中得到一些空条目,但这些条目很容易被删除。
You want to find the strings preceded by > and followed by <, ignoring trailing or leading whitespace. You can do this quite easily with a loop looking at each character in the string, or regular expressions could help. Something like >[ \t]*[^<]+[ \t]*<.
You could also use re.split and a regex representing the tag contents, something like <[^>]*> as the splitter, you will get some empty entries in the array, but these are easily deleted.