打印某些 HTML Python Mechanize

发布于 2024-12-09 16:04:57 字数 436 浏览 0 评论 0原文

我正在制作一个用于自动登录网站的小 python 脚本。但我被困住了。

我希望将 html 的一小部分打印到终端中，该部分位于网站 html 文件中的此标记内：

<td class=h3 align='right'>&nbsp;&nbsp;John Appleseed</td><td>&nbsp;<a href="members_myaccount.php"><img border=0 src="../tbs_v7_0/images/myaccount.gif" alt="My Account"></a></td>

但是如何提取并打印名称 John Appleseed？

顺便说一下，我在 Mac 上使用 Python 的 Mechanize。

原文

Im making a small python script for auto logon to a website. But i'm stuck.

I'm looking to print into terminal a small part of the html, located within this tag in the html file on the site:

<td class=h3 align='right'>  John Appleseed</td><td> <a href="members_myaccount.php"><img border=0 src="../tbs_v7_0/images/myaccount.gif" alt="My Account"></a></td>

But how do I extract and print just the name, John Appleseed?

I'm using Pythons' Mechanize on a mac, by the way.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

暖风昔人 2024-12-16 16:04:57

Mechanize 仅适用于获取 html。一旦您想从 html 中提取信息，您可以使用 BeautifulSoup 等。（另请参阅我对类似问题的回答：网页挖掘或抓取或爬行？我应该使用什么工具/库？）

取决于位于 html 中（从您的问题中不清楚），您可以使用以下代码：

html = ... # this is the html you've fetched

from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(html)
# use this (gets all <td> elements)
cols = soup.findAll('td')
# or this (gets only <td> elements with class='h3')
cols = soup.findAll('td', attrs={"class" : 'h3'})
print cols[0].renderContents() # print content of first <td> element

Mechanize is only good for fetching the html. Once you want to extract information from the html, you could use for example BeautifulSoup. (See also my answer to a similar question: Web mining or scraping or crawling? What tool/library should I use?)

Depending on where the <td> is located in the html (it's unclear from your question), you could use the following code:

html = ... # this is the html you've fetched

from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(html)
# use this (gets all <td> elements)
cols = soup.findAll('td')
# or this (gets only <td> elements with class='h3')
cols = soup.findAll('td', attrs={"class" : 'h3'})
print cols[0].renderContents() # print content of first <td> element

回复收藏 0 原文

很酷又爱笑 2024-12-16 16:04:57

由于您尚未提供页面的完整 HTML，因此现在唯一的选择是使用 string.find() 或正则表达式。

但是，找到这个的标准方法是使用 xpath。看到这个问题：How to use Xpath in Python?

即可获取xpath对于使用 Firefox 的“检查元素”功能的元素。

例如，如果您想在 stackoverflow 站点中查找用户名的 XPATH。

打开firefox并登录网站&右键单击用户名（在我的例子中为 shadyabhi）并选择“检查元素”。
将鼠标悬停在标签上或右键单击它并“复制 xpath”。

在此处输入图像描述

回复收藏 0 原文

还在原地等你 2024-12-16 16:04:57

您可以使用解析器提取文档中的任何信息。我建议您使用 lxml 模块。

这里有一个示例：

from lxml import etree
from StringIO import StringIO

parser = etree.HTMLParser()

tree = etree.parse(StringIO("""<td class=h3 align='right'>  John Appleseed</td><td> <a href="members_myaccount.php"><img border=0 src="../tbs_v7_0/images/myaccount.gif" alt="My Account"></a></td>"""),parser)


>>> tree.xpath("string()").strip()
u'John Appleseed'

有关 lxml 的更多信息此处

You can use a parser to extract any information in a document. I suggest you to use lxml module.

Here you have an example:

from lxml import etree
from StringIO import StringIO

parser = etree.HTMLParser()

tree = etree.parse(StringIO("""<td class=h3 align='right'>  John Appleseed</td><td> <a href="members_myaccount.php"><img border=0 src="../tbs_v7_0/images/myaccount.gif" alt="My Account"></a></td>"""),parser)


>>> tree.xpath("string()").strip()
u'John Appleseed'

More information about lxml here

回复收藏 0 原文

~没有更多了~