Python 3 HTML 解析器
我确信每个人都会抱怨,并告诉我查看文档(我有),但我只是不明白如何实现与以下内容相同的效果:
curl -s http://www.maxmind.com/app/locate_my_ip | awk '/align="center">/{getline;print}'
到目前为止,我在 python3 中拥有的只是:
import urllib.request
f = urllib.request.urlopen('http://www.maxmind.com/app/locate_my_ip')
for lines in f.readlines():
print(lines)
f.close()
说真的,有什么建议吗? (请不要告诉我阅读 http://docs.python.org/release/3.0.1/library/html.parser.html 因为我已经学习Python 1天了,很容易混淆)一个简单的例子会很棒!
I'm sure everyone will groan, and tell me to look at the documentation (which I have) but I just don't understand how to achieve the same as the following:
curl -s http://www.maxmind.com/app/locate_my_ip | awk '/align="center">/{getline;print}'
All I have in python3 so far is:
import urllib.request
f = urllib.request.urlopen('http://www.maxmind.com/app/locate_my_ip')
for lines in f.readlines():
print(lines)
f.close()
Seriously, any suggestions (please don't tell me to read http://docs.python.org/release/3.0.1/library/html.parser.html as I have been learning python for 1 day, and get easily confused) a simple example would be amazing!!!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这是基于上面拉斯曼斯的回答。
说明:
for line in f
迭代类文件对象 f 中的行。 Python 允许您迭代文件中的行,就像迭代列表中的项目一样。if b'align="center">' in line
查找字符串 'align="center">'在当前行中。b
表示这是一个字节缓冲区,而不是一个字符串。看来 urllib.reqquest.urlopen 将结果解释为二进制数据,而不是 unicode 字符串,并且未经修饰的'align="center">'
将被解释为一个 unicode 字符串。 (这是上面TypeError
的来源。)next(f)
获取文件的下一行,因为您的原始 awk 脚本在 'align=" 之后打印了该行中心">'而不是当前行。decode
方法(字符串在 Python 中具有方法)获取二进制数据并将其转换为可打印的 unicode 对象。 rstrip() 方法会去除所有尾随空白(即每行末尾的换行符)。This is based off of larsmans's answer, above.
Explanation:
for line in f
iterates over the lines in the file-like object, f. Python let's you iterate over lines in a file like you would items in a list.if b'align="center">' in line
looks for the string 'align="center">' in the current line. Theb
indicates that this is a buffer of bytes, rather than a string. It appears thaturllib.reqquest.urlopen
interpets the results as binary data, rather than unicode strings, and an unadorned'align="center">'
would be interpreted as a unicode string. (That was the source of theTypeError
above.)next(f)
takes the next line of the file, because your original awk script printed the line after 'align="center">' rather than the current line. Thedecode
method (strings have methods in Python) takes the binary data and converts it to a printable unicode object. Therstrip()
method strips any trailing whitespace (namely, the newline at the end of each line.但请务必阅读 Python 教程。
But be sure to read the Python tutorial.
我可能会使用正则表达式来获取ip本身:
它将打印格式的第一个字符串:1-3位数字,句点,1-3位数字,...
我认为您正在寻找该行,您可以简单地扩展findall() 表达式中的字符串来处理这个问题。 (有关更多详细信息,请参阅 python 文档)。
顺便说一句,匹配字符串前面的 r 使其成为原始字符串,因此您不需要转义其中的 python 转义字符(但您仍然需要转义 RE 转义字符)。
希望有帮助
I would probably use regular expressions to get the ip itself:
which will print the first string of the format: 1-3digits, period, 1-3digits,...
I take it you were looking for the line, you could simply extend the string in the findall() expression to take care of that. (see the python docs for re for more details).
By the way, the r in front of the match string makes it a raw string so you wouldn't need to escape python escape characters inside of it (but you still need to escape RE escape characters).
Hope that helps