Python 中的屏幕抓取
我目前正在尝试屏幕抓取一个网站以将信息放入字典中。我正在使用 urllib2 和 BeautifulSoup。我不知道如何解析网页源信息以获取我想要的内容并将其读入字典。我想要的信息显示为
。我正在考虑使用 reg 表达式读取行,将时间和日期转换为日期时间,然后解析行以将数据读入字典中。字典输出应该类似于
[ { “日期”: dateime(2010, 11, 24, 23, 59), "title": "鞋底进场。和平出局。", } ]
当前代码:
from BeautifulSoup import BeautifulSoup
import re
import urllib2
url = 'http://events.cmich.edu/RssStudentEvents.aspx'
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)
抱歉,文字墙很长,感谢您的时间和帮助!
I am currently trying to screen scrape a website to put info into a dictionary. I am using urllib2 and BeautifulSoup. I cannot figure out how to parse the web pages source info to get what I want and to read it into a dictionary. The info I want is displayed as <title>Nov 24 | 8:00AM | Sole In. Peace Out. </title>
in the source code. I am thinking of using a reg expression to read in the line, convert the time and date to a datetime, and then parse the line to read the data into a dictionary. The dictionary output should be something along the lines of
[
{
"date": dateime(2010, 11, 24, 23, 59),
"title": "Sole In. Peace Out.",
}
]
Current Code:
from BeautifulSoup import BeautifulSoup
import re
import urllib2
url = 'http://events.cmich.edu/RssStudentEvents.aspx'
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)
Sorry for the wall of text, and thank you for your time and help!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
像这样的东西..
findAll()
返回搜索元素的所有实例.. 因此您可以像对待任何其他列表一样对待它。那应该就可以了:)
编辑:小修复
Edit2:修复下面的评论
Something like this..
findAll()
returns all instances of the search element.. so you can just treat it like any other list.That should just about do it :)
Edit: small fix
Edit2: fix from comments below
编辑:我没有意识到这不是一个 HTML 页面,所以看看 Chris 的更正。下面的内容适用于 HTML 页面。
您可以使用:
或:
看一下这里:
EDIT: I did not realize it's not a HTML page, so take a look at the correction by Chris. The below would work for HTML pages.
You can use:
or:
Take a look here:
请注意,这还包括一天中的时间。
然后,我认为您想要所有项目,
如果您想要更复杂的年份处理,您可以这样做。你明白了。
最后补充:生成器将是使用它的好方法。
Note that that's also including the time of day.
And then, as I think you want all the items,
If you wanted more complex year handling you could do it. You get the idea.
Final addition: a generator would be a nice way of using this.