使用 Python 获取某个州的所有教堂列表
我很擅长Python,所以当细节不重要时,伪代码就足够了。请让我开始这项任务 - 如何在网上爬行以获取我所在州教堂的蜗牛邮件地址。一旦我有了诸如“123 Old West Road #3 Old Lyme City MD 01234”之类的单行代码,我就可以通过足够的尝试和错误将其解析为城市、州、街道、号码、公寓。我的问题是 - 如果我在线使用白页,那么我该如何处理所有的 HTML 垃圾、HTML 表格、广告等?我认为我不需要他们的电话号码,但这不会有什么坏处 - 一旦解析,我总是可以将其扔掉。即使您的解决方案是半手动的(例如保存为 pdf,然后打开 acrobat,另存为文本) - 我可能仍然对此感到满意。谢谢!哎呀,我什至会接受 Perl 片段 - 我可以自己翻译它们。
I am pretty good with Python, so pseudo-code will suffice when details are trivial. Please get me started on the task - how do go about crawling the net for the snail mail addresses of churches in my state. Once I have a one liner such as "123 Old West Road #3 Old Lyme City MD 01234", I can probably parse it into City, State, Street, number, apt with enough trial and error. My problem is - if I use white pages online, then how do I deal with all the HTML junk, HTML tables, ads, etc? I do not think I need their phone number, but it will not hurt - I can always throw it out once parsed. Even if your solution is half-manual (such as save to pdf, then open acrobat, save as text) - I might be happy with it still. Thanks! Heck, I will even accept Perl snippets - I can translate them myself.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
您可以使用mechanize。它是一个模拟浏览器的 python 库,因此您可以在白页中爬行(类似于手动执行的操作)。
为了处理“html 垃圾”,python 也有一个库: BeautifulSoup
这是一种从 HTML 中获取所需数据的好方法(当然,它假设您对 HTML 有一点了解,因为您仍然需要导航解析树)。
更新:关于如何点击多个页面的后续问题。 mechanize 就是一个可以做到这一点的库。仔细看看他们的例子,尤其是。 follow_link 方法。正如我所说,它模拟了一个浏览器,因此在Python中可以快速实现“点击”。
You could use mechanize. It's a python library that simulates a browser, so you could crawl through the white pages (similarly to what you do manually).
In order to deal with the 'html junk' python has a library for that too: BeautifulSoup
It is a lovely way to get the data you want out of HTML (of course it assumes you know a little bit about HTML, as you will still have to navigate the parse tree).
Update: As to your follow-up question on how to click through multiple pages. mechanize is a library to do just that. Take a closer look at their examples, esp. the follow_link method. As I said it simulates a browser, so 'clicking' can be realized quickly in python.
尝试
lynx --dump
下载网页。所有麻烦的 HTML 标签都将从输出中删除,页面中的所有链接将一起出现。Try
lynx --dump <url>
to download the web pages. All the troublesome HTML tags will be stripped from the output, and all the links from the page will appear together.您尝试执行的操作称为抓取或网页抓取。
如果您在 搜索 /www.packtpub.com/article/web-scraping-with-python" rel="nofollow noreferrer">Python 和抓取,您可能会找到 有帮助的工具。
(我从未使用过 scrapy,但它的网站看起来很有前途:)
What you're trying to do is called Scraping or web scraping.
If you do some searches on python and scraping, you may find a list of tools that will help.
(I have never used scrapy, but it's site looks promising :)
美丽的汤是理所当然的。您可以从以下网址开始:http://www.churchangel.com/。他们有一个巨大的列表,并且格式非常规则 - 翻译:易于设置 BSoup 进行抓取。
Beautiful Soup is a no brainer. Here's a site you might start at http://www.churchangel.com/. They have a huge list and the formatting is very regular -- translation: easy to setup BSoup to scrape.
如果您只是查找某个地理区域中教堂的地址,Python 脚本可能不是完成这项工作的最佳工具。
美国人口普查提供了用于地理信息系统的教堂数据集。如果在空间区域中找到所有
x
是一个反复出现的问题,那么投资学习 GIS。然后,您可以利用您的 Python 技能来完成许多地理任务。Python scripts might not be the best tool for this job, if you're just looking for addresses of churches in a geographic area.
The US census provides a data set of churches for use with geographic information systems. If finding all the
x
in a spatial area is a recurring problem, invest in learning a GIS. Then you can bring your Python skills to bear on many geographic tasks.