使用 Python 获取某个州的所有教堂列表
我很擅长Python,所以当细节不重要时,伪代码就足够了。请让我开始这项任务 - 如何在网上爬行以获取我所在州教堂的蜗牛邮件地址。一旦我有了诸如“123 Old West Road #3 Old Lyme City MD 01234”之类的单行代码,我就可以通过足够的尝试和错误将其解析为城市、州、街道、号码、公寓。我的问题是 - 如果我在线使用白页,那么我该如何处理所有的 HTML 垃圾、HTML 表格、广告等?我认为我不需要他们的电话号码,但这不会有什么坏处 - 一旦解析,我总是可以将其扔掉。即使您的解决方案是半手动的(例如保存为 pdf,然后打开 acrobat,另存为文本) - 我可能仍然对此感到满意。谢谢!哎呀,我什至会接受 Perl 片段 - 我可以自己翻译它们。
I am pretty good with Python, so pseudo-code will suffice when details are trivial. Please get me started on the task - how do go about crawling the net for the snail mail addresses of churches in my state. Once I have a one liner such as "123 Old West Road #3 Old Lyme City MD 01234", I can probably parse it into City, State, Street, number, apt with enough trial and error. My problem is - if I use white pages online, then how do I deal with all the HTML junk, HTML tables, ads, etc? I do not think I need their phone number, but it will not hurt - I can always throw it out once parsed. Even if your solution is half-manual (such as save to pdf, then open acrobat, save as text) - I might be happy with it still. Thanks! Heck, I will even accept Perl snippets - I can translate them myself.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

您可以使用mechanize。它是一个模拟浏览器的 python 库,因此您可以在白页中爬行(类似于手动执行的操作)。
为了处理“html 垃圾”,python 也有一个库: BeautifulSoup
这是一种从 HTML 中获取所需数据的好方法(当然,它假设您对 HTML 有一点了解,因为您仍然需要导航解析树)。
更新:关于如何点击多个页面的后续问题。 mechanize 就是一个可以做到这一点的库。仔细看看他们的例子,尤其是。 follow_link 方法。正如我所说,它模拟了一个浏览器,因此在Python中可以快速实现“点击”。
You could use mechanize. It's a python library that simulates a browser, so you could crawl through the white pages (similarly to what you do manually).
In order to deal with the 'html junk' python has a library for that too: BeautifulSoup
It is a lovely way to get the data you want out of HTML (of course it assumes you know a little bit about HTML, as you will still have to navigate the parse tree).
Update: As to your follow-up question on how to click through multiple pages. mechanize is a library to do just that. Take a closer look at their examples, esp. the follow_link method. As I said it simulates a browser, so 'clicking' can be realized quickly in python.
lynx --dump
下载网页。所有麻烦的 HTML 标签都将从输出中删除,页面中的所有链接将一起出现。Try
lynx --dump <url>
to download the web pages. All the troublesome HTML tags will be stripped from the output, and all the links from the page will appear together.您尝试执行的操作称为抓取或网页抓取。
如果您在 搜索 /www.packtpub.com/article/web-scraping-with-python" rel="nofollow noreferrer">Python 和抓取,您可能会找到 有帮助的工具。
(我从未使用过 scrapy,但它的网站看起来很有前途:)
What you're trying to do is called Scraping or web scraping.
If you do some searches on python and scraping, you may find a list of tools that will help.
(I have never used scrapy, but it's site looks promising :)
美丽的汤是理所当然的。您可以从以下网址开始:http://www.churchangel.com/。他们有一个巨大的列表,并且格式非常规则 - 翻译:易于设置 BSoup 进行抓取。
Beautiful Soup is a no brainer. Here's a site you might start at http://www.churchangel.com/. They have a huge list and the formatting is very regular -- translation: easy to setup BSoup to scrape.
如果您只是查找某个地理区域中教堂的地址,Python 脚本可能不是完成这项工作的最佳工具。
是一个反复出现的问题,那么投资学习 GIS。然后,您可以利用您的 Python 技能来完成许多地理任务。Python scripts might not be the best tool for this job, if you're just looking for addresses of churches in a geographic area.
The US census provides a data set of churches for use with geographic information systems. If finding all the
in a spatial area is a recurring problem, invest in learning a GIS. Then you can bring your Python skills to bear on many geographic tasks.