使用 Python 获取某个州的所有教堂列表

发布于 2024-08-15 00:19:32 字数 348 浏览 14 评论 0原文

我很擅长Python,所以当细节不重要时,伪代码就足够了。请让我开始这项任务 - 如何在网上爬行以获取我所在州教堂的蜗牛邮件地址。一旦我有了诸如“123 Old West Road #3 Old Lyme City MD 01234”之类的单行代码,我就可以通过足够的尝试和错误将其解析为城市、州、街道、号码、公寓。我的问题是 - 如果我在线使用白页,那么我该如何处理所有的 HTML 垃圾、HTML 表格、广告等?我认为我不需要他们的电话号码,但这不会有什么坏处 - 一旦解析,我总是可以将其扔掉。即使您的解决方案是半手动的(例如保存为 pdf,然后打开 acrobat,另存为文本) - 我可能仍然对此感到满意。谢谢!哎呀,我什至会接受 Perl 片段 - 我可以自己翻译它们。

I am pretty good with Python, so pseudo-code will suffice when details are trivial. Please get me started on the task - how do go about crawling the net for the snail mail addresses of churches in my state. Once I have a one liner such as "123 Old West Road #3 Old Lyme City MD 01234", I can probably parse it into City, State, Street, number, apt with enough trial and error. My problem is - if I use white pages online, then how do I deal with all the HTML junk, HTML tables, ads, etc? I do not think I need their phone number, but it will not hurt - I can always throw it out once parsed. Even if your solution is half-manual (such as save to pdf, then open acrobat, save as text) - I might be happy with it still. Thanks! Heck, I will even accept Perl snippets - I can translate them myself.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

往日情怀 2024-08-22 00:19:32

您可以使用mechanize。它是一个模拟浏览器的 python 库,因此您可以在白页中爬行(类似于手动执行的操作)。

为了处理“html 垃圾”,python 也有一个库: BeautifulSoup
这是一种从 HTML 中获取所需数据的好方法(当然,它假设您对 HTML 有一点了解,因为您仍然需要导航解析树)。

更新:关于如何点击多个页面的后续问题。 mechanize 就是一个可以做到这一点的库。仔细看看他们的例子,尤其是。 follow_link 方法。正如我所说,它模拟了一个浏览器,因此在Python中可以快速实现“点击”。

You could use mechanize. It's a python library that simulates a browser, so you could crawl through the white pages (similarly to what you do manually).

In order to deal with the 'html junk' python has a library for that too: BeautifulSoup
It is a lovely way to get the data you want out of HTML (of course it assumes you know a little bit about HTML, as you will still have to navigate the parse tree).

Update: As to your follow-up question on how to click through multiple pages. mechanize is a library to do just that. Take a closer look at their examples, esp. the follow_link method. As I said it simulates a browser, so 'clicking' can be realized quickly in python.

抽个烟儿 2024-08-22 00:19:32

尝试 lynx --dump 下载网页。所有麻烦的 HTML 标签都将从输出中删除,页面中的所有链接将一起出现。

Try lynx --dump <url> to download the web pages. All the troublesome HTML tags will be stripped from the output, and all the links from the page will appear together.

霊感 2024-08-22 00:19:32

您尝试执行的操作称为抓取或网页抓取。

如果您在 搜索 /www.packtpub.com/article/web-scraping-with-python" rel="nofollow noreferrer">Python 和抓取,您可能会找到 有帮助的工具

(我从未使用过 scrapy,但它的网站看起来很有前途:)

What you're trying to do is called Scraping or web scraping.

If you do some searches on python and scraping, you may find a list of tools that will help.

(I have never used scrapy, but it's site looks promising :)

离去的眼神 2024-08-22 00:19:32

美丽的汤是理所当然的。您可以从以下网址开始:http://www.churchangel.com/。他们有一个巨大的列表,并且格式非常规则 - 翻译:易于设置 BSoup 进行抓取。

Beautiful Soup is a no brainer. Here's a site you might start at http://www.churchangel.com/. They have a huge list and the formatting is very regular -- translation: easy to setup BSoup to scrape.

弱骨蛰伏 2024-08-22 00:19:32

如果您只是查找某个地理区域中教堂的地址,Python 脚本可能不是完成这项工作的最佳工具。

美国人口普查提供了用于地理信息系统的教堂数据集。如果在空间区域中找到所有x是一个反复出现的问题,那么投资学习 GIS。然后,您可以利用您的 Python 技能来完成许多地理任务。

Python scripts might not be the best tool for this job, if you're just looking for addresses of churches in a geographic area.

The US census provides a data set of churches for use with geographic information systems. If finding all the x in a spatial area is a recurring problem, invest in learning a GIS. Then you can bring your Python skills to bear on many geographic tasks.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文