python 多线程/多进程代码
在下面的代码中,我正在考虑使用多线程或多进程从 url 获取数据。我认为池是理想的,任何人都可以帮忙建议解决方案吗?
想法:池线程/进程,收集数据……我的偏好是进程优于线程,但不确定。
import urllib
URL = "http://download.finance.yahoo.com/d/quotes.csv?s=%s&f=sl1t1v&e=.csv"
symbols = ('GGP', 'JPM', 'AIG', 'AMZN','GGP', 'JPM', 'AIG', 'AMZN')
#symbols = ('GGP')
def fetch_quote(symbols):
url = URL % '+'.join(symbols)
fp = urllib.urlopen(url)
try:
data = fp.read()
finally:
fp.close()
return data
def main():
data_fp = fetch_quote(symbols)
# print data_fp
if __name__ =='__main__':
main()
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您有一个同时请求多个信息的流程。让我们尝试一一获取这些信息。您的代码将是:
所以 main() 调用,逐一获取每个 url 的数据。
让我们用一个池来对其进行多处理:
在下面的 main 中,将创建一个新进程来请求每个符号 url。
注意:在 python 上,由于存在 GIL,多线程通常被认为是一个错误的解决方案。
有关文档,请参阅:Python 中的多处理
You have a process that request, several information at once. Let's try to fetch these information one by one.. Your code will be :
So main() call, one by one every url to get the data.
Let's multiprocess it with a pool:
In the following main a new process is created to request each symbols urls.
Note: on python, since the GIL is present, multithreading must be mostly considered as a wrong solution.
For documentation see: Multiprocessing in python
这是一个非常简单的例子。它迭代符号,每次传递一个符号到 fetch_quote。
So here's a very simple example. It iterates over symbols passing one at a time to fetch_quote.
事实上,两者都不做也是可以做到的。您可以使用异步调用在一个线程中完成此操作,例如来自 Twisted Web。
Actually it's possible to do it without neither. You can get it done in one thread using asynchronous calls, like for example
twisted.web.client.getPage
from Twisted Web.正如你所知,由于 GIL,Python 中的多线程实际上并不是多线程。本质上它是在给定时间运行的单个线程。因此,在您的程序中,如果您希望在任何给定时间获取多个 url,多线程可能不是最佳选择。另外,在抓取之后,您将数据存储在单个文件或某个持久数据库中吗?这里的决定可能会影响你的表现。
多进程这样效率更高,但有时间和时间。额外进程产生的内存开销。我最近在 Python 中探索了这两个选项。这是网址(带有代码) -
python ->多处理模块
As you would know multi-threading in Python is not actually multi-threading due to GIL. Essentially it's a single thread that's running at a given time. So in your program if you want multiple urls to be fetched at any given time, multi-threading might not be the way to go. Also after the crawl you store the data in a single file or some persistent db? The decision here could affect your performance.
multi-processes are more efficient that way but have the time & memory overhead of extra processes spawn. I have explored both these options in Python recently. Here's the url (with code) -
python -> multiprocessing module