Java 或 Python 分布式计算作业(根据学生预算)?
我有一个大型数据集(约 40G),我想在实验室中的几台计算机上使用它进行一些 NLP(基本上是令人尴尬的并行),但我没有 root 访问权限,并且只有1G用户空间。 我尝试了hadoop,但当然这已经死了——数据存储在外部USB硬盘上,由于1G用户空间上限,我无法将其加载到dfs上。 我一直在研究几个基于Python的选项(因为如果我能帮助的话,我宁愿使用NLTK而不是Java的lingpipe),看起来分布式计算选项看起来像:
- Ipython
- DISCO
在我的hadoop经验之后,我试图确保我尝试做出明智的选择——任何有关可能更合适的帮助将不胜感激。
亚马逊的 EC2 等并不是真正的选择,因为我几乎没有预算。
I have a large dataset (c. 40G) that I want to use for some NLP (largely embarrassingly parallel) over a couple of computers in the lab, to which i do not have root access, and only 1G of user space.
I experimented with hadoop, but of course this was dead in the water-- the data is stored on an external usb hard drive, and i cant load it on to the dfs because of the 1G user space cap.
I have been looking into a couple of python based options (as I'd rather use NLTK instead of Java's lingpipe if I can help it), and it seems distributed compute options look like:
- Ipython
- DISCO
After my hadoop experience, i am trying to make sure i try and make an informed choice -- any help on what might be more appropriate would be greatly appreciated.
Amazon's EC2 etc not really an option, as i have next to no budget.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
与您学校的 IT 部门联系(尤其是如果您在大学),如果是为了作业或研究,我敢打赌他们会非常乐意为您提供更多磁盘空间。
Speak with the IT dept at your school (especially if you are in college), if it is for an assignment or research I bet they would be more than happy to give you more disk space.
没有实际答案;我本来想把这个作为评论,但在这个网站上,如果你仍然是菜鸟,你只能回答,
如果它真的像那样并行,而且只有几台计算机,你能不能把数据集分开吗提前手动?
您是否确认不会有防火墙或类似的东西阻止您使用类似的东西?
你可能只有 1GB 的用户空间,但是,如果是 linux,那么 /tmp 呢? (如果是 Windows,那么 %temp% 呢?)
no actual answers; i'd have put this as a comment but on this site you're forced to only answer if you're still a noob
if it's genuinely as parallel as that, and it's only a couple of computers, could you not split the dataset up manually ahead of time?
have you confirmed that there isn't going to be a firewall or similar stopping you using something like that anyway?
you may only have 1GB of user space, but, if linux, what about /tmp ? (if windows, what about %temp% ? )
请务必与您学校的 IT 部门联系。使用不属于您的计算机资源不是一个好主意。
我发现了JPPF,它使得具有大量处理能力要求的应用程序可以在任意数量的计算机上运行。我不确定您是否需要在客户端计算机上安装软件,但需要在客户端计算机上打开某些端口。
Definitely speak with the IT department at your school. It's not a good idea to utilize computer resources that don't belong to you.
I found JPPF, which enables applications with large processing power requirements to be run on any number of computers. I'm not sure if you need to install software on the client machines, but certain ports need to be open on the client machines.
如果您的计算部门无法提供更多资源,那么您将不得不考虑在对其进行任何工作之前将数据集分解为可管理的块,并将结果缩减为有意义的集合。
来自 IT 的更多资源将是一条出路。
祝你好运 !
本
If more resources in your computing department are a no go, you're going to have to consider breaking down your data set into manageable chunks before you do any work on it, ad reduce the results down into a meaningful set.
More resources from IT would be the way to go.
Good luck !
Ben