Python 或 awk/sed 用于清理数据
我使用 R 进行数据分析,并且对此非常满意。然而,清理数据可能会更容易一些。我正在考虑学习另一种适合这项任务的语言。具体来说,我正在寻找一种工具来获取原始数据,删除不必要的变量或观察结果,并对其进行格式化以便于在 R 中加载。内容主要是数字和字符串数据,而不是多行文本。
我正在考虑 awk/sed 组合与 Python 的比较。 (我认识到 Perl 将是另一种选择,但是,如果我要学习另一种完整的语言,Python 似乎是一个更好、更具可扩展性的选择。)
sed/awk 的优点是学习起来更快。缺点是这种组合不如 Python 那样可扩展。事实上,如果我学习 Python,我可能会想象一些“任务蔓延”,这很好,但不是我的目标。
我的另一个考虑因素是对大数据集的应用。据我了解,awk/sed 是逐行操作的,而 Python 通常会将所有数据拉入内存。这可能是 sed/awk 的另一个优势。
还有其他我遗漏的问题吗?如果您能提供任何建议,我们将不胜感激。 (我添加了 R 标签,供 R 用户提供他们的清洁建议。)
I use R for data analysis and am very happy with it. Cleaning data could be a bit easier, however. I am thinking about learning another language suited to this task. Specifically, I am looking for a tool to use to take raw data, remove unnecessary variables or observations, and format it for easy loading in R. Contents would be mostly numeric and string data, as opposed to multi-line text.
I am considering the awk/sed combination versus Python. (I recognize that Perl would be another option, but, if I was going to learn another full language, Python seems to be a better, more extensible choice.)
The advantage of sed/awk is that it would be quicker to learn. The disadvantage is that this combination isn't as extensible as Python. Indeed, I might imagine some "mission creep" if I learned Python, which would be fine, but not my goal.
The other consideration that I had is applications to large data sets. As I understand it, awk/sed operate line-by-line, while Python would typically pull all the data into memory. This could be another advantage for sed/awk.
Are there other issues that I'm missing? Any advice that you can offer would be appreciated. (I included the R tag for R users to offer their cleaning recommendations.)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
不想破坏你的冒险,但我想说不,原因如下:
最重要的是:您已经了解 R。
也就是说,当然 sed/awk 非常适合小型程序甚至单行程序,而 Python 是一门很好的语言。但我会考虑继续使用 R。
Not to spoil your adventure, but I'd say no and here is why:
and most importantly: you already know R.
That said, of course sed/awk are great for small programs or even one-liners and Python is a fine language. But I would consider to also stick with R.
我经常使用 Python 和 Perl。我对 sed 非常了解,并且曾经经常使用 awk。我曾断断续续地使用过 R。 Perl 的数据转换功能和速度是最好的。
老实说,我不知道为什么人们会学习 sed 和 awk 而不是 Perl。
郑重声明,我不是“Perl 人”。我喜欢它作为一把瑞士军刀,而不是作为一种宗教。
I use Python and Perl regularly. I know sed fairly well and once used awk a lot. I've used R in fits and spurts. Perl is the best of the bunch for data transformation function and speed.
I'm honestly at a loss to think why one would learn sed and awk over Perl.
For the record, I'm not "a Perl guy". I like it as a swiss army knife, not as a religion.
我会推荐 sed/awk 以及类似 UNIX 平台上可用的其他命令行工具:comm、tr、sort、cut、join、grep 以及内置的 shell 功能(如循环等) 。您确实不需要学习另一种编程语言,因为 R 可以处理数据操作,甚至比其他流行的脚本语言更好。
I would recommend sed/awk along with the wealth of other command line tools available on UNIX-alike platforms: comm, tr, sort, cut, join, grep, and built in shell capabilities like looping and whatnot. You really don't need to learn another programming language as R can handle data manipulation as well as if not better than the other popular scripting languages.
我建议长期投资一种合适的语言来处理数据文件,比如 python、perl 或 ruby,而不是短期的 sed/awk 解决方案。我认为所有的数据分析师至少需要三种语言;我使用 C 进行大量计算,使用 Perl 处理数据文件,使用 R 进行交互式分析和图形。
我在Python流行之前就学了Perl。我听说过有关 ruby 的一些很棒的事情,所以您可能想尝试一下。
对于其中任何一个,您都可以逐行处理文件; python不需要提前读取完整的文件。
I would recommend investing for the long term with a proper language for processing data files, like python or perl or ruby, vs the short term sed/awk solution. I think that all data analysts need at least three languages; I use C for hefty computations, perl for processing data files, and R for interactive analysis and graphics.
I learned perl before python had become popular. I've heard great things about ruby so you might want to try that instead.
For any of these you can work with files line-by-line; python doesn't need to read the full file in advance.
我建议使用“awk”进行此类处理。
想必您只是在简单文本文件中搜索/拒绝无效观察结果。
awk 执行此任务的速度快如闪电,并且编程非常简单。
如果您需要做任何更复杂的事情,那么您也可以。
如果您不介意性能下降,Python 也是一种可能。 “rpy”库可用于紧密集成Python和R组件。
I would recommend 'awk' for this type of processing.
Presumably you are just searching/rejecting invalid observations in simple text files.
awk is lightning fast at this task and is very simple to program.
If you need to do anything more complex then you can.
Python is also a possibility if you don't mind the performance hit. The "rpy" library can be used to closely integrate the python and R components.
我同意德克的观点。我也考虑过同样的事情,也使用过一些其他语言。但最终,我再次对更有经验的用户使用 R 所做的事情感到惊讶。像 ddply 或 plyr 这样的包可能对您来说非常有趣。话虽这么说,SQL 经常帮助我处理数据问题
I agree with Dirk. I thought about the same thing and used other languages a bit, too. But in the end I was surprised again again what more experienced users do with R. Packages like
ddply
orplyr
might be very interesting to you. That being said SQL helped me with data juggling often