Python 或 awk/sed 用于清理数据

发布于 2024-12-05 10:35:06 字数 499 浏览 0 评论 0原文

我使用 R 进行数据分析,并且对此非常满意。然而,清理数据可能会更容易一些。我正在考虑学习另一种适合这项任务的语言。具体来说,我正在寻找一种工具来获取原始数据,删除不必要的变量或观察结果,并对其进行格式化以便于在 R 中加载。内容主要是数字和字符串数据,而不是多行文本。

我正在考虑 awk/sed 组合与 Python 的比较。 (我认识到 Perl 将是另一种选择,但是,如果我要学习另一种完整的语言,Python 似乎是一个更好、更具可扩展性的选择。)

sed/awk 的优点是学习起来更快。缺点是这种组合不如 Python 那样可扩展。事实上,如果我学习 Python,我可能会想象一些“任务蔓延”,这很好,但不是我的目标。

我的另一个考虑因素是对大数据集的应用。据我了解,awk/sed 是逐行操作的,而 Python 通常会将所有数据拉入内存。这可能是 sed/awk 的另一个优势。

还有其他我遗漏的问题吗?如果您能提供任何建议,我们将不胜感激。 (我添加了 R 标签,供 R 用户提供他们的清洁建议。)

I use R for data analysis and am very happy with it. Cleaning data could be a bit easier, however. I am thinking about learning another language suited to this task. Specifically, I am looking for a tool to use to take raw data, remove unnecessary variables or observations, and format it for easy loading in R. Contents would be mostly numeric and string data, as opposed to multi-line text.

I am considering the awk/sed combination versus Python. (I recognize that Perl would be another option, but, if I was going to learn another full language, Python seems to be a better, more extensible choice.)

The advantage of sed/awk is that it would be quicker to learn. The disadvantage is that this combination isn't as extensible as Python. Indeed, I might imagine some "mission creep" if I learned Python, which would be fine, but not my goal.

The other consideration that I had is applications to large data sets. As I understand it, awk/sed operate line-by-line, while Python would typically pull all the data into memory. This could be another advantage for sed/awk.

Are there other issues that I'm missing? Any advice that you can offer would be appreciated. (I included the R tag for R users to offer their cleaning recommendations.)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

旧伤还要旧人安 2024-12-12 10:35:06

不想破坏你的冒险,但我想说不,原因如下:

  • R 是矢量化的,而 sed/awk 不是
  • R 已经具有 Perl 正则表达式和扩展正则表达式
  • R 可以更轻松地求助于统计例程(例如,插补) )如果您需要,
  • R 可以可视化、总结……

最重要的是:您已经了解 R

也就是说,当然 sed/awk 非常适合小型程序甚至单行程序,而 Python 是一门很好的语言。但我会考虑继续使用 R。

Not to spoil your adventure, but I'd say no and here is why:

  • R is vectorised where sed/awk are not
  • R already has both Perl regular expression and extended regular expressions
  • R can more easily make recourse to statistical routines (say, imputation) if you need it
  • R can visualize, summarize, ...

and most importantly: you already know R.

That said, of course sed/awk are great for small programs or even one-liners and Python is a fine language. But I would consider to also stick with R.

初见你 2024-12-12 10:35:06

我经常使用 Python 和 Perl。我对 sed 非常了解,并且曾经经常使用 awk。我曾断断续续地使用过 R。 Perl 的数据转换功能和速度是最好的。

  • Perl 基本上可以做 sed 和 awk 能做的所有事情,但还有更多。 (事实上​​,perl 附带的 a2p 和 s2p 将 awk 和 sed 脚本转换为 Perl。)
  • Perl 包含在大多数 Linux/Unix 系统中。如果情况并非如此,那么就有充分的理由学习 sed 和 awk。这个理由早已死了。
  • Perl 拥有一组丰富的模块,比 awk 或 sed 提供的功能要强大得多。例如,这些模块支持逆向互补 DNA 序列、计算统计数据、解析 CSV 文件或计算 MD5 的单行程序。 (有关软件包,请参阅 http://cpan.org/
  • Perl 本质上与 sed 和 awk 一样简洁。对于像我这样的人(我怀疑还有你)来说,在命令行上快速转换数据是一个很大的福音。 Python 过于冗长,无法有效地使用命令行。

老实说,我不知道为什么人们会学习 sed 和 awk 而不是 Perl。

郑重声明,我不是“Perl 人”。我喜欢它作为一把瑞士军刀,而不是作为一种宗教。

I use Python and Perl regularly. I know sed fairly well and once used awk a lot. I've used R in fits and spurts. Perl is the best of the bunch for data transformation function and speed.

  • Perl can do essentially everything sed and awk can do, but lots more as well. (In fact, a2p and s2p, which come with perl, convert awk and sed scripts to Perl.)
  • Perl is included with most Linux/Unix systems. When that wasn't the case, there was good reason to learn sed and awk. That reason is long dead.
  • Perl has a rich set of modules that provide much more power than one can get from awk or sed. For example, these modules enable one-liners that reverse complement DNA sequences, compute statistics, parse CSV files, or calculate MD5s. (see http://cpan.org/ for packages)
  • Perl is essentially as terse as sed and awk. For people like me (and, I suspect, you), quickly transforming data on the command line is a great boon. Python's too wordy for efficient command line use.

I'm honestly at a loss to think why one would learn sed and awk over Perl.

For the record, I'm not "a Perl guy". I like it as a swiss army knife, not as a religion.

怎樣才叫好 2024-12-12 10:35:06

我会推荐 sed/awk 以及类似 UNIX 平台上可用的其他命令行工具:comm、tr、sort、cut、join、grep 以及内置的 shell 功能(如循环等) 。您确实不需要学习另一种编程语言,因为 R 可以处理数据操作,甚至比其他流行的脚本语言更好。

I would recommend sed/awk along with the wealth of other command line tools available on UNIX-alike platforms: comm, tr, sort, cut, join, grep, and built in shell capabilities like looping and whatnot. You really don't need to learn another programming language as R can handle data manipulation as well as if not better than the other popular scripting languages.

樱花坊 2024-12-12 10:35:06

我建议长期投资一种合适的语言来处理数据文件,比如 python、perl 或 ruby​​,而不是短期的 sed/awk 解决方案。我认为所有的数据分析师至少需要三种语言;我使用 C 进行大量计算,使用 Perl 处理数据文件,使用 R 进行交互式分析和图形。

我在Python流行之前就学了Perl。我听说过有关 ruby​​ 的一些很棒的事情,所以您可能想尝试一下。

对于其中任何一个,您都可以逐行处理文件; python不需要提前读取完整的文件。

I would recommend investing for the long term with a proper language for processing data files, like python or perl or ruby, vs the short term sed/awk solution. I think that all data analysts need at least three languages; I use C for hefty computations, perl for processing data files, and R for interactive analysis and graphics.

I learned perl before python had become popular. I've heard great things about ruby so you might want to try that instead.

For any of these you can work with files line-by-line; python doesn't need to read the full file in advance.

南街女流氓 2024-12-12 10:35:06

我建议使用“awk”进行此类处理。

想必您只是在简单文本文件中搜索/拒绝无效观察结果。

awk 执行此任务的速度快如闪电,并且编程非常简单。

如果您需要做任何更复杂的事情,那么您也可以。

如果您不介意性能下降,Python 也是一种可能。 “rpy”库可用于紧密集成Python和R组件。

I would recommend 'awk' for this type of processing.

Presumably you are just searching/rejecting invalid observations in simple text files.

awk is lightning fast at this task and is very simple to program.

If you need to do anything more complex then you can.

Python is also a possibility if you don't mind the performance hit. The "rpy" library can be used to closely integrate the python and R components.

最丧也最甜 2024-12-12 10:35:06

我同意德克的观点。我也考虑过同样的事情,也使用过一些其他语言。但最终,我再次对更有经验的用户使用 R 所做的事情感到惊讶。像 ddply 或 plyr 这样的包可能对您来说非常有趣。话虽这么说,SQL 经常帮助我处理数据问题

I agree with Dirk. I thought about the same thing and used other languages a bit, too. But in the end I was surprised again again what more experienced users do with R. Packages like ddply or plyrmight be very interesting to you. That being said SQL helped me with data juggling often

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文