对大数据文件进行采样
我目前担任数据仓库程序员,因此必须将大量平面文件放入 ETL 过程中。当然,在加载文件之前我必须了解其内容,问题是大多数文件都> >。 1 GB 大,我无法使用我亲爱的老朋友“记事本”打开它们。开玩笑。我通常使用 VIM 或 Notepad++,但打开文件仍然需要一段时间。我可以使用 VIM 或其他编辑器执行文件的“部分”读取吗?
PS 我知道我可以编写一个 10 行脚本来“数据采样”文件,但说服团队成员使用编辑器的功能比我编写的脚本更简单。
感谢您提供的任何见解。
I currently work in the position of Data Warehouse programmer and as such have to put numerous flat files through ETL process. Of course prior to loading the file I have to be aware of its content, the problem is that majority of the files are > 1 GB large and I can not open them using my dear old friend "notepad". Kidding. I usually use VIM or Notepad++ but it still takes a while to open the file. Could I perform a "partial" read of the file using VIM or some other editor?
P.S. I know that I could write a 10 liner script to "data sample" the file, but it would be simpler to convince team members to use a feature of an editor than a script that I wrote.
Thank you for any insight you might have.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
如果你想继续使用 vim,你可以看看 LargeFile< /a> 脚本。
另外,我总是发现 UltraEdit 可以非常快地打开大文件。
If you want to stick with using vim, you could have a look at the LargeFile script.
Alternatively, I've always found that UltraEdit opens large files extremely quickly.
你说你有VIM,这让我想知道你是否也有unix环境?
如果您愿意,您可以通过 unix 实用程序
top
管道输入并在屏幕上显示原始输入。像这样:编辑:(感谢Honk)
terminal$> head -N 15 file.csv
(其中 15 表示您只想查看 15 行)。
You said you had VIM, that makes me wonder if you have a unix environment as well?
If you like, you can pipe the input through unix utility
top
and display the raw imput on your screen. Like this:EDIT: (thanks Honk)
terminal$> head -N 15 file.csv
(Where that 15 indicates you want to see 15 lines only).
很肯定有很多类似的问题,但是嘿,Textpad 是一个不错的选择。
Pretty sure there are loads of similar questions, but hey, Textpad is a good choice for this.
使用 head 命令。
use the head command.
在Solaris 上使用“less”...在Windows 上通过cygwin 使用相同的内容。在大型机上不会出现此问题,ISPF 编辑器可以很好地处理它。
Use the 'less' on solaris ... use the same through cygwin on windows. On mainframes this problem doesn't appear, ISPF editor handles it pretty well.
UltraEdit 声称可以处理超过 4GB 的文件...
UltraEdit claims to handle files over 4GB...