将大文本文件拆分为列

发布于 2025-01-03 07:36:15 字数 2673 浏览 0 评论 0原文

已解决：我的一个好朋友为我编写了以下程序：

filename="my_input_file"
context="channel"               # this is the key which separates the blocks in the input file
desired_column_separator=","    # this will separate the columns in the output file
output_prefix="modified_"       # prefix for the output file


if [ -d ./tmp ]
then

echo " "
echo "***WARNING***"
echo "I want to use and delete a ./tmp/ directory, but one already exists... please remove/rename it, or alter my code***"
echo " " 
exit
fi


mkdir ./tmp
cd ./tmp

csplit -z -n 4 ../$filename  /$context/ {*} 1> /dev/null

filenum=`ls -1 ./ | wc -l`
limit=`echo "$filenum - 1" | bc -l`
lines=`wc -l < xx0000`

touch tmp.dat


        for j in `seq 1 $lines`
        do

    oldstring=''

                for i in `seq 0 $limit`
                do

                inputNo=`printf "%04d" $i`
                string=`head -n $j 'xx'$inputNo | tail -n 1`

        oldstring=$oldstring$string$desired_column_separator

                done

        finalstring=`echo $oldstring | tr -d '\r' | tr -d '\n'`  

        echo "working on line "$j" out of "$lines
                echo -n $finalstring >> tmp.dat                
                echo -e "\r" >> tmp.dat

        done

mv tmp.dat ../$output_prefix$filename
cd ..
rm -r -f ./tmp/

echo "...done!"

原文：我知道在这个论坛上分割文本文件已经被搞死了，但我找不到专门针对我的问题的方法。我想将一个大文件（> 200mb）拆分为文本行上的列，但“拆分”函数将每一列放入其自己的文件中。老实说，将 3,000 多个单独的文件文本加载到其他程序中是一件痛苦的事情。除此之外，我还想提取文本文件的一部分作为我的数据的标题（第 4 行的最后一部分）。初始文件由一列组成，如下所示：

channel names:
03/02/2012 12:03:03 - TDS3k(CH1)
start times:
03/02/2012 12:08:02.311422
dt:
0.000000
data:
-8.000000E-4
-8.000000E-4
-1.600000E-3
... (9,994 lines omitted)
-2.400000E-3
-1.600000E-3
-2.400000E-3
channel names:
03/02/2012 12:03:03 - TDS3k(CH1)
start times:
03/02/2012 12:33:11.169533
dt:
0.000000
data:
-8.000000E-4
-1.600000E-3
-1.600000E-3
... (another 9,997 lines omitted)

我希望它看起来像这样：

channel names:                     channel names:
03/02/2012 12:03:03 - TDS3k(CH1)   03/02/2012 12:03:03 - TDS3k(CH1)
start times:                       start times:
03/02/2012 12:08:02.311422         03/02/2012 12:33:11.169533
dt:                                dt:
0.000000                           0.000000
data:                              data:
-8.000000E-4                       -8.000000E-4   ...
-8.000000E-4                       -1.600000E-3   ...
-1.600000E-3                       -1.600000E-3   ...
...                                ...

我怀疑在正确的位置进行分割比标题更容易做到，但我也做得不够好。

提前致谢

编辑：我还没有使用任何特定的语言。我只需要一种可以在 R 中分析的格式的数据。我会采用你们建议的任何可行的方法。

原文

SOLVED: A good friend of mine wrote the following program for me:

filename="my_input_file"
context="channel"               # this is the key which separates the blocks in the input file
desired_column_separator=","    # this will separate the columns in the output file
output_prefix="modified_"       # prefix for the output file


if [ -d ./tmp ]
then

echo " "
echo "***WARNING***"
echo "I want to use and delete a ./tmp/ directory, but one already exists... please remove/rename it, or alter my code***"
echo " " 
exit
fi


mkdir ./tmp
cd ./tmp

csplit -z -n 4 ../$filename  /$context/ {*} 1> /dev/null

filenum=`ls -1 ./ | wc -l`
limit=`echo "$filenum - 1" | bc -l`
lines=`wc -l < xx0000`

touch tmp.dat


        for j in `seq 1 $lines`
        do

    oldstring=''

                for i in `seq 0 $limit`
                do

                inputNo=`printf "%04d" $i`
                string=`head -n $j 'xx'$inputNo | tail -n 1`

        oldstring=$oldstring$string$desired_column_separator

                done

        finalstring=`echo $oldstring | tr -d '\r' | tr -d '\n'`  

        echo "working on line "$j" out of "$lines
                echo -n $finalstring >> tmp.dat                
                echo -e "\r" >> tmp.dat

        done

mv tmp.dat ../$output_prefix$filename
cd ..
rm -r -f ./tmp/

echo "...done!"

Original: I know splitting text files has been done to death on this forum, but I couldn't find a method specific to my problem.
I want to split a large file (>200mb) into columns on a text line, but the 'split' function puts every column in its own file. 3,000-odd individual files text are a pain to load into other programs, to be honest. On top of this, I would also like to extract a part of the text file to use as the header for my data (the last part of line 4).
The initial file consists of a single column, like so:

channel names:
03/02/2012 12:03:03 - TDS3k(CH1)
start times:
03/02/2012 12:08:02.311422
dt:
0.000000
data:
-8.000000E-4
-8.000000E-4
-1.600000E-3
... (9,994 lines omitted)
-2.400000E-3
-1.600000E-3
-2.400000E-3
channel names:
03/02/2012 12:03:03 - TDS3k(CH1)
start times:
03/02/2012 12:33:11.169533
dt:
0.000000
data:
-8.000000E-4
-1.600000E-3
-1.600000E-3
... (another 9,997 lines omitted)

I would like it to look like so:

channel names:                     channel names:
03/02/2012 12:03:03 - TDS3k(CH1)   03/02/2012 12:03:03 - TDS3k(CH1)
start times:                       start times:
03/02/2012 12:08:02.311422         03/02/2012 12:33:11.169533
dt:                                dt:
0.000000                           0.000000
data:                              data:
-8.000000E-4                       -8.000000E-4   ...
-8.000000E-4                       -1.600000E-3   ...
-1.600000E-3                       -1.600000E-3   ...
...                                ...

I suspect getting the split in the right place is easier to do than the header, but I'm not good enough to do either.

Thanks in advance

EDIT: I'm not using any particular language yet. I just need the data in a format where I can analyse it in R. I'll go with whatever you guys can suggest that will work.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

月亮邮递员 2025-01-10 07:36:15

您使用什么语言？每个条目有多少个“数据”条目？

使用 python，最简单的方法是首先将数据分解为“条目”，然后为每个条目编写一个解析函数，以仅生成您希望在最终输出中看到的值。然后只需加入最终输出，或使用 csv 模块编写它。

input = """channel names:
03/02/2012 12:03:03 - TDS3k(CH1)
start times:
03/02/2012 12:33:11.169533
dt:
0.000000
data:
-8.000000E-4
-1.600000E-3
-1.600000E-3
channel names:
03/02/2012 12:03:03 - TDS3k(CH1)
start times:
03/02/2012 12:33:11.169533
dt:
0.000000
data:
-8.000000E-4
-1.600000E-3
-1.600000E-3
"""

LINES_PER_ENTRY = 10

def parseEntry(entry):
    return entry

raw = input.split('\n')

entries =  [raw[i*LINES_PER_ENTRY:(i+1)*LINES_PER_ENTRY] for i in range(len(raw)/10)]


parsed_entries = [parseEntry(entry) for entry in entries]

outfile = open('outfile.txt','w')
for parsed_entry in parsed_entries:
    outfile.write('\t'.join(parsed_entry) + "\n")
print parsed_entries

What language are you using? How many 'data' entries are there per entry?

Using python, the easiest way to do it is to first break the data up into 'entries', and then write a parsing function for each entry to produce only the values you wish to see in your final output. Then simply join final output, or write it using the csv module.

input = """channel names:
03/02/2012 12:03:03 - TDS3k(CH1)
start times:
03/02/2012 12:33:11.169533
dt:
0.000000
data:
-8.000000E-4
-1.600000E-3
-1.600000E-3
channel names:
03/02/2012 12:03:03 - TDS3k(CH1)
start times:
03/02/2012 12:33:11.169533
dt:
0.000000
data:
-8.000000E-4
-1.600000E-3
-1.600000E-3
"""

LINES_PER_ENTRY = 10

def parseEntry(entry):
    return entry

raw = input.split('\n')

entries =  [raw[i*LINES_PER_ENTRY:(i+1)*LINES_PER_ENTRY] for i in range(len(raw)/10)]


parsed_entries = [parseEntry(entry) for entry in entries]

outfile = open('outfile.txt','w')
for parsed_entry in parsed_entries:
    outfile.write('\t'.join(parsed_entry) + "\n")
print parsed_entries

回复收藏 0 原文

~没有更多了~