在 shell 脚本中从 CSV 中提取数据(Sed、AWK、Grep?)

发布于 2024-08-05 11:22:37 字数 4497 浏览 2 评论 0原文

我需要从 CSV 文件中提取一些数据。 CSV 是一个包含多条记录的 2 列文件。第一列是日期,第二列是需要提取的数据。 CSV 文件的第一行是列标题,因此可以跳过。我已经为提取数据的 csv 文件创建了列标题,因此不需要这样做,我只需使用 >>将数据导入其中。

以下是 CSV 文件中的 1 条记录/行(许多条记录/行):

"2009-09-20 00:12:37","a:2:{s:15:""info_buyRequest"";a:5:{s:4:""uenc"";s:116:""aHR0cDovL3N0b3JlLmZvcmdldGhhbmdvdmVycy5jb20vcGF0Y2hlcy9pbmRpdmlkdWFsLXBhdGNoZXMvZnJlZS1zYW1wbGUuaHRtbD9fX19TSUQ9VQ,,"";s:7:""product"";s:1:""1"";s:15:""related_product"";s:0:"""";s:7:""options"";a:13:{i:17;s:2:""59"";i:16;s:2:""50"";i:15;s:2:""49"";i:14;s:2:""47"";i:13;s:2:""41"";i:12;s:2:""34"";i:11;s:2:""25"";i:10;s:2:""23"";i:9;s:2:""19"";i:8;s:2:""17"";i:7;s:2:""12"";i:6;s:1:""9"";i:5;s:1:""5"";}s:3:""qty"";i:1;}s:7:""options"";a:13:{i:0;a:7:{s:5:""label"";s:25:""How did you hear about us"";s:5:""value"";s:22:""Friend / Family Member"";s:11:""print_value"";s:22:""Friend / Family Member"";s:9:""option_id"";s:2:""17"";s:11:""option_type"";s:9:""drop_down"";s:12:""option_value"";s:2:""59"";s:11:""custom_view"";b:0;}i:1;a:7:{s:5:""label"";s:3:""Age"";s:5:""value"";s:5:""21-24"";s:11:""print_value"";s:5:""21-24"";s:9:""option_id"";s:2:""16"";s:11:""option_type"";s:9:""drop_down"";s:12:""option_value"";s:2:""50"";s:11:""custom_view"";b:0;}i:2;a:7:{s:5:""label"";s:14:""Marital Status"";s:5:""value"";s:9:""UnMarried"";s:11:""print_value"";s:9:""UnMarried"";s:9:""option_id"";s:2:""15"";s:11:""option_type"";s:5:""radio"";s:12:""option_value"";s:2:""49"";s:11:""custom_view"";b:0;}i:3;a:7:{s:5:""label"";s:3:""Sex"";s:5:""value"";s:6:""Female"";s:11:""print_value"";s:6:""Female"";s:9:""option_id"";s:2:""14"";s:11:""option_type"";s:5:""radio"";s:12:""option_value"";s:2:""47"";s:11:""custom_view"";b:0;}i:4;a:7:{s:5:""label"";s:10:""Occupation"";s:5:""value"";s:7:""Student"";s:11:""print_value"";s:7:""Student"";s:9:""option_id"";s:2:""13"";s:11:""option_type"";s:9:""drop_down"";s:12:""option_value"";s:2:""41"";s:11:""custom_view"";b:0;}i:5;a:7:{s:5:""label"";s:9:""Education"";s:5:""value"";s:16:""College Graduate"";s:11:""print_value"";s:16:""College Graduate"";s:9:""option_id"";s:2:""12"";s:11:""option_type"";s:9:""drop_down"";s:12:""option_value"";s:2:""34"";s:11:""custom_view"";b:0;}i:6;a:7:{s:5:""label"";s:16:""Household Income"";s:5:""value"";s:7:""30K-50K"";s:11:""print_value"";s:7:""30K-50K"";s:9:""option_id"";s:2:""11"";s:11:""option_type"";s:9:""drop_down"";s:12:""option_value"";s:2:""25"";s:11:""custom_view"";b:0;}i:7;a:7:{s:5:""label"";s:23:""Do You Take Supplements"";s:5:""value"";s:2:""No"";s:11:""print_value"";s:2:""No"";s:9:""option_id"";s:2:""10"";s:11:""option_type"";s:5:""radio"";s:12:""option_value"";s:2:""23"";s:11:""custom_view"";b:0;}i:8;a:7:{s:5:""label"";s:40:""How would you rank your typical hangover"";s:5:""value"";s:4:""Mild"";s:11:""print_value"";s:4:""Mild"";s:9:""option_id"";s:1:""9"";s:11:""option_type"";s:9:""drop_down"";s:12:""option_value"";s:2:""19"";s:11:""custom_view"";b:0;}i:9;a:7:{s:5:""label"";s:51:""What type of establishments do you typically prefer"";s:5:""value"";s:10:""Nightclubs"";s:11:""print_value"";s:10:""Nightclubs"";s:9:""option_id"";s:1:""8"";s:11:""option_type"";s:9:""drop_down"";s:12:""option_value"";s:2:""17"";s:11:""custom_view"";b:0;}i:10;a:7:{s:5:""label"";s:40:""How often do you usually go out per week"";s:5:""value"";s:3:""1-2"";s:11:""print_value"";s:3:""1-2"";s:9:""option_id"";s:1:""7"";s:11:""option_type"";s:9:""drop_down"";s:12:""option_value"";s:2:""12"";s:11:""custom_view"";b:0;}i:11;a:7:{s:5:""label"";s:49:""How many drinks do you typically consume per week"";s:5:""value"";s:3:""6-8"";s:11:""print_value"";s:3:""6-8"";s:9:""option_id"";s:1:""6"";s:11:""option_type"";s:9:""drop_down"";s:12:""option_value"";s:1:""9"";s:11:""custom_view"";b:0;}i:12;a:7:{s:5:""label"";s:53:""How would you prefer to buy our Products"";s:5:""value"";s:6:""Online"";s:11:""print_value"";s:6:""Online"";s:9:""option_id"";s:1:""5"";s:11:""option_type"";s:9:""drop_down"";s:12:""option_value"";s:1:""5"";s:11:""custom_view"";b:0;}}}"

输出应该是此处找到的数据:

  • ""print_value";s:?:""{DATA}""

如果 ? 是一个数字,{DATA} 是正在提取的数据,

那么此 1 条记录的输出将是:

"2009-09-20 00:12:37","Friend / Family Member","21-24","UnMarried","Female","Student","College Graduate","30K-50K","No","Mild","Nightclubs","1-2","6-8","Online"

我不是。精通 Sed、AWK 或 Grep,但我知道如果不是全部三个工具,也可以使用其中一种工具来完成,我们将非常感激任何帮助或推动正确的方向。

I need to extract some data from a CSV file. The CSV is a 2 column file with multiple records. The first column is the date, the second column is the data that needs to be extracted. The first row of the CSV file is the column headers, so it can be skipped. And I've already created the column header for the extracted data's csv file, so theres no need for that, I'll simply use >> to import the data into it.

Here is 1 record/line (of many) in the CSV file:

"2009-09-20 00:12:37","a:2:{s:15:""info_buyRequest"";a:5:{s:4:""uenc"";s:116:""aHR0cDovL3N0b3JlLmZvcmdldGhhbmdvdmVycy5jb20vcGF0Y2hlcy9pbmRpdmlkdWFsLXBhdGNoZXMvZnJlZS1zYW1wbGUuaHRtbD9fX19TSUQ9VQ,,"";s:7:""product"";s:1:""1"";s:15:""related_product"";s:0:"""";s:7:""options"";a:13:{i:17;s:2:""59"";i:16;s:2:""50"";i:15;s:2:""49"";i:14;s:2:""47"";i:13;s:2:""41"";i:12;s:2:""34"";i:11;s:2:""25"";i:10;s:2:""23"";i:9;s:2:""19"";i:8;s:2:""17"";i:7;s:2:""12"";i:6;s:1:""9"";i:5;s:1:""5"";}s:3:""qty"";i:1;}s:7:""options"";a:13:{i:0;a:7:{s:5:""label"";s:25:""How did you hear about us"";s:5:""value"";s:22:""Friend / Family Member"";s:11:""print_value"";s:22:""Friend / Family Member"";s:9:""option_id"";s:2:""17"";s:11:""option_type"";s:9:""drop_down"";s:12:""option_value"";s:2:""59"";s:11:""custom_view"";b:0;}i:1;a:7:{s:5:""label"";s:3:""Age"";s:5:""value"";s:5:""21-24"";s:11:""print_value"";s:5:""21-24"";s:9:""option_id"";s:2:""16"";s:11:""option_type"";s:9:""drop_down"";s:12:""option_value"";s:2:""50"";s:11:""custom_view"";b:0;}i:2;a:7:{s:5:""label"";s:14:""Marital Status"";s:5:""value"";s:9:""UnMarried"";s:11:""print_value"";s:9:""UnMarried"";s:9:""option_id"";s:2:""15"";s:11:""option_type"";s:5:""radio"";s:12:""option_value"";s:2:""49"";s:11:""custom_view"";b:0;}i:3;a:7:{s:5:""label"";s:3:""Sex"";s:5:""value"";s:6:""Female"";s:11:""print_value"";s:6:""Female"";s:9:""option_id"";s:2:""14"";s:11:""option_type"";s:5:""radio"";s:12:""option_value"";s:2:""47"";s:11:""custom_view"";b:0;}i:4;a:7:{s:5:""label"";s:10:""Occupation"";s:5:""value"";s:7:""Student"";s:11:""print_value"";s:7:""Student"";s:9:""option_id"";s:2:""13"";s:11:""option_type"";s:9:""drop_down"";s:12:""option_value"";s:2:""41"";s:11:""custom_view"";b:0;}i:5;a:7:{s:5:""label"";s:9:""Education"";s:5:""value"";s:16:""College Graduate"";s:11:""print_value"";s:16:""College Graduate"";s:9:""option_id"";s:2:""12"";s:11:""option_type"";s:9:""drop_down"";s:12:""option_value"";s:2:""34"";s:11:""custom_view"";b:0;}i:6;a:7:{s:5:""label"";s:16:""Household Income"";s:5:""value"";s:7:""30K-50K"";s:11:""print_value"";s:7:""30K-50K"";s:9:""option_id"";s:2:""11"";s:11:""option_type"";s:9:""drop_down"";s:12:""option_value"";s:2:""25"";s:11:""custom_view"";b:0;}i:7;a:7:{s:5:""label"";s:23:""Do You Take Supplements"";s:5:""value"";s:2:""No"";s:11:""print_value"";s:2:""No"";s:9:""option_id"";s:2:""10"";s:11:""option_type"";s:5:""radio"";s:12:""option_value"";s:2:""23"";s:11:""custom_view"";b:0;}i:8;a:7:{s:5:""label"";s:40:""How would you rank your typical hangover"";s:5:""value"";s:4:""Mild"";s:11:""print_value"";s:4:""Mild"";s:9:""option_id"";s:1:""9"";s:11:""option_type"";s:9:""drop_down"";s:12:""option_value"";s:2:""19"";s:11:""custom_view"";b:0;}i:9;a:7:{s:5:""label"";s:51:""What type of establishments do you typically prefer"";s:5:""value"";s:10:""Nightclubs"";s:11:""print_value"";s:10:""Nightclubs"";s:9:""option_id"";s:1:""8"";s:11:""option_type"";s:9:""drop_down"";s:12:""option_value"";s:2:""17"";s:11:""custom_view"";b:0;}i:10;a:7:{s:5:""label"";s:40:""How often do you usually go out per week"";s:5:""value"";s:3:""1-2"";s:11:""print_value"";s:3:""1-2"";s:9:""option_id"";s:1:""7"";s:11:""option_type"";s:9:""drop_down"";s:12:""option_value"";s:2:""12"";s:11:""custom_view"";b:0;}i:11;a:7:{s:5:""label"";s:49:""How many drinks do you typically consume per week"";s:5:""value"";s:3:""6-8"";s:11:""print_value"";s:3:""6-8"";s:9:""option_id"";s:1:""6"";s:11:""option_type"";s:9:""drop_down"";s:12:""option_value"";s:1:""9"";s:11:""custom_view"";b:0;}i:12;a:7:{s:5:""label"";s:53:""How would you prefer to buy our Products"";s:5:""value"";s:6:""Online"";s:11:""print_value"";s:6:""Online"";s:9:""option_id"";s:1:""5"";s:11:""option_type"";s:9:""drop_down"";s:12:""option_value"";s:1:""5"";s:11:""custom_view"";b:0;}}}"

The Output should be the data found here:

  • ""print_value";s:?:""{DATA}""

Were the ? is a number, and {DATA} is the data being extracted.

So the output for example of this 1 record would be:

"2009-09-20 00:12:37","Friend / Family Member","21-24","UnMarried","Female","Student","College Graduate","30K-50K","No","Mild","Nightclubs","1-2","6-8","Online"

I am not proficient in Sed,AWK, or Grep, but I know it can be done using one of these tools if not all three. Any help or nudges in the right direction would be GREATLY appreciated.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

白鸥掠海 2024-08-12 11:22:37

我建议你使用 PHP 来反序列化该结构。
然而,这里有一个使用 sedtr 实现的快速但肮脏的版本。当然你可以做得更好:

cat file.csv | \
tr ",;" "\n" | \
sed -e 's/[asbi]:[0-9]*[:]*//g' -e '/^[{}]/d' -e 's/""//g' -e '/^"{/d' | \
sed -n -e '/^"/p' -e '/^print_value$/,/^option_id$/p' | \
sed -e '/^option_id/d' -e '/^print_value/d' -e 's/^"\(.*\)"$/\1/' | \
tr "\n" "," | \
sed -e 's/,\([0-9]*-[0-9]*-[0-9]*\)/\n\1/g' -e 's/,$//' | \
sed -e 's/^/"/g' -e 's/$/"/g' -e 's/,/","/g'

解释:

  • 用逗号和分号分割
  • 删除删除php结构语法s:X:Y,b:X,...并删除以{或}或“{
  • 提取部分” 开头的行从 print_value 到下一个 option_id,还保留日期(以 " 开头的行)
  • 删除这些标签(打印和选项),并删除日期周围的引号
  • 用逗号分隔行连接所有行
  • (以日期模式开头),并删除多余的逗号最后
  • 在所有字段周围添加引号

哇,我知道这很尴尬:)

I suggest you use PHP to de-serialize the structure.
However, here's a quick and dirty version of what you want using sed and tr. Certainly you can do this much much better:

cat file.csv | \
tr ",;" "\n" | \
sed -e 's/[asbi]:[0-9]*[:]*//g' -e '/^[{}]/d' -e 's/""//g' -e '/^"{/d' | \
sed -n -e '/^"/p' -e '/^print_value$/,/^option_id$/p' | \
sed -e '/^option_id/d' -e '/^print_value/d' -e 's/^"\(.*\)"$/\1/' | \
tr "\n" "," | \
sed -e 's/,\([0-9]*-[0-9]*-[0-9]*\)/\n\1/g' -e 's/,$//' | \
sed -e 's/^/"/g' -e 's/$/"/g' -e 's/,/","/g'

The explanation:

  • split by commas and semicolons
  • remove remove the php structure syntax s:X:Y, b:X, ... and remove lines starting with { or } or "{
  • extract the section from print_value to the next option_id, also keep the date (line start with ")
  • remove those labels (print and option), and remove quotations around the date
  • concat all lines with commas
  • seperate lines (starting with date pattern), and remove extra comma at end
  • add quotations around all fields

Wow, I know it's embarrassing :)

公布 2024-08-12 11:22:37

这是我的答案:


cat TestData \
    | grep -o -P "print_value\"\";.*?:\"\".*?\"\";" \
    | perl -pe 's|print_value.*:\"\"(.*?)\"\";|\1|'

第一行显示数据(存储在 TestData 中)。

第二行要求 grep 将 print_value 中的每个匹配项分隔到最接近的 '"";'。
请注意,我使用了“.*?”对于非贪婪匹配(需要使用“-P”)。

最后一行使用 perl 删除所有不需要的内容。请注意,我使用“(.*?)”来匹配所需的组,并使用“\1”来显示该组。

希望这有帮助。

Here is my anwser:


cat TestData \
    | grep -o -P "print_value\"\";.*?:\"\".*?\"\";" \
    | perl -pe 's|print_value.*:\"\"(.*?)\"\";|\1|'

The first line show the data (stored in TestData).

The second line asks grep to separate each match from print_value to the nearest '"";'.
Notice that I use '.*?' for non greedy match (needs to use '-P' with it).

The last line use perl to strip all un-needed. See that I use '(.*?)' to match the needed group and use '\1' to show the group.

Hope this helps.

述情 2024-08-12 11:22:37

这是一个 sed oneliner:

sed -nr 's/^([^,]+),(.*)$/\2@%@\1/;:a;s/""print_value"";s:[0-9]+:""([^"]+)""(.*)$/\2,"\1"/;ta;s/^.*@%@//p' <source

基本上提取数据并使用唯一的分隔符“@%@”将其附加到行尾。

当循环/替换构造失败(即不再有数据)时,丢弃原始行的剩余内容,使数据格式良好。

Here's a sed oneliner:

sed -nr 's/^([^,]+),(.*)$/\2@%@\1/;:a;s/""print_value"";s:[0-9]+:""([^"]+)""(.*)$/\2,"\1"/;ta;s/^.*@%@//p' <source

Basically extract the data and append it to the end of the line using a unique delimiter '@%@'.

When the loop/substitute construct fails (i.e. no more data), throw away what is left of the original line leaving the data nicely formatted.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文