将计数转换为相对概率

发布于 2024-10-16 00:03:38 字数 759 浏览 6 评论 0原文

背景

根据单词和计数的 CSV 文件创建概率词典。这是文本分割问题的前奏，而不是作业问题。

问题

给定一个包含以下单词和计数的 CSV 文件：

aardvark,10
aardwolf,9
armadillo,9
platypus,5
zebra,1

创建一个文件，其概率相对于文件中最大的计数：

aardvark,1
aardwolf,0.9
armadillo,0.9
platypus,0.5
zebra,0.1

例如，aardvark,1 计算为 aardvark,10 /10 和 platypus,0.5 计算为 platypus,5/10。

问题

实现 shell 脚本来创建相对概率文件的最有效方法是什么？

约束

单词和数字都没有任何顺序。
没有主要的编程语言（例如 Perl、Ruby、Python、Java、C、Fortran 或 Cobol）。
欢迎使用标准 Unix 工具，例如 awk、sed 或 sort。
所有概率必须相对于文件中的最高概率。
文字是唯一的，数字却不是。
计数是自然数。

谢谢你！

原文

Background

Create a probability lexicon based on a CSV file of words and tallies. This is a prelude to a text segmentation problem, not a homework problem.

Problem

Given a CSV file with the following words and tallies:

aardvark,10
aardwolf,9
armadillo,9
platypus,5
zebra,1

Create a file with probabilities relative to the largest tally in the file:

aardvark,1
aardwolf,0.9
armadillo,0.9
platypus,0.5
zebra,0.1

Where, for example, aardvark,1 is calculated as aardvark,10/10 and platypus,0.5 is calculated as platypus,5/10.

Question

What is the most efficient way to implement a shell script to create the file of relative probabilities?

Constraints

Neither the words nor the numbers are in any order.
No major programming language (such as Perl, Ruby, Python, Java, C, Fortran, or Cobol).
Standard Unix tools such as awk, sed, or sort are welcome.
All probabilities must be relative to the highest probability in the file.
The words are unique, the numbers are not.
The tallies are natural numbers.

Thank you!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

余生再见 2024-10-23 00:03:38

awk 'BEGIN{max=0;OFS=FS=","}  $NF>max{max=$NF}NR>FNR {print $1,($2/max) }' file file

awk 'BEGIN{max=0;OFS=FS=","}  $NF>max{max=$NF}NR>FNR {print $1,($2/max) }' file file

回复收藏 0 原文

错爱 2024-10-23 00:03:38

无需读取文件两次：

awk 'BEGIN {OFS = FS = ","} {a[$1] = $2} $2 > max {max=$2} END {for (w in a) print w, a[w]/max}' inputfile

如果您需要按单词排序的输出：

awk ... | sort

或

awk 'BEGIN {OFS = FS = ","} {a[$1] = $2; ind[j++] = $1} $2 > max {max=$2} END {n = asort(ind); for (i=1; i<=n; i++) print ind[i], a[ind[i]]/max}' inputfile

如果您需要按概率排序的输出：

awk ... | sort -t, -k2,2n -k1,1

No need to read the file twice:

awk 'BEGIN {OFS = FS = ","} {a[$1] = $2} $2 > max {max=$2} END {for (w in a) print w, a[w]/max}' inputfile

If you need the output sorted by word:

awk ... | sort

awk 'BEGIN {OFS = FS = ","} {a[$1] = $2; ind[j++] = $1} $2 > max {max=$2} END {n = asort(ind); for (i=1; i<=n; i++) print ind[i], a[ind[i]]/max}' inputfile

If you need the output sorted by probability:

awk ... | sort -t, -k2,2n -k1,1

回复收藏 0 原文

再可℃爱ぅ一点好了 2024-10-23 00:03:38

这不是防错的，但类似的东西应该可以工作：

#!/bin/bash

INPUT=data.cvs
OUTPUT=tally.cvs
DIGITS=1

OLDIFS=$IFS
IFS=,

maxval=0  # Assuming all $val are positive

while read name val
do
    if (( val > maxval )); then maxval=$val; fi
done < $INPUT

# Make sure $OUTPUT doesn't exist

touch $OUTPUT

while read name val
do
    tally=`echo "scale=$DIGITS; result=$val/$maxval; if (0 <= result && result < 1) { print "0" }; print result" | bc`
    echo "$name,$tally" >> $OUTPUT
done < $INPUT

IFS=$OLDIFS

借用这个问题< /a>，以及各种谷歌搜索。

This is not error-proof but something like this should work:

#!/bin/bash

INPUT=data.cvs
OUTPUT=tally.cvs
DIGITS=1

OLDIFS=$IFS
IFS=,

maxval=0  # Assuming all $val are positive

while read name val
do
    if (( val > maxval )); then maxval=$val; fi
done < $INPUT

# Make sure $OUTPUT doesn't exist

touch $OUTPUT

while read name val
do
    tally=`echo "scale=$DIGITS; result=$val/$maxval; if (0 <= result && result < 1) { print "0" }; print result" | bc`
    echo "$name,$tally" >> $OUTPUT
done < $INPUT

IFS=$OLDIFS

Borrowed from this question, and various googling.

回复收藏 0 原文

~没有更多了~