将计数转换为相对概率

发布于 2024-10-16 00:03:38 字数 759 浏览 6 评论 0原文

背景

根据单词和计数的 CSV 文件创建概率词典。这是文本分割问题的前奏,而不是作业问题。

问题

给定一个包含以下单词和计数的 CSV 文件:

aardvark,10
aardwolf,9
armadillo,9
platypus,5
zebra,1

创建一个文件,其概率相对于文件中最大的计数:

aardvark,1
aardwolf,0.9
armadillo,0.9
platypus,0.5
zebra,0.1

例如,aardvark,1 计算为 aardvark,10 /10platypus,0.5 计算为 platypus,5/10

问题

实现 shell 脚本来创建相对概率文件的最有效方法是什么?

约束

  • 单词和数字都没有任何顺序。
  • 没有主要的编程语言(例如 Perl、Ruby、Python、Java、C、Fortran 或 Cobol)。
  • 欢迎使用标准 Unix 工具,例如 awksedsort
  • 所有概率必须相对于文件中的最高概率。
  • 文字是唯一的,数字却不是。
  • 计数是自然数。

谢谢你!

Background

Create a probability lexicon based on a CSV file of words and tallies. This is a prelude to a text segmentation problem, not a homework problem.

Problem

Given a CSV file with the following words and tallies:

aardvark,10
aardwolf,9
armadillo,9
platypus,5
zebra,1

Create a file with probabilities relative to the largest tally in the file:

aardvark,1
aardwolf,0.9
armadillo,0.9
platypus,0.5
zebra,0.1

Where, for example, aardvark,1 is calculated as aardvark,10/10 and platypus,0.5 is calculated as platypus,5/10.

Question

What is the most efficient way to implement a shell script to create the file of relative probabilities?

Constraints

  • Neither the words nor the numbers are in any order.
  • No major programming language (such as Perl, Ruby, Python, Java, C, Fortran, or Cobol).
  • Standard Unix tools such as awk, sed, or sort are welcome.
  • All probabilities must be relative to the highest probability in the file.
  • The words are unique, the numbers are not.
  • The tallies are natural numbers.

Thank you!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

余生再见 2024-10-23 00:03:38
awk 'BEGIN{max=0;OFS=FS=","}  $NF>max{max=$NF}NR>FNR {print $1,($2/max) }' file file
awk 'BEGIN{max=0;OFS=FS=","}  $NF>max{max=$NF}NR>FNR {print $1,($2/max) }' file file
错爱 2024-10-23 00:03:38

无需读取文件两次:

awk 'BEGIN {OFS = FS = ","} {a[$1] = $2} $2 > max {max=$2} END {for (w in a) print w, a[w]/max}' inputfile

如果您需要按单词排序的输出:

awk ... | sort

awk 'BEGIN {OFS = FS = ","} {a[$1] = $2; ind[j++] = $1} $2 > max {max=$2} END {n = asort(ind); for (i=1; i<=n; i++) print ind[i], a[ind[i]]/max}' inputfile

如果您需要按概率排序的输出:

awk ... | sort -t, -k2,2n -k1,1

No need to read the file twice:

awk 'BEGIN {OFS = FS = ","} {a[$1] = $2} $2 > max {max=$2} END {for (w in a) print w, a[w]/max}' inputfile

If you need the output sorted by word:

awk ... | sort

or

awk 'BEGIN {OFS = FS = ","} {a[$1] = $2; ind[j++] = $1} $2 > max {max=$2} END {n = asort(ind); for (i=1; i<=n; i++) print ind[i], a[ind[i]]/max}' inputfile

If you need the output sorted by probability:

awk ... | sort -t, -k2,2n -k1,1
再可℃爱ぅ一点好了 2024-10-23 00:03:38

这不是防错的,但类似的东西应该可以工作:

#!/bin/bash

INPUT=data.cvs
OUTPUT=tally.cvs
DIGITS=1

OLDIFS=$IFS
IFS=,

maxval=0  # Assuming all $val are positive

while read name val
do
    if (( val > maxval )); then maxval=$val; fi
done < $INPUT

# Make sure $OUTPUT doesn't exist

touch $OUTPUT

while read name val
do
    tally=`echo "scale=$DIGITS; result=$val/$maxval; if (0 <= result && result < 1) { print "0" }; print result" | bc`
    echo "$name,$tally" >> $OUTPUT
done < $INPUT

IFS=$OLDIFS

借用这个问题< /a>,以及各种谷歌搜索。

This is not error-proof but something like this should work:

#!/bin/bash

INPUT=data.cvs
OUTPUT=tally.cvs
DIGITS=1

OLDIFS=$IFS
IFS=,

maxval=0  # Assuming all $val are positive

while read name val
do
    if (( val > maxval )); then maxval=$val; fi
done < $INPUT

# Make sure $OUTPUT doesn't exist

touch $OUTPUT

while read name val
do
    tally=`echo "scale=$DIGITS; result=$val/$maxval; if (0 <= result && result < 1) { print "0" }; print result" | bc`
    echo "$name,$tally" >> $OUTPUT
done < $INPUT

IFS=$OLDIFS

Borrowed from this question, and various googling.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文