将计数转换为相对概率
背景
根据单词和计数的 CSV 文件创建概率词典。这是文本分割问题的前奏,而不是作业问题。
问题
给定一个包含以下单词和计数的 CSV 文件:
aardvark,10
aardwolf,9
armadillo,9
platypus,5
zebra,1
创建一个文件,其概率相对于文件中最大的计数:
aardvark,1
aardwolf,0.9
armadillo,0.9
platypus,0.5
zebra,0.1
例如,aardvark,1
计算为 aardvark,10 /10
和 platypus,0.5
计算为 platypus,5/10
。
问题
实现 shell 脚本来创建相对概率文件的最有效方法是什么?
约束
- 单词和数字都没有任何顺序。
- 没有主要的编程语言(例如 Perl、Ruby、Python、Java、C、Fortran 或 Cobol)。
- 欢迎使用标准 Unix 工具,例如
awk
、sed
或sort
。 - 所有概率必须相对于文件中的最高概率。
- 文字是唯一的,数字却不是。
- 计数是自然数。
谢谢你!
Background
Create a probability lexicon based on a CSV file of words and tallies. This is a prelude to a text segmentation problem, not a homework problem.
Problem
Given a CSV file with the following words and tallies:
aardvark,10
aardwolf,9
armadillo,9
platypus,5
zebra,1
Create a file with probabilities relative to the largest tally in the file:
aardvark,1
aardwolf,0.9
armadillo,0.9
platypus,0.5
zebra,0.1
Where, for example, aardvark,1
is calculated as aardvark,10/10
and platypus,0.5
is calculated as platypus,5/10
.
Question
What is the most efficient way to implement a shell script to create the file of relative probabilities?
Constraints
- Neither the words nor the numbers are in any order.
- No major programming language (such as Perl, Ruby, Python, Java, C, Fortran, or Cobol).
- Standard Unix tools such as
awk
,sed
, orsort
are welcome. - All probabilities must be relative to the highest probability in the file.
- The words are unique, the numbers are not.
- The tallies are natural numbers.
Thank you!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
无需读取文件两次:
如果您需要按单词排序的输出:
或
如果您需要按概率排序的输出:
No need to read the file twice:
If you need the output sorted by word:
or
If you need the output sorted by probability:
这不是防错的,但类似的东西应该可以工作:
借用这个问题< /a>,以及各种谷歌搜索。
This is not error-proof but something like this should work:
Borrowed from this question, and various googling.