使用awk,cat,head,sort,tr计算文本文件中最常用的单词

・2 分钟阅读

计算文本文件中最常用的字词

cat WAR_AND_PEACE_By_LeoTolstoi.txt | tr -cs"[:alnum:]""n"| tr"[:lower:]""[:upper:]" | awk '{h[$1]++}END{for (i in h){print h[i]""i}}'|sort -nr | cat -n | head -n 30
使用 cat WAR_AND_PEACE_By_LeoTolstoi.txt | tr -cs"[:alnum:]""n"| tr"[:lower:]""[:upper:]" | sort -S16M | uniq -c |sort -nr | cat -n | head -n 30 ("sort-s1g"-linux/gnu sort only )也可以完成这项工作,但是,对于更大的文件来说,这是一些缺点(由于排序的空间/时间复杂性),
示例输出
# get some input http://www.gutenberg.org
$ cat WAR_AND_PEACE_By_LeoTolstoi.txt | tr -cs"[:alnum:]""n"| tr"[:lower:]""[:upper:]" | awk '{h[$1]++}END{for (i in h){print h[i]""i}}'|sort -nr | cat -n | head -n 30 
 1 34720 THE
 2 22300 AND
 3 16753 TO
 4 15007 OF
 5 10608 A
 6 10004 HE
 7 9036 IN
 8 8204 THAT
 9 7984 HIS
 10 7359 WAS
 11 5710 WITH
 12 5617 IT
 13 5365 HAD
 14 4725 HER
 15 4697 NOT
 16 4637 HIM
 17 4547 AT
 18 4524 I
 19 4414 S
 20 4054 BUT
 21 4035 AS
 22 4014 ON
 23 3871 YOU
 24 3555 FOR
 25 3488 SHE
 26 3347 IS
 27 2842 SAID
 28 2813 ALL
 29 2709 FROM
 30 2458 BY
Hujiuxiang profile image