[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
GEN: More on Lojban Letter Frequencies
John Cowan's letter frequency data posted the other day is probably more
accurate than he thought. I did the same exercise on a significantly
larger set of Lojban text (and one perhaps less weighted by Nick's
massive contribution to the corpus of Lojban text). The results were
almost identical, except for the letter 'o', and I suspect the value for
that letter may be an arithmetic or copying error on his part, since his
data sums to slightly less than 1000.
My data is based on 75315 words of Lojban text (367K) compared to
Cowan's 20K words, and probably includes the vast majority of such text
in the archives which is greater than single sentence length. I was
similarly careful in removing non-lojban from the text body, even to
removing the contents of zoi and la'o quotes manually.
My raw and normalized frequencies are in the two left columns below.
The third column is John's data. The 4th and 5th columns are the old
static data. The 6th column is the normalized static results based on
taking only 1 copy of each word in the raw Lojban text I used for the
dynamic data combined with the gismu list, cmavo list, and Nick's lujvo
list. This approximates a maximal list of words that could appear in
the dictionary, though it probably has a small excess of meaningful
cmavo compounds.
Old results Current
static static static
letter dynamic no-lujvo with-lujvo with-lujvo/cmavo
raw Lojbab Cowan
' 13888 048 045 037 028 057
a 30431 106 105 118 084 125
b 6016 021 021 025 024 024
c 11849 041 042 043 029 037
d 7123 025 023 026 024 023
e 26810 094 095 059 044 075
f 3678 013 013 017 017 013
g 4159 015 014 017 016 018
i 37295 130 132 124 076 107
j 5108 018 017 029 028 022
k 9546 033 033 034 031 031
l 21156 074 073 041 039 048
m 8971 031 032 030 029 034
n 15557 054 055 067 058 051
o 17890 062 057 047 029 042
p 6062 021 022 024 024 026
r 11410 040 039 054 084 058
s 10229 036 037 040 038 045
t 7762 027 026 043 038 034
u 21556 075 076 076 050 067
v 3310 012 010 014 013 015
x 2180 008 008 012 015 011
y 1487 005 004 002 158 022
z 3101 011 009 010 010 013
, 71
______
75315 wds 9300 words/compounds
367090 char 67361 char
The two dynamic data-sets gave identical rank-ordering, thus confirming
my observation that almost 4x the amount of data had little effect. The
new static data significantly differed from theory, and was not all that
far from the dynamic data ordering - no letter moved more than 4
positions from the dynamic rank except 'y' which is probably used
excessively in current Lojban text because people don't know the rafsi
well enough to use reduced forms all the times that they could.
both dynamic: iaeul on'cr skmtd pbjgf vzxy
old no-lujvo: iaune rotcl s'kmj dbpfg vxzy
old with-lujvo: yarin uelts kcmoj 'pbdf gxvz
new static: aieur 'nlso ctmkp bdjyg vfzx
lojbab