[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

GEN: More on Lojban Letter Frequencies



John Cowan's letter frequency data posted the other day is probably more
accurate than he thought.  I did the same exercise on a significantly
larger set of Lojban text (and one perhaps less weighted by Nick's
massive contribution to the corpus of Lojban text).  The results were
almost identical, except for the letter 'o', and I suspect the value for
that letter may be an arithmetic or copying error on his part, since his
data sums to slightly less than 1000.

My data is based on 75315 words of Lojban text (367K) compared to
Cowan's 20K words, and probably includes the vast majority of such text
in the archives which is greater than single sentence length.  I was
similarly careful in removing non-lojban from the text body, even to
removing the contents of zoi and la'o quotes manually.

My raw and normalized frequencies are in the two left columns below.
The third column is John's data.  The 4th and 5th columns are the old
static data.  The 6th column is the normalized static results based on
taking only 1 copy of each word in the raw Lojban text I used for the
dynamic data combined with the gismu list, cmavo list, and Nick's lujvo
list.  This approximates a maximal list of words that could appear in
the dictionary, though it probably has a small excess of meaningful
cmavo compounds.

                                   Old results       Current
                               static    static       static
 letter       dynamic         no-lujvo  with-lujvo  with-lujvo/cmavo
        raw  Lojbab Cowan
 '      13888  048   045         037       028         057
 a      30431  106   105         118       084         125
 b       6016  021   021         025       024         024
 c      11849  041   042         043       029         037
 d       7123  025   023         026       024         023
 e      26810  094   095         059       044         075
 f       3678  013   013         017       017         013
 g       4159  015   014         017       016         018
 i      37295  130   132         124       076         107
 j       5108  018   017         029       028         022
 k       9546  033   033         034       031         031
 l      21156  074   073         041       039         048
 m       8971  031   032         030       029         034
 n      15557  054   055         067       058         051
 o      17890  062   057         047       029         042
 p       6062  021   022         024       024         026
 r      11410  040   039         054       084         058
 s      10229  036   037         040       038         045
 t       7762  027   026         043       038         034
 u      21556  075   076         076       050         067
 v       3310  012   010         014       013         015
 x       2180  008   008         012       015         011
 y       1487  005   004         002       158         022
 z       3101  011   009         010       010         013
 ,         71
       ______
        75315 wds                                     9300 words/compounds
       367090 char                                   67361 char


The two dynamic data-sets gave identical rank-ordering, thus confirming
my observation that almost 4x the amount of data had little effect.  The
new static data significantly differed from theory, and was not all that
far from the dynamic data ordering - no letter moved more than 4
positions from the dynamic rank except 'y' which is probably used
excessively in current Lojban text because people don't know the rafsi
well enough to use reduced forms all the times that they could.

   both dynamic:        iaeul on'cr skmtd pbjgf vzxy
   old no-lujvo:        iaune rotcl s'kmj dbpfg vxzy
   old with-lujvo:      yarin uelts kcmoj 'pbdf gxvz
   new static:          aieur 'nlso ctmkp bdjyg vfzx

lojbab