[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: GEN: More on Lojban Letter Frequencies



la lojbab cusku di'e

> The results [of his and my frequency counts] were
> almost identical, except for the letter 'o', and I suspect the value for
> that letter may be an arithmetic or copying error on his part, since his
> data sums to slightly less than 1000.

My data sums to less than 1000 because I used a truncating, rather than
rounding, calculation.  However, I checked the number of "o" characters,
using a different method (filtering out all non-"o" characters and
counting them), and the fraction of the running text is correct.

BTW, the exact statistics of the raw input was: 19579 words in 96703
characters of running text, representing 76485 letters (the rest were
whitespace or dots).

> My data is based on 75315 words of Lojban text (367K) compared to
> Cowan's 20K words, and probably includes the vast majority of such text
> in the archives which is greater than single sentence length.

Well, if that's the best we can do, that's the best we can do.  For the
record, the most massive frequency count of the pre-computer era was one
made on a German corpus of 59,298,274 letters.

> The 6th column is the normalized static results based on
> taking only 1 copy of each word in the raw Lojban text I used for the
> dynamic data combined with the gismu list, cmavo list, and Nick's lujvo
> list.  This approximates a maximal list of words that could appear in
> the dictionary, though it probably has a small excess of meaningful
> cmavo compounds.

Now why didn't I think of that?  :-)

--
John Cowan              sharing account <lojbab@access.digex.net> for now
                e'osai ko sarji la lojban.
> the dictionary, though it probably has a small excess of meaningful
> cmavo compounds.
>
>                                    Old results       Current
>                                static    static       static
>  letter       dynamic         no-lujvo  with-lujvo  with-lujvo/cmavo
>         raw  Lojbab Cowan
>  '      13888  048   045         037       028         057
>  a      30431  106   105         118       084         125
>  b       6016  021   021         025       024         024
>  c      11849  041   042         043       029         037
>  d       7123  025   023         026       024         023
>  e      26810  094   095         059       044         075
>  f       3678  013   013         017       017         013
>  g       4159  015   014         017       016         018
>  i      37295  130   132         124       076         107
>  j       5108  018   017         029       028         022
>  k       9546  033   033         034       031         031
>  l      21156  074   073         041       039         048
>  m       8971  031   032         030       029         034
>  n      15557  054   055         067       058         051
>  o      17890  062   057         047       029         042
>  p       6062  021   022         024       024         026
>  r      11410  040   039         054       084         058
>  s      10229  036   037         040       038         045
>  t       7762  027   026         043       038         034
>  u      21556  075   076         076       050         067
>  v       3310  012   010         014       013         015
>  x       2180  008   008         012       015         011
>  y       1487  005   004         002       158         022
>  z       3101  011   009         010       010         013
>  ,         71
>        ______
>         75315 wds                                     9300 words/compounds
>        367090 char                                   67361 char
>
>
> The two dynamic data-sets gave identical rank-ordering, thus confirming
> my observation that almost 4x the amount of data had little effect.  The
> new static data significantly differed from theory, and was not all that
> far from the dynamic data ordering - no letter moved more than 4
> positions from the dynamic rank except 'y' which is probably used
> excessively in current Lojban text because people don't know the rafsi
> well enough to use reduced forms all the times that they could.
>
>    both dynamic:        iaeul on'cr skmtd pbjgf vzxy
>    old no-lujvo:        iaune rotcl s'kmj dbpfg vxzy
>    old with-lujvo:      yarin uelts kcmoj 'pbdf gxvz
>    new static:          aieur 'nlso ctmkp bdjyg vfzx
>
> lojbab
>


--
John Cowan              sharing account <lojbab@access.digex.net> for now
                e'osai ko sarji la lojban.