[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: GEN: More on Lojban Letter Frequencies
la lojbab cusku di'e
> The results [of his and my frequency counts] were
> almost identical, except for the letter 'o', and I suspect the value for
> that letter may be an arithmetic or copying error on his part, since his
> data sums to slightly less than 1000.
My data sums to less than 1000 because I used a truncating, rather than
rounding, calculation. However, I checked the number of "o" characters,
using a different method (filtering out all non-"o" characters and
counting them), and the fraction of the running text is correct.
BTW, the exact statistics of the raw input was: 19579 words in 96703
characters of running text, representing 76485 letters (the rest were
whitespace or dots).
> My data is based on 75315 words of Lojban text (367K) compared to
> Cowan's 20K words, and probably includes the vast majority of such text
> in the archives which is greater than single sentence length.
Well, if that's the best we can do, that's the best we can do. For the
record, the most massive frequency count of the pre-computer era was one
made on a German corpus of 59,298,274 letters.
> The 6th column is the normalized static results based on
> taking only 1 copy of each word in the raw Lojban text I used for the
> dynamic data combined with the gismu list, cmavo list, and Nick's lujvo
> list. This approximates a maximal list of words that could appear in
> the dictionary, though it probably has a small excess of meaningful
> cmavo compounds.
Now why didn't I think of that? :-)
--
John Cowan sharing account <lojbab@access.digex.net> for now
e'osai ko sarji la lojban.
> the dictionary, though it probably has a small excess of meaningful
> cmavo compounds.
>
> Old results Current
> static static static
> letter dynamic no-lujvo with-lujvo with-lujvo/cmavo
> raw Lojbab Cowan
> ' 13888 048 045 037 028 057
> a 30431 106 105 118 084 125
> b 6016 021 021 025 024 024
> c 11849 041 042 043 029 037
> d 7123 025 023 026 024 023
> e 26810 094 095 059 044 075
> f 3678 013 013 017 017 013
> g 4159 015 014 017 016 018
> i 37295 130 132 124 076 107
> j 5108 018 017 029 028 022
> k 9546 033 033 034 031 031
> l 21156 074 073 041 039 048
> m 8971 031 032 030 029 034
> n 15557 054 055 067 058 051
> o 17890 062 057 047 029 042
> p 6062 021 022 024 024 026
> r 11410 040 039 054 084 058
> s 10229 036 037 040 038 045
> t 7762 027 026 043 038 034
> u 21556 075 076 076 050 067
> v 3310 012 010 014 013 015
> x 2180 008 008 012 015 011
> y 1487 005 004 002 158 022
> z 3101 011 009 010 010 013
> , 71
> ______
> 75315 wds 9300 words/compounds
> 367090 char 67361 char
>
>
> The two dynamic data-sets gave identical rank-ordering, thus confirming
> my observation that almost 4x the amount of data had little effect. The
> new static data significantly differed from theory, and was not all that
> far from the dynamic data ordering - no letter moved more than 4
> positions from the dynamic rank except 'y' which is probably used
> excessively in current Lojban text because people don't know the rafsi
> well enough to use reduced forms all the times that they could.
>
> both dynamic: iaeul on'cr skmtd pbjgf vzxy
> old no-lujvo: iaune rotcl s'kmj dbpfg vxzy
> old with-lujvo: yarin uelts kcmoj 'pbdf gxvz
> new static: aieur 'nlso ctmkp bdjyg vfzx
>
> lojbab
>
--
John Cowan sharing account <lojbab@access.digex.net> for now
e'osai ko sarji la lojban.