[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Character frequencies for Lojban -- a first cut
Some time ago (back in JL9:33-34), lojbab generated a list of static
letter frequencies for Lojban: how often each letter a-z and ' occurs
in: 1) the gismu and cmavo lists; 2) those plus a rough guess at what
a lujvo list would look like (at that time, we didn't have one).
Of course, this data totally ignored the fact that some words occur
more often than others, so it was suitable for making a Lojban Scrabble set,
but not for Lojban cryptanalysis.
Well, I took 20,000 words of Lojban I had on my PC, very carefully
excluded all English stuff, folded case (upper case is so marginal in
Lojban it's not worth treating as separate), and stripped everything
except a-z and ' (the . character is really optional, though strongly
recommended, and some writers don't use it).
Then I could generate a first cut at dynamic frequencies of characters
based on actual running text. Some of the text is not fully grammatical,
but it's probably all "lexically sound", which is all that really matters.
I tried to make sure that multiple versions of the same text weren't
included, to avoid biases.
Here are the results, plus lojbab's old data:
static static
letter dynamic no-lujvo with-lujvo
' 045 037 028
a 105 118 084
b 021 025 024
c 042 043 029
d 023 026 024
e 095 059 044
f 013 017 017
g 014 017 016
i 132 124 076
j 017 029 028
k 033 034 031
l 073 041 039
m 032 030 029
n 055 067 058
o 057 047 029
p 022 024 024
r 039 054 084
s 037 040 038
t 026 043 038
u 076 076 050
v 010 014 013
x 008 012 015
y 004 002 158
z 009 010 010
Here are the three different rank orders:
dynamic: iaeul on'cr skmtd pbjgf vzxy
no-lujvo: iaune rotcl s'kmj dbpfg vxzy
with-lujvo: yarin uelts kcmoj 'pbdf gxvz
As you can see, the dynamic rank-ordering agrees fairly well with the
no-lujvo static rank-ordering, especially at the top and the bottom.
The with-lujvo rank-ordering puts "y" at the top, which reflects the fact
that the "lujvo-list" used to build it contained mostly proposals that
had never been used, many of them dating back to pre-Lojban days.
But otherwise it too is fairly sane.
As I said in the Subject header, all this is a first cut. We will need
our 50,000-word dictionary for honest static frequencies, and maybe
500,000 words of running text for honest dynamic frequencies. Watch
this space. :-)
--
John Cowan sharing account <lojbab@access.digex.net> for now
e'osai ko sarji la lojban.