ljo
|
2016-01-25
, 11:33
|
Posts: 102 |
Thanked: 187 times |
Joined on Jan 2010
|
#231
|
|
2016-01-25
, 11:49
|
Posts: 105 |
Thanked: 205 times |
Joined on Dec 2015
@ Spain
|
#232
|
|
2016-01-25
, 20:02
|
Posts: 27 |
Thanked: 35 times |
Joined on Jan 2016
@ Sweden
|
#233
|
@spidernik84 et al, this should rather be between 0.7-1.8 million wordforms but not much more based on the 92034 stems (roughly what we count as words) which is about the size of a standard working vocabulary of other latin script languages like french (0.63 million aspell wordforms). So there is something wrong with the assumptions in the expansion processing.
nico@hendrix:~/aspell/aspell6-it-2.4-20070901-0$ aspell -l en dump master | aspell -l en expand | wc 119789 119789 1153336 nico@hendrix:~/aspell/aspell6-it-2.4-20070901-0$ aspell -l it dump master | aspell -l it expand | wc 95193 36636439 655315062
The Following 3 Users Say Thank You to spidernik84 For This Useful Post: | ||
|
2016-01-25
, 20:06
|
Posts: 86 |
Thanked: 362 times |
Joined on Dec 2007
@ Paris / France
|
#234
|
aspell -l it dump master | aspell -l it expand | wc -w
The Following 5 Users Say Thank You to eber42 For This Useful Post: | ||
|
2016-01-25
, 20:17
|
Posts: 102 |
Thanked: 187 times |
Joined on Jan 2010
|
#235
|
The Following 4 Users Say Thank You to ljo For This Useful Post: | ||
|
2016-01-25
, 20:24
|
Posts: 27 |
Thanked: 35 times |
Joined on Jan 2016
@ Sweden
|
#236
|
From a quick look I see the following causes for the large size:
- lots of words are repeated with prefixes such as dall', sull'. At the moment my model handles words separated by quotation marks or hyphens as single words, so words with different prefixes are treated as different words. OKBoard roadmap contains an item for managing prefixes and suffixes (explicit ones with punctuation signs, or linked together as in German) as distinct words, but I don't know when (and if) I will work on it.
- some words are repeated multiple times with weird capitalization: are these different words: Sull'Acclimatatele, sull'Acclimatatele, sull'acclimatatele ? At the moment words with different capitalizations are treated as different words (unless they are at the beginning of a sentence). But the case of words with two different capitalization is not very well handled.
Spidernik84's text corpus only contains 315k words (only counting those which are also known by aspell), so my short term suggestion is to add an option to provide a (smaller) dictionary instead of using aspell's one or to trust the input corpus to be flawless.
What do you think ?
Edit: ouch, spidernik84 was faster with wc
|
2016-01-25
, 20:31
|
Posts: 102 |
Thanked: 187 times |
Joined on Jan 2010
|
#237
|
1) But the case of words with two different capitalization is not very well handled.
...
2) so my short term suggestion is to add an option to provide a (smaller) dictionary instead of using aspell's one or to trust the input corpus to be flawless.
What do you think ?
|
2016-01-25
, 20:50
|
Posts: 529 |
Thanked: 988 times |
Joined on Mar 2015
|
#238
|
|
2016-01-25
, 21:01
|
Posts: 27 |
Thanked: 35 times |
Joined on Jan 2016
@ Sweden
|
#239
|
sorry guys, i dont know how many of you are italian, but i am.
Dall' Sull' and other words could be just inserted as single words.
When you write sentences you actually left a space between preposition and other word.
|
2016-01-25
, 21:08
|
Posts: 529 |
Thanked: 988 times |
Joined on Mar 2015
|
#240
|
The Following User Says Thank You to itdoesntmatt For This Useful Post: | ||
Tags |
bettertxtentry, huntnpeck sucks, okboard, sailfish, swype |
|