View Single Post
Posts: 27 | Thanked: 35 times | Joined on Jan 2016 @ Sweden
#233
Originally Posted by ljo View Post
@spidernik84 et al, this should rather be between 0.7-1.8 million wordforms but not much more based on the 92034 stems (roughly what we count as words) which is about the size of a standard working vocabulary of other latin script languages like french (0.63 million aspell wordforms). So there is something wrong with the assumptions in the expansion processing.
I think you are right. I just failed another generation attempt (ran out of 20GB of RAM plus 5GB of swap... ).
I did a comparison with the English language, this is what I see:

Code:
nico@hendrix:~/aspell/aspell6-it-2.4-20070901-0$ aspell -l en dump master | aspell -l en expand | wc
 119789  119789 1153336
nico@hendrix:~/aspell/aspell6-it-2.4-20070901-0$ aspell -l it dump master | aspell -l it expand | wc
  95193 36636439 655315062
The number of words generated for the Italian language is INSANE.
You seem to know a lot of this. Have you got any idea of what can be done to keep the dictionary smaller? I've been searching for aspell alternative dictionaries with no luck...

Thanks. I surely hope we don't need to rent a Cray cluster to generate this dict...
 

The Following 3 Users Say Thank You to spidernik84 For This Useful Post: