View Single Post
Posts: 86 | Thanked: 362 times | Joined on Dec 2007 @ Paris / France
#234
As discussed with spidernik84, the Italian aspell dictionary contains 34M words (with affix expansion support that was added for Spanish).
Try this :
Code:
aspell -l it dump master | aspell -l it expand | wc -w
In the current process, aspell is used for filtering out badly written words (because available texts sometimes contains errors).

Even if we fix the corpus reader script the keyboard has not been built to work with this volume: My largest language (French) contains ~100k words (and only 45k used by the word prediction engine, others are in "best effort" mode).

From a quick look I see the following causes for the large size:
  • lots of words are repeated with prefixes such as dall', sull'. At the moment my model handles words separated by quotation marks or hyphens as single words, so words with different prefixes are treated as different words. OKBoard roadmap contains an item for managing prefixes and suffixes (explicit ones with punctuation signs, or linked together as in German) as distinct words, but I don't know when (and if) I will work on it.
  • some words are repeated multiple times with weird capitalization: are these different words: Sull'Acclimatatele, sull'Acclimatatele, sull'acclimatatele ? At the moment words with different capitalizations are treated as different words (unless they are at the beginning of a sentence). But the case of words with two different capitalization is not very well handled.


Spidernik84's text corpus only contains 315k words (only counting those which are also known by aspell), so my short term suggestion is to add an option to provide a (smaller) dictionary instead of using aspell's one or to trust the input corpus to be flawless.

What do you think ?

Edit: ouch, spidernik84 was faster with wc
 

The Following 5 Users Say Thank You to eber42 For This Useful Post: