View Single Post
Posts: 27 | Thanked: 35 times | Joined on Jan 2016 @ Sweden
#236
Originally Posted by eber42 View Post
From a quick look I see the following causes for the large size:
  • lots of words are repeated with prefixes such as dall', sull'. At the moment my model handles words separated by quotation marks or hyphens as single words, so words with different prefixes are treated as different words. OKBoard roadmap contains an item for managing prefixes and suffixes (explicit ones with punctuation signs, or linked together as in German) as distinct words, but I don't know when (and if) I will work on it.
  • some words are repeated multiple times with weird capitalization: are these different words: Sull'Acclimatatele, sull'Acclimatatele, sull'acclimatatele ? At the moment words with different capitalizations are treated as different words (unless they are at the beginning of a sentence). But the case of words with two different capitalization is not very well handled.
Hello Eber!
I never heard those words before
I can tell you for sure that the form dall' sull' is surely correct, but a bit too formulaic. Also, those are "articulated prepositions" in front of nouns, hence should be considered on their own. Example:

dall'anima
dall'oceano

The nouns are "anima" and "oceano", while "dall'" is the preposition. That does not justify creating a word for each preposition+word combination!
There are additional rules, naturally: for instance, that form is only used with words starting with vocals...
Good that you are thinking of handling this situation.

As for the capitalization: I would not consider common to have capitalised variants of words. Most words are either capitalised or not, so I'd prioritise lower case words when multiple variants are found.


Spidernik84's text corpus only contains 315k words (only counting those which are also known by aspell), so my short term suggestion is to add an option to provide a (smaller) dictionary instead of using aspell's one or to trust the input corpus to be flawless.

What do you think ?

Edit: ouch, spidernik84 was faster with wc
We can try to skip aspell just for my language, for sure... I'm afraid of the results though: spelling mistakes are definitely common
It's worth a shot, I'll see what happens. Thanks for your help.
 

The Following 2 Users Say Thank You to spidernik84 For This Useful Post: