View Single Post
Posts: 102 | Thanked: 187 times | Joined on Jan 2010
#204
@ssahla, @mautz, yes, a good mix of genres and basic understanding of lexicon and the collected corpus data can probably save some time and effort. 100 MB compressed is probably the threshold to aim to get over for the corpus data. Zipfs law will make it hard to find more than a fraction of the inflection forms of morphologically rich languages, so a good frequency dictionary could be beneficial if it is hard to find available corpus data.
 

The Following User Says Thank You to ljo For This Useful Post: