View Single Post
Posts: 27 | Thanked: 35 times | Joined on Jan 2016 @ Sweden
#241
Originally Posted by itdoesntmatt View Post
Ciao a te e grazie tante per il vostro impegno!
Ciao! No problem... it's getting more complicated than I thought
I get what you mean now. It's an option but this is up to Eber

I tried the various approaches suggested here. I'll try further with this:

Code:
tr '[:upper:]' '[:lower:]' < sanitized.list > sanitized.lower
tr -s [:space:] \\n < sanitized.lower | sort | uniq > sanitized.uniq
So, essentially: lowercase everything, put on single line, sort, remove duplicates.

It is a bit better:
Code:
wc sanitized.uniq
 13615329  13615329 236920596 sanitized.uniq
I'll then fetch a list of proper Italian nouns of people and cities and push them in the file, so to preserve some basic capitalisation.

I'll try to generate the file again tomorrow. Any ideas or help with the dict are more then welcome
 

The Following 2 Users Say Thank You to spidernik84 For This Useful Post: