Active Topics

 


Poll: What advanced text entry method(s) would you like to see on Sailfish?
Poll Options
What advanced text entry method(s) would you like to see on Sailfish?

Reply
Thread Tools
Posts: 86 | Thanked: 362 times | Joined on Dec 2007 @ Paris / France
#191
Originally Posted by ferlanero View Post
I send you a PM with the input file.
Two issues found:
  • My script use aspell as initial dictionary when building language files. It did not deal with affix compression (never heard about this thing before), so it failed with Spanish aspell dictionary (it explains the very low word count). This is now fixed in the git repository.
  • Your corpus file contains some UTF-8 encoding errors and some characters outside allowed range. Error count is very small (~5k errors in a 1GB file) so you can just use the simple cleaning script in okb-engine/tools/clean_corpus.py which will remove these.
    Code:
    lbzip2 -d < corpus.txt.bz2 | clean_corpus.py | lbzip2 > new_corpus.txt.bz2
 

The Following 5 Users Say Thank You to eber42 For This Useful Post:
Posts: 105 | Thanked: 205 times | Joined on Dec 2015 @ Spain
#192
Originally Posted by eber42 View Post
Two issues found:
  • My script use aspell as initial dictionary when building language files. It did not deal with affix compression (never heard about this thing before), so it failed with Spanish aspell dictionary (it explains the very low word count). This is now fixed in the git repository.
  • Your corpus file contains some UTF-8 encoding errors and some characters outside allowed range. Error count is very small (~5k errors in a 1GB file) so you can just use the simple cleaning script in okb-engine/tools/clean_corpus.py which will remove these.
    Code:
    lbzip2 -d < corpus.txt.bz2 | clean_corpus.py | lbzip2 > new_corpus.txt.bz2
Hi eber42

Thanks for your response.

Just one question, when I use your clean tool for the possible UTF-8 errors, it says:

Code:
Traceback (most recent call last):
  File "/home/ferlanero/okb-engine-master/tools/clean_corpus.py", line 23, in <module>
    line = sys.stdin.readline()
  File "/usr/lib/python3.5/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8a in position 4650: invalid start byte
And the process stop. You know how to fix it? By the way, I continue with my investigations...

Thanks!
 
Posts: 105 | Thanked: 205 times | Joined on Dec 2015 @ Spain
#193
I think I have solved the previous problem... I say you later
 
Posts: 86 | Thanked: 362 times | Joined on Dec 2007 @ Paris / France
#194
Originally Posted by ferlanero View Post
Hi eber42
Just one question, when I use your clean tool for the possible UTF-8 errors, it says:
You have to fetch the last version from the git repository. I committed a fix yesterday.
 

The Following 3 Users Say Thank You to eber42 For This Useful Post:
Posts: 105 | Thanked: 205 times | Joined on Dec 2015 @ Spain
#195
I already did it and now everything is working now I'm working hard in the Spanish language THX!
 

The Following 2 Users Say Thank You to ferlanero For This Useful Post:
Posts: 105 | Thanked: 205 times | Joined on Dec 2015 @ Spain
#196
During the process I get this error. Any ideas?

Code:
[ferlanero@ferlanero-XPS ~]$ export CORPUS_DIR=/home/ferlanero/okboard/langs
[ferlanero@ferlanero-XPS ~]$ export WORK_DIR=/home/ferlanero/okboard/langs
[ferlanero@ferlanero-XPS ~]$ cd /home/ferlanero/okb-engine-master/
[ferlanero@ferlanero-XPS okb-engine-master]$ db/build.sh esBuilding for languages:  es
~/okb-engine-master/ngrams ~/okb-engine-master/db
running build
running build_ext
running build
running build_ext
~/okb-engine-master/db
~/okb-engine-master/cluster ~/okb-engine-master/db
g++ -c -pipe -g -Wall -W -D_REENTRANT -fPIC -DQT_GUI_LIB -DQT_CORE_LIB -I. -isystem /usr/include/qt -isystem /usr/include/qt/QtGui -isystem /usr/include/qt/QtCore -Ibuild -I/usr/lib/qt/mkspecs/linux-g++ -o build/cluster.o cluster.cpp
g++  -o build/cluster build/cluster.o   -lQt5Gui -lQt5Core -lGL -lpthread 
~/okb-engine-master/db
«/home/ferlanero/okb-engine-master/db/lang-en.cf» -> «/home/ferlanero/okboard/langs/lang-en.cf»
«/home/ferlanero/okb-engine-master/db/lang-es.cf» -> «/home/ferlanero/okboard/langs/lang-es.cf»
«/home/ferlanero/okb-engine-master/db/lang-fr.cf» -> «/home/ferlanero/okboard/langs/lang-fr.cf»
«/home/ferlanero/okb-engine-master/db/lang-nl.cf» -> «/home/ferlanero/okboard/langs/lang-nl.cf»
«/home/ferlanero/okb-engine-master/db/add-words-fr.txt» -> «/home/ferlanero/okboard/langs/add-words-fr.txt»
«/home/ferlanero/okb-engine-master/db/db.version» -> «/home/ferlanero/okboard/langs/db.version»
make: '.depend-es' está actualizado.
( [ -f "add-words-es.txt" ] && cat "add-words-es.txt" ; aspell -l es dump master | aspell -l es expand | tr ' ' '\n') | sort | uniq > es-full.dict
lbzip2 -d < /home/ferlanero/okboard/langs/corpus-es.txt.bz2 | /home/ferlanero/okb-engine-master/db/../tools/corpus-splitter.pl 200 50 es-learn.tmp.bz2 es-test.tmp.bz2
mv -vf es-learn.tmp.bz2 es-learn.txt.bz2
«es-learn.tmp.bz2» -> «es-learn.txt.bz2»
mv -vf es-test.tmp.bz2 es-test.txt.bz2
«es-test.tmp.bz2» -> «es-test.txt.bz2»
set -o pipefail ; lbzip2 -d < es-learn.txt.bz2 | /home/ferlanero/okb-engine-master/db/../tools/import_corpus.py es-full.dict | sort -rn | lbzip2 -9 > grams-es-full.csv.bz2.tmp
[5] 18889/885412 words, 315866 n-grams, read 1 MB
...
[1155] 135339/885412 words, 22358980 n-grams, read 244 MB
Traceback (most recent call last):
  File "/home/ferlanero/okb-engine-master/db/../tools/import_corpus.py", line 209, in <module>
    ci.parse_line(line)
  File "/home/ferlanero/okb-engine-master/db/../tools/import_corpus.py", line 76, in parse_line
    if re.match(r'[\.:;\!\?]', mo.group(0)): self.next_sentence()
  File "/home/ferlanero/okb-engine-master/db/../tools/import_corpus.py", line 167, in next_sentence
    self.count(roll)
  File "/home/ferlanero/okb-engine-master/db/../tools/import_corpus.py", line 180, in count
    self.count2(elts)
  File "/home/ferlanero/okb-engine-master/db/../tools/import_corpus.py", line 189, in count2
    if key not in self.grams: self.grams[key] = 0
MemoryError
/home/ferlanero/okb-engine-master/db/makefile:43: fallo en las instrucciones para el objetivo 'grams-es-full.csv.bz2'
make: *** [grams-es-full.csv.bz2] Error 1
 
Posts: 105 | Thanked: 205 times | Joined on Dec 2015 @ Spain
#197
Another error. I post it in case f help. The error of the previous post was solved reducing the size of the corpora file. Now this is the new error:

Code:
[...]
[1760] 30000/30000 words, 16164573 n-grams, read 308 MB
[1765] 30000/30000 words, 16198654 n-grams, read 309 MB
mv -f grams-es-learn.csv.bz2.tmp grams-es-learn.csv.bz2
Computing clusters for language es. Please make some coffee ...
 (logs can be found in clusters-es.log)
set -o pipefail ; lbzip2 -d < grams-es-learn.csv.bz2 | sort -rn | sed -n "1,13500000 p" \
 | /home/ferlanero/okb-engine-master/db/../tools/cluster -n 10 -o clusters-es.tmp > clusters-es.log 2>&1
/bin/sh: línea 1:  1870 Hecho                   lbzip2 -d < grams-es-learn.csv.bz2
      1871 Tubería rota           | sort -rn
      1872 Tubería rota           | sed -n "1,13500000 p"
      1873 Terminado (killed)      | /home/ferlanero/okb-engine-master/db/../tools/cluster -n 10 -o clusters-es.tmp > clusters-es.log 2>&1
/home/ferlanero/okb-engine-master/db/makefile:86: fallo en las instrucciones para el objetivo 'clusters-es.txt'
make: *** [clusters-es.txt] Error 137
 
Posts: 102 | Thanked: 187 times | Joined on Jan 2010
#198
Originally Posted by ferlanero View Post
Another error. I post it in case f help. The error of the previous post was solved reducing the size of the corpora file. Now this is the new error:

Code:
[...]
[1760] 30000/30000 words, 16164573 n-grams, read 308 MB
[1765] 30000/30000 words, 16198654 n-grams, read 309 MB
mv -f grams-es-learn.csv.bz2.tmp grams-es-learn.csv.bz2
Computing clusters for language es. Please make some coffee ...
 (logs can be found in clusters-es.log)
set -o pipefail ; lbzip2 -d < grams-es-learn.csv.bz2 | sort -rn | sed -n "1,13500000 p" \
 | /home/ferlanero/okb-engine-master/db/../tools/cluster -n 10 -o clusters-es.tmp > clusters-es.log 2>&1
/bin/sh: línea 1:  1870 Hecho                   lbzip2 -d < grams-es-learn.csv.bz2
      1871 Tubería rota           | sort -rn
      1872 Tubería rota           | sed -n "1,13500000 p"
      1873 Terminado (killed)      | /home/ferlanero/okb-engine-master/db/../tools/cluster -n 10 -o clusters-es.tmp > clusters-es.log
Did you run make clean (or used -r flag) after the memory problem run?
 
Posts: 105 | Thanked: 205 times | Joined on Dec 2015 @ Spain
#199
Originally Posted by ljo View Post
Did you run make clean (or used -r flag) after the memory problem run?
I think no. I have to do that? Now I'm trying with a smaller corpus file to test... maybe the problem is due to only have 4gb of RAM? In any case, I already have ordered 16Gb RAM to work with okboard.

@Ijo & eber42 How many RAM do you have in your pc's to process the corpora file? And how large is your corpus.txt file before compressing with bz2? Because mine is almost 3GB...
 
Posts: 102 | Thanked: 187 times | Joined on Jan 2010
#200
@ferlanero, these are userland processes, so they will usually not use more than 4GB each (I reran it and it was never more than 6.8GB for your full data). But of course if you have room for the OS on top of it it is beneficial. It seems to be quite sensitive to fragmentation, so a fresh restart before running won't hurt before running a corpus of this size (1.5 GB uncompressed (and 470MB compressed)). I don't see why you cannot just cut the size in half for the two largest corpora (leipzig and wiki). I didn't see much difference with that size compared to using all of your data. There were still some noise which could be scrapped.

Last edited by ljo; 2016-01-18 at 00:03.
 

The Following 2 Users Say Thank You to ljo For This Useful Post:
Reply

Tags
bettertxtentry, huntnpeck sucks, okboard, sailfish, swype


 
Forum Jump


All times are GMT. The time now is 11:20.