eber42
|
2016-01-14
, 19:26
|
Posts: 86 |
Thanked: 362 times |
Joined on Dec 2007
@ Paris / France
|
#191
|
The Following 5 Users Say Thank You to eber42 For This Useful Post: | ||
|
2016-01-15
, 10:00
|
Posts: 105 |
Thanked: 205 times |
Joined on Dec 2015
@ Spain
|
#192
|
Two issues found:
- My script use aspell as initial dictionary when building language files. It did not deal with affix compression (never heard about this thing before), so it failed with Spanish aspell dictionary (it explains the very low word count). This is now fixed in the git repository.
- Your corpus file contains some UTF-8 encoding errors and some characters outside allowed range. Error count is very small (~5k errors in a 1GB file) so you can just use the simple cleaning script in okb-engine/tools/clean_corpus.py which will remove these.
Code:lbzip2 -d < corpus.txt.bz2 | clean_corpus.py | lbzip2 > new_corpus.txt.bz2
Traceback (most recent call last): File "/home/ferlanero/okb-engine-master/tools/clean_corpus.py", line 23, in <module> line = sys.stdin.readline() File "/usr/lib/python3.5/codecs.py", line 321, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8a in position 4650: invalid start byte
|
2016-01-15
, 13:04
|
Posts: 105 |
Thanked: 205 times |
Joined on Dec 2015
@ Spain
|
#193
|
|
2016-01-15
, 18:54
|
Posts: 86 |
Thanked: 362 times |
Joined on Dec 2007
@ Paris / France
|
#194
|
The Following 3 Users Say Thank You to eber42 For This Useful Post: | ||
|
2016-01-16
, 14:45
|
Posts: 105 |
Thanked: 205 times |
Joined on Dec 2015
@ Spain
|
#195
|
The Following 2 Users Say Thank You to ferlanero For This Useful Post: | ||
|
2016-01-17
, 13:07
|
Posts: 105 |
Thanked: 205 times |
Joined on Dec 2015
@ Spain
|
#196
|
[ferlanero@ferlanero-XPS ~]$ export CORPUS_DIR=/home/ferlanero/okboard/langs [ferlanero@ferlanero-XPS ~]$ export WORK_DIR=/home/ferlanero/okboard/langs [ferlanero@ferlanero-XPS ~]$ cd /home/ferlanero/okb-engine-master/ [ferlanero@ferlanero-XPS okb-engine-master]$ db/build.sh esBuilding for languages: es ~/okb-engine-master/ngrams ~/okb-engine-master/db running build running build_ext running build running build_ext ~/okb-engine-master/db ~/okb-engine-master/cluster ~/okb-engine-master/db g++ -c -pipe -g -Wall -W -D_REENTRANT -fPIC -DQT_GUI_LIB -DQT_CORE_LIB -I. -isystem /usr/include/qt -isystem /usr/include/qt/QtGui -isystem /usr/include/qt/QtCore -Ibuild -I/usr/lib/qt/mkspecs/linux-g++ -o build/cluster.o cluster.cpp g++ -o build/cluster build/cluster.o -lQt5Gui -lQt5Core -lGL -lpthread ~/okb-engine-master/db «/home/ferlanero/okb-engine-master/db/lang-en.cf» -> «/home/ferlanero/okboard/langs/lang-en.cf» «/home/ferlanero/okb-engine-master/db/lang-es.cf» -> «/home/ferlanero/okboard/langs/lang-es.cf» «/home/ferlanero/okb-engine-master/db/lang-fr.cf» -> «/home/ferlanero/okboard/langs/lang-fr.cf» «/home/ferlanero/okb-engine-master/db/lang-nl.cf» -> «/home/ferlanero/okboard/langs/lang-nl.cf» «/home/ferlanero/okb-engine-master/db/add-words-fr.txt» -> «/home/ferlanero/okboard/langs/add-words-fr.txt» «/home/ferlanero/okb-engine-master/db/db.version» -> «/home/ferlanero/okboard/langs/db.version» make: '.depend-es' está actualizado. ( [ -f "add-words-es.txt" ] && cat "add-words-es.txt" ; aspell -l es dump master | aspell -l es expand | tr ' ' '\n') | sort | uniq > es-full.dict lbzip2 -d < /home/ferlanero/okboard/langs/corpus-es.txt.bz2 | /home/ferlanero/okb-engine-master/db/../tools/corpus-splitter.pl 200 50 es-learn.tmp.bz2 es-test.tmp.bz2 mv -vf es-learn.tmp.bz2 es-learn.txt.bz2 «es-learn.tmp.bz2» -> «es-learn.txt.bz2» mv -vf es-test.tmp.bz2 es-test.txt.bz2 «es-test.tmp.bz2» -> «es-test.txt.bz2» set -o pipefail ; lbzip2 -d < es-learn.txt.bz2 | /home/ferlanero/okb-engine-master/db/../tools/import_corpus.py es-full.dict | sort -rn | lbzip2 -9 > grams-es-full.csv.bz2.tmp [5] 18889/885412 words, 315866 n-grams, read 1 MB ... [1155] 135339/885412 words, 22358980 n-grams, read 244 MB Traceback (most recent call last): File "/home/ferlanero/okb-engine-master/db/../tools/import_corpus.py", line 209, in <module> ci.parse_line(line) File "/home/ferlanero/okb-engine-master/db/../tools/import_corpus.py", line 76, in parse_line if re.match(r'[\.:;\!\?]', mo.group(0)): self.next_sentence() File "/home/ferlanero/okb-engine-master/db/../tools/import_corpus.py", line 167, in next_sentence self.count(roll) File "/home/ferlanero/okb-engine-master/db/../tools/import_corpus.py", line 180, in count self.count2(elts) File "/home/ferlanero/okb-engine-master/db/../tools/import_corpus.py", line 189, in count2 if key not in self.grams: self.grams[key] = 0 MemoryError /home/ferlanero/okb-engine-master/db/makefile:43: fallo en las instrucciones para el objetivo 'grams-es-full.csv.bz2' make: *** [grams-es-full.csv.bz2] Error 1
|
2016-01-17
, 18:16
|
Posts: 105 |
Thanked: 205 times |
Joined on Dec 2015
@ Spain
|
#197
|
[...] [1760] 30000/30000 words, 16164573 n-grams, read 308 MB [1765] 30000/30000 words, 16198654 n-grams, read 309 MB mv -f grams-es-learn.csv.bz2.tmp grams-es-learn.csv.bz2 Computing clusters for language es. Please make some coffee ... (logs can be found in clusters-es.log) set -o pipefail ; lbzip2 -d < grams-es-learn.csv.bz2 | sort -rn | sed -n "1,13500000 p" \ | /home/ferlanero/okb-engine-master/db/../tools/cluster -n 10 -o clusters-es.tmp > clusters-es.log 2>&1 /bin/sh: línea 1: 1870 Hecho lbzip2 -d < grams-es-learn.csv.bz2 1871 Tubería rota | sort -rn 1872 Tubería rota | sed -n "1,13500000 p" 1873 Terminado (killed) | /home/ferlanero/okb-engine-master/db/../tools/cluster -n 10 -o clusters-es.tmp > clusters-es.log 2>&1 /home/ferlanero/okb-engine-master/db/makefile:86: fallo en las instrucciones para el objetivo 'clusters-es.txt' make: *** [clusters-es.txt] Error 137
|
2016-01-17
, 19:17
|
Posts: 102 |
Thanked: 187 times |
Joined on Jan 2010
|
#198
|
Another error. I post it in case f help. The error of the previous post was solved reducing the size of the corpora file. Now this is the new error:
Code:[...] [1760] 30000/30000 words, 16164573 n-grams, read 308 MB [1765] 30000/30000 words, 16198654 n-grams, read 309 MB mv -f grams-es-learn.csv.bz2.tmp grams-es-learn.csv.bz2 Computing clusters for language es. Please make some coffee ... (logs can be found in clusters-es.log) set -o pipefail ; lbzip2 -d < grams-es-learn.csv.bz2 | sort -rn | sed -n "1,13500000 p" \ | /home/ferlanero/okb-engine-master/db/../tools/cluster -n 10 -o clusters-es.tmp > clusters-es.log 2>&1 /bin/sh: línea 1: 1870 Hecho lbzip2 -d < grams-es-learn.csv.bz2 1871 Tubería rota | sort -rn 1872 Tubería rota | sed -n "1,13500000 p" 1873 Terminado (killed) | /home/ferlanero/okb-engine-master/db/../tools/cluster -n 10 -o clusters-es.tmp > clusters-es.log
|
2016-01-17
, 19:41
|
Posts: 105 |
Thanked: 205 times |
Joined on Dec 2015
@ Spain
|
#199
|
|
2016-01-17
, 22:08
|
Posts: 102 |
Thanked: 187 times |
Joined on Jan 2010
|
#200
|
Tags |
bettertxtentry, huntnpeck sucks, okboard, sailfish, swype |
|