Advanced text entry on Sailfish (Swype or similar) - Page 20 - maemo.org

Active Topics

Firefox with Leste (10)
to Maemo 7 / Leste by teroyk - 2 days, 22 hrs ago
more...

Page 20 of 38

Thread Tools

eber42	2016-01-14 , 19:26
Posts: 86 \| Thanked: 362 times \| Joined on Dec 2007 @ Paris / France	#191

Originally Posted by ferlanero

I send you a PM with the input file.

Two issues found:

My script use aspell as initial dictionary when building language files. It did not deal with affix compression (never heard about this thing before), so it failed with Spanish aspell dictionary (it explains the very low word count). This is now fixed in the git repository.
Your corpus file contains some UTF-8 encoding errors and some characters outside allowed range. Error count is very small (~5k errors in a 1GB file) so you can just use the simple cleaning script in okb-engine/tools/clean_corpus.py which will remove these.
Code:
```
lbzip2 -d < corpus.txt.bz2 | clean_corpus.py | lbzip2 > new_corpus.txt.bz2
```

Quote & Reply |

The Following 5 Users Say Thank You to eber42 For This Useful Post:
Feathers McGraw, ferlanero, Jordi, juiceme, ljo

ferlanero	2016-01-15 , 10:00
Posts: 105 \| Thanked: 205 times \| Joined on Dec 2015 @ Spain	#192

Originally Posted by eber42

Two issues found:
My script use aspell as initial dictionary when building language files. It did not deal with affix compression (never heard about this thing before), so it failed with Spanish aspell dictionary (it explains the very low word count). This is now fixed in the git repository.
Your corpus file contains some UTF-8 encoding errors and some characters outside allowed range. Error count is very small (~5k errors in a 1GB file) so you can just use the simple cleaning script in okb-engine/tools/clean_corpus.py which will remove these.
Code:
lbzip2 -d < corpus.txt.bz2 | clean_corpus.py | lbzip2 > new_corpus.txt.bz2

Hi eber42

Thanks for your response.

Just one question, when I use your clean tool for the possible UTF-8 errors, it says:

Code:

Traceback (most recent call last):
  File "/home/ferlanero/okb-engine-master/tools/clean_corpus.py", line 23, in <module>
    line = sys.stdin.readline()
  File "/usr/lib/python3.5/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8a in position 4650: invalid start byte

And the process stop. You know how to fix it? By the way, I continue with my investigations...

Thanks!

Quote & Reply |

ferlanero	2016-01-15 , 13:04
Posts: 105 \| Thanked: 205 times \| Joined on Dec 2015 @ Spain	#193

I think I have solved the previous problem... I say you later

Quote & Reply |

eber42	2016-01-15 , 18:54
Posts: 86 \| Thanked: 362 times \| Joined on Dec 2007 @ Paris / France	#194

Originally Posted by ferlanero

Hi eber42
Just one question, when I use your clean tool for the possible UTF-8 errors, it says:

You have to fetch the last version from the git repository. I committed a fix yesterday.

Quote & Reply |

The Following 3 Users Say Thank You to eber42 For This Useful Post:
Feathers McGraw, ferlanero, juiceme

ferlanero	2016-01-16 , 14:45
Posts: 105 \| Thanked: 205 times \| Joined on Dec 2015 @ Spain	#195

I already did it and now everything is working now

I'm working hard in the Spanish language

THX!

Quote & Reply |

The Following 2 Users Say Thank You to ferlanero For This Useful Post:
Feathers McGraw, juiceme

ferlanero	2016-01-17 , 13:07
Posts: 105 \| Thanked: 205 times \| Joined on Dec 2015 @ Spain	#196

During the process I get this error. Any ideas?

Code:

[ferlanero@ferlanero-XPS ~]$ export CORPUS_DIR=/home/ferlanero/okboard/langs
[ferlanero@ferlanero-XPS ~]$ export WORK_DIR=/home/ferlanero/okboard/langs
[ferlanero@ferlanero-XPS ~]$ cd /home/ferlanero/okb-engine-master/
[ferlanero@ferlanero-XPS okb-engine-master]$ db/build.sh esBuilding for languages:  es
~/okb-engine-master/ngrams ~/okb-engine-master/db
running build
running build_ext
running build
running build_ext
~/okb-engine-master/db
~/okb-engine-master/cluster ~/okb-engine-master/db
g++ -c -pipe -g -Wall -W -D_REENTRANT -fPIC -DQT_GUI_LIB -DQT_CORE_LIB -I. -isystem /usr/include/qt -isystem /usr/include/qt/QtGui -isystem /usr/include/qt/QtCore -Ibuild -I/usr/lib/qt/mkspecs/linux-g++ -o build/cluster.o cluster.cpp
g++  -o build/cluster build/cluster.o   -lQt5Gui -lQt5Core -lGL -lpthread 
~/okb-engine-master/db
«/home/ferlanero/okb-engine-master/db/lang-en.cf» -> «/home/ferlanero/okboard/langs/lang-en.cf»
«/home/ferlanero/okb-engine-master/db/lang-es.cf» -> «/home/ferlanero/okboard/langs/lang-es.cf»
«/home/ferlanero/okb-engine-master/db/lang-fr.cf» -> «/home/ferlanero/okboard/langs/lang-fr.cf»
«/home/ferlanero/okb-engine-master/db/lang-nl.cf» -> «/home/ferlanero/okboard/langs/lang-nl.cf»
«/home/ferlanero/okb-engine-master/db/add-words-fr.txt» -> «/home/ferlanero/okboard/langs/add-words-fr.txt»
«/home/ferlanero/okb-engine-master/db/db.version» -> «/home/ferlanero/okboard/langs/db.version»
make: '.depend-es' está actualizado.
( [ -f "add-words-es.txt" ] && cat "add-words-es.txt" ; aspell -l es dump master | aspell -l es expand | tr ' ' '\n') | sort | uniq > es-full.dict
lbzip2 -d < /home/ferlanero/okboard/langs/corpus-es.txt.bz2 | /home/ferlanero/okb-engine-master/db/../tools/corpus-splitter.pl 200 50 es-learn.tmp.bz2 es-test.tmp.bz2
mv -vf es-learn.tmp.bz2 es-learn.txt.bz2
«es-learn.tmp.bz2» -> «es-learn.txt.bz2»
mv -vf es-test.tmp.bz2 es-test.txt.bz2
«es-test.tmp.bz2» -> «es-test.txt.bz2»
set -o pipefail ; lbzip2 -d < es-learn.txt.bz2 | /home/ferlanero/okb-engine-master/db/../tools/import_corpus.py es-full.dict | sort -rn | lbzip2 -9 > grams-es-full.csv.bz2.tmp
[5] 18889/885412 words, 315866 n-grams, read 1 MB
...
[1155] 135339/885412 words, 22358980 n-grams, read 244 MB
Traceback (most recent call last):
  File "/home/ferlanero/okb-engine-master/db/../tools/import_corpus.py", line 209, in <module>
    ci.parse_line(line)
  File "/home/ferlanero/okb-engine-master/db/../tools/import_corpus.py", line 76, in parse_line
    if re.match(r'[\.:;\!\?]', mo.group(0)): self.next_sentence()
  File "/home/ferlanero/okb-engine-master/db/../tools/import_corpus.py", line 167, in next_sentence
    self.count(roll)
  File "/home/ferlanero/okb-engine-master/db/../tools/import_corpus.py", line 180, in count
    self.count2(elts)
  File "/home/ferlanero/okb-engine-master/db/../tools/import_corpus.py", line 189, in count2
    if key not in self.grams: self.grams[key] = 0
MemoryError
/home/ferlanero/okb-engine-master/db/makefile:43: fallo en las instrucciones para el objetivo 'grams-es-full.csv.bz2'
make: *** [grams-es-full.csv.bz2] Error 1

Quote & Reply |

ferlanero	2016-01-17 , 18:16
Posts: 105 \| Thanked: 205 times \| Joined on Dec 2015 @ Spain	#197

Another error. I post it in case f help. The error of the previous post was solved reducing the size of the corpora file. Now this is the new error:

Code:

[...]
[1760] 30000/30000 words, 16164573 n-grams, read 308 MB
[1765] 30000/30000 words, 16198654 n-grams, read 309 MB
mv -f grams-es-learn.csv.bz2.tmp grams-es-learn.csv.bz2
Computing clusters for language es. Please make some coffee ...
 (logs can be found in clusters-es.log)
set -o pipefail ; lbzip2 -d < grams-es-learn.csv.bz2 | sort -rn | sed -n "1,13500000 p" \
 | /home/ferlanero/okb-engine-master/db/../tools/cluster -n 10 -o clusters-es.tmp > clusters-es.log 2>&1
/bin/sh: línea 1:  1870 Hecho                   lbzip2 -d < grams-es-learn.csv.bz2
      1871 Tubería rota           | sort -rn
      1872 Tubería rota           | sed -n "1,13500000 p"
      1873 Terminado (killed)      | /home/ferlanero/okb-engine-master/db/../tools/cluster -n 10 -o clusters-es.tmp > clusters-es.log 2>&1
/home/ferlanero/okb-engine-master/db/makefile:86: fallo en las instrucciones para el objetivo 'clusters-es.txt'
make: *** [clusters-es.txt] Error 137

Quote & Reply |

ljo	2016-01-17 , 19:17
Posts: 102 \| Thanked: 187 times \| Joined on Jan 2010	#198

Originally Posted by ferlanero

Another error. I post it in case f help. The error of the previous post was solved reducing the size of the corpora file. Now this is the new error:

Code:

[...]
[1760] 30000/30000 words, 16164573 n-grams, read 308 MB
[1765] 30000/30000 words, 16198654 n-grams, read 309 MB
mv -f grams-es-learn.csv.bz2.tmp grams-es-learn.csv.bz2
Computing clusters for language es. Please make some coffee ...
 (logs can be found in clusters-es.log)
set -o pipefail ; lbzip2 -d < grams-es-learn.csv.bz2 | sort -rn | sed -n "1,13500000 p" \
 | /home/ferlanero/okb-engine-master/db/../tools/cluster -n 10 -o clusters-es.tmp > clusters-es.log 2>&1
/bin/sh: línea 1:  1870 Hecho                   lbzip2 -d < grams-es-learn.csv.bz2
      1871 Tubería rota           | sort -rn
      1872 Tubería rota           | sed -n "1,13500000 p"
      1873 Terminado (killed)      | /home/ferlanero/okb-engine-master/db/../tools/cluster -n 10 -o clusters-es.tmp > clusters-es.log

Did you run make clean (or used -r flag) after the memory problem run?

Quote & Reply |

ferlanero	2016-01-17 , 19:41
Posts: 105 \| Thanked: 205 times \| Joined on Dec 2015 @ Spain	#199

Originally Posted by ljo

Did you run make clean (or used -r flag) after the memory problem run?

I think no. I have to do that? Now I'm trying with a smaller corpus file to test... maybe the problem is due to only have 4gb of RAM? In any case, I already have ordered 16Gb RAM to work with okboard.

@Ijo & eber42 How many RAM do you have in your pc's to process the corpora file? And how large is your corpus.txt file before compressing with bz2? Because mine is almost 3GB...

Quote & Reply |

ljo	2016-01-17 , 22:08
Posts: 102 \| Thanked: 187 times \| Joined on Jan 2010	#200

@ferlanero, these are userland processes, so they will usually not use more than 4GB each (I reran it and it was never more than 6.8GB for your full data). But of course if you have room for the OS on top of it it is beneficial. It seems to be quite sensitive to fragmentation, so a fresh restart before running won't hurt before running a corpus of this size (1.5 GB uncompressed (and 470MB compressed)). I don't see why you cannot just cut the size in half for the two largest corpora (leipzig and wiki). I didn't see much difference with that size compared to using all of your data. There were still some noise which could be scrapped.

Last edited by ljo; 2016-01-18 at 00:03.

Quote & Reply |

The Following 2 Users Say Thank You to ljo For This Useful Post:
ferlanero, juiceme

Page 20 of 38

Active Topics

Firefox with Leste (10)