|
2016-01-08
, 12:43
|
Posts: 102 |
Thanked: 187 times |
Joined on Jan 2010
|
#182
|
Thank you again. Now I already figure how this files works with OKBoard. Now I have another 2 questions:
1 - When I add corpora files, have I to copy all the sentences and words into one file and then compress it to bz2 or can I choose multiple files, compress them into one bz2 and then run db/build.sh?
2 - And another Q, which tools do you use to correct format that dumps of files? There is an easy way to do that? Normally I have to remove the first columns of the files, but another times I have to remove the last characters of those files... So I don't know how to make it properly...
3- And regarding this, the correct dump files format for OKBoard sentences must to end with dots and between these sentences there must to be blank lines or these are unnecessary?
Thanks in advice folks!
|
2016-01-12
, 14:17
|
Posts: 105 |
Thanked: 205 times |
Joined on Dec 2015
@ Spain
|
#183
|
Traceback (most recent call last): File "/home/ferlanero/okb-engine-master/db/../tools/import_corpus.py", line 210, in <module> line = sys.stdin.readline() File "/usr/lib/python3.5/codecs.py", line 321, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 35: invalid start byte /home/ferlanero/okb-engine-master/db/makefile:43: fallo en las instrucciones para el objetivo 'grams-es-full.csv.bz2' make: *** [grams-es-full.csv.bz2] Error 1
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 35: invalid start byte
|
2016-01-12
, 14:18
|
Posts: 105 |
Thanked: 205 times |
Joined on Dec 2015
@ Spain
|
#184
|
[ferlanero@ferlanero-imac okb-engine-master]$ db/build.sh esBuilding for languages: es ~/okb-engine-master/ngrams ~/okb-engine-master/db running build running build_ext running build running build_ext ~/okb-engine-master/db ~/okb-engine-master/cluster ~/okb-engine-master/db make: No se hace nada para 'first'. ~/okb-engine-master/db «/home/ferlanero/okb-engine-master/db/lang-en.cf» -> «/home/ferlanero/okboard/langs/lang-en.cf» «/home/ferlanero/okb-engine-master/db/lang-es.cf» -> «/home/ferlanero/okboard/langs/lang-es.cf» «/home/ferlanero/okb-engine-master/db/lang-fr.cf» -> «/home/ferlanero/okboard/langs/lang-fr.cf» «/home/ferlanero/okb-engine-master/db/lang-nl.cf» -> «/home/ferlanero/okboard/langs/lang-nl.cf» «/home/ferlanero/okb-engine-master/db/add-words-fr.txt» -> «/home/ferlanero/okboard/langs/add-words-fr.txt» «/home/ferlanero/okb-engine-master/db/db.version» -> «/home/ferlanero/okboard/langs/db.version» make: '.depend-es' está actualizado. ( [ -f "add-words-es.txt" ] && cat "add-words-es.txt" ; aspell -l es dump master ) | sort | uniq > es-full.dict lbzip2 -d < /home/ferlanero/okboard/langs/corpus-es.txt.bz2 | /home/ferlanero/okb-engine-master/db/../tools/corpus-splitter.pl 200 50 es-learn.tmp.bz2 es-test.tmp.bz2 mv -vf es-learn.tmp.bz2 es-learn.txt.bz2 «es-learn.tmp.bz2» -> «es-learn.txt.bz2» mv -vf es-test.tmp.bz2 es-test.txt.bz2 «es-test.tmp.bz2» -> «es-test.txt.bz2» set -o pipefail ; lbzip2 -d < es-learn.txt.bz2 | /home/ferlanero/okb-engine-master/db/../tools/import_corpus.py es-full.dict | sort -rn | lbzip2 -9 > grams-es-full.csv.bz2.tmp [5] 597/56329 words, 3232 n-grams, read 1 MB [10] 739/56329 words, 5115 n-grams, read 3 MB [15] 821/56329 words, 6611 n-grams, read 4 MB [20] 880/56329 words, 7950 n-grams, read 6 MB [25] 938/56329 words, 9167 n-grams, read 8 MB [30] 988/56329 words, 10184 n-grams, read 9 MB [35] 1023/56329 words, 11131 n-grams, read 11 MB [40] 1064/56329 words, 12179 n-grams, read 13 MB [45] 1091/56329 words, 13063 n-grams, read 14 MB [50] 1118/56329 words, 13953 n-grams, read 16 MB [55] 1135/56329 words, 14789 n-grams, read 18 MB [60] 1157/56329 words, 15571 n-grams, read 19 MB [65] 1178/56329 words, 16394 n-grams, read 21 MB [70] 1192/56329 words, 17120 n-grams, read 23 MB [75] 1207/56329 words, 17834 n-grams, read 24 MB [80] 1218/56329 words, 18545 n-grams, read 26 MB [85] 1231/56329 words, 19251 n-grams, read 28 MB [90] 1246/56329 words, 19947 n-grams, read 30 MB [95] 1257/56329 words, 20578 n-grams, read 31 MB [100] 1272/56329 words, 21158 n-grams, read 33 MB [105] 1282/56329 words, 21716 n-grams, read 35 MB [110] 1291/56329 words, 22330 n-grams, read 36 MB [115] 1301/56329 words, 22881 n-grams, read 38 MB [120] 1313/56329 words, 23434 n-grams, read 40 MB [125] 1319/56329 words, 24057 n-grams, read 41 MB [130] 1332/56329 words, 24653 n-grams, read 43 MB [135] 1339/56329 words, 25191 n-grams, read 45 MB [140] 1344/56329 words, 25706 n-grams, read 46 MB [145] 1350/56329 words, 26212 n-grams, read 48 MB [150] 1357/56329 words, 26724 n-grams, read 50 MB [155] 1364/56329 words, 27268 n-grams, read 51 MB [160] 1372/56329 words, 27804 n-grams, read 53 MB [165] 1380/56329 words, 28273 n-grams, read 55 MB [170] 1382/56329 words, 28785 n-grams, read 56 MB [175] 1384/56329 words, 29261 n-grams, read 58 MB [180] 1387/56329 words, 29778 n-grams, read 60 MB [185] 1393/56329 words, 30224 n-grams, read 61 MB [190] 1397/56329 words, 30689 n-grams, read 63 MB [195] 1403/56329 words, 31129 n-grams, read 65 MB [200] 1407/56329 words, 31507 n-grams, read 66 MB [205] 1410/56329 words, 31922 n-grams, read 68 MB [210] 1413/56329 words, 32384 n-grams, read 70 MB [215] 1421/56329 words, 32810 n-grams, read 71 MB [220] 1422/56329 words, 33225 n-grams, read 73 MB [225] 1428/56329 words, 33679 n-grams, read 75 MB [230] 1437/56329 words, 34144 n-grams, read 76 MB [235] 1442/56329 words, 34627 n-grams, read 78 MB [240] 1447/56329 words, 35082 n-grams, read 80 MB [245] 1451/56329 words, 35460 n-grams, read 82 MB [250] 1453/56329 words, 35870 n-grams, read 83 MB [255] 1458/56329 words, 36247 n-grams, read 85 MB [260] 1462/56329 words, 36619 n-grams, read 87 MB [265] 1467/56329 words, 36977 n-grams, read 88 MB [270] 1474/56329 words, 37412 n-grams, read 90 MB [275] 1477/56329 words, 37789 n-grams, read 92 MB [280] 1478/56329 words, 38156 n-grams, read 93 MB [285] 1479/56329 words, 38555 n-grams, read 95 MB [290] 1482/56329 words, 38947 n-grams, read 97 MB [295] 1487/56329 words, 39360 n-grams, read 98 MB [300] 1490/56329 words, 39767 n-grams, read 100 MB [305] 1495/56329 words, 40150 n-grams, read 102 MB [310] 1499/56329 words, 40525 n-grams, read 103 MB [315] 1501/56329 words, 40898 n-grams, read 105 MB [320] 1507/56329 words, 41346 n-grams, read 107 MB [325] 1514/56329 words, 41762 n-grams, read 108 MB [330] 1517/56329 words, 42151 n-grams, read 110 MB [335] 1518/56329 words, 42552 n-grams, read 112 MB [340] 1520/56329 words, 42987 n-grams, read 113 MB [345] 1521/56329 words, 43382 n-grams, read 115 MB [350] 1522/56329 words, 43798 n-grams, read 117 MB [355] 1525/56329 words, 44180 n-grams, read 118 MB [360] 1529/56329 words, 44556 n-grams, read 120 MB [365] 1532/56329 words, 44890 n-grams, read 122 MB [370] 1532/56329 words, 45264 n-grams, read 123 MB [375] 1534/56329 words, 45631 n-grams, read 125 MB [380] 1541/56329 words, 46036 n-grams, read 127 MB [385] 1544/56329 words, 46406 n-grams, read 128 MB [390] 1547/56329 words, 46804 n-grams, read 130 MB [395] 1550/56329 words, 47161 n-grams, read 132 MB [400] 1551/56329 words, 47534 n-grams, read 133 MB [405] 1552/56329 words, 47836 n-grams, read 135 MB [410] 1555/56329 words, 48154 n-grams, read 137 MB [415] 1560/56329 words, 48481 n-grams, read 138 MB [420] 1563/56329 words, 48868 n-grams, read 140 MB [425] 1564/56329 words, 49190 n-grams, read 142 MB [430] 1566/56329 words, 49514 n-grams, read 143 MB [435] 1569/56329 words, 49852 n-grams, read 145 MB [440] 1571/56329 words, 50169 n-grams, read 147 MB [445] 1573/56329 words, 50516 n-grams, read 148 MB [450] 1574/56329 words, 50831 n-grams, read 150 MB [455] 1576/56329 words, 51170 n-grams, read 152 MB [460] 1578/56329 words, 51518 n-grams, read 153 MB [465] 1578/56329 words, 51871 n-grams, read 155 MB [470] 1582/56329 words, 52191 n-grams, read 157 MB [475] 1584/56329 words, 52514 n-grams, read 158 MB [480] 1586/56329 words, 52882 n-grams, read 160 MB [485] 1590/56329 words, 53211 n-grams, read 162 MB [490] 1594/56329 words, 53547 n-grams, read 163 MB [495] 1597/56329 words, 53868 n-grams, read 165 MB [500] 1600/56329 words, 54210 n-grams, read 167 MB [505] 1603/56329 words, 54490 n-grams, read 168 MB [510] 1605/56329 words, 54821 n-grams, read 170 MB [515] 1607/56329 words, 55123 n-grams, read 172 MB [520] 1609/56329 words, 55434 n-grams, read 173 MB [525] 1611/56329 words, 55707 n-grams, read 175 MB [530] 1613/56329 words, 55987 n-grams, read 177 MB [535] 1616/56329 words, 56308 n-grams, read 178 MB [540] 1618/56329 words, 56606 n-grams, read 180 MB [545] 1621/56329 words, 56924 n-grams, read 182 MB [550] 1621/56329 words, 57210 n-grams, read 183 MB [555] 1623/56329 words, 57517 n-grams, read 185 MB [560] 1626/56329 words, 57839 n-grams, read 187 MB [565] 1630/56329 words, 58160 n-grams, read 188 MB [570] 1632/56329 words, 58426 n-grams, read 190 MB [575] 1634/56329 words, 58729 n-grams, read 192 MB [580] 1634/56329 words, 59014 n-grams, read 193 MB [585] 1637/56329 words, 59330 n-grams, read 195 MB [590] 1639/56329 words, 59659 n-grams, read 197 MB [595] 1641/56329 words, 59974 n-grams, read 198 MB [600] 1645/56329 words, 60303 n-grams, read 200 MB [605] 1647/56329 words, 60667 n-grams, read 202 MB [610] 1647/56329 words, 60986 n-grams, read 203 MB [615] 1651/56329 words, 61343 n-grams, read 205 MB [620] 1651/56329 words, 61634 n-grams, read 207 MB [625] 1660/56329 words, 61936 n-grams, read 208 MB [630] 1663/56329 words, 62284 n-grams, read 210 MB [635] 1666/56329 words, 62616 n-grams, read 212 MB [640] 1666/56329 words, 62920 n-grams, read 213 MB [645] 1670/56329 words, 63266 n-grams, read 215 MB [650] 1675/56329 words, 63590 n-grams, read 217 MB [655] 1679/56329 words, 63947 n-grams, read 218 MB [660] 1682/56329 words, 64242 n-grams, read 220 MB [665] 1684/56329 words, 64586 n-grams, read 222 MB [670] 1685/56329 words, 64909 n-grams, read 223 MB [675] 1688/56329 words, 65219 n-grams, read 225 MB [680] 1693/56329 words, 65512 n-grams, read 227 MB [685] 1694/56329 words, 65789 n-grams, read 228 MB [690] 1695/56329 words, 66081 n-grams, read 230 MB [695] 1697/56329 words, 66379 n-grams, read 232 MB [700] 1698/56329 words, 66711 n-grams, read 233 MB [705] 1699/56329 words, 67054 n-grams, read 235 MB [710] 1703/56329 words, 67392 n-grams, read 237 MB [715] 1705/56329 words, 67674 n-grams, read 238 MB [720] 1706/56329 words, 67996 n-grams, read 240 MB [725] 1709/56329 words, 68337 n-grams, read 242 MB [730] 1710/56329 words, 68607 n-grams, read 243 MB [735] 1711/56329 words, 68893 n-grams, read 245 MB [740] 1713/56329 words, 69135 n-grams, read 246 MB [745] 1713/56329 words, 69465 n-grams, read 248 MB [750] 1716/56329 words, 69765 n-grams, read 250 MB [755] 1718/56329 words, 70053 n-grams, read 251 MB [760] 1719/56329 words, 70303 n-grams, read 253 MB [765] 1720/56329 words, 70617 n-grams, read 255 MB [770] 1722/56329 words, 70927 n-grams, read 256 MB [775] 1724/56329 words, 71187 n-grams, read 258 MB Traceback (most recent call last): File "/home/ferlanero/okb-engine-master/db/../tools/import_corpus.py", line 210, in <module> line = sys.stdin.readline() File "/usr/lib/python3.5/codecs.py", line 321, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 35: invalid start byte /home/ferlanero/okb-engine-master/db/makefile:43: fallo en las instrucciones para el objetivo 'grams-es-full.csv.bz2' make: *** [grams-es-full.csv.bz2] Error 1 [ferlanero@ferlanero-imac okb-engine-master]$
The Following User Says Thank You to ferlanero For This Useful Post: | ||
|
2016-01-12
, 22:25
|
Posts: 102 |
Thanked: 187 times |
Joined on Jan 2010
|
#185
|
Hi again guys!
How I can find the character that gives the error in the original corpus-es file? I mean, this instruction points to somewhere in that file?
Code:UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 35: invalid start byte
The Following 3 Users Say Thank You to ljo For This Useful Post: | ||
|
2016-01-13
, 11:18
|
Posts: 105 |
Thanked: 205 times |
Joined on Dec 2015
@ Spain
|
#186
|
@ferlanero, it says the first byte of a multibyte character is wrong. So probably some faulty utf-8 encoding.
You either find a way to search for the named byte by its hexadecimal value or convert to some base your tool (unless perl, python, grep) understands..
The Following 3 Users Say Thank You to ferlanero For This Useful Post: | ||
|
2016-01-13
, 14:57
|
Posts: 102 |
Thanked: 187 times |
Joined on Jan 2010
|
#187
|
About the data I have gathered, yes, I have tons of them, including many, many, diferent sources. So I decided to process all of them
and when I do it, I would like your support to make a proper rpm to add support to Spanish language for OKBoard. It'll be posible?
|
2016-01-13
, 19:22
|
Posts: 86 |
Thanked: 362 times |
Joined on Dec 2007
@ Paris / France
|
#188
|
The Following 4 Users Say Thank You to eber42 For This Useful Post: | ||
|
2016-01-13
, 22:57
|
Posts: 102 |
Thanked: 187 times |
Joined on Jan 2010
|
#189
|
|
2016-01-14
, 00:21
|
Posts: 105 |
Thanked: 205 times |
Joined on Dec 2015
@ Spain
|
#190
|
@ferlanero, according to your logs, the 258 first MB of your text corpus only use 1700 unique words (which is roughly the vocabulary of a 2 year old child). The n-gram count is incredibly low also.
The corpus URL you quoted should be all right, so maybe there was a issue when you converted them.
Could you share your input file (or at least a sample) ?
Tags |
bettertxtentry, huntnpeck sucks, okboard, sailfish, swype |
|
- When I add corpora files, have I to copy all the sentences and words into one file and then compress it to bz2 or can I choose multiple files, compress them into one bz2 and then run db/build.sh?
- And another Q, which tools do you use to correct format that dumps of files? There is an easy way to do that? Normally I have to remove the first columns of the files, but another times I have to remove the last characters of those files... So I don't know how to make it properly...
- And regarding this, the correct dump files format for OKBoard sentences must to end with dots and between these sentences there must to be blank lines or these are unnecessary?
Thanks in advice folks!