Active Topics

 


Poll: What advanced text entry method(s) would you like to see on Sailfish?
Poll Options
What advanced text entry method(s) would you like to see on Sailfish?

Reply
Thread Tools
Posts: 105 | Thanked: 205 times | Joined on Dec 2015 @ Spain
#181
Thank you again. Now I already figure how this files works with OKBoard. Now I have another 2 questions:

- When I add corpora files, have I to copy all the sentences and words into one file and then compress it to bz2 or can I choose multiple files, compress them into one bz2 and then run db/build.sh?

- And another Q, which tools do you use to correct format that dumps of files? There is an easy way to do that? Normally I have to remove the first columns of the files, but another times I have to remove the last characters of those files... So I don't know how to make it properly...

- And regarding this, the correct dump files format for OKBoard sentences must to end with dots and between these sentences there must to be blank lines or these are unnecessary?

Thanks in advice folks!
 
Posts: 102 | Thanked: 187 times | Joined on Jan 2010
#182
Originally Posted by ferlanero View Post
Thank you again. Now I already figure how this files works with OKBoard. Now I have another 2 questions:

1 - When I add corpora files, have I to copy all the sentences and words into one file and then compress it to bz2 or can I choose multiple files, compress them into one bz2 and then run db/build.sh?

2 - And another Q, which tools do you use to correct format that dumps of files? There is an easy way to do that? Normally I have to remove the first columns of the files, but another times I have to remove the last characters of those files... So I don't know how to make it properly...

3- And regarding this, the correct dump files format for OKBoard sentences must to end with dots and between these sentences there must to be blank lines or these are unnecessary?

Thanks in advice folks!
1 - One file. Just use cat to concatenate them all together:
cat file1 file2 file3 file4 file5 > corpus-es.txt
2 - iconv, sed, perl, python.
3 - It is either interpunctation or empty rows inbetween that is needed, not both.
 
Posts: 105 | Thanked: 205 times | Joined on Dec 2015 @ Spain
#183
Hi again guys!

After finding all the necessary databases corpus and adjust them to the requisites of OKBoard, the process gives me this error, which I don't know how to solve. Any ideas, please?

Code:
Traceback (most recent call last):
  File "/home/ferlanero/okb-engine-master/db/../tools/import_corpus.py", line 210, in <module>
    line = sys.stdin.readline()
  File "/usr/lib/python3.5/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 35: invalid start byte
/home/ferlanero/okb-engine-master/db/makefile:43: fallo en las instrucciones para el objetivo 'grams-es-full.csv.bz2'
make: *** [grams-es-full.csv.bz2] Error 1
How I can find the character that gives the error in the original corpus-es file? I mean, this instruction points to somewhere in that file?

Code:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 35: invalid start byte
 
Posts: 105 | Thanked: 205 times | Joined on Dec 2015 @ Spain
#184
Maybe the full log helps in any order:

Code:
[ferlanero@ferlanero-imac okb-engine-master]$ db/build.sh esBuilding for languages:  es
~/okb-engine-master/ngrams ~/okb-engine-master/db
running build
running build_ext
running build
running build_ext
~/okb-engine-master/db
~/okb-engine-master/cluster ~/okb-engine-master/db
make: No se hace nada para 'first'.
~/okb-engine-master/db
«/home/ferlanero/okb-engine-master/db/lang-en.cf» -> «/home/ferlanero/okboard/langs/lang-en.cf»
«/home/ferlanero/okb-engine-master/db/lang-es.cf» -> «/home/ferlanero/okboard/langs/lang-es.cf»
«/home/ferlanero/okb-engine-master/db/lang-fr.cf» -> «/home/ferlanero/okboard/langs/lang-fr.cf»
«/home/ferlanero/okb-engine-master/db/lang-nl.cf» -> «/home/ferlanero/okboard/langs/lang-nl.cf»
«/home/ferlanero/okb-engine-master/db/add-words-fr.txt» -> «/home/ferlanero/okboard/langs/add-words-fr.txt»
«/home/ferlanero/okb-engine-master/db/db.version» -> «/home/ferlanero/okboard/langs/db.version»
make: '.depend-es' está actualizado.
( [ -f "add-words-es.txt" ] && cat "add-words-es.txt" ; aspell -l es dump master ) | sort | uniq > es-full.dict
lbzip2 -d < /home/ferlanero/okboard/langs/corpus-es.txt.bz2 | /home/ferlanero/okb-engine-master/db/../tools/corpus-splitter.pl 200 50 es-learn.tmp.bz2 es-test.tmp.bz2
mv -vf es-learn.tmp.bz2 es-learn.txt.bz2
«es-learn.tmp.bz2» -> «es-learn.txt.bz2»
mv -vf es-test.tmp.bz2 es-test.txt.bz2
«es-test.tmp.bz2» -> «es-test.txt.bz2»
set -o pipefail ; lbzip2 -d < es-learn.txt.bz2 | /home/ferlanero/okb-engine-master/db/../tools/import_corpus.py es-full.dict | sort -rn | lbzip2 -9 > grams-es-full.csv.bz2.tmp
[5] 597/56329 words, 3232 n-grams, read 1 MB
[10] 739/56329 words, 5115 n-grams, read 3 MB
[15] 821/56329 words, 6611 n-grams, read 4 MB
[20] 880/56329 words, 7950 n-grams, read 6 MB
[25] 938/56329 words, 9167 n-grams, read 8 MB
[30] 988/56329 words, 10184 n-grams, read 9 MB
[35] 1023/56329 words, 11131 n-grams, read 11 MB
[40] 1064/56329 words, 12179 n-grams, read 13 MB
[45] 1091/56329 words, 13063 n-grams, read 14 MB
[50] 1118/56329 words, 13953 n-grams, read 16 MB
[55] 1135/56329 words, 14789 n-grams, read 18 MB
[60] 1157/56329 words, 15571 n-grams, read 19 MB
[65] 1178/56329 words, 16394 n-grams, read 21 MB
[70] 1192/56329 words, 17120 n-grams, read 23 MB
[75] 1207/56329 words, 17834 n-grams, read 24 MB
[80] 1218/56329 words, 18545 n-grams, read 26 MB
[85] 1231/56329 words, 19251 n-grams, read 28 MB
[90] 1246/56329 words, 19947 n-grams, read 30 MB
[95] 1257/56329 words, 20578 n-grams, read 31 MB
[100] 1272/56329 words, 21158 n-grams, read 33 MB
[105] 1282/56329 words, 21716 n-grams, read 35 MB
[110] 1291/56329 words, 22330 n-grams, read 36 MB
[115] 1301/56329 words, 22881 n-grams, read 38 MB
[120] 1313/56329 words, 23434 n-grams, read 40 MB
[125] 1319/56329 words, 24057 n-grams, read 41 MB
[130] 1332/56329 words, 24653 n-grams, read 43 MB
[135] 1339/56329 words, 25191 n-grams, read 45 MB
[140] 1344/56329 words, 25706 n-grams, read 46 MB
[145] 1350/56329 words, 26212 n-grams, read 48 MB
[150] 1357/56329 words, 26724 n-grams, read 50 MB
[155] 1364/56329 words, 27268 n-grams, read 51 MB
[160] 1372/56329 words, 27804 n-grams, read 53 MB
[165] 1380/56329 words, 28273 n-grams, read 55 MB
[170] 1382/56329 words, 28785 n-grams, read 56 MB
[175] 1384/56329 words, 29261 n-grams, read 58 MB
[180] 1387/56329 words, 29778 n-grams, read 60 MB
[185] 1393/56329 words, 30224 n-grams, read 61 MB
[190] 1397/56329 words, 30689 n-grams, read 63 MB
[195] 1403/56329 words, 31129 n-grams, read 65 MB
[200] 1407/56329 words, 31507 n-grams, read 66 MB
[205] 1410/56329 words, 31922 n-grams, read 68 MB
[210] 1413/56329 words, 32384 n-grams, read 70 MB
[215] 1421/56329 words, 32810 n-grams, read 71 MB
[220] 1422/56329 words, 33225 n-grams, read 73 MB
[225] 1428/56329 words, 33679 n-grams, read 75 MB
[230] 1437/56329 words, 34144 n-grams, read 76 MB
[235] 1442/56329 words, 34627 n-grams, read 78 MB
[240] 1447/56329 words, 35082 n-grams, read 80 MB
[245] 1451/56329 words, 35460 n-grams, read 82 MB
[250] 1453/56329 words, 35870 n-grams, read 83 MB
[255] 1458/56329 words, 36247 n-grams, read 85 MB
[260] 1462/56329 words, 36619 n-grams, read 87 MB
[265] 1467/56329 words, 36977 n-grams, read 88 MB
[270] 1474/56329 words, 37412 n-grams, read 90 MB
[275] 1477/56329 words, 37789 n-grams, read 92 MB
[280] 1478/56329 words, 38156 n-grams, read 93 MB
[285] 1479/56329 words, 38555 n-grams, read 95 MB
[290] 1482/56329 words, 38947 n-grams, read 97 MB
[295] 1487/56329 words, 39360 n-grams, read 98 MB
[300] 1490/56329 words, 39767 n-grams, read 100 MB
[305] 1495/56329 words, 40150 n-grams, read 102 MB
[310] 1499/56329 words, 40525 n-grams, read 103 MB
[315] 1501/56329 words, 40898 n-grams, read 105 MB
[320] 1507/56329 words, 41346 n-grams, read 107 MB
[325] 1514/56329 words, 41762 n-grams, read 108 MB
[330] 1517/56329 words, 42151 n-grams, read 110 MB
[335] 1518/56329 words, 42552 n-grams, read 112 MB
[340] 1520/56329 words, 42987 n-grams, read 113 MB
[345] 1521/56329 words, 43382 n-grams, read 115 MB
[350] 1522/56329 words, 43798 n-grams, read 117 MB
[355] 1525/56329 words, 44180 n-grams, read 118 MB
[360] 1529/56329 words, 44556 n-grams, read 120 MB
[365] 1532/56329 words, 44890 n-grams, read 122 MB
[370] 1532/56329 words, 45264 n-grams, read 123 MB
[375] 1534/56329 words, 45631 n-grams, read 125 MB
[380] 1541/56329 words, 46036 n-grams, read 127 MB
[385] 1544/56329 words, 46406 n-grams, read 128 MB
[390] 1547/56329 words, 46804 n-grams, read 130 MB
[395] 1550/56329 words, 47161 n-grams, read 132 MB
[400] 1551/56329 words, 47534 n-grams, read 133 MB
[405] 1552/56329 words, 47836 n-grams, read 135 MB
[410] 1555/56329 words, 48154 n-grams, read 137 MB
[415] 1560/56329 words, 48481 n-grams, read 138 MB
[420] 1563/56329 words, 48868 n-grams, read 140 MB
[425] 1564/56329 words, 49190 n-grams, read 142 MB
[430] 1566/56329 words, 49514 n-grams, read 143 MB
[435] 1569/56329 words, 49852 n-grams, read 145 MB
[440] 1571/56329 words, 50169 n-grams, read 147 MB
[445] 1573/56329 words, 50516 n-grams, read 148 MB
[450] 1574/56329 words, 50831 n-grams, read 150 MB
[455] 1576/56329 words, 51170 n-grams, read 152 MB
[460] 1578/56329 words, 51518 n-grams, read 153 MB
[465] 1578/56329 words, 51871 n-grams, read 155 MB
[470] 1582/56329 words, 52191 n-grams, read 157 MB
[475] 1584/56329 words, 52514 n-grams, read 158 MB
[480] 1586/56329 words, 52882 n-grams, read 160 MB
[485] 1590/56329 words, 53211 n-grams, read 162 MB
[490] 1594/56329 words, 53547 n-grams, read 163 MB
[495] 1597/56329 words, 53868 n-grams, read 165 MB
[500] 1600/56329 words, 54210 n-grams, read 167 MB
[505] 1603/56329 words, 54490 n-grams, read 168 MB
[510] 1605/56329 words, 54821 n-grams, read 170 MB
[515] 1607/56329 words, 55123 n-grams, read 172 MB
[520] 1609/56329 words, 55434 n-grams, read 173 MB
[525] 1611/56329 words, 55707 n-grams, read 175 MB
[530] 1613/56329 words, 55987 n-grams, read 177 MB
[535] 1616/56329 words, 56308 n-grams, read 178 MB
[540] 1618/56329 words, 56606 n-grams, read 180 MB
[545] 1621/56329 words, 56924 n-grams, read 182 MB
[550] 1621/56329 words, 57210 n-grams, read 183 MB
[555] 1623/56329 words, 57517 n-grams, read 185 MB
[560] 1626/56329 words, 57839 n-grams, read 187 MB
[565] 1630/56329 words, 58160 n-grams, read 188 MB
[570] 1632/56329 words, 58426 n-grams, read 190 MB
[575] 1634/56329 words, 58729 n-grams, read 192 MB
[580] 1634/56329 words, 59014 n-grams, read 193 MB
[585] 1637/56329 words, 59330 n-grams, read 195 MB
[590] 1639/56329 words, 59659 n-grams, read 197 MB
[595] 1641/56329 words, 59974 n-grams, read 198 MB
[600] 1645/56329 words, 60303 n-grams, read 200 MB
[605] 1647/56329 words, 60667 n-grams, read 202 MB
[610] 1647/56329 words, 60986 n-grams, read 203 MB
[615] 1651/56329 words, 61343 n-grams, read 205 MB
[620] 1651/56329 words, 61634 n-grams, read 207 MB
[625] 1660/56329 words, 61936 n-grams, read 208 MB
[630] 1663/56329 words, 62284 n-grams, read 210 MB
[635] 1666/56329 words, 62616 n-grams, read 212 MB
[640] 1666/56329 words, 62920 n-grams, read 213 MB
[645] 1670/56329 words, 63266 n-grams, read 215 MB
[650] 1675/56329 words, 63590 n-grams, read 217 MB
[655] 1679/56329 words, 63947 n-grams, read 218 MB
[660] 1682/56329 words, 64242 n-grams, read 220 MB
[665] 1684/56329 words, 64586 n-grams, read 222 MB
[670] 1685/56329 words, 64909 n-grams, read 223 MB
[675] 1688/56329 words, 65219 n-grams, read 225 MB
[680] 1693/56329 words, 65512 n-grams, read 227 MB
[685] 1694/56329 words, 65789 n-grams, read 228 MB
[690] 1695/56329 words, 66081 n-grams, read 230 MB
[695] 1697/56329 words, 66379 n-grams, read 232 MB
[700] 1698/56329 words, 66711 n-grams, read 233 MB
[705] 1699/56329 words, 67054 n-grams, read 235 MB
[710] 1703/56329 words, 67392 n-grams, read 237 MB
[715] 1705/56329 words, 67674 n-grams, read 238 MB
[720] 1706/56329 words, 67996 n-grams, read 240 MB
[725] 1709/56329 words, 68337 n-grams, read 242 MB
[730] 1710/56329 words, 68607 n-grams, read 243 MB
[735] 1711/56329 words, 68893 n-grams, read 245 MB
[740] 1713/56329 words, 69135 n-grams, read 246 MB
[745] 1713/56329 words, 69465 n-grams, read 248 MB
[750] 1716/56329 words, 69765 n-grams, read 250 MB
[755] 1718/56329 words, 70053 n-grams, read 251 MB
[760] 1719/56329 words, 70303 n-grams, read 253 MB
[765] 1720/56329 words, 70617 n-grams, read 255 MB
[770] 1722/56329 words, 70927 n-grams, read 256 MB
[775] 1724/56329 words, 71187 n-grams, read 258 MB
Traceback (most recent call last):
  File "/home/ferlanero/okb-engine-master/db/../tools/import_corpus.py", line 210, in <module>
    line = sys.stdin.readline()
  File "/usr/lib/python3.5/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 35: invalid start byte
/home/ferlanero/okb-engine-master/db/makefile:43: fallo en las instrucciones para el objetivo 'grams-es-full.csv.bz2'
make: *** [grams-es-full.csv.bz2] Error 1
[ferlanero@ferlanero-imac okb-engine-master]$
 

The Following User Says Thank You to ferlanero For This Useful Post:
Posts: 102 | Thanked: 187 times | Joined on Jan 2010
#185
Originally Posted by ferlanero View Post
Hi again guys!

How I can find the character that gives the error in the original corpus-es file? I mean, this instruction points to somewhere in that file?

Code:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 35: invalid start byte
@ferlanero, it says the first byte of a multibyte character is wrong. So probably some faulty utf-8 encoding.
You either find a way to search for the named byte by its hexadecimal value or convert to some base your tool (unless perl, python, grep) understands.

Or pm me with proper urls to download your selected resources, this is true for all of you who have made an effort and collected enough useful resources to build the required model data for your language, and I will maintain the language package for the community. I though will reject any request not containing at least 40 million running words from several types of source materials like discussed earlier in this thread.
 

The Following 3 Users Say Thank You to ljo For This Useful Post:
Posts: 105 | Thanked: 205 times | Joined on Dec 2015 @ Spain
#186
Originally Posted by ljo View Post
@ferlanero, it says the first byte of a multibyte character is wrong. So probably some faulty utf-8 encoding.
You either find a way to search for the named byte by its hexadecimal value or convert to some base your tool (unless perl, python, grep) understands..
Great! I found it! Now I can continue with the process. Thank ljo.

About the data I have gathered, yes, I have tons of them, including many, many, diferent sources. So I decided to process all of them and when I do it, I would like your support to make a proper rpm to add support to Spanish language for OKBoard. It'll be posible?

I have about 3Gb of corpora data so maybe it takes me a little bit to process everything...
 

The Following 3 Users Say Thank You to ferlanero For This Useful Post:
Posts: 102 | Thanked: 187 times | Joined on Jan 2010
#187
Originally Posted by ferlanero View Post
Great! I found it! Now I can continue with the process. Thank ljo.
Great!
NB Don't forget to use the Thanks!-link in the replies since that is the formal way to send thanks in this forum.

Originally Posted by ferlanero View Post
About the data I have gathered, yes, I have tons of them, including many, many, diferent sources. So I decided to process all of them
Good, lets hope the mix is ok then to make good predictions.

Originally Posted by ferlanero View Post
and when I do it, I would like your support to make a proper rpm to add support to Spanish language for OKBoard. It'll be posible?
Yes, that is possible.

Originally Posted by ferlanero View Post
I have about 3Gb of corpora data so maybe it takes me a little bit to process everything...
Sounds like you will end up with a good language resource for Spanish.
 

The Following 3 Users Say Thank You to ljo For This Useful Post:
Posts: 86 | Thanked: 362 times | Joined on Dec 2007 @ Paris / France
#188
@ferlanero, according to your logs, the 258 first MB of your text corpus only use 1700 unique words (which is roughly the vocabulary of a 2 year old child). The n-gram count is incredibly low also.

The corpus URL you quoted should be all right, so maybe there was a issue when you converted them.
Could you share your input file (or at least a sample) ?

For everybody, I recommend using the Opensubtitles.org corpus collected by OPUS. It should be more relevant for casual chat and could be used together with more formal text samples (news, books, wikipedia ...)
 

The Following 4 Users Say Thank You to eber42 For This Useful Post:
Posts: 102 | Thanked: 187 times | Joined on Jan 2010
#189
@eber42, yes, the OPUS corpus is compiled by a former colleague of mine. It is good for a lot of things.
Let us see what @ferlanero comes up with.
Could the ones who waived their their hands for producing resources for German please tell their status? Otherwise I will do one on February 1st.
 

The Following 3 Users Say Thank You to ljo For This Useful Post:
Posts: 105 | Thanked: 205 times | Joined on Dec 2015 @ Spain
#190
Originally Posted by eber42 View Post
@ferlanero, according to your logs, the 258 first MB of your text corpus only use 1700 unique words (which is roughly the vocabulary of a 2 year old child). The n-gram count is incredibly low also.
The file of that log is 1Gb larger and is the result of merge news, wikipedia and newsscrawl from 2006 to 2011 from Uni-Leipzig. As the summation of another files from other sources increase the final size of the file incredible I decided to split it in several files to check it separately before processing altogether. As I'm having several problems with the UTF8 encoding I have to check each corpus separately. For example, the one that contains the colloquial speak, I already have checked it and is ready to process. The same occurs with the acedemic corpus and a dictionary. But as I said, my main trouble is with the erros with the UTF - ASCII codification

Originally Posted by eber42 View Post
The corpus URL you quoted should be all right, so maybe there was a issue when you converted them.
Could you share your input file (or at least a sample) ?
If you want, I don't have any problem sharing my files in order to check it's validity or not and to correct any errors they have. As I said I have enough processing power to help adding more languages support to OKBoard. So I hope your guidelines to do that correctly.

I send you a PM with the input file.

Thanks for your supoprt and your work with OKBoard
 

The Following 4 Users Say Thank You to ferlanero For This Useful Post:
Reply

Tags
bettertxtentry, huntnpeck sucks, okboard, sailfish, swype


 
Forum Jump


All times are GMT. The time now is 12:43.