View Single Post
Posts: 102 | Thanked: 187 times | Joined on Jan 2010
#185
Originally Posted by ferlanero View Post
Hi again guys!

How I can find the character that gives the error in the original corpus-es file? I mean, this instruction points to somewhere in that file?

Code:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 35: invalid start byte
@ferlanero, it says the first byte of a multibyte character is wrong. So probably some faulty utf-8 encoding.
You either find a way to search for the named byte by its hexadecimal value or convert to some base your tool (unless perl, python, grep) understands.

Or pm me with proper urls to download your selected resources, this is true for all of you who have made an effort and collected enough useful resources to build the required model data for your language, and I will maintain the language package for the community. I though will reject any request not containing at least 40 million running words from several types of source materials like discussed earlier in this thread.
 

The Following 3 Users Say Thank You to ljo For This Useful Post: