Hm according to @eber42 on tjc you need corpus files: "Corpus files are compressed plain text files (.txt.bz2) and not tar files. So you can just join different sources before compression. They should contain only sentences separated with dots and/or blank lines, and with proper capitalization."