Data Scientist

	e-mail: dvkazakov @ gmail.com (remove spaces on both sides of @)	Phone/WhatsApp: +7-916-909-7864	Telegram: @denis_v_kazakov
GitHub	Skype: denis.v.kazakov		Русский

Google Translate detected!

Using machine learning to establish authorship: comparing human and machine translation

Skills:

Data preparation with Pandas

Neural network: model building, training and evaluation of results (keras).

Statistical hypothesis testing

Power of statistical tests

Languages: Python, R

Jupyter notebook: click this link to download the project notebook.

"Google Translate detected!" is the battle cry of translators seeing that a translation was done by a computer rather than a human translator (implying that the translation is poor and it is clearly visible).

I myself have been working as a translator for many years. Quite often, I start reading a text and think, "It is not an original, it must be a translation." And it turns out to be a translation. There are certain indicators differentiating a good translation from a bad one. And there are differences between human and machine translations. It is really a text classification problem so it would be interesting to see if an ML algorigthm could work in this case.

It is a study project to compare human and machine translations of four books by Charles Dickens: The Pickwick Papers, Oliver Twist, David Copperfield and Nicholas Nickleby. The choice was driven by their large volume and the fact that they all were translated by the same team of translators.

Data preparation

I had translations by the same team of translators (Lann, Krivtsova, et al.). I also got each book translated by Google Translate.

I also manually made sure that the names of at least main characters were translated the same was as in the human translation. I did not read the entire text, of course, rather used MS Word search. (A preliminary experiment demonstrated that ML easily identifies authorship by proper names.)

Each book was saved in a txt file split by sentences (by periods), i.e. each sentence was a separate paragraph. After that, sentences were merged into text segments with lengths greater than or equal to a predefined minimum number of characters. And then, all files were joined into one with an authorship column (0 for human translation and 1 for Google Translate). Finally, the data were split into train and test sets.

Model hyperparameter optimization

I started with a simple model: a neural network without hidden layers and a sigmoid as the activation function. Words were vectorized by embedding.

The following hyperparameters were optimized:

Dictionary size

Embedding space dimensions

Minimum and maximum (used in vectorization) segmend size.

The most important factor determining classification accuracy is segment length: accuracy increased with segment lengths despite the fact that larger segments meant smaller train set. Minimum segment length varied between 100 and 4096 characters. A reasonably long segment would be 1800 character long (a standard page size in translation).

After hyperparameter optimization, accuracy was measured on the test set for several segment lengths.

Results

Accuracy on the test set was 85% with a minimum segment length of 100 characters; 91% for 250 characters and 99% for 1800 characters (standard page size in translation).

All these results were obtained with the most primitive neural network without hidden layers, equivalent to logistic regression. Adding hidden layers, use of recurrent networks (including LSTM) did not result in a significant improvement, probably due to "physical" limitations of the model as we only compare word choice and their order and the differences are probably not significant in small segments.

Same task by humans

I prepared two samples for comparison: 40 and 25 segments with a length of 100 and 250 characters respectively. I first tried to tell the difference between human and machine translation myself.

My accuracy on 100-character long samples was 0.7; with the model's accuracy of 0.85. On the sample with a minimum length of segments of 250 characters, I scored 0.84; the model 0.76.

However, the difference is not statistically significant in both cases as shown by the McNemar's test (p100 = 0.18; p250 = 0.73).

It makes sense then to estimate the power of the test. I did not find a function for this purpose in Python, so I used R. Power is not sufficient: 0.35 in the first case and 0.09 in the second.

Notebooks: click this link to download Jupyter notebook with hypothesis testing in Python without statistical power estimation or this link to download same calculations with statistical power estimation in R.

The next step would be to try larger samples. My colleagues also promised to do this test on students.