More Progress Towards Rusyn Artificial Intelligence

tl:dr; lem2en nmt project now on aws ec2 and it's alive!

After initial success processing Lemko to English translation segment using Tensorflow (#tf) neural machine translation (#NMT) Sequence-to-Sequence (#seq2seq), it appears as if it would take literally days or even months to finish given the the available hardware (Lenovo M30 laptop 1.7GHz, 4GB). A solution is to use cloud computing, and specifically Amazon Web Services (#aws). This has been accomplished.

Training Data

Eventually, we are going to feed the machine tens of thousands of perfectly aligned, premium-quality training data generously made available by an NGO. But first, we're going to give it something very easy to digest and see how long it takes and what the bill is.

What could be simpler than "My name is X":

#train.lem
Называм ся Параскєва.
Называм ся Петро.
Называм ся Ваньо.
Я ся называм Катерина тераз.
Я ся называм Параска.
Я называм ся Митро.
Называм ся Ярослав.
Называм ся Мария.
The English version:
#train.en
My name is Paraskiewa.
My name is Petro.
My name is Wanio.
My name is now Kateryna.
My name is Paraska.
My name is Mytro.
My name is Jarosław.
My name is Maria.
Next, time to create the development and testing data. the Stanford NLP group's codebase English-Vietnamese vocabulary data, pulled from International Workshop on Spoken Language Translation's (#IWSLT2015), has a train:dev:test ratio of about 98:1:1, i.e.:
Dataset Filename Lines Percent
Training train.vi 133317 97.928%
Training train.en 133317 97.928%
Development tst2012.vi 1553 01.141%
Development tst2012.en 1553 01.141%
Testing tst2013.vi 1268 00.931%
Testing tst2013.en 1268 00.931%
We'll have to content ourselves with a train:dev:test ratio of 8:1:1 for now.

Development data

#dev.lem
Я ся называм Ярослав.
#dev.en
My name is Jarosław.

Test data

#test.dev
Я ся называм Мария.
#test.en
My name is Maria.



Vocab file

Surprisingly, the Stanford NLP group's codebase English-Vietnamese vocabulary data, pulled from International Workshop on Spoken Language Translation's (#IWSLT2015), is just the training data words listed (no dev or test data) without any stemming or word-frequency sorting. Each word form occurs once. So, recreating it for Lemko will be pretty easy. Meanwhile I have created a program, Agni, to stem and sort by frequency, if need be. It's currently operational for Hungarian.
The vocab file needs to start with the unknown token "<unk>," the starting symbol "<s>," and the end-of-sentence marker "</s>".
#vocab.lem
<unk>
<s>
</s>
Называм
ся
Параскєва
.
Петро
Ваньо
Я
называм
Катерина
тераз
Параска
Митро
Ярослав
Мария
And the same for the English:
#vocab.en
<unk>
<s>
</s>
My
name
is
Paraskiewa
.
Petro
Wanio
now
Kateryna
Paraska
Mytro
Jarosław
Maria

Output

Now, it says, "Perform external evaluation." But that will have to wait until next week ;-)

For all you language hackers out there, here's the secret sauce:

tmux
source activate tensorflow_p36
git clone https://github.com/tensorflow/nmt/
wget https://raw.githubusercontent.com/pgleasonjr/nmt/master/scripts/download_lemko.sh
sudo chmod +x download_lemko.sh
./download_lemko.sh /tmp/nmt_data
vim nmt/utils/misc_utils.py (fix line 34; ZZ)
mkdir /tmp/nmt_model
cd nmt
python -m nmt.nmt \
    --src=lem --tgt=en \
    --vocab_prefix=/tmp/nmt_data/vocab  \
    --train_prefix=/tmp/nmt_data/train \
    --dev_prefix=/tmp/nmt_data/dev  \
    --test_prefix=/tmp/nmt_data/dev \
    --out_dir=/tmp/nmt_model \
    --num_train_steps=12000 \
    --steps_per_stats=100 \
    --num_layers=2 \
    --num_units=128 \
    --dropout=0.2 \
    --metrics=bleu
C-a "          split vertically (top/bottom)
C-a %          split horizontally (left/right)

Comments