Uses a character level encoder-decoder network of LSTMs. The encoder network reads the input sentence character by character and summarizes the sentence in its state. This state is then used as initial state of the decoder network to produce the translated sentence one character at a time. During prediction, the decoder also recieves its previous output as input to the next time. For training we use a technique called "teacher forcing" i.e. we feed the actual previous character instead of the previous prediction which greatly benefits the training.
This is a companion discussion topic for the original entry at https://kni.me/w/x_Oc-UjBgviZq7Ff