English to German Translation using Convolution Sequence to Sequence architecture

rahul karanam

5 min readJan 28, 2021

This is a step by step implementation of Language translation using convolution seq2seq learning in Pytorch.

Please find the notebook for the above implementation.

facebookresearch/fairseq

Facebook AI Research Sequence-to-Sequence Toolkit. Contribute to facebookresearch/fairseq development by creating an…

github.com

This is basically taken from the convolutions sequence to sequence learning paper by Facebook Research team.

The code is being taken from the Charon Guo.

This session will be divided into four parts:

Introduction
Encoder
Decoder
Inference and Future Scope

Introduction

Since 2014, there has been tremendous improvement in language modelling and machine translation tasks. From RNN based statistical architecture to the current State of the art model ( Bert, Elmo, GPT3…) we have seen major changes in the encoder and decoder structure.

Until 2017,encoder-decoder based bi-directional LSTM having attention mechanism has been considered as the state of the art(SOTA) but in 2017, Jonas et al. from Facebook Research have implemented a convolution-based approach for language translation.

RNN’s are primarily used for sequence data such as language modelling, machine translations. While, Convolution is being majorly used in images for image classification, recognition and various other tasks. Using convolution for text data is a new technique as it involves parallel processing of data as compared to sequential data for every timestamp in RNN.

For a better understanding of RNN, LSTM you can read this blog

Visualize NLP-The intuition behind RNN, LSTM, GRN and Attention Mechanism

Predicting the next event is a wish everyone tries to achieve.

rahulkarnam777.medium.com

The major advantages of using CNN instead of RNN is speed and accuracy.CNN works faster as the operation of the data is done parallelly whereas in RNN it is done sequentially.

We are going to translate English to German using convolution-based seq2seq learning.

Dataset being used is WMT.

First load spacy English and german models from spacy.

Import necessary libraries.

Initializing random seed in order to maintain similar values.

Load WMT dataset and create English and german tokenizers.

Create Text fields for source and target data.

Split the dataset into train, valid and test data.

Create train and test data batches using data loader with batch size.

We divide the whole model into two parts:

A.Encoder

B.Decoder

Encoder

First, create encoder class and initiate with necessary details.

Initially, input data is being sent for creating word embedding of the size of source length. Similarly, we create positional embedding to create the position of words as in CNN architecture there is no accountability or storage of the sequence of the data which is being given input to CNN. We add padding to the start and end of the input word as convolution operation will reduce the overall length based upon the kernel size. For example, a kernel of size 3 will reduce the overall dimension by 2, in order to compensate we add <sos> and <pos> tag to the start and end of the word.

Then the output of word embedding and positional embedding are concatenated.

The concatenated output will then be given as an input to fully connected network from which it will be given to convolution networks as an input.

First, we need to create a Module list from PyTorch module in order to create CNN layers in the encoder class. We need to change the order of dimension of the conved_input as per Pytorch CNN input should be given as [batch_size,hid_dim,src_len] but instead we get the conved_input as [batch_size,src_len,hid_dim]. Using Permute function we can change the order.

We need to create convolution network based upon input_dim,output_dim,number of layers,kernel size,max length,padding size.

Here we use GLU(Gated Linear Unit) as activation function as per the paper.

GLU(Gated Linear Unit) is an activation function similar to tanh in working, it basically acts as a logic gate which controls what information needs to flow into the CNN, but it removes information by half i.e why we double the conved_input data(output_dim=2*hid_dim).

In the paper, researchers have used a technique called residual network(Skip Connection) which will give access to different neurons loss function as to ease backpropagation and decrease the loss at a faster rate than the normal backprop operation. The output of the CNN will be sent to fully connected neural network from which the residual output is being obtained. The embedding(word+positional) will be added to the output of the convolution network.

Based on the number of layers and the kernel size the convolution operation continues, there will be two outputs from the encoder class one is conved_input( fully connected layer which is the output of CNN layers)and combined_output(which is having access to the embedding dimension of the source input).

Please follow the notebook given above for a better understanding of each code as per the blocks provided above.

Decoder

Now its time for the decoder to decode the output of the encoder and translate the source sentence. Decoder construction is similar to the encoder except we add attention mechanism to it as to look over all the input states for better prediction.

The only change which we can observe is the residual connection is being added to CNN layer. The outputs from encoders are fed into CNN with the target sentence.

Initially, the target sentence is being padded by two tags as the CNN shouldn't predict the first word just by looking at the second tag, because of that we add two paddings at the start of the target word. Then it is being sent to create embedding vector both word and positional word embedding which is later concatenated using element-wise multiplication.

In the CNN layer of the decoder, attention mechanism is introduced, conved output from the encoder is used for creation attention scores which are then multiplied by combined output with the embedding of the decoder state and then everything will be connected to fully connected network. This will be passed to a softmax function to get probabilities. The attention combined will be multiplied with the encoder combined output. By the probabilities, the decoder will predict the word from the vocab using the highest probable word for each previous state.

Training data and Inference

We create seq2seq class which will instantiate both encoder and decoder classes and input the encoder output to the decoder which in turn provide attention and attention_combined.

After creating the classes we train the model using different hyperparameters, kernel size, input dim,emb_dim..soon.

The model is being trained for a set number of epochs before reaching the lowest score.

Later, for inferring the data we create a function called translate_sentence which will take an input and predict the output sentence.

For visualization of the attention mechanism, we create a function to show the attention of each input word with the output word.

Future Scope

Implementing CNN for text-based tasks improves overall performance as it is way faster than RNN. We can use the CNN based seq2seq for machine translation, language modelling, audio generation…

I guess you get the gist of the whole paper, while I might not be explaining step by step in this blog but you can find everything in the notebook. This blog is just to get an overall point of the paper. I hope you understood the nuances of the topic as it comprises of multiple topics such as convolutions,encoder-decoder, GLU…