A CNN-LSTM model to generate a sentence which describes the contents/scene of an image and establishes a Spatial Relationship (position, activity etc.) among the entities