Well, some of you might say “A white dog in a grassy area”, some may say “White dog with brown spots” and yet some others might say “A dog on grass and some pink flowers”.
Definitely, all of these captions are relevant for this image and there may be some others also. But the point I want to make is; it’s so easy for us, as human beings, to just have a glance at a picture and describe it in an appropriate language. Even a 5-year-old could do this with utmost ease.
But, can you write a computer program that takes an image as input and produces a relevant caption as output?