it's demonstrated that the simple pre-schooling task of predicting which caption goes with which picture is really an economical and scalable way to find out SOTA picture representations from scratch on the dataset of https://k2spiceshop.com/product/liquid-k2-on-paper-online/