Vision And Language Understanding With Localized Evidence