This is probably because it can’t actually understand images, it’s relying on other services to deal with it. So it can do an image search to find what something might be an image of, or use ocr to extract text, but it’ll fail on tasks that involve the idiosyncrasies of each image.
I believe it uses something like image to text to provide general understanding of image to LLM, maybe in form of vector floats even... making it good at understanding content of image but cannot do ops you listed.
Some things I tried that it failed at:
* Simple "spot the difference" a young child could do
* Counting chess pieces on a board or coins in a mario screenshot
* Describing a photo of a car on a road
* Evaluating a graph of y=x and one of y=x^2
* Explaining a meme image (said it can't do anything with images of people)
* Taking a screenshot & giving HTML for it (gave random HTML unrelated to image)
* Taking an image and converting to SVG (gave random SVG unrelated to image)
* Describing a photo of a car (it was parked on a road, Bard said it was driving in front of a brick wall)
It was quite good at OCR though even for images that are pretty tough for other models (eg serial numbers on industrial parts).