Skip to content
This repository was archived by the owner on Dec 16, 2025. It is now read-only.
This repository was archived by the owner on Dec 16, 2025. It is now read-only.

Gemini multi image capability seems to be unstable and fail very basic question #706

@ChenChengKuan

Description

@ChenChengKuan

Description of the bug:

Hello
I followed the sample code to analyze multiple images in the official doc

One weird thing I found is that the client cannot correctly identify the number of images uploaded so it only give the response on certain image

Followings are the images I test (I think any image should be able to reproduce):

fire
lighthouse
tree

img1 = Image.open("fire.jpg")
img2 = Image.open("lighthouse.jpg")
img3 = Image.open("tree.jpg")

MODEL_NAME = 'gemini-2.0-flash-001'
client = genai.Client(api_key="My api key")
response = client.models.generate_content(model=MODEL_NAME,
                                          contents=['Describe each image', img1, img2, img3])
print(response.text)

The output only contains the last image:
The image shows a single, large tree standing in a green field against a bright blue sky with scattered white clouds. The tree has a full, round crown of green leaves. The trunk of the tree is thick and gray. The field is a vibrant green, and the sky is a clear blue with wispy white clouds. The image is well-lit and has a clear, crisp focus.

When I further ask how many images do you see

response = client.models.generate_content(model=MODEL_NAME,
                                          contents=['How many images do you see?', img1, img2, img3])
print(response.text)
I see one image.

One workaround I found is to add system prompt to force it consider all images upload before answer, but this feel very unnatural

response = client.models.generate_content(model=MODEL_NAME,config=types.GenerateContentConfig(system_instruction="Consider all images uploaded by users before answering any question"),
                                          contents=['Describe each image', img1,img2,img3])

The output
Here are the descriptions of each image:

  1. A bonfire that appears to be made out of branches is ablaze, with flames reaching high into the sky.
  2. A lighthouse stands on a shore by the ocean, its light shining across the water under a purple sky with the setting sun.
  3. A large green tree stands tall in a field of grass under a bright blue sky with fluffy white clouds.

When asking the number of the image uploaded:

response = client.models.generate_content(model=MODEL_NAME,config=types.GenerateContentConfig(system_instruction="Consider all images uploaded by users before answering any question"),
                                          contents=['How many images do you see?', img1,img2,img3])
print(response.text)
I see 3 images.

Actual vs expected behavior:

I think it should know how many images to analyze without using further prompting? It's feel awkward

Any other information you'd like to share?

What I have tried:

  1. I found appending the correct number of image tag (<image>) at the end of question helps but not as stable as system prompt
  2. Sometimes, when asked about how many images, it will add the cropped ones. It answer something like I saw one image and X cropped tiles or just give me the numbers of original + cropped (In this case, it's 3+15=18), just weird from user perspective, this is a bit hard to trigger but you can get one when trying multiple times

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions