Image to Image

Overview

Google Gemini 2.5 Flash Image, a powerful new image generation and editing model with advanced features and creative control.

Image + Text-to-Image (Editing): Provide an image and use text prompts to add, remove, or modify elements, change the style, or adjust the color grading.
Multi-Image to Image (Composition & Style Transfer): Use multiple input images to compose a new scene or transfer the style from one image to another.
Iterative Refinement: Engage in a conversation to progressively refine your image over multiple turns, making small adjustments until it’s perfect.
High-Fidelity Text Rendering: Accurately generate images that contain legible and well-placed text, ideal for logos, diagrams, and posters.

Supported inputs & outputs :

Inputs: Text and Images Outputs: Text and image

Authentication

This endpoint requires authentication using a Bearer token.

Authorization

string

default:"sk-***********"

required

Your API key in the format: YOUR_API_KEY

Request Body

contents

array

required

Show properties

parts

array

required

Hide properties

text

string

required

The prompt for the generation.

Hide properties

inline_data

object

required

Hide properties

mime_type

string

required

Supported MIME types:image/png, image/jpeg, image/webp

Hide properties

data

string

required

image base64.

generationConfig

object

Show properties

responseModalities

string[]

The model defaults to returning text and image responses (['Text', 'Image']). You can configure the response to return only images without text using ( ['Image']).

imageConfig

object

Hide properties

aspectRatio

string

1:1, 3:2, 2:3, 3:4, 4:3, 4:5, 5:4, 9:16, 16:9, 21:9

curl -X POST "https://gptproto.com/v1beta/models/gemini-2.5-flash-image:generateContent" \
  -H "Authorization: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "contents": [
    {
      "parts": [
        {
          "text": "Using the provided image of my cat, please add a small, knitted wizard hat on its head. Make it look like it's sitting comfortably and not falling off."
        },
        {
          "inline_data": {
            "mime_type": "image/jpeg",
            "data": "iVBORw0KGgoAAAANSUhEUgAAANQAAAFPCA...."
          }
        }
      ]
    }
  ],
  "generationConfig": {
    "responseModalities": [
      "TEXT",
      "IMAGE"
    ]
  }
}'

Technical specifications

Maximum images per prompt: 3
Maximum image size: 7 MB
Supported aspect ratios: 1:1, 3:2, 2:3, 3:4, 4:3, 4:5, 5:4, 9:16, 16:9, 21:9
Supported MIME types: image/png, image/jpeg, image/webp

Generate Content

You can optionally configure the response modalities and aspect ratio of the model’s output in the config field of generate_content calls.

Output types

The model defaults to returning text and image responses (i.e. response_modalities=['Text', 'Image']). You can configure the response to return only images without text using response_modalities=['Image'].

Aspect ratios

The model defaults to matching the output image size to that of your input image, or otherwise generates 1:1 squares. You can control the aspect ratio of the output image using the aspect_ratio field under image_config in the response request: The different ratios available and the size of the image generated are listed in this table:

Aspect ratio	Resolution
1:1	1024x1024
2:3	832x1248
3:2	1248x832
3:4	864x1184
4:3	1184x864
4:5	896x1152
5:4	1152x896
9:16	768x1344
16:9	1344x768
21:9	1536x672

Response

{
    "candidates": [
        {
            "content": {
                "role": "model",
                "parts": [
                    {
                        "inlineData": {
                            "mimeType": "image/png",
                            "data": "image base64"
                        }
                    }
                ]
            },
            "finishReason": "STOP"
        }
    ],
    "usageMetadata": {
        "promptTokenCount": 1302,
        "candidatesTokenCount": 1290,
        "totalTokenCount": 2592,
        "thoughtsTokenCount": 0,
        "promptTokensDetails": [
            {
                "modality": "IMAGE",
                "tokenCount": 1290
            },
            {
                "modality": "TEXT",
                "tokenCount": 12
            }
        ]
    },
    "modelVersion": "gemini-2.5-flash-image"
}

{
  "error": {
    "message": "Invalid signature",
    "type": "401"
  }
}

Request Example

Adding and removing elements

Provide an image and describe your change. The model will match the original image’s style, lighting, and perspective.

curl --location 'https://gptproto.com/v1beta/models/gemini-2.5-flash-image:generateContent' \
--header 'Authorization: sk-xxxx' \
--header 'Content-Type: application/json' \
--data '{
      "contents": [{
        "parts":[
            {"text": "Using the provided image of my cat, please add a small, knitted wizard hat on its head. Make it look like it's sitting comfortably and not falling off."},
            {
              "inline_data": {
                "mime_type":"image/png",
                "data": "iVBORw0KGgoAAAANSUhEUgAAANQAAAFPCA...."
              }
            }
        ]
      }]
    }'

Input	Output

A photorealistic picture of a fluffy ginger cat…	Using the provided image of my cat, please add a small, knitted wizard hat…

Advanced composition: Combining multiple images

Provide multiple images as context to create a new, composite scene. This is perfect for product mockups or creative collages.

curl --location 'https://gptproto.com/v1beta/models/gemini-2.5-flash-image:generateContent' \
--header 'Authorization: sk-xxxx' \
--header 'Content-Type: application/json' \
--data '{
      "contents": [{
        "parts":[
            {
              "inline_data": {
                "mime_type":"image/png",
                "data": "iVBORw0KGgoAAAANSUhEUgAAANQAAAFPCA...."
              }
            },
            {
              "inline_data": {
                "mime_type":"image/png",
                "data": "{{gemini_png_base64_2}}"
              }
            },
            {"text": "Create a professional e-commerce fashion photo. Take the blue floral dress from the first image and let the woman from the second image wear it. Generate a realistic, full-body shot of the woman wearing the dress, with the lighting and shadows adjusted to match the outdoor environment."}
        ]
      }]
    }'

Input1	Input2	Output

A professionally shot photo of a blue floral summer dress…	Full-body shot of a woman with her hair in a bun…	Create a professional e-commerce fashion photo…

Best Practices

To elevate your results from good to great, incorporate these professional strategies into your workflow.

Be Hyper-Specific: The more detail you provide, the more control you have. Instead of “fantasy armor,” describe it: “ornate elven plate armor, etched with silver leaf patterns, with a high collar and pauldrons shaped like falcon wings.”
Provide Context and Intent: Explain the purpose of the image. The model’s understanding of context will influence the final output. For example, “Create a logo for a high-end, minimalist skincare brand” will yield better results than just “Create a logo.”
Iterate and Refine: Don’t expect a perfect image on the first try. Use the conversational nature of the model to make small changes. Follow up with prompts like, “That’s great, but can you make the lighting a bit warmer?” or “Keep everything the same, but change the character’s expression to be more serious.”
Use Step-by-Step Instructions: For complex scenes with many elements, break your prompt into steps. “First, create a background of a serene, misty forest at dawn. Then, in the foreground, add a moss-covered ancient stone altar. Finally, place a single, glowing sword on top of the altar.”
Use “Semantic Negative Prompts”: Instead of saying “no cars,” describe the desired scene positively: “an empty, deserted street with no signs of traffic.”
Control the Camera: Use photographic and cinematic language to control the composition. Terms like wide-angle shot, macro shot, low-angle perspective.

Limitations

For best performance, use the following languages: EN, es-MX, ja-JP, zh-CN, hi-IN.
Image generation does not support audio or video inputs.
The model won’t always follow the exact number of image outputs that the user explicitly asks for.
The model works best with up to 3 images as an input.
When generating text for an image, Gemini works best if you first generate the text and then ask for an image with the text.
Uploading images of children is not currently supported in EEA, CH, and UK.

API Reference

​Overview

​Supported inputs & outputs :

​Authentication

​Request Body

​Image to Image

​Generate Content

​Output types

​Aspect ratios

​Response

​Request Example

​Adding and removing elements

​Advanced composition: Combining multiple images

​Best Practices

​Limitations

Overview

Supported inputs & outputs :

Authentication

Request Body

Image to Image

Generate Content

Output types

Aspect ratios

Response

Request Example

Adding and removing elements

Advanced composition: Combining multiple images

Best Practices

Limitations