Skip to main content
POST
/
api
/
v3
/
kwaivgi
/
kling-lipsync
/
audio-to-video
Audio To Video (Lip Sync)
curl --request POST \
  --url https://gptproto.com/api/v3/kwaivgi/kling-lipsync/audio-to-video \
  --header 'Authorization: <authorization>' \
  --header 'Content-Type: application/json' \
  --data '{
  "audio": "<string>",
  "video": "<string>"
}'
{
    "error": {
    "message": "Invalid signature",
    "type": "401"
}
}

Overview

Kling LipSync converts audio into talking head video by generating lifelike lip movements perfectly synced to the input audio. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

Authentication

This endpoint requires authentication using a Bearer token.
Authorization
string
default:"sk-***********"
required
Your API key in the format: YOUR_API_KEY

Request Body

audio
string
required
The URL pointing to the audio file that will be used for generating synchronized lip movements. Supported audio file formats: .mp3/.wav/.m4a/.aac, with a maximum file size of 5MB.
video
string
required
The URL of the video file for generating synchronized lip movements. Video files support .mp4/.mov, file size does not exceed 100MB, video length does not exceed 10s and is not shorter than 2s, only 720p and 1080p are supported, length and width dimensions should both be between 720px and 1920px.

Request Example

curl --location 'https://gptproto.com/api/v3/kwaivgi/kling-lipsync/audio-to-video' \
--header 'Authorization: YOUR_API_KEY' \
--header 'Content-Type: application/json' \
--data '{
  "audio": "https://d1q70pf5vjeyhc.cloudfront.net/predictions/1a91a2f60dff4594b2ad9d5396aef8de/1.mp3",
  "video": "https://d2p7pge43lyniu.cloudfront.net/output/2e375cce-8989-4d72-9629-be85e616b295-u1_07850bf7-c365-46c6-a636-9e5fa6822173.mp4"
}'

Response

data.id
string
Unique identifier for the prediction, Task Id
data.status
string
Status of the task: created, processing, completed, or failed
{
    "error": {
    "message": "Invalid signature",
    "type": "401"
}
}

Usage Notes

  • The video should contain a clear, visible face for optimal lip synchronization
  • Audio duration should match or be shorter than the video duration
  • Processing time varies based on video length and complexity
  • Use the task ID returned in the response to query the generation status via the Query Task endpoint