Work in Progress: This page is under development. Use the feedback button on the bottom right to help us improve it.

Schema

Schemas define the structure and format of data in your tables. Every source and sink table requires a schema that specifies the data format and field definitions.

Schema Object Structure

A schema consists of:

PropertyRequiredDescription
formatYesData serialization format (JSON, Avro, Parquet, etc.)
fieldsYesArray of field definitions
primary_keysNoArray of field names that form the primary key
bad_dataNoHow to handle malformed records (Fail or Drop)
inferredNoWhether the schema was auto-inferred

format

json

The most common format for streaming data.

OptionTypeDefaultDescription
confluentSchemaRegistrybooleanfalseUse Confluent Schema Registry
schemaIdnumber-Schema version ID from registry
includeSchemabooleanfalseInclude schema in each message
debeziumbooleanfalseEnable Debezium CDC mode
unstructuredbooleanfalseTreat payload as single unstructured field
timestampFormatstring"rfc3339"rfc3339 or unix_millis
decimalEncodingstring"number"number, string, or bytes
{
  "format": {
    "json": {
      "timestampFormat": "rfc3339"
    }
  }
}

avro

Apache Avro binary format.

OptionTypeDefaultDescription
confluentSchemaRegistrybooleanfalseUse Confluent Schema Registry
rawDatumsbooleanfalseRaw Avro datums without container
intoUnstructuredJsonbooleanfalseConvert to unstructured JSON
readerSchemastring-Override reader schema
schemaIdnumber-Schema version ID
{
  "format": {
    "avro": {
      "confluentSchemaRegistry": true
    }
  }
}

parquet

Apache Parquet columnar format, typically used for sink tables.

OptionTypeDefaultDescription
compressionstring"uncompressed"uncompressed, snappy, gzip, zstd, lz4
rowGroupBytesnumber-Target row group size in bytes
{
  "format": {
    "parquet": {
      "compression": "snappy",
      "rowGroupBytes": 134217728
    }
  }
}

protobuf

Protocol Buffers binary format.

OptionTypeDefaultDescription
messageNamestring-Protobuf message type name
compiledSchemabytes-Compiled protobuf descriptor
confluentSchemaRegistrybooleanfalseUse Confluent Schema Registry
lengthDelimitedbooleanfalseUse length-delimited framing
intoUnstructuredJsonbooleanfalseConvert to unstructured JSON
{
  "format": {
    "protobuf": {
      "confluentSchemaRegistry": true,
      "messageName": "MyMessage"
    }
  }
}

rawString

Single UTF-8 string field. Requires exactly one field of type Utf8.

{
  "format": {
    "rawString": {}
  }
}

rawBytes

Single binary field. Requires exactly one field of type Bytes.

{
  "format": {
    "rawBytes": {}
  }
}

fields

Primitive Types

TypeDescriptionSQL Equivalent
Int3232-bit signed integerINTEGER
Int6464-bit signed integerBIGINT
UInt3232-bit unsigned integer-
UInt6464-bit unsigned integer-
F3232-bit floating pointFLOAT
F6464-bit floating pointDOUBLE
BoolBooleanBOOLEAN
Utf8UTF-8 stringVARCHAR / TEXT
BytesBinary dataBYTEA
DateTimeRFC3339 datetimeTIMESTAMP
Date32Date (days since epoch)DATE
UnixMillisUnix timestamp (milliseconds)TIMESTAMP
UnixMicrosUnix timestamp (microseconds)TIMESTAMP
UnixNanosUnix timestamp (nanoseconds)TIMESTAMP
JsonJSON dataJSON

Field Definition

Each field in the fields array has this structure:

{
  "field_name": "user_id",
  "field_type": {
    "type": {
      "primitive": "Int64"
    }
  },
  "nullable": false
}

struct (Nested Objects)

For nested objects, use the struct type with nested fields:

{
  "field_name": "address",
  "field_type": {
    "type": {
      "struct": {
        "fields": [
          {
            "field_name": "street",
            "field_type": { "type": { "primitive": "Utf8" } },
            "nullable": true
          },
          {
            "field_name": "city",
            "field_type": { "type": { "primitive": "Utf8" } },
            "nullable": false
          },
          {
            "field_name": "zip",
            "field_type": { "type": { "primitive": "Utf8" } },
            "nullable": true
          }
        ]
      }
    }
  },
  "nullable": true
}

list (Arrays)

For arrays, use the list type with an element definition:

{
  "field_name": "tags",
  "field_type": {
    "type": {
      "list": {
        "field_name": "item",
        "field_type": { "type": { "primitive": "Utf8" } },
        "nullable": false
      }
    }
  },
  "nullable": true
}

Complete Examples

Simple JSON Schema

{
  "format": { "json": {} },
  "fields": [
    {
      "field_name": "event_id",
      "field_type": { "type": { "primitive": "Utf8" } },
      "nullable": false
    },
    {
      "field_name": "user_id",
      "field_type": { "type": { "primitive": "Int64" } },
      "nullable": false
    },
    {
      "field_name": "event_type",
      "field_type": { "type": { "primitive": "Utf8" } },
      "nullable": false
    },
    {
      "field_name": "amount",
      "field_type": { "type": { "primitive": "F64" } },
      "nullable": true
    },
    {
      "field_name": "timestamp",
      "field_type": { "type": { "primitive": "DateTime" } },
      "nullable": false
    }
  ]
}

Schema with Nested Struct

{
  "format": { "json": {} },
  "fields": [
    {
      "field_name": "order_id",
      "field_type": { "type": { "primitive": "Utf8" } },
      "nullable": false
    },
    {
      "field_name": "customer",
      "field_type": {
        "type": {
          "struct": {
            "fields": [
              {
                "field_name": "id",
                "field_type": { "type": { "primitive": "Int64" } },
                "nullable": false
              },
              {
                "field_name": "name",
                "field_type": { "type": { "primitive": "Utf8" } },
                "nullable": false
              },
              {
                "field_name": "email",
                "field_type": { "type": { "primitive": "Utf8" } },
                "nullable": true
              }
            ]
          }
        }
      },
      "nullable": false
    },
    {
      "field_name": "items",
      "field_type": {
        "type": {
          "list": {
            "field_name": "item",
            "field_type": {
              "type": {
                "struct": {
                  "fields": [
                    {
                      "field_name": "sku",
                      "field_type": { "type": { "primitive": "Utf8" } },
                      "nullable": false
                    },
                    {
                      "field_name": "quantity",
                      "field_type": { "type": { "primitive": "Int32" } },
                      "nullable": false
                    }
                  ]
                }
              }
            },
            "nullable": false
          }
        }
      },
      "nullable": false
    }
  ]
}

Parquet Sink Schema

{
  "format": {
    "parquet": {
      "compression": "snappy",
      "rowGroupBytes": 134217728
    }
  },
  "fields": [
    {
      "field_name": "id",
      "field_type": { "type": { "primitive": "Utf8" } },
      "nullable": false
    },
    {
      "field_name": "value",
      "field_type": { "type": { "primitive": "F64" } },
      "nullable": false
    },
    {
      "field_name": "event_date",
      "field_type": { "type": { "primitive": "Date32" } },
      "nullable": false
    }
  ]
}

bad_data

Control how malformed records are handled:

ModeDescription
FailPipeline fails on bad records (default)
DropSkip bad records and continue processing
{
  "format": { "json": {} },
  "bad_data": { "Drop": {} },
  "fields": [...]
}