diff --git a/docs/connect-data-in.md b/docs/connect-data-in.md index 3d55ac06..6bcfa901 100644 --- a/docs/connect-data-in.md +++ b/docs/connect-data-in.md @@ -11,6 +11,9 @@ Timeplus supports multiple ways to load data into the system, or access the exte - On top of the REST API and SDKs, Timeplus Enterprise adds integrations with [Kafka Connect](/kafka-connect), [AirByte](https://airbyte.com/connectors/timeplus), [Sling](/sling), and seatunnel. - Last but not the least, if you are not ready to load your real data into Timeplus, or just want to play with the system, you can use the web console to [create sample streaming data](#streamgen), or use SQL to create random streams. +When working with external data sources, you can use various [data formats](/data-formats) including JSON, CSV, Protobuf, and Avro. See the [Data Formats](/data-formats) page for comprehensive information. + + ## Add new data via web console Choose "Data Collection" from the navigation menu to setup data access to other systems. There are two categories: diff --git a/docs/data-formats.md b/docs/data-formats.md new file mode 100644 index 00000000..54a633b9 --- /dev/null +++ b/docs/data-formats.md @@ -0,0 +1,663 @@ +--- +title: Data Formats +--- + +# Data Formats + +Timeplus supports multiple data formats for reading from and writing to external systems like Apache Kafka, Apache Pulsar, Apache Iceberg, S3, NATS JetStream, and others. + +This page provides comprehensive guidance on all supported data formats. + +## Supported Data Formats {#formats} + +The following data formats are supported across various sources and sinks: + +| Format | Description | Use Case | +| ---------------- | ---------------------------------------- | -------- | +| `RawBLOB` | Raw text, no parsing (default) | Plain text or binary data | +| `JSONEachRow` | Parses one JSON document per line | Most common format for JSON data | +| `CSV` | Parses comma-separated values | Legacy systems, data exports | +| `TSV` | Tab-separated values | Like CSV, but tab-delimited | +| `ProtobufSingle` | One Protobuf message per message | Protobuf-encoded streaming data | +| `Protobuf` | Multiple Protobuf messages per message | Protobuf-encoded batches | +| `Avro` | Avro-encoded messages | Schema-first data serialization | + + +## RawBLOB + +Read / write row as raw text. Only one column could be defined in using `RawBLOB`. + +**Example** + +The external stream `raw` column value is same as Kafka message data. + +```sql +CREATE EXTERNAL STREAM my_stream( + raw string +) +SETTINGS type='kafka', + brokers='localhost:9092', + topic='events', + data_format='RawBLOB' +``` + +## JSONEachRow + +Each line (or message) contains a single JSON string with key/value pairs mapping to columns. + +**Example** + +Each Kafka message is plain text with JSON line format. For example, +```json +{"user_id": 123, "action": "click", "timestamp": "2024-01-15T10:30:00Z"} +``` + +The Kafka external stream is defined with columns: `user_id`, `action` and `timestamp`. +```sql +CREATE EXTERNAL STREAM my_stream( + user_id int, + action string, + timestamp datetime64(3) +) +SETTINGS type='kafka', + brokers='localhost:9092', + topic='events', + data_format='JSONEachRow', + one_message_per_row=true +``` + +In reading the external stream, the fetched Kafka message is parsed by JSON. The resulting column value is got from corresponding JSON key. If a column name is not found in the parsed JSON object keys, it is filled with the default value. + + +**Related Settings** +- `one_message_per_row=true`: Ensures each message contains exactly one JSON document. When set to `false`, especially useful when writing data. + +## CSV and TSV + +Use CSV or TSV for processing comma-separated or tab-separated data. + +**Example CSV:** +```csv +123,click,2024-01-15T10:30:00Z +124,view,2024-01-15T10:31:00Z +``` + +**Configuration:** +```sql +CREATE EXTERNAL STREAM csv_stream( + user_id int, + action string, + timestamp datetime64(3) +) +SETTINGS type='kafka', + brokers='localhost:9092', + topic='csv_events', + data_format='CSV' +``` + +## Protobuf {#protobuf} + +Timeplus supports reading and writing [Protobuf](https://protobuf.dev/) formatted messages. You can use Protobuf with or without a [Schema Registry](/kafka-schema-registry). + +There are two data formats for Protobuf: `ProtobufSingle` and `Protobuf`. They are encoded differently and can not be used interchangeably. + +**ProtobufSingle** Each message has only one protobuf message. This is mostly used. + +**Protobuf** Each message may have one or multiple protobuf messages. The message length is prepended the protobuf payload for decoding. + +### Create + +When not using a Schema Registry, you need to define the Protobuf schema using SQL. + +1. Create Protobuf Schema + +```sql +CREATE OR REPLACE FORMAT SCHEMA schema_name AS ' + syntax = "proto3"; + + message SearchRequest { + string query = 1; + int32 page_number = 2; + int32 results_per_page = 3; + } +' TYPE Protobuf +``` + +2. Create External Stream + +Then refer to this schema while creating an external stream: + +```sql +CREATE EXTERNAL STREAM stream_name( + query string, + page_number int32, + results_per_page int32) +SETTINGS type='kafka', + brokers='localhost:9092', + topic='topic_name', + data_format='ProtobufSingle', + format_schema='schema_name:SearchRequest' +``` + +The `format_schema` setting contains two parts: the registered schema name (in this example: `schema_name`), and the message type (in this example: `SearchRequest`). Combine them with a colon. + +### Column Inference + +If columns definition is totally ignored in creating external stream with data_format Protobuf or Avro, the columns name and type will be auto inferenced from format schema and added to the stream. + +For example, `query`, `page_number` and `results_per_page` columns will be auto-created in below SQL. + +```sql +CREATE EXTERNAL STREAM stream_name +SETTINGS type='kafka', + brokers='localhost:19092', + topic='topic_name', + data_format='ProtobufSingle', + format_schema='schema_name:SearchRequest' +``` + +### Examples For Complex Protobuf Schema {#protobuf_complex} + +#### Nested Schema + +```sql +CREATE FORMAT SCHEMA simple_nested AS ' + syntax = "proto3"; + + message Name { + string first = 1; + string last = 2; + } + + message Person { + string email = 1; + Name name = 2; + int32 age = 3; + map skills = 4; + } +' TYPE Protobuf +``` + +```sql +CREATE EXTERNAL STREAM people( + email string, + name_first string, + name.last string, + skills map(string, int32), + age int32 +) +SETTINGS type='kafka', + brokers='localhost:9092', + topic='people', + data_format='ProtobufSingle', + format_schema='simple_nested:Person' +``` + +**Notes** +1. `Person` is the top level message type. It refers to the `Name` message type. +2. Use `name` as the prefix for the column names. Use either `_` or `.` to connect the prefix with the nested field names. +3. You don't have to define all possible columns. Only the columns you defined will be read. Other columns/fields are skipped. + +#### Enum + +If your Protobuf definition includes an enum type: + +```protobuf +enum Level { + LevelOne = 0; + LevelTwo = 1; +} +``` + +You can use the enum type in Timeplus: + +```sql +CREATE EXTERNAL STREAM ..( + .. + level enum8('LevelOne'=0,'LevelTwo'=1), + .. +) +``` + +#### Repeated (Arrays) + +If your Protobuf definition has a repeated field: + +```protobuf +repeated string Status +``` + +Use the array type in Timeplus: + +```sql +CREATE EXTERNAL STREAM ..( + .. + status array(string), + .. +) +``` + +#### Repeated and Nested {#repeat_nested} + +For fields that are both custom types and repeated: + +```protobuf +syntax = "proto3"; +message DataComponent { + optional string message = 1; + message Params { + message KeyValue { + optional string name = 1; + optional string value = 2; + } + repeated KeyValue Param = 1; + } + optional Params params = 2; +} +``` + +Use the tuple type in Timeplus: + +```sql +CREATE EXTERNAL STREAM ..( + message string, + params tuple(Param nested( name string, value string )) +) +``` + +The streaming data will be shown as: + +| message | params | +| ------- | --------------------------------------------------------------- | +| No. 0 | ([('key_1','value_1'),('key_2','value_2'),('key_3','value_3')]) | + +#### Package + +If your Protobuf definition includes a package: + +```protobuf +package demo; +message StockRecord { +.. +} +``` + +If there is only 1 package in the Protobuf definition, you don't have to include the package name: + +```sql +CREATE EXTERNAL STREAM ..( + .. +) +SETTINGS .. format_schema="schema_name:StockRecord" +``` + +If there are multiple packages, you can use the fully qualified name: + +```sql +CREATE EXTERNAL STREAM ..( + .. +) +SETTINGS .. format_schema="schema_name:demo.StockRecord" +``` + +#### Import Schemas + +If you have created a format schema, you can create another schema and import it: + +```sql +CREATE FORMAT SCHEMA import_example AS ' + import "schema_name.proto"; + message Test { + required string ID = 1; + optional Level TheLevel = 2; + } +' TYPE Protobuf +``` + +Make sure to add `.proto` as the suffix. + +## Avro {#avro} + +Timeplus supports reading and writing [Avro](https://avro.apache.org) formatted messages. Available since Timeplus Proton 1.5.10. You can use Avro with or without a [Schema Registry](/kafka-schema-registry). + +### Create + +When not using a Schema Registry, you need to define the Avro schema using SQL. + +1. Create Avro Schema + +```sql +CREATE OR REPLACE FORMAT SCHEMA avro_schema AS '{ + "namespace": "example.avro", + "type": "record", + "name": "User", + "fields": [ + {"name": "name", "type": "string"}, + {"name": "favorite_number", "type": ["int", "null"]}, + {"name": "favorite_color", "type": ["string", "null"]} + ] +} +' TYPE Avro; +``` + +2. Create External Stream + +Then refer to this schema while creating an external stream: + +```sql +CREATE EXTERNAL STREAM stream_avro( + name string, + favorite_number nullable(int32), + favorite_color nullable(string)) +SETTINGS type='kafka', + brokers='localhost:9092', + topic='topic_name', + data_format='Avro', + format_schema='avro_schema' +``` + +Then you can write data to the topic: + +```sql +INSERT INTO stream_avro(name,favorite_number,favorite_color) VALUES('test',1,'red') +``` + +### Avro Data Types Mapping {#avro_types} + +#### Avro Primitive Types + +The table below shows supported Avro primitive data types and how they match Timeplus data types: + +|Avro data type|Timeplus data type| +|---|---| +|int|int8,int16,int32,uint8,uint16,uint32| +|long|int64,uint64| +|float|float32| +|double|float64| +|bytes,string|string| +|fixed(N)|fixed_string(N)| +|enum|enum8,enum16| +|array(T)|array(T)| +|map(k,v)|map(k,v)| +|union(null,T)|nullable(T)| +|null|nullable(nothing)| +|int(date)|date,date32| +|long (timestamp-millis)|datetime64(3)| +|long (timestamp-micros)|datetime64(6)| +|string (uuid) | uuid| +|record|tuple| + +#### Avro Logical Types + +If you use `logicalType` in your Avro schema, Timeplus will automatically map it to the corresponding Timeplus data type: + +- UUID: maps to `uuid`. +- Date: maps to `date`. +- Timestamp (millisecond precision): maps to `datetime64(3)`. +- Timestamp (microsecond precision): maps to `datetime64(6)`. + +Other logical types are not implemented yet. + +**Example:** + +Given the following Avro schema: + +```json +{ + "type": "record", + "name": "schema", + "fields": [ + { + "name": "time", + "type": { "type": "long", "logicalType": "timestamp-millis" } + }, + { "name": "key", "type": "string" }, + { "name": "value", "type": "double" } + ] +} +``` + +The external stream would be: + +```sql +CREATE EXTERNAL STREAM avro ( + time datetime64(3), + key string, + value float64 +) SETTINGS ...; +``` + +#### Record + +There are two ways to map a `record`. The simple one is using `tuple`: + +Given an Avro schema: +```json +{ + "name": "Root", + "type": "record", + "fields": [{ + "name": "a_record_field", + "type": { + "name": "a_record_field", + "type": "record", + "fields": [ + {"name": "one", "type": "string"}, + {"name": "two", "type": "int"} + ] + } + }] +} +``` + +The external stream uses tuple: + +```sql +CREATE EXTERNAL STREAM avro ( + a_record_field tuple(one string, two int32) +) SETTINGS ...; +``` + +The other way is flattening the fields: + +```sql +CREATE EXTERNAL STREAM avro ( + `a_record_field.one` string, + `a_record_field.two` int32 +) SETTINGS ...; +``` + +The column name for each field will be the record field name followed by a dot (`.`), and the field name. + +#### Array of Record + +To map an array of records, you can use either `array(tuple(...))` or `nested()` (they are the same). By default, Timeplus will flatten the columns. + +Given an Avro schema: +```json +{ + "name": "Root", + "type": "record", + "fields": [{ + "name": "an_array_of_records", + "type": { + "type": "array", + "items": { + "name": "record_inside_an_array", + "type": "record", + "fields": [ + {"name": "one", "type": "string"}, + {"name": "two", "type": "int"} + ] + } + } + }] +} +``` + +Create a stream like this: +```sql +CREATE EXTERNAL STREAM avro ( + an_array_of_records array(tuple(one string, two int32)) +) SETTINGS ...; +``` + +This will become: +```sql +CREATE EXTERNAL STREAM avro ( + `an_array_of_records.one` array(string), + `an_array_of_records.two` array(int32) +) SETTINGS ...; +``` + +The Avro output format handles this properly. + +You can use `SET flatten_nested = 0` to disable the flatten behavior. + +#### Union + +Since Timeplus does not support native union types, there is no perfect way to handle Avro unions. One stream can only handle one of the union elements (except for `null`). If you need to generate values for different element types, you will need to create multiple streams. + +**Example** + +Given an Avro schema: +```json +{ + "name": "Root", + "type": "record", + "fields": [{ + "name": "int_or_string", + "type": ["int", "string"] + }] +} +``` + +When creating the stream, you can only map the `int_or_string` field to either int or string: + +```sql +CREATE EXTERNAL STREAM avro ( + int_or_string int32 +) SETTINGS ...; +``` + +This stream can only write `int` values. For string values, create another stream: + +```sql +CREATE EXTERNAL STREAM avro ( + int_or_string string +) SETTINGS ...; +``` + +You can also use the flatten naming convention: + +```sql +-- using the `int` element +CREATE EXTERNAL STREAM avro ( + `int_or_string.int` int32 +) SETTINGS ...; + +-- using the `string` element +CREATE EXTERNAL STREAM avro ( + `int_or_string.string` string +) SETTINGS ...; +``` + +For named types (record, fixed, and enum), use the name property instead of the type name. For example: + +```json +{ + "name": "Root", + "type": "record", + "fields": [{ + "name": "int_or_record", + "type": ["int", { + "name": "details", + "type": "record", + "fields": [...] + }] + }] +} +``` + +To map to the record element: + +```sql +CREATE EXTERNAL STREAM avro ( + `int_or_record.details` tuple(...) -- use the name "details", not "record" +) SETTINGS ...; +``` + +**Note:** The Avro input format only supports the flatten naming convention. If you create a stream using: + +```sql +CREATE EXTERNAL STREAM avro ( + int_or_string int32 +) SETTINGS ...; +``` + +Then `SELECT * FROM avro` won't work. + +#### Nullable + +There is a special case for union: when the union has two elements and one of them is `null`, this union field will be mapped to a nullable column. + +**Example:** + +Avro schema: +```json +{ + "name": "Root", + "type": "record", + "fields": [{ + "name": "null_or_int", + "type": ["null", "int"] + }] +} +``` + +Stream: +```sql +CREATE EXTERNAL STREAM avro ( + null_or_int nullable(int32) +) SETTINGS ...; +``` + +However, in Timeplus, `nullable` cannot be applied to all types. For instance, `nullable(tuple(...))` is invalid. If a field in the Avro schema is `"type": ["null", {"type": "record"}]`, you can only map it to a `tuple`, and it can't be `null`. + +## Managing Format Schemas {#manage-schemas} + +When working with custom Protobuf or Avro schemas (without Schema Registry), you can manage schemas using SQL commands. + +### List Schemas + +List all schemas in the current Timeplus deployment: + +```sql +SHOW FORMAT SCHEMAS +``` + +### Show Details For A Schema + +```sql +SHOW CREATE FORMAT SCHEMA schema_name +``` + +### Drop A Schema + +```sql +DROP FORMAT SCHEMA schema_name; +``` + +## Advanced Settings {#advanced} + +### input_format_ignore_parsing_errors + +In reading external stream, error may occur in parsing the raw data with specified data format. By default, exception will throw and terminate the query. Set it to true to ignore errors happen when parsing input data (errors will be logged). + +### max_insert_block_size + +For data formats that write multiple rows into a single message (such as `JSONEachRow` or `CSV`), this setting controls the maximum number of rows that can be written into one message. + +### max_insert_block_bytes + +For data formats that write multiple rows into a single message, this setting controls the maximum size (in bytes) that one message can be. diff --git a/docs/kafka-schema-registry.md b/docs/kafka-schema-registry.md index c581ccbc..0e98264b 100644 --- a/docs/kafka-schema-registry.md +++ b/docs/kafka-schema-registry.md @@ -1,5 +1,7 @@ # Kafka Schema Registry +This page covers using Kafka Schema Registry with Timeplus. For general information about Protobuf and Avro data formats (including custom schemas without Schema Registry), see the [Data Formats](/data-formats) page. + ## Read Messages in Protobuf or Avro Schema {#read} To consume Kafka data using **Avro** or **Protobuf** via a Schema Registry, create an external stream using the `kafka_schema_registry_url` and associated settings. @@ -85,4 +87,4 @@ INSERT INTO my_ex_stream SETTINGS force_refresh_schema=true ... ``` ::: -For the data type mappings between Avro and Timeplus data type, please check [this doc](/timeplus-format-schema#avro_types). +For the data type mappings between Avro and Timeplus data type, please check the [Data Formats page](/data-formats#avro_types). diff --git a/docs/send-data-out.md b/docs/send-data-out.md index b692b195..56b21d77 100644 --- a/docs/send-data-out.md +++ b/docs/send-data-out.md @@ -12,6 +12,8 @@ Timeplus supports various systems as the downstreams: * [Notify others via Slack](#slack) * [Send data to other systems via Redpanda Connect](#rpconnect) +When sending data to external systems, you can use various [data formats](/data-formats) including JSON, CSV, Protobuf, and Avro. See the [Data Formats](/data-formats) page for comprehensive information. + ## Send data to Kafka{#kafka} You can leverage Timeplus for various streaming analysis, such as diff --git a/docs/shared/kafka-external-stream-read.md b/docs/shared/kafka-external-stream-read.md index efd0c27a..c398fc28 100644 --- a/docs/shared/kafka-external-stream-read.md +++ b/docs/shared/kafka-external-stream-read.md @@ -1,6 +1,6 @@ ## Read Data from Kafka -Timeplus allows reading Kafka messages in multiple data formats, including: +Timeplus allows reading Kafka messages in multiple [Data Formats](/data-formats), including: * Plain string (raw) * CSV / TSV @@ -136,7 +136,7 @@ SETTINGS data_format='TSV'; ### Read Avro or Protobuf Messages -To read Avro-encoded / Protobuf-encoded Kafka message, please refer to [Schema](/timeplus-format-schema) and [Schema Registry](/kafka-schema-registry) for details. +To read Avro-encoded / Protobuf-encoded Kafka message, please refer to [Avro Schema](/data-formats#avro), [Protobuf Schema](/data-formats#protobuf) and [Schema Registry](/kafka-schema-registry) for details. ### Access Kafka Message Metadata diff --git a/docs/shared/kafka-external-stream-write.md b/docs/shared/kafka-external-stream-write.md index b01206bd..7615faba 100644 --- a/docs/shared/kafka-external-stream-write.md +++ b/docs/shared/kafka-external-stream-write.md @@ -99,11 +99,11 @@ Same as CSV, but uses **tab characters** as delimiters instead of commas. ### Write as ProtobufSingle -To write Protobuf-encoded messages from Kafka topics, please refer to [Protobuf Schema](/timeplus-format-schema), and [Kafka Schema Registry](/kafka-schema-registry) pages for details. +To write Protobuf-encoded messages from Kafka topics, please refer to [Protobuf Schema](/data-formats#protobuf), and [Kafka Schema Registry](/kafka-schema-registry) pages for details. ### Write as Avro -To write Avro-encoded messages from Kafka topics, please refer to [Avro Schema](/timeplus-format-schema), and [Kafka Schema Registry](/kafka-schema-registry) pages for details. +To write Avro-encoded messages from Kafka topics, please refer to [Avro Schema](/data-formats#avro), and [Kafka Schema Registry](/kafka-schema-registry) pages for details. ### Write Kafka Message Metadata diff --git a/docs/shared/kafka-external-stream.md b/docs/shared/kafka-external-stream.md index bf2f3ab9..3033831b 100644 --- a/docs/shared/kafka-external-stream.md +++ b/docs/shared/kafka-external-stream.md @@ -128,6 +128,8 @@ Defines how Kafka messages are parsed and written. Supported formats are | `Avro` | Avro-encoded messages | | `RawBLOB` | Raw text, no parsing (default) | +For detailed information on each format, including type mappings, examples, and usage with Protobuf and Avro, see the [Data Formats](/data-formats) page. + #### format_schema Required for these data formats: diff --git a/docs/shared/nats-jetstream-external-stream-read.md b/docs/shared/nats-jetstream-external-stream-read.md index b77eb5bc..1f34ab85 100644 --- a/docs/shared/nats-jetstream-external-stream-read.md +++ b/docs/shared/nats-jetstream-external-stream-read.md @@ -138,7 +138,7 @@ SETTINGS data_format='TSV'; ### Read Avro or Protobuf Messages -To read Avro-encoded or Protobuf-encoded NATS messages, please refer to [Schema](/timeplus-format-schema) documentation. +To read Avro-encoded or Protobuf-encoded NATS messages, please refer to [Avro Schema](/data-formats#avro) and [Protobuf Schema](/data-formats#protobuf) documentation. ### Access NATS Message Metadata diff --git a/docs/shared/nats-jetstream-external-stream-write.md b/docs/shared/nats-jetstream-external-stream-write.md index 4873a758..8cdf35b2 100644 --- a/docs/shared/nats-jetstream-external-stream-write.md +++ b/docs/shared/nats-jetstream-external-stream-write.md @@ -65,7 +65,7 @@ Each row is encoded as one CSV/TSV line. ### Write as Protobuf / Avro -To write Protobuf-encoded or Avro-encoded messages, please refer to [Schema](/timeplus-format-schema) documentation. +To write Protobuf-encoded or Avro-encoded messages, please refer to [Protobuf Schema](/data-formats#protobuf) and [Avro Schema](/data-formats#avro) documentation. ### Specify Subject with `_nats_subject` diff --git a/docs/shared/nats-jetstream-external-stream.md b/docs/shared/nats-jetstream-external-stream.md index 23f908f8..22df54b0 100644 --- a/docs/shared/nats-jetstream-external-stream.md +++ b/docs/shared/nats-jetstream-external-stream.md @@ -115,6 +115,8 @@ Common formats include: | `Protobuf` | Multiple Protobuf messages per NATS message | | `Avro` | Avro-encoded messages | +For detailed information on each format, including type mappings, examples, and usage with Protobuf and Avro, see the [Data Formats](/data-formats) page. + #### format_schema Required for `ProtobufSingle`, `Protobuf`, and `Avro` formats. Defines the schema for message serialization. diff --git a/docs/shared/pulsar-external-stream-write.md b/docs/shared/pulsar-external-stream-write.md index 801519e1..35aa887f 100644 --- a/docs/shared/pulsar-external-stream-write.md +++ b/docs/shared/pulsar-external-stream-write.md @@ -128,7 +128,7 @@ Then you can run `INSERT INTO` or use a materialized view to write data to the t INSERT INTO stream_name(query,page_number,results_per_page) VALUES('test',1,100) ``` -Please refer to [Protobuf/Avro Schema](/timeplus-format-schema) for more details. +Please refer to [Protobuf/Avro Schema](/data-formats) for more details. #### Avro @@ -166,7 +166,7 @@ Then you can run `INSERT INTO` or use a materialized view to write data to the t INSERT INTO stream_avro(name,favorite_number,favorite_color) VALUES('test',1,'red') ``` -Please refer to [Protobuf/Avro Schema](/timeplus-format-schema) for more details. +Please refer to [Protobuf/Avro Schema](/data-formats) for more details. ### Continuously Write to Pulsar via MV diff --git a/docs/shared/pulsar-external-stream.md b/docs/shared/pulsar-external-stream.md index 8bbdc9a4..dd22d812 100644 --- a/docs/shared/pulsar-external-stream.md +++ b/docs/shared/pulsar-external-stream.md @@ -106,7 +106,7 @@ Like [Kafka External Stream](/kafka-source), Pulsar External Stream also support #### data_format The supported values for `data_format` are: -- JSONEachRow: parse each row of the message as a single JSON document. The top level JSON key/value pairs will be parsed as the columns. +- JSONEachRow: parse each row of the message as a single JSON document. The top level JSON key/value pairs will be parsed as the columns. - CSV: less commonly used. - TSV: similar to CSV but tab as the separator - ProtobufSingle: for single Protobuf message per message @@ -114,6 +114,8 @@ The supported values for `data_format` are: - Avro - RawBLOB: the default value. Read/write message as plain text. +For detailed information on each format, including type mappings, examples, and usage with Protobuf and Avro, see the [Data Formats](/data-formats) page. + For data formats which write multiple rows into one single message (such as `JSONEachRow` or `CSV`), two more advanced settings are available: #### max_insert_block_size diff --git a/docs/shared/s3-external-table.md b/docs/shared/s3-external-table.md index 46fdbc04..ea5dece9 100644 --- a/docs/shared/s3-external-table.md +++ b/docs/shared/s3-external-table.md @@ -164,6 +164,8 @@ The supported values for `data_format` are: - Avro - RawBLOB: the default value. Read/write message as plain text. +For detailed information on each format, including type mappings, examples, and usage with Protobuf and Avro, see the [Data Formats](/data-formats) page. + For data formats which write multiple rows into one single message (such as `JSONEachRow` or `CSV`), two more advanced settings are available: #### read_from diff --git a/docs/sql-create-format-schema.md b/docs/sql-create-format-schema.md index b29c21e4..5c0d488c 100644 --- a/docs/sql-create-format-schema.md +++ b/docs/sql-create-format-schema.md @@ -39,7 +39,7 @@ Please note: 1. If you want to ensure there is only a single Protobuf message per Kafka message, please set `data_format` to `ProtobufSingle`. If you set it to `Protobuf`, then there could be multiple Protobuf messages in a single Kafka message. 2. The `format_schema` setting contains two parts: the registered schema name (in this example: schema_name), and the message type (in this example: SearchRequest). Combining them together with a semicolon. 3. You can use this external stream to read or write Protobuf messages in the target Kafka/Confluent topics. -4. For more advanced use cases, please check the [examples for complex schema](/timeplus-format-schema#protobuf_complex). +4. For more advanced use cases, please check the [Data Formats page](/data-formats#protobuf_complex). ## Avro Available since Proton 1.5.10. diff --git a/docs/timeplus-format-schema.md b/docs/timeplus-format-schema.md index 259e33ed..1fadbd3e 100644 --- a/docs/timeplus-format-schema.md +++ b/docs/timeplus-format-schema.md @@ -1,6 +1,6 @@ # Protobuf / Avro Schema -Timeplus supports reading or writing messages in [Protobuf](https://protobuf.dev/) or [Avro](https://avro.apache.org) format for [Kafka External Stream](/kafka-source) or [Pulsar External Stream](/pulsar-source). This document covers how to process data without a Schema Registry. Check [this page](/kafka-schema-registry) if your Kafka topics are associated with a Schema Registry. +Timeplus supports reading or writing messages in [Protobuf](https://protobuf.dev/) or [Avro](https://avro.apache.org) format for external streams. This document covers how to process data without a Schema Registry. Check [this page](/kafka-schema-registry) if your Kafka topics are associated with a Schema Registry. ## Create Schema {#create} @@ -46,7 +46,7 @@ Please note: 1. If you want to ensure there is only a single Protobuf message per Kafka message, please set `data_format` to `ProtobufSingle`. If you set it to `Protobuf`, then there could be multiple Protobuf messages in a single Kafka message. 2. The `format_schema` setting contains two parts: the registered schema name (in this example: schema_name), and the message type (in this example: SearchRequest). Combining them together with a semicolon. 3. You can use this external stream to read or write Protobuf messages in the target Kafka/Confluent topics. -4. For more advanced use cases, please check the [examples for complex schema](/timeplus-format-schema#protobuf_complex). +4. For more advanced use cases, please check the [Data Formats page](/data-formats#protobuf_complex). ### Avro Available since Timeplus Proton 1.5.10. diff --git a/sidebars.js b/sidebars.js index 21b9f455..be0e512e 100644 --- a/sidebars.js +++ b/sidebars.js @@ -97,12 +97,12 @@ const sidebars = { id: "howtos", label: "How Tos", }, - ] + ], }, { type: "category", label: "CONNECT DATA IN", - link : { + link: { type: "doc", id: "connect-data-in", }, @@ -133,7 +133,7 @@ const sidebars = { type: "doc", id: "kafka-source", }, - items: ["kafka-schema-registry", "timeplus-format-schema"], + items: ["kafka-schema-registry"], }, { type: "doc", @@ -223,7 +223,7 @@ const sidebars = { id: "syslog-input", label: "Syslog Input", }, - ] + ], }, ], }, @@ -268,7 +268,7 @@ const sidebars = { id: "materialized-view-monitoring", label: "Monitoring", }, - ] + ], }, { type: "category", @@ -303,7 +303,7 @@ const sidebars = { id: "bidirectional-range-join", label: "Bidirectional Range Join", }, - ] + ], }, { type: "category", @@ -333,22 +333,22 @@ const sidebars = { id: "session-aggregation", label: "Session", }, - ] + ], }, { type: "doc", id: "shuffle-data", - label: "Shuffle Data" + label: "Shuffle Data", }, { type: "doc", id: "partition-data", - label: "Partition Data" + label: "Partition Data", }, { type: "doc", id: "jit", - label: "Just-In-Time Compilation" + label: "Just-In-Time Compilation", }, { type: "doc", @@ -378,7 +378,7 @@ const sidebars = { "remote-udf", ], }, - ] + ], }, { type: "category", @@ -424,7 +424,7 @@ const sidebars = { id: "append-stream-tiered-storage", label: "Tier Storage", }, - ] + ], }, { type: "doc", @@ -462,7 +462,7 @@ const sidebars = { id: "mutable-stream-ttl", label: "TTL", }, - ] + ], }, { type: "doc", @@ -473,14 +473,14 @@ const sidebars = { }, { type: "doc", - id: "dictionary", + id: "dictionary", label: "Dictionary", }, { type: "doc", id: "viz", }, - ] + ], }, { type: "category", @@ -511,9 +511,9 @@ const sidebars = { id: "iceberg-sink", }, { - type: "doc", - label: "BigQuery", - id: "bigquery-external", + type: "doc", + label: "BigQuery", + id: "bigquery-external", }, { type: "doc", @@ -521,24 +521,24 @@ const sidebars = { id: "clickhouse-external-table", }, { - type: "doc", - label: "Databricks", - id: "databricks-external", + type: "doc", + label: "Databricks", + id: "databricks-external", }, { - type: "doc", - label: "Datadog", - id: "datadog-external", + type: "doc", + label: "Datadog", + id: "datadog-external", }, { - type: "doc", - label: "Elasticsearch", - id: "elastic-external", + type: "doc", + label: "Elasticsearch", + id: "elastic-external", }, { - type: "doc", - label: "HTTP", - id: "http-external-stream", + type: "doc", + label: "HTTP", + id: "http-external-stream", }, { type: "doc", @@ -551,9 +551,9 @@ const sidebars = { id: "s3-sink", }, { - type: "doc", - label: "Splunk", - id: "splunk-external", + type: "doc", + label: "Splunk", + id: "splunk-external", }, // { // type: "doc", @@ -571,9 +571,9 @@ const sidebars = { // id: "mongo-external-table", // }, { - type: "doc", - label: "Slack", - id: "slack-external", + type: "doc", + label: "Slack", + id: "slack-external", }, { type: "doc", @@ -584,7 +584,7 @@ const sidebars = { type: "doc", id: "alert", }, - ] + ], }, { type: "category", @@ -592,6 +592,11 @@ const sidebars = { items: [ "query-syntax", "query-settings", + { + type: "doc", + label: "Data Formats", + id: "data-formats", + }, "datatypes", { type: "category", @@ -741,6 +746,11 @@ const sidebars = { ], }, "grok", + { + type: "doc", + label: "Protobuf / Avro Schema", + id: "timeplus-format-schema", + }, ], }, { @@ -960,7 +970,14 @@ const sidebars = { { type: "category", label: "Older 2.x Releases", - items: ["enterprise-v2.8", "enterprise-v2.7", "enterprise-v2.6", "enterprise-v2.5", "enterprise-v2.4", "enterprise-v2.3"], + items: [ + "enterprise-v2.8", + "enterprise-v2.7", + "enterprise-v2.6", + "enterprise-v2.5", + "enterprise-v2.4", + "enterprise-v2.3", + ], }, // "v2-release-notes", {