Skip to content
This repository was archived by the owner on Sep 11, 2024. It is now read-only.

Commit bfcf680

Browse files
committed
docs: refactor record grouping and file name format
1 parent 1a4ea1e commit bfcf680

File tree

1 file changed

+12
-1
lines changed

1 file changed

+12
-1
lines changed

README.md

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -116,9 +116,19 @@ enabled).
116116
### Record grouping
117117

118118
Incoming records are being grouped until flushed.
119+
The connector flushes grouped records in one file per `offset.flush.interval.ms` setting for partitions that have received new messages during this period. The setting defaults to 60 seconds.
120+
121+
Record grouping, similar to Kafka topics, has 2 modes:
122+
123+
- Changelog: Connector groups all records in the order received from a Kafka topic, and stores all of them in a file.
124+
- Compact: Connector groups all records by an identity (e.g. key) and only keeps the latest value stored in a file.
125+
126+
Modes are defined implicitly by the fields used of the [file name template](#file-name-format).
119127

120128
#### Grouping by the topic and partition
121129

130+
*Mode: Changelog*
131+
122132
In this mode, the connector groups records by the topic and partition.
123133
When a file is written, an offset of the first record in it is added to
124134
its name.
@@ -153,6 +163,8 @@ In this case, there will be two files `topicA-part0-off0` and
153163

154164
#### Grouping by the key
155165

166+
*Mode: Compact*
167+
156168
In this mode, the connector groups records by the Kafka key. It always
157169
puts one record in a file, the latest record that arrived before a flush
158170
for each key. Also, it overwrites files if later new records with the
@@ -223,7 +235,6 @@ Connector class name, in this case: `io.aiven.kafka.connect.s3.AivenKafkaConnect
223235
### S3 Object Names
224236

225237
S3 connector stores series of files in the specified bucket. Each object is named using pattern `[<aws.s3.prefix>]<topic>-<partition>-<startoffset>[.gz]`. The `.gz` extension is used if gzip compression is used, see `file.compression.type` below.
226-
The connector creates one file per Apache Kafka Connect `offset.flush.interval.ms` setting for partitions that have received new messages during that period. The setting defaults to 60 seconds.
227238

228239
### Data File Format
229240

0 commit comments

Comments
 (0)