Skip to content

Conversation

@fupelaqu
Copy link
Contributor

@fupelaqu fupelaqu commented Dec 5, 2025

Add support to bulk from multiple data sources :

Data Sources

Source Type Format Description
In-Memory Scala objects Direct streaming from collections
JSON Text Newline-delimited JSON (NDJSON)
JSON Array Text JSON array with nested structures
Parquet Binary Columnar storage format
Delta Lake Directory ACID transactional data lake

Examples:

// High-performance file indexing
implicit val options: BulkOptions = BulkOptions(
  defaultIndex = "products",
  maxBulkSize = 10000,
  balance = 16,
  disableRefresh = true
)

implicit val hadoopConf: Configuration = new Configuration()

// Load from Parquet
client.bulkFromFile(
  filePath = "/data/products.parquet",
  format = Parquet,
  idKey = Some("id")
).foreach { result =>
  result.indices.foreach(client.refresh)
  println(s"Indexed ${result.successCount} docs at ${result.metrics.throughput} docs/sec")
}

// Load from Delta Lake
client.bulkFromFile(
  filePath = "/data/delta-products",
  format = Delta,
  idKey = Some("id"),
  update = Some(true)
).foreach { result =>
  println(s"Updated ${result.successCount} products from Delta Lake")
}

// Load JSON Array with nested objects
client.bulkFromFile(
  filePath = "/data/persons.json",
  format = JsonArray,
  idKey = Some("uuid")
).foreach { result =>
  println(s"Indexed ${result.successCount} persons with nested structures")
}

@fupelaqu fupelaqu marked this pull request as ready for review December 5, 2025 20:39
@fupelaqu fupelaqu merged commit e4b0e40 into main Dec 6, 2025
2 checks passed
@fupelaqu fupelaqu deleted the feature/bulkFromSourceFile branch December 9, 2025 11:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants