Making max integrated cache staleness configurable for Spark connector (Azure#32592)

jeet1995 · web-flow · commit b2234d61f9da · 2023-01-04T12:32:43.000+01:00
* Add benchmark code

* Add tupleSize for readMany-specific benchmarking

* Added configurability for MaxIntegratedCacheStaleness for reads

* Fixed compilation error

* Added edge case handling for tuning integrated cache staleness.

* Modified property name for tuning integrated cache staleness.

* Modified spec test to verify parsing of max integrated cache staleness.

* Add more duration granularities for max integrated cache staleness.

* Reduced duration granularities for max integrated cache staleness.

* Modified spark config name for max integrated cache staleness.

* Modified CHANGELOG.md and configuration-reference.md

* Modified CHANGELOG.md

* Removed locale information link verification issue.

* Modified CHANGELOG.md

* Used case-based equality check.

* Addressed review comments.

* Modified CHANGELOG.md

* Updated configuration-reference.md
diff --git a/sdk/cosmos/azure-cosmos-spark_3-1_2-12/CHANGELOG.md b/sdk/cosmos/azure-cosmos-spark_3-1_2-12/CHANGELOG.md
@@ -3,6 +3,8 @@
 ### 4.16.0-beta.1 (Unreleased)
 
 #### Features Added
+* Added the `spark.cosmos.read.maxIntegratedCacheStalenessInMS` configuration key
+  to make `MaxIntegratedCacheStaleness` tunable for caching queries. - See [PR 32592](https://github.com/Azure/azure-sdk-for-java/pull/32592)
 
 #### Breaking Changes
 
diff --git a/sdk/cosmos/azure-cosmos-spark_3-2_2-12/CHANGELOG.md b/sdk/cosmos/azure-cosmos-spark_3-2_2-12/CHANGELOG.md
@@ -3,6 +3,8 @@
 ### 4.16.0-beta.1 (Unreleased)
 
 #### Features Added
+* Added the `spark.cosmos.read.maxIntegratedCacheStalenessInMS` configuration key
+  to make `MaxIntegratedCacheStaleness` tunable for caching queries. - See [PR 32592](https://github.com/Azure/azure-sdk-for-java/pull/32592)
 
 #### Breaking Changes
 
diff --git a/sdk/cosmos/azure-cosmos-spark_3-3_2-12/CHANGELOG.md b/sdk/cosmos/azure-cosmos-spark_3-3_2-12/CHANGELOG.md
@@ -3,6 +3,8 @@
 ### 4.16.0-beta.1 (Unreleased)
 
 #### Features Added
+* Added the `spark.cosmos.read.maxIntegratedCacheStalenessInMS` configuration key
+  to make `MaxIntegratedCacheStaleness` tunable for caching queries. - See [PR 32592](https://github.com/Azure/azure-sdk-for-java/pull/32592)
 
 #### Breaking Changes
 
@@ -12,7 +14,6 @@
 
 ### 4.15.0 (2022-11-16)
 
-
 #### Features Added
 Spark 3.3 support: - See [PR 31666](https://github.com/Azure/azure-sdk-for-java/pull/31666).
 #### Other Changes
diff --git a/sdk/cosmos/azure-cosmos-spark_3_2-12/docs/configuration-reference.md b/sdk/cosmos/azure-cosmos-spark_3_2-12/docs/configuration-reference.md
@@ -38,10 +38,12 @@
 | `spark.cosmos.write.patch.filter`      | None        | Used for [Conditional patch](https://docs.microsoft.com/azure/cosmos-db/partial-document-update-getting-started#java) |
 
 ### Query Config
-| Config Property Name      | Default | Description |
-| :---        |    :----   |         :--- | 
-| `spark.cosmos.read.customQuery`      | None   | When provided the custom query will be processed against the Cosmos endpoint instead of dynamically generating the query via predicate push down. Usually it is recommended to rely on Spark's predicate push down because that will allow to generate the most efficient set of filters based on the query plan. But there are a couple of predicates like aggregates (count, group by, avg, sum etc.) that cannot be pushed down yet (at least in Spark 3.1) - so the custom query is a fallback to allow them to be pushed into the query sent to Cosmos. If specified, with schema inference enabled, the custom query will also be used to infer the schema. |
-| `spark.cosmos.read.maxItemCount`  | `1000`    | Overrides the maximum number of documents that can be returned for a single query- or change feed request. The default value is `1000` - consider increasing this only for average document sizes significantly smaller than 1KB or when projection reduces the number of properties selected in queries significantly (like when only selecting "id" of documents etc.).  |
+| Config Property Name                                           | Default | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
+|:---------------------------------------------------------------|:--------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| 
+| `spark.cosmos.read.customQuery`                                | None    | When provided the custom query will be processed against the Cosmos endpoint instead of dynamically generating the query via predicate push down. Usually it is recommended to rely on Spark's predicate push down because that will allow to generate the most efficient set of filters based on the query plan. But there are a couple of predicates like aggregates (count, group by, avg, sum etc.) that cannot be pushed down yet (at least in Spark 3.1) - so the custom query is a fallback to allow them to be pushed into the query sent to Cosmos. If specified, with schema inference enabled, the custom query will also be used to infer the schema. |
+| `spark.cosmos.read.maxItemCount`                               | `1000`  | Overrides the maximum number of documents that can be returned for a single query- or change feed request. The default value is `1000` - consider increasing this only for average document sizes significantly smaller than 1KB or when projection reduces the number of properties selected in queries significantly (like when only selecting "id" of documents etc.).                                                                                                                                                                                                                                                                                         |
+| `spark.cosmos.read.maxIntegratedCacheStalenessInMS`        | None    | Sets the max time window in hours for which query results remain cached in the integrated cache in the dedicated gateway connectivity mode. Learn more about `MaxIntegratedCacheStaleness` [here](https://learn.microsoft.com/azure/cosmos-db/integrated-cache)                                                                                                                                                                                                                                                                                                                                                                         |
+
 
 #### Schema Inference Config
 When doing read operations, users can specify a custom schema or allow the connector to infer it. Schema inference is enabled by default.
diff --git a/sdk/cosmos/azure-cosmos-spark_3_2-12/src/main/scala/com/azure/cosmos/spark/CosmosConfig.scala b/sdk/cosmos/azure-cosmos-spark_3_2-12/src/main/scala/com/azure/cosmos/spark/CosmosConfig.scala
@@ -5,7 +5,7 @@ package com.azure.cosmos.spark
 
 import com.azure.cosmos.implementation.{SparkBridgeImplementationInternal, Strings}
 import com.azure.cosmos.implementation.routing.LocationHelper
-import com.azure.cosmos.models.{CosmosChangeFeedRequestOptions, CosmosParameterizedQuery, FeedRange}
+import com.azure.cosmos.models.{CosmosChangeFeedRequestOptions, CosmosParameterizedQuery, DedicatedGatewayRequestOptions, FeedRange}
 import com.azure.cosmos.spark.ChangeFeedModes.ChangeFeedMode
 import com.azure.cosmos.spark.ChangeFeedStartFromModes.{ChangeFeedStartFromMode, PointInTime}
 import com.azure.cosmos.spark.CosmosPatchOperationTypes.CosmosPatchOperationTypes
@@ -86,6 +86,7 @@ private[spark] object CosmosConfigNames {
   val ThroughputControlPreferredRegionsList = "spark.cosmos.throughputControl.preferredRegionsList"
   val ThroughputControlDisableTcpConnectionEndpointRediscovery = "spark.cosmos.throughputControl.disableTcpConnectionEndpointRediscovery"
   val ThroughputControlUseGatewayMode = "spark.cosmos.throughputControl.useGatewayMode"
+  val ReadMaxIntegratedCacheStalenessInMilliseconds = "spark.cosmos.read.maxIntegratedCacheStalenessInMS"
   val ThroughputControlName = "spark.cosmos.throughputControl.name"
   val ThroughputControlTargetThroughput = "spark.cosmos.throughputControl.targetThroughput"
   val ThroughputControlTargetThroughputThreshold = "spark.cosmos.throughputControl.targetThroughputThreshold"
@@ -153,6 +154,7 @@ private[spark] object CosmosConfigNames {
     ThroughputControlPreferredRegionsList,
     ThroughputControlDisableTcpConnectionEndpointRediscovery,
     ThroughputControlUseGatewayMode,
+    ReadMaxIntegratedCacheStalenessInMilliseconds,
     ThroughputControlName,
     ThroughputControlTargetThroughput,
     ThroughputControlTargetThroughputThreshold,
@@ -405,6 +407,7 @@ private case class CosmosReadConfig(forceEventualConsistency: Boolean,
                                     schemaConversionMode: SchemaConversionMode,
                                     maxItemCount: Int,
                                     prefetchBufferSize: Int,
+                                    dedicatedGatewayRequestOptions: DedicatedGatewayRequestOptions,
                                     customQuery: Option[CosmosParameterizedQuery])
 
 private object SchemaConversionModes extends Enumeration {
@@ -465,12 +468,34 @@ private object CosmosReadConfig {
       "See `reactor.util.concurrent.Queues.get(int)` for more details. This means by the max. memory used for " +
       "buffering is 5 MB multiplied by the effective prefetch buffer size for each Executor/CPU-Core.")
 
+  private val MaxIntegratedCacheStalenessInMilliseconds = CosmosConfigEntry[Duration](
+    key = CosmosConfigNames.ReadMaxIntegratedCacheStalenessInMilliseconds,
+    mandatory = false,
+    defaultValue = None,
+    parseFromStringFunction = queryText => Duration.ofMillis(queryText.toLong),
+    helpMessage = "The max integrated cache staleness is the time window in milliseconds within which subsequent reads and queries are served from " +
+      "the integrated cache configured with the dedicated gateway. The request is served from the integrated cache itself provided the data " +
+      "has not been evicted from the cache or a new read is run with a lower MaxIntegratedCacheStaleness than the age of the current cached " +
+      "entry."
+  )
+
   def parseCosmosReadConfig(cfg: Map[String, String]): CosmosReadConfig = {
     val forceEventualConsistency = CosmosConfigEntry.parse(cfg, ForceEventualConsistency)
     val jsonSchemaConversionMode = CosmosConfigEntry.parse(cfg, JsonSchemaConversion)
     val customQuery = CosmosConfigEntry.parse(cfg, CustomQuery)
     val maxItemCount = CosmosConfigEntry.parse(cfg, MaxItemCount)
     val prefetchBufferSize = CosmosConfigEntry.parse(cfg, PrefetchBufferSize)
+    val maxIntegratedCacheStalenessInMilliseconds = CosmosConfigEntry.parse(cfg, MaxIntegratedCacheStalenessInMilliseconds)
+    val dedicatedGatewayRequestOptions = {
+      val result = new DedicatedGatewayRequestOptions
+      maxIntegratedCacheStalenessInMilliseconds match {
+        case Some(stalenessProvidedByUser) =>
+          result.setMaxIntegratedCacheStaleness(stalenessProvidedByUser)
+        case None =>
+      }
+      result
+    }
+
 
     CosmosReadConfig(
       forceEventualConsistency.get,
@@ -487,6 +512,7 @@ private object CosmosReadConfig {
           case None => 8
         }
       ),
+      dedicatedGatewayRequestOptions,
       customQuery)
   }
 }
diff --git a/sdk/cosmos/azure-cosmos-spark_3_2-12/src/main/scala/com/azure/cosmos/spark/CosmosConstants.scala b/sdk/cosmos/azure-cosmos-spark_3_2-12/src/main/scala/com/azure/cosmos/spark/CosmosConstants.scala
@@ -7,6 +7,8 @@ import com.azure.core.util.CoreUtils
 import com.azure.cosmos.implementation.HttpConstants
 import reactor.util.concurrent.Queues
 
+import java.time.Duration
+
 // cosmos db related constants
 private object CosmosConstants {
   private[this] val propertiesFileName = "azure-cosmos-spark.properties"
diff --git a/sdk/cosmos/azure-cosmos-spark_3_2-12/src/main/scala/com/azure/cosmos/spark/CosmosTableSchemaInferrer.scala b/sdk/cosmos/azure-cosmos-spark_3_2-12/src/main/scala/com/azure/cosmos/spark/CosmosTableSchemaInferrer.scala
@@ -3,12 +3,13 @@
 package com.azure.cosmos.spark
 
 import com.azure.cosmos.implementation.ImplementationBridgeHelpers
-import com.azure.cosmos.models.{CosmosQueryRequestOptions, FeedRange}
+import com.azure.cosmos.models.{CosmosQueryRequestOptions, DedicatedGatewayRequestOptions, FeedRange}
 import com.azure.cosmos.spark.diagnostics.BasicLoggingTrait
 import com.azure.cosmos.util.CosmosPagedIterable
 import com.fasterxml.jackson.databind.JsonNode
 import org.apache.spark.sql.catalyst.analysis.TypeCoercion
 
+import java.time.Duration
 import java.util.stream.Collectors
 
 // scalastyle:off underscore.import
@@ -117,6 +118,8 @@ private object CosmosTableSchemaInferrer
       SparkUtils.safeOpenConnectionInitCaches(sourceContainer, (msg, e) => logWarning(msg, e))
       val queryOptions = new CosmosQueryRequestOptions()
       queryOptions.setMaxBufferedItemCount(cosmosInferenceConfig.inferSchemaSamplingSize)
+      queryOptions.setDedicatedGatewayRequestOptions(cosmosReadConfig.dedicatedGatewayRequestOptions)
+
       val queryText = cosmosInferenceConfig.inferSchemaQuery match {
         case None =>
           ImplementationBridgeHelpers
diff --git a/sdk/cosmos/azure-cosmos-spark_3_2-12/src/main/scala/com/azure/cosmos/spark/ItemsPartitionReader.scala b/sdk/cosmos/azure-cosmos-spark_3_2-12/src/main/scala/com/azure/cosmos/spark/ItemsPartitionReader.scala
@@ -5,7 +5,7 @@ package com.azure.cosmos.spark
 
 import com.azure.cosmos.implementation.spark.{OperationContextAndListenerTuple, OperationListener}
 import com.azure.cosmos.implementation.{ImplementationBridgeHelpers, SparkBridgeImplementationInternal, SparkRowItem, Strings}
-import com.azure.cosmos.models.{CosmosParameterizedQuery, CosmosQueryRequestOptions, ModelBridgeInternal}
+import com.azure.cosmos.models.{CosmosParameterizedQuery, CosmosQueryRequestOptions, DedicatedGatewayRequestOptions, ModelBridgeInternal}
 import com.azure.cosmos.spark.BulkWriter.getThreadInfo
 import com.azure.cosmos.spark.diagnostics.{DiagnosticsContext, DiagnosticsLoader, LoggerHelper, SparkTaskContext}
 import org.apache.spark.TaskContext
@@ -16,6 +16,8 @@ import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
 import org.apache.spark.sql.connector.read.PartitionReader
 import org.apache.spark.sql.types.StructType
 
+import java.time.Duration
+
 // per spark task there will be one CosmosPartitionReader.
 // This provides iterator to read from the assigned spark partition
 // For now we are creating only one spark partition
@@ -49,6 +51,8 @@ private case class ItemsPartitionReader
     s"query: ${cosmosQuery.toString}, Context: ${operationContext.toString} ${getThreadInfo}")
 
   private val readConfig = CosmosReadConfig.parseCosmosReadConfig(config)
+
+
   private val clientCacheItem = CosmosClientCache(
     CosmosClientConfiguration(config, readConfig.forceEventualConsistency),
     Some(cosmosClientStateHandles.value.cosmosClientMetadataCaches),
@@ -138,6 +142,8 @@ private case class ItemsPartitionReader
         ).toInt
       )
 
+      queryOptions.setDedicatedGatewayRequestOptions(readConfig.dedicatedGatewayRequestOptions)
+
       ImplementationBridgeHelpers
         .CosmosQueryRequestOptionsHelper
         .getCosmosQueryRequestOptionsAccessor
diff --git a/sdk/cosmos/azure-cosmos-spark_3_2-12/src/test/scala/com/azure/cosmos/spark/CosmosConfigSpec.scala b/sdk/cosmos/azure-cosmos-spark_3_2-12/src/test/scala/com/azure/cosmos/spark/CosmosConfigSpec.scala
@@ -7,7 +7,7 @@ import com.azure.cosmos.spark.utils.CosmosPatchTestHelper
 import org.apache.spark.sql.types.{NumericType, StructType}
 
 import java.text.SimpleDateFormat
-import java.time.Instant
+import java.time.{Duration, Instant}
 import java.util.UUID
 import scala.collection.mutable.ListBuffer
 import scala.util.Random
@@ -245,11 +245,13 @@ class CosmosConfigSpec extends UnitSpec {
     config.customQuery shouldBe empty
     config.maxItemCount shouldBe 1000
     config.prefetchBufferSize shouldBe 8
+    config.dedicatedGatewayRequestOptions.getMaxIntegratedCacheStaleness shouldBe null
 
     userConfig = Map(
       "spark.cosmos.read.forceEventualConsistency" -> "false",
       "spark.cosmos.read.schemaConversionMode" -> "Strict",
-      "spark.cosmos.read.maxItemCount" -> "1000"
+      "spark.cosmos.read.maxItemCount" -> "1000",
+      "spark.cosmos.read.maxIntegratedCacheStalenessInMS" -> "1000"
     )
 
     config = CosmosReadConfig.parseCosmosReadConfig(userConfig)
@@ -259,6 +261,7 @@ class CosmosConfigSpec extends UnitSpec {
     config.customQuery shouldBe empty
     config.maxItemCount shouldBe 1000
     config.prefetchBufferSize shouldBe 8
+    config.dedicatedGatewayRequestOptions.getMaxIntegratedCacheStaleness shouldBe Duration.ofMillis(1000)
 
     userConfig = Map(
       "spark.cosmos.read.forceEventualConsistency" -> "false",
diff --git a/sdk/cosmos/azure-cosmos-spark_3_2-12/src/test/scala/com/azure/cosmos/spark/FilterAnalyzerSpec.scala b/sdk/cosmos/azure-cosmos-spark_3_2-12/src/test/scala/com/azure/cosmos/spark/FilterAnalyzerSpec.scala
@@ -2,10 +2,13 @@
 // Licensed under the MIT License.
 package com.azure.cosmos.spark
 
-import com.azure.cosmos.models.CosmosParameterizedQuery
+import com.azure.cosmos.models.{CosmosParameterizedQuery, DedicatedGatewayRequestOptions}
 import org.apache.spark.sql.sources.{AlwaysFalse, AlwaysTrue, EqualTo, Filter, In, IsNotNull, IsNull, StringContains, StringEndsWith, StringStartsWith}
 import org.assertj.core.api.Assertions.assertThat
+import org.assertj.core.api.InstanceOfAssertFactories.DURATION
 import reactor.util.concurrent.Queues
+
+import java.time.Duration
 // scalastyle:off underscore.import
 import scala.collection.JavaConverters._
 // scalastyle:on underscore.import
@@ -16,7 +19,7 @@ class FilterAnalyzerSpec extends UnitSpec {
 
   private[this] val readConfigWithoutCustomQuery =
     new CosmosReadConfig(
-      true, SchemaConversionModes.Relaxed, 100, Queues.XS_BUFFER_SIZE, None)
+      true, SchemaConversionModes.Relaxed, 100, Queues.XS_BUFFER_SIZE, new DedicatedGatewayRequestOptions, None)
   private[this] val queryText = "SELECT * FROM c WHERE c.abc='Hello World'"
   private[this] val query = Some(CosmosParameterizedQuery(
     queryText,
@@ -27,6 +30,7 @@ class FilterAnalyzerSpec extends UnitSpec {
     SchemaConversionModes.Relaxed,
     100,
     Queues.XS_BUFFER_SIZE,
+    new DedicatedGatewayRequestOptions,
     query)
 
   "many filters" should "be translated to cosmos predicates with AND" in {