zalando · FxKu · Aug 9, 2024 · Dec 4, 2022 · Jul 1, 2024 · Jul 1, 2024
@@ -211,9 +211,9 @@ spec:
                   enable_init_containers:
                     type: boolean
                     default: true
-                  enable_secrets_deletion:
+                  enable_owner_references:
                     type: boolean
-                    default: true
+                    default: false
                   enable_persistent_volume_claim_deletion:
                     type: boolean
                     default: true
@@ -226,6 +226,9 @@ spec:
                   enable_readiness_probe:
                     type: boolean
                     default: false
+                  enable_secrets_deletion:
+                    type: boolean
+                    default: true
                   enable_sidecars:
                     type: boolean
                     default: true

@@ -120,6 +120,7 @@ rules:
   - create
   - delete
   - get
+  - patch
   - update
 # to check nodes for node readiness label
 - apiGroups:
@@ -196,6 +197,7 @@ rules:
   - get
   - list
   - patch
+  - update
 # to CRUD cron jobs for logical backups
 - apiGroups:
   - batch

@@ -129,8 +129,8 @@ configKubernetes:
   enable_finalizers: false
   # enables initContainers to run actions before Spilo is started
   enable_init_containers: true
-  # toggles if operator should delete secrets on cluster deletion
-  enable_secrets_deletion: true
+  # toggles if child resources should have an owner reference to the postgresql CR
+  enable_owner_references: false
   # toggles if operator should delete PVCs on cluster deletion
   enable_persistent_volume_claim_deletion: true
   # toggles pod anti affinity on the Postgres pods
@@ -139,6 +139,8 @@ configKubernetes:
   enable_pod_disruption_budget: true
   # toogles readiness probe for database pods
   enable_readiness_probe: false
+  # toggles if operator should delete secrets on cluster deletion
+  enable_secrets_deletion: true
   # enables sidecar containers to run alongside Spilo in the same pod
   enable_sidecars: true
 

@@ -223,9 +223,9 @@ configuration:
 
 Now, every cluster manifest must contain the configured annotation keys to
 trigger the delete process when running `kubectl delete pg`. Note, that the
-`Postgresql` resource would still get deleted as K8s' API server does not
-block it. Only the operator logs will tell, that the delete criteria wasn't
-met.
+`Postgresql` resource would still get deleted because the operator does not
+instruct K8s' API server to block it. Only the operator logs will tell, that
+the delete criteria was not met.
 
 **cluster manifest**
 
@@ -243,11 +243,65 @@ spec:
 
 In case, the resource has been deleted accidentally or the annotations were
 simply forgotten, it's safe to recreate the cluster with `kubectl create`.
-Existing Postgres cluster are not replaced by the operator. But, as the
-original cluster still exists the status will show `CreateFailed` at first.
-On the next sync event it should change to `Running`. However, as it is in
-fact a new resource for K8s, the UID will differ which can trigger a rolling
-update of the pods because the UID is used as part of backup path to S3.
+Existing Postgres cluster are not replaced by the operator. But, when the
+original cluster still exists the status will be `CreateFailed` at first. On
+the next sync event it should change to `Running`. However, because it is in
+fact a new resource for K8s, the UID and therefore, the backup path to S3,
+will differ and trigger a rolling update of the pods.
+
+## Owner References and Finalizers
+
+The Postgres Operator can set [owner references](https://kubernetes.io/docs/concepts/overview/working-with-objects/owners-dependents/) to most of a cluster's child resources to improve
+monitoring with GitOps tools and enable cascading deletes. There are three
+exceptions:
+
+* Persistent Volume Claims, because they are handled by the [PV Reclaim Policy]https://kubernetes.io/docs/tasks/administer-cluster/change-pv-reclaim-policy/ of the Stateful Set
+* The config endpoint + headless service resource because it is managed by Patroni
+* Cross-namespace secrets, because owner references are not allowed across namespaces by design
+
+The operator would clean these resources up with its regular delete loop
+unless they got synced correctly. If for some reason the initial cluster sync
+fails, e.g. after a cluster creation or operator restart, a deletion of the
+cluster manifest would leave orphaned resources behind which the user has to
+clean up manually.
+
+Another option is to enable finalizers which first ensures the deletion of all
+child resources before the cluster manifest gets removed. There is a trade-off
+though: The deletion is only performed after the next two operator SYNC cycles
+with the first one setting a `deletionTimestamp` and the latter reacting to it.
+The final removal of the custom resource will add a DELETE event to the worker
+queue but the child resources are already gone at this point. If you do not
+desire this behavior consider enabling owner references instead.
+
+**postgres-operator ConfigMap**
+
+```yaml
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: postgres-operator
+data:
+  enable_finalizers: "false"
+  enable_owner_references: "true"
+```
+
+**OperatorConfiguration**
+
+```yaml
+apiVersion: "acid.zalan.do/v1"
+kind: OperatorConfiguration
+metadata:
+  name: postgresql-operator-configuration
+configuration:
+  kubernetes:
+    enable_finalizers: false
+    enable_owner_references: true
+```
+
+:warning: Please note, both options are disabled by default. When enabling owner
+references the operator cannot block cascading deletes, even when the [delete protection annotations](administrator.md#delete-protection-via-annotations)
+are in place. You would need an K8s admission controller that blocks the actual
+`kubectl delete` API call e.g. based on existing annotations.
 
 ## Role-based access control for the operator
 

@@ -263,6 +263,31 @@ Parameters to configure cluster-related Kubernetes objects created by the
 operator, as well as some timeouts associated with them. In a CRD-based
 configuration they are grouped under the `kubernetes` key.
 
+* **enable_finalizers**
+  By default, a deletion of the Postgresql resource will trigger an event
+  that leads to a cleanup of all child resources. However, if the database
+  cluster is in a broken state (e.g. failed initialization) and the operator
+  cannot fully sync it, there can be leftovers. By enabling finalizers the
+  operator will ensure all managed resources are deleted prior to the
+  Postgresql resource. See also [admin docs](../administrator.md#owner-references-and-finalizers)
+  for more information The default is `false`.
+
+* **enable_owner_references**
+  The operator can set owner references on its child resources (except PVCs,
+  Patroni config service/endpoint, cross-namespace secrets) to improve cluster
+  monitoring and enable cascading deletion. The default is `false`. Warning,
+  enabling this option disables configured delete protection checks (see below).
+
+* **delete_annotation_date_key**
+  key name for annotation that compares manifest value with current date in the
+  YYYY-MM-DD format. Allowed pattern: `'([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9]'`.
+  The default is empty which also disables this delete protection check.
+
+* **delete_annotation_name_key**
+  key name for annotation that compares manifest value with Postgres cluster name.
+  Allowed pattern: `'([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9]'`. The default is
+  empty which also disables this delete protection check.
+
 * **pod_service_account_name**
   service account used by Patroni running on individual Pods to communicate
   with the operator. Required even if native Kubernetes support in Patroni is
@@ -293,16 +318,6 @@ configuration they are grouped under the `kubernetes` key.
   of a database created by the operator. If the annotation key is also provided
   by the database definition, the database definition value is used.
 
-* **delete_annotation_date_key**
-  key name for annotation that compares manifest value with current date in the
-  YYYY-MM-DD format. Allowed pattern: `'([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9]'`.
-  The default is empty which also disables this delete protection check.
-
-* **delete_annotation_name_key**
-  key name for annotation that compares manifest value with Postgres cluster name.
-  Allowed pattern: `'([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9]'`. The default is
-  empty which also disables this delete protection check.
-
 * **downscaler_annotations**
   An array of annotations that should be passed from Postgres CRD on to the
   statefulset and, if exists, to the connection pooler deployment as well.
@@ -332,20 +347,6 @@ configuration they are grouped under the `kubernetes` key.
   drained if the node_readiness_label is not used. If this option if set to
   `false` the `spilo-role=master` selector will not be added to the PDB.
 
-* **enable_finalizers**
-  By default, a deletion of the Postgresql resource will trigger an event
-  that leads to a cleanup of all child resources. However, if the database
-  cluster is in a broken state (e.g. failed initialization) and the operator
-  cannot fully sync it, there can be leftovers. By enabling finalizers the
-  operator will ensure all managed resources are deleted prior to the
-  Postgresql resource. There is a trade-off though: The deletion is only
-  performed after the next two SYNC cycles with the first one updating the
-  internal spec and the latter reacting on the `deletionTimestamp` while
-  processing the SYNC event. The final removal of the custom resource will
-  add a DELETE event to the worker queue but the child resources are already
-  gone at this point.
-  The default is `false`.
-
 * **persistent_volume_claim_retention_policy**
   The operator tries to protect volumes as much as possible. If somebody
   accidentally deletes the statefulset or scales in the `numberOfInstances` the

@@ -95,7 +95,7 @@ def setUpClass(cls):
             print("Failed to delete the 'standard' storage class: {0}".format(e))
 
         # operator deploys pod service account there on start up
-        # needed for test_multi_namespace_support()
+        # needed for test_multi_namespace_support and test_owner_references
         cls.test_namespace = "test"
         try:
             v1_namespace = client.V1Namespace(metadata=client.V1ObjectMeta(name=cls.test_namespace))
@@ -1352,17 +1352,11 @@ def test_multi_namespace_support(self):
             k8s.wait_for_pod_start("spilo-role=master", self.test_namespace)
             k8s.wait_for_pod_start("spilo-role=replica", self.test_namespace)
             self.assert_master_is_unique(self.test_namespace, "acid-test-cluster")
+            # acid-test-cluster will be deleted in test_owner_references test
 
         except timeout_decorator.TimeoutError:
             print('Operator log: {}'.format(k8s.get_operator_log()))
             raise
-        finally:
-            # delete the new cluster so that the k8s_api.get_operator_state works correctly in subsequent tests
-            # ideally we should delete the 'test' namespace here but
-            # the pods inside the namespace stuck in the Terminating state making the test time out
-            k8s.api.custom_objects_api.delete_namespaced_custom_object(
-                "acid.zalan.do", "v1", self.test_namespace, "postgresqls", "acid-test-cluster")
-            time.sleep(5)
 
     @timeout_decorator.timeout(TEST_TIMEOUT_SEC)
     @unittest.skip("Skipping this test until fixed")
@@ -1573,6 +1567,71 @@ def test_overwrite_pooler_deployment(self):
         self.eventuallyEqual(lambda: k8s.count_running_pods("connection-pooler="+pooler_name),
                              0, "Pooler pods not scaled down")
 
+    @timeout_decorator.timeout(TEST_TIMEOUT_SEC)
+    def test_owner_references(self):
+        '''
+           Enable owner references, test if resources get updated and test cascade deletion of test cluster.
+        '''
+        k8s = self.k8s
+        cluster_name = 'acid-test-cluster'
+        cluster_label = 'application=spilo,cluster-name={}'.format(cluster_name)
+        default_test_cluster = 'acid-minimal-cluster'
+
+        try:
+            # enable owner references in config
+            enable_owner_refs = {
+                "data": {
+                    "enable_owner_references": "true"
+                }
+            }
+            k8s.update_config(enable_owner_refs)
+            self.eventuallyEqual(lambda: k8s.get_operator_state(), {"0": "idle"}, "Operator does not get in sync")
+
+            time.sleep(5)  # wait for the operator to sync the cluster and update resources
+
+            # check if child resources were updated with owner references
+            self.assertTrue(self.check_cluster_child_resources_owner_references(cluster_name, self.test_namespace), "Owner references not set on all child resources of {}".format(cluster_name))
+            self.assertTrue(self.check_cluster_child_resources_owner_references(default_test_cluster), "Owner references not set on all child resources of {}".format(default_test_cluster))
+
+            # delete the new cluster to test owner references
+            # and also to make k8s_api.get_operator_state work better in subsequent tests
+            # ideally we should delete the 'test' namespace here but the pods
+            # inside the namespace stuck in the Terminating state making the test time out
+            k8s.api.custom_objects_api.delete_namespaced_custom_object(
+                "acid.zalan.do", "v1", self.test_namespace, "postgresqls", cluster_name)
+
+            # statefulset, pod disruption budget and secrets should be deleted via owner reference
+            self.eventuallyEqual(lambda: k8s.count_pods_with_label(cluster_label), 0, "Pods not deleted")
+            self.eventuallyEqual(lambda: k8s.count_statefulsets_with_label(cluster_label), 0, "Statefulset not deleted")
+            self.eventuallyEqual(lambda: k8s.count_pdbs_with_label(cluster_label), 0, "Pod disruption budget not deleted")
+            self.eventuallyEqual(lambda: k8s.count_secrets_with_label(cluster_label), 0, "Secrets were not deleted")
+
+            time.sleep(5)  # wait for the operator to also delete the leftovers
+
+            # pvcs and Patroni config service/endpoint should not be affected by owner reference
+            # but deleted by the operator almost immediately
+            self.eventuallyEqual(lambda: k8s.count_pvcs_with_label(cluster_label), 0, "PVCs not deleted")
+            self.eventuallyEqual(lambda: k8s.count_services_with_label(cluster_label), 0, "Patroni config service not deleted")
+            self.eventuallyEqual(lambda: k8s.count_endpoints_with_label(cluster_label), 0, "Patroni config endpoint not deleted")
+
+            # disable owner references in config
+            disable_owner_refs = {
+                "data": {
+                    "enable_owner_references": "false"
+                }
+            }
+            k8s.update_config(disable_owner_refs)
+            self.eventuallyEqual(lambda: k8s.get_operator_state(), {"0": "idle"}, "Operator does not get in sync")
+
+            time.sleep(5)  # wait for the operator to remove owner references
+
+            # check if child resources were updated without Postgresql owner references
+            self.assertTrue(self.check_cluster_child_resources_owner_references(default_test_cluster, "default", True), "Owner references still present on some child resources of {}".format(default_test_cluster))
+
+        except timeout_decorator.TimeoutError:
+            print('Operator log: {}'.format(k8s.get_operator_log()))
+            raise
+
     @timeout_decorator.timeout(TEST_TIMEOUT_SEC)
     def test_password_rotation(self):
         '''
@@ -1771,7 +1830,6 @@ def test_rolling_update_flag(self):
             replica = k8s.get_cluster_replica_pod()
             self.assertTrue(replica.metadata.creation_timestamp > old_creation_timestamp, "Old master pod was not recreated")
 
-
         except timeout_decorator.TimeoutError:
             print('Operator log: {}'.format(k8s.get_operator_log()))
             raise
@@ -2202,6 +2260,39 @@ def assert_distributed_pods(self, target_nodes, cluster_labels='cluster-name=aci
 
         return True
 
+    def check_cluster_child_resources_owner_references(self, cluster_name, cluster_namespace='default', inverse=False):
+        k8s = self.k8s
+
+        # check if child resources were updated with owner references
+        sset = k8s.api.apps_v1.read_namespaced_stateful_set(cluster_name, cluster_namespace)
+        self.assertTrue(self.has_postgresql_owner_reference(sset.metadata.owner_references, inverse), "statefulset owner reference check failed")
+
+        svc = k8s.api.core_v1.read_namespaced_service(cluster_name, cluster_namespace)
+        self.assertTrue(self.has_postgresql_owner_reference(svc.metadata.owner_references, inverse), "primary service owner reference check failed")
+        replica_svc = k8s.api.core_v1.read_namespaced_service(cluster_name + "-repl", cluster_namespace)
+        self.assertTrue(self.has_postgresql_owner_reference(replica_svc.metadata.owner_references, inverse), "replica service owner reference check failed")
+
+        ep = k8s.api.core_v1.read_namespaced_endpoints(cluster_name, cluster_namespace)
+        self.assertTrue(self.has_postgresql_owner_reference(ep.metadata.owner_references, inverse), "primary endpoint owner reference check failed")
+        replica_ep = k8s.api.core_v1.read_namespaced_endpoints(cluster_name + "-repl", cluster_namespace)
+        self.assertTrue(self.has_postgresql_owner_reference(replica_ep.metadata.owner_references, inverse), "replica owner reference check failed")
+
+        pdb = k8s.api.policy_v1.read_namespaced_pod_disruption_budget("postgres-{}-pdb".format(cluster_name), cluster_namespace)
+        self.assertTrue(self.has_postgresql_owner_reference(pdb.metadata.owner_references, inverse), "pod disruption owner reference check failed")
+
+        pg_secret = k8s.api.core_v1.read_namespaced_secret("postgres.{}.credentials.postgresql.acid.zalan.do".format(cluster_name), cluster_namespace)
+        self.assertTrue(self.has_postgresql_owner_reference(pg_secret.metadata.owner_references, inverse), "postgres secret owner reference check failed")
+        standby_secret = k8s.api.core_v1.read_namespaced_secret("standby.{}.credentials.postgresql.acid.zalan.do".format(cluster_name), cluster_namespace)
+        self.assertTrue(self.has_postgresql_owner_reference(standby_secret.metadata.owner_references, inverse), "standby secret owner reference check failed")
+
+        return True
+
+    def has_postgresql_owner_reference(self, owner_references, inverse):
+        if inverse:
+            return owner_references is None or owner_references[0].kind != 'postgresql'
+
+        return owner_references is not None and owner_references[0].kind == 'postgresql' and owner_references[0].controller
+
     def list_databases(self, pod_name):
         '''
            Get list of databases we might want to iterate over

@@ -49,7 +49,7 @@ data:
   enable_master_pooler_load_balancer: "false"
   enable_password_rotation: "false"
   enable_patroni_failsafe_mode: "false"
-  enable_secrets_deletion: "true"
+  enable_owner_references: "false"
   enable_persistent_volume_claim_deletion: "true"
   enable_pgversion_env_var: "true"
   # enable_pod_antiaffinity: "false"
@@ -59,6 +59,7 @@ data:
   enable_readiness_probe: "false"
   enable_replica_load_balancer: "false"
   enable_replica_pooler_load_balancer: "false"
+  enable_secrets_deletion: "true"
   # enable_shm_volume: "true"
   # enable_sidecars: "true"
   enable_spilo_wal_path_compat: "true"

@@ -94,6 +94,7 @@ rules:
   - create
   - delete
   - get
+  - patch
   - update
 # to check nodes for node readiness label
 - apiGroups:
@@ -166,6 +167,7 @@ rules:
   - get
   - list
   - patch
+  - update
 # to CRUD cron jobs for logical backups
 - apiGroups:
   - batch
-Original file line number
+Diff line change
@@ Expand Up / @@ -120,6 +120,7 @@ rules: @@
       - create
       - delete
       - get
+      - patch
       - update
     # to check nodes for node readiness label
     - apiGroups:
@@ Expand Down Expand Up / @@ -196,6 +197,7 @@ rules: @@
       - get
       - list
       - patch
+      - update
     # to CRUD cron jobs for logical backups
     - apiGroups:
       - batch
@@ Expand Down @@