Skip to content

Conversation

@AFaust
Copy link

@AFaust AFaust commented Nov 5, 2025

This PR provides changes to support ElasticSearch with regards to index health check and fix functionalities. This includes handling to determine whether SOLR or ElasticSearch plugins are actually relevant based on checks on the Search subsystem activation state (checking the sourceBeanName property dynamically).

The ElasticSearch support includes the following capabilities:

  • checking existence of node index entries
    • filtering based on limitations for event generation (stores which do not support behaviours at all - which are necessary for events to work - and node types filtered out by ACS event filtering)
    • doing checks for all nodes, including those deleted and not expected in the index (filtered by SOLR plugin setup), since deletion events may have been lost/missed
    • checking internal ALIVE state which seems to only be partially used by ACS (i.e. Admin Console only counts ALIVE index entries when reporting indexed nodes)
    • checking PATH field if path indexing is expected (issues during indexing based on timing / missed events may cause index entries not to have a PATH field)
  • purging node index entries that should not exist via an explicit direct delete request to ElasticSearch
  • triggering a re-index operation via the live indexer app by issueing a "fake" node creation event
    • if any out-of-process extension is relying on events, such events may interfere with that extension unless the extension checks and takes into account the actual creation date of the node or does secondary check that it has already processed a previous creation event for the same node
    • if issueing "fake" node creation events interferes with out-of-process extensions, it may be possible to set up an ACS instance running health processor which processes events via a separate ActiveMQ instance than is used for "real" events
    • additional effort to reverse-engineer the internal logic + events used between the live indexer mediator and other constitutent index components of the indexer app may in the future enhance the functionality of this PR to directly issue indexing events instead of using "fake" node creation events

The following issues have been encountered and identified with the ElasticSearch-based search subsystem in general:

  • it is not supported to perform any kind of indexing before having an ACS instance with ElasticSearch subsystem enabled perform the initial index setup including model mapping
    • indexer apps are technically able to perform index updates, but when ACS is later switch to use the prepared ElasticSearch instance, conflicts in the model mappings will be reported that indicate search behaviour will not be as expected
    • this is the reason the integration test setup in this PR for the Elastic plugins need to explicitly delete the index + mappings created during Docker bootstrap to ensure a cleanly initialised empty index
  • when activating the ElasticSearch search subsystem in ACS takes about 10-20 seconds to perform asynchronous index initialisation / mapping updates
    • operations relying on index-based searches (e.g. integration tests) may need to include some additional delay / wait time before they can reliably perform any kind of lookups
    • there is no way to check / identify whether the initialisation has completed apart from checking the log file
  • PATH indexing is extremely brittle, can cause a lot of error messages in the indexer app log, introduce significant delays, and cause nodes not to be found when the ACS Admin Console actually shows all nodes to be indexed
    • path information is not part of node events, only parent/ancestor node IDs are included (primaryHierarchy and secondaryParents)
    • the path indexing sub-component of the indexer app uses parent node IDs to lookup their index entry to lookup PATH values from there, and use these as prefixes to build the PATH values for the node to be indexed
    • if an index entry for any parent cannot be found or does not include a PATH field, an error is thrown and event processing is retries with a bit of a processing delay
    • path indexing is processed via a single consumer, so any error + retry with one node holds up indexing other nodes as well
    • when bulk index operations are necessary, the order of events may significantly influence the stability of path indexing, and maybe even its correctness
      • ACS sends node events for a transaction in the order the first updates are applied on nodes, so if node A and node B are initially not in a hierarchy, but then B is moved below A, and then A is renamed (all in the same txn), the event for node B is sent before the event for node A ==> it is unclear, but potentially possible, that this may result in an incorrect PATH for node B because the rename of node A may not be reflected in its PATH index values yet when they are used as prefixes to build the new PATH values of node B
    • the default ACS outbound event filtering based on node types blocks events from being sent for some types of nodes that may be relevant for the PATH of others
      • the default filter includes the wildcard sys:* type filter - this prevents indexing of any sys:container or sys:store_root nodes, which would prevent the Elastic node index fixer plugin of this PR from fixing PATH issues for any nodes immediately below such nodes and transitively for potentially more (the sys:store_root might actually block ANY PATH indexing unless the official re-indexer app was used to ensure the sys:store_root has been indexed)
      • it is highly advisable to adapt the event filter for node types on any ACS instances to avoid missing PATH index values and errors in the indexer app
      • the official re-indexer app indexes more nodes than would be indexed by only reacting to node events, and with the minimal suggested change to the event node type filter, the Elastic index fixer plugin will cause more nodes to be indexed than are estimated to be indexable in the ACS Admin Console Search Services tool
      • the suggested type filter for events is: sys:deleted,sys:descriptor,sys:reference,sys:lost_found,sys:archiveUser, rule:*, act:*, fm:*, cm:thumbnail, cm:failedThumbnail, cm:rating, rma:rmsite include_subtypes, usr:user
      • the default type filter for events is: sys:*, fm:*, cm:thumbnail, cm:failedThumbnail, cm:rating, rma:rmsite include_subtypes, usr:user
      • the suggested filter adds exclusions for nodes related to folder rules and persisted actions, which will always cause errors during PATH indexing because the node grouping all rules for a folder is of type cm:systemFolder, which is hard-coded to be filtered from event sending (as are all sub-types of the system folder type) - it seems questionable any out-of-process extension would ever be interested to react to any events on rules/actions
      • the suggested filter expands the sys:* wildcard to filter out any types of the system model except for the sys:container and sys:store_root - the former is relevant for PATH indexing of person or group authority nodes, the latter is relevant if the Elastic index fixer plugin of this PR is used to trigger an index build from a completely empty ElasticSearch index
    • NOTE: There seem to be functionally breaking changes between ACS 25.1 and 25.2 - in ACS 25.2 the node type event filter seems to be effectively ignored, which is due to the "Sync Service" overriding the default EnterpriseEventGenerator implementation
      • the "Sync Service" implementation is supposed to be virtually identical with additions only where necessary for its functionality, but apparently cause the type of a node to not be loaded yet when the event filter is applied, thus bypassing this particular filter
      • this proved to be a signficant source of confusion when testing this Health Processor PR on both versions of ACS (25.1 in integration tests, 25.2 in a Docker Compose setup external to the project source structure), as the breaking change actually meant ACS 25.2 was successfully running the index fixer plugins even without adapting the node type event filter configuration

@AFaust AFaust requested a review from pvriel November 5, 2025 21:10
@AFaust
Copy link
Author

AFaust commented Nov 5, 2025

@pvriel Would it be possible to add build secrets to the project so that the integration tests using Enterprise ACS 25.1 may run successfully? Unfortunately, since ACS Community does not support ElasticSearch, it was neccessary to add a sub-project for Enterprise to test this particular PR's feature.

@AllArgsConstructor(access = AccessLevel.PRIVATE)
public class NodeHealthReport implements Serializable {

private static final long serialVersionUID = -939101152283421058L;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I don't see a problem with doing this at the moment, I don't understand why we are adding this now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not blocking for this PR, though.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was just fixing a compiler warning. Any serialisable class should have a serial version UID to ensure Java can detect when attempting to de-serialise an object from binary with an incompatible state. I believe when not explicitly set, JVM may dynamically generate a UID. It could be left out, much like other compiler warnings I tried to address so I wasn't cluttered with warnings to a degree I was missing any I might introduce.

};
try
{
ElasticResult searchResult = elasticRequestExecutor.checkNodeIndexed(searchEndpoint, expectedNodeRefStatuses);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is checkNodeIndexed an expensive operation? I'm wondering whether it makes sense to use a parallel stream (or thread executor) here instead of just using a for-loop.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkNodeIndexed is quite inexpensive. It is a multi-retrieval-by-ID kind of operation, and the post-processing (matching nodes into state buckets) is done based on node state already loaded into txn cache before checkNodeIndexed is even called (during the endpoint selection).

entry.getKey().getNodeRef(), messages);
healthReport.data(NodeIndexHealthReport.class).addAll(entry.getValue());
return healthReport;
}).filter(Objects::nonNull).forEach(healthReports::add);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm missing something here: how can this stream produce null values?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It no longer can. At some point in the development it might have, and I just forgot to remove the now redundant nonNull.

@Override
protected Set<NodeHealthReport> doProcess(Set<NodeRef> nodeRefs)
{
Set<NodeHealthReport> healthReports = new HashSet<>(nodeRefs.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The process of finding a search endpoint seems to be similar to the one from ElasticIndexValidationHealthProcessorPlugin. Do you think it's worth exploring some abstraction possibilities here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An abstraction for the for-loop would be possible and trivial. It felt wrong to move it into the abstract class I had already introduced for the subsystem-specific enablement logic, and I did not feel like it was worth to introduce yet another intermediate class at the time.

*/
@Slf4j
@AllArgsConstructor
public class ElasticEndpointSelectorBeanPostProcessor implements BeanDefinitionRegistryPostProcessor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this is not that difficult to understand, I'm wondering if we can have a more clean solution here instead of dynamically adjusting constructor arguments.
Would it be possible to:

  1. use the endpoint properties as a @ConfigurationProperties bean;
  2. use it as a @ConstructorBinding for the aggregate selector.

Unless I'm misunderstanding this code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I would not use constructor arguments at all. I was mostly trying to follow the existing Lombok-heavy patterns in the health processor module. @ConfigurationProperties or @ConstructorBinding are Spring Boot annotations that should not be available in ACS. My trust in Spring annotations beyond trivial @Value is very low, especially in non-trivial systems like ACS. My personal idea of a cleaner solution would be to turn most (or everything) into regular Spring XML property instead of constructor arguments.

return;
}

BeanDefinition aggregateSelector = registry.getBeanDefinition("eu.xenit.alfresco.healthprocessor.endpoint.solr." + AggregateSearchEndpointSelector.class.getSimpleName());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cf. ElasticEndpointSelectorBeanPostProcessor

{
if (!first)
{
sb.append(',');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to double-check: why only the first one? Or are you only expecting two Statuses?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other way around: For all BUT the first one we are adding a comma, so that the JSON is structurally valid. This needs to support an arbitrary number of entries depending on the configured batch size in the processor / indexing strategy.

@pvriel pvriel force-pushed the XEN-3141-Support-Elastic branch from 2bbd8b8 to 7f88cef Compare November 19, 2025 10:01
@pvriel pvriel force-pushed the XEN-3141-Support-Elastic branch from 98b9131 to 162fcdc Compare November 19, 2025 10:49
@sonarqubecloud
Copy link

Quality Gate Failed Quality Gate failed

Failed conditions
38.4% Coverage on New Code (required ≥ 80%)
D Security Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants