Skip to content

Conversation

@kalyanalle
Copy link
Contributor

on multi card systems for each device(GPU)

Related-To: VLCLJ-2646

Note: left the debug prints for testing purpose, will remove in the final code.

Brief Function Logic Explanations
Data Collection Functions

collectMemoryData()
Purpose: Collects memory telemetry from a device
Enumerates memory modules using zesDeviceEnumMemoryModules()
For each module: gets bandwidth counters (read/write) and memory state (free/used)
Stores in deviceData.memoryBandwidth and deviceData.memoryStates
Returns gracefully if no memory modules found
collectPowerData()
Purpose: Collects power consumption telemetry from a device
Enumerates power domains using zesDeviceEnumPowerDomains()
For each domain: gets energy counters using zesPowerGetEnergyCounter()
Stores in deviceData.powerEnergy
Returns gracefully if no power domains found
collectTemperatureData()
Purpose: Collects temperature readings from a device
Enumerates temperature sensors using zesDeviceEnumTemperatureSensors()
For each sensor: gets temperature value using zesTemperatureGetState()
Stores in deviceData.temperatures
Returns gracefully if no temperature sensors found
collectPciData()
Purpose: CRITICAL - Collects PCI info and creates unique device ID
Gets PCI properties using zesDevicePciGetProperties()
Creates BDF string: "bus:device:function" (e.g., "3:0:0")
Gets PCI traffic stats using zesDevicePciGetStats()
Returns false if PCI properties fail (test-critical failure)

Validation Functions
validateUniquePciBdf()
Purpose: CORE PMT VALIDATION - Ensures no duplicate PCI addresses
Uses std::set to detect duplicate BDF identifiers
Returns false if duplicate found → PMT mapping error detected
Most critical validation - proves each device has unique address
validateMemoryDataIsolation()
Purpose: Ensures memory counters differ between all device pairs
Double loop: Compares every device pair (i vs j where j > i)
Checks memory bandwidth: read/write counters must differ
Checks memory state: free memory should differ between devices
EXPECT_FALSE on identical data → detects PMT cross-contamination
validatePowerDataIsolation()
Purpose: Ensures power readings differ between all device pairs
Double loop: Compares every device pair
Checks energy counters: power consumption values must differ
EXPECT_FALSE on identical energy → detects shared power data
validateTemperatureDataIsolation()
Purpose: Validates temperature readings are realistic per device
Double loop: Validates each device's temperature range
Range check: 0°C < temperature < 150°C per device
No uniqueness requirement (idle GPUs may have similar temps)
Ensures PMT thermal interface is accessible
validatePciDataIsolation()
Purpose: CRITICAL - Validates PCI bus uniqueness and traffic isolation
EXPECT_NE on PCI bus numbers → different devices must be on different buses
Compares PCI traffic stats: RX/TX/packet counters must differ
EXPECT_FALSE on identical stats → detects PMT interface sharing
Core PMT mapping validation - validates the commit's fix

SYSMAN_MULTI_DEVICE_TEST,
GivenMultipleDevicesWhenQueryingAllSysmanAPIsThenEachDeviceReturnsUniqueValues) {

LOG_INFO << "=== Multi-Device PMT Interface Validation Test ===";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to mention PMT

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

ASSERT_EQ(ZE_RESULT_SUCCESS,
zesDeviceGet(lzt::get_default_zes_driver(), &deviceCount, nullptr));

if (deviceCount < 2) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (deviceCount < 2) {
if (deviceCount > 1) {

This seems more intuitive.

Copy link
Contributor Author

@kalyanalle kalyanalle Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code checks if there are fewer than 2 devices available. If true, it skips the test rather than failing it.
Without this check, the test would likely crash or fail when trying to access a second GPU that doesn't exist.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not question the need for this check. Just want to convey that the suggested change is more intuitive for a reader.

Comment on lines 346 to 506
GTEST_SKIP()
<< "Multi-device PMT validation requires at least 2 Intel GPUs. "
<< "Found: " << deviceCount << " device(s). "
<< "This test validates PMT interface isolation across multiple GPUs.";
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid mention of PMT

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noted

LOG_INFO << " Multi-device environment validated: " << deviceCount
<< " devices found";
LOG_INFO << "Testing " << devices.size()
<< " devices for PMT interface isolation";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here as well. Please search for this word and remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

for (auto tempHandle : tempHandles) {
double temperature = 0.0;
if (zesTemperatureGetState(tempHandle, &temperature) == ZE_RESULT_SUCCESS) {
deviceData.temperatures.push_back(temperature);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 271 to 454
// Helper function to validate temperature data isolation between devices
void validateTemperatureDataIsolation(
const std::vector<MultiDeviceData> &deviceData) {
for (size_t i = 0; i < deviceData.size(); i++) {
for (size_t j = i + 1; j < deviceData.size(); j++) {

// Compare temperature readings
for (const auto &tempA : deviceData[i].temperatures) {
for (const auto &tempB : deviceData[j].temperatures) {

// Validate temperature ranges are realistic
EXPECT_GT(tempA, 0.0) << "Device " << deviceData[i].deviceIdentifier
<< " has invalid temperature reading";
EXPECT_LT(tempA, 150.0) << "Device " << deviceData[i].deviceIdentifier
<< " temperature seems unrealistic (>150°C)";

EXPECT_GT(tempB, 0.0) << "Device " << deviceData[j].deviceIdentifier
<< " has invalid temperature reading";
EXPECT_LT(tempB, 150.0) << "Device " << deviceData[j].deviceIdentifier
<< " temperature seems unrealistic (>150°C)";

// Note: Temperatures could be identical if both GPUs are idle,
// so we don't enforce uniqueness here, just realistic values
}
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

granted they can come similar maybe this could be made as a warning if it same instead of a failure?
I dont think we need to do the valid range check, since its already done in the temperature specific checks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 465 to 472
LOG_DEBUG << "[DEBUG] Validating PCI statistics isolation...";
validatePciDataIsolation(deviceData);
LOG_DEBUG << "[DEBUG] PCI statistics isolation validation complete";

LOG_DEBUG << "[DEBUG] Step 3 complete: All isolation validations finished";

// Step 4: Final validation summary
LOG_DEBUG << "[DEBUG] Step 4: Generating final validation summary";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These seem like not necessary comments, please remove

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed all the debug prints in the main test

Comment on lines 416 to 422
LOG_DEBUG << "[DEBUG] Step 1 complete: All device data collected";

// Step 2: Validate PCI BDF uniqueness (PMT interface mapping validation)
LOG_DEBUG << "[DEBUG] Step 2: Validating PCI BDF uniqueness...";
ASSERT_TRUE(validateUniquePciBdf(deviceData))
<< "PMT interface mapping validation failed - duplicate PCI BDF detected";
LOG_DEBUG << "[DEBUG] Step 2 complete: PCI BDF uniqueness validated";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debug is not needed to be mentioned since we are already using LOG_DEBUG calls.
Also remove usage of the term PMT

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 212 to 213
for (const auto &bandwidthA : deviceData[i].memoryBandwidth) {
for (const auto &bandwidthB : deviceData[j].memoryBandwidth) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you are avoiding the ordering issue using brute force here. This is fine, but could be optimized if you can generate the MultiDeviceData structure in an ordered manner. We can use module properties to understand the type of handle for each API and sort it based on that. All cards would return he same number of handles with similar handle types, so you always expect to have a 1:1 mapping.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done , taken care

Copy link
Contributor Author

@kalyanalle kalyanalle Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed nested loops in memory state validation - now uses 1:1 comparison with sorted data
Added 1:1 comparison for power and temperature validation
Added count verification - ensures devices have same number of modules/domains/sensors
Fixed incomplete main test - added missing validation calls
Better error messages - include module/domain/sensor index numbers
Consistent logging - added success indicators

@kalyanalle kalyanalle requested a review from shubskmr December 8, 2025 11:00
@kalyanalle kalyanalle force-pushed the multi_gpu_pmt branch 3 times, most recently from a674d35 to 4d5f5e3 Compare December 11, 2025 11:20
on multi card systems for each device(GPU)

Related-To: VLCLJ-2646

Signed-off-by: Alle, Kalyan <kalyan.alle@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants