-
Notifications
You must be signed in to change notification settings - Fork 66
Add new CTS test to validate if unique telemetry is reported by sysman #327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| SYSMAN_MULTI_DEVICE_TEST, | ||
| GivenMultipleDevicesWhenQueryingAllSysmanAPIsThenEachDeviceReturnsUniqueValues) { | ||
|
|
||
| LOG_INFO << "=== Multi-Device PMT Interface Validation Test ==="; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to mention PMT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| ASSERT_EQ(ZE_RESULT_SUCCESS, | ||
| zesDeviceGet(lzt::get_default_zes_driver(), &deviceCount, nullptr)); | ||
|
|
||
| if (deviceCount < 2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if (deviceCount < 2) { | |
| if (deviceCount > 1) { |
This seems more intuitive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code checks if there are fewer than 2 devices available. If true, it skips the test rather than failing it.
Without this check, the test would likely crash or fail when trying to access a second GPU that doesn't exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not question the need for this check. Just want to convey that the suggested change is more intuitive for a reader.
| GTEST_SKIP() | ||
| << "Multi-device PMT validation requires at least 2 Intel GPUs. " | ||
| << "Found: " << deviceCount << " device(s). " | ||
| << "This test validates PMT interface isolation across multiple GPUs."; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid mention of PMT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
noted
| LOG_INFO << " Multi-device environment validated: " << deviceCount | ||
| << " devices found"; | ||
| LOG_INFO << "Testing " << devices.size() | ||
| << " devices for PMT interface isolation"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here as well. Please search for this word and remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
conformance_tests/sysman/test_sysman_device/src/test_sysman_multi_device.cpp
Show resolved
Hide resolved
| for (auto tempHandle : tempHandles) { | ||
| double temperature = 0.0; | ||
| if (zesTemperatureGetState(tempHandle, &temperature) == ZE_RESULT_SUCCESS) { | ||
| deviceData.temperatures.push_back(temperature); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| // Helper function to validate temperature data isolation between devices | ||
| void validateTemperatureDataIsolation( | ||
| const std::vector<MultiDeviceData> &deviceData) { | ||
| for (size_t i = 0; i < deviceData.size(); i++) { | ||
| for (size_t j = i + 1; j < deviceData.size(); j++) { | ||
|
|
||
| // Compare temperature readings | ||
| for (const auto &tempA : deviceData[i].temperatures) { | ||
| for (const auto &tempB : deviceData[j].temperatures) { | ||
|
|
||
| // Validate temperature ranges are realistic | ||
| EXPECT_GT(tempA, 0.0) << "Device " << deviceData[i].deviceIdentifier | ||
| << " has invalid temperature reading"; | ||
| EXPECT_LT(tempA, 150.0) << "Device " << deviceData[i].deviceIdentifier | ||
| << " temperature seems unrealistic (>150°C)"; | ||
|
|
||
| EXPECT_GT(tempB, 0.0) << "Device " << deviceData[j].deviceIdentifier | ||
| << " has invalid temperature reading"; | ||
| EXPECT_LT(tempB, 150.0) << "Device " << deviceData[j].deviceIdentifier | ||
| << " temperature seems unrealistic (>150°C)"; | ||
|
|
||
| // Note: Temperatures could be identical if both GPUs are idle, | ||
| // so we don't enforce uniqueness here, just realistic values | ||
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
granted they can come similar maybe this could be made as a warning if it same instead of a failure?
I dont think we need to do the valid range check, since its already done in the temperature specific checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| LOG_DEBUG << "[DEBUG] Validating PCI statistics isolation..."; | ||
| validatePciDataIsolation(deviceData); | ||
| LOG_DEBUG << "[DEBUG] PCI statistics isolation validation complete"; | ||
|
|
||
| LOG_DEBUG << "[DEBUG] Step 3 complete: All isolation validations finished"; | ||
|
|
||
| // Step 4: Final validation summary | ||
| LOG_DEBUG << "[DEBUG] Step 4: Generating final validation summary"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These seem like not necessary comments, please remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed all the debug prints in the main test
| LOG_DEBUG << "[DEBUG] Step 1 complete: All device data collected"; | ||
|
|
||
| // Step 2: Validate PCI BDF uniqueness (PMT interface mapping validation) | ||
| LOG_DEBUG << "[DEBUG] Step 2: Validating PCI BDF uniqueness..."; | ||
| ASSERT_TRUE(validateUniquePciBdf(deviceData)) | ||
| << "PMT interface mapping validation failed - duplicate PCI BDF detected"; | ||
| LOG_DEBUG << "[DEBUG] Step 2 complete: PCI BDF uniqueness validated"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
debug is not needed to be mentioned since we are already using LOG_DEBUG calls.
Also remove usage of the term PMT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| for (const auto &bandwidthA : deviceData[i].memoryBandwidth) { | ||
| for (const auto &bandwidthB : deviceData[j].memoryBandwidth) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you are avoiding the ordering issue using brute force here. This is fine, but could be optimized if you can generate the MultiDeviceData structure in an ordered manner. We can use module properties to understand the type of handle for each API and sort it based on that. All cards would return he same number of handles with similar handle types, so you always expect to have a 1:1 mapping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done , taken care
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed nested loops in memory state validation - now uses 1:1 comparison with sorted data
Added 1:1 comparison for power and temperature validation
Added count verification - ensures devices have same number of modules/domains/sensors
Fixed incomplete main test - added missing validation calls
Better error messages - include module/domain/sensor index numbers
Consistent logging - added success indicators
5944178 to
8835856
Compare
a674d35 to
4d5f5e3
Compare
on multi card systems for each device(GPU) Related-To: VLCLJ-2646 Signed-off-by: Alle, Kalyan <kalyan.alle@intel.com>
on multi card systems for each device(GPU)
Related-To: VLCLJ-2646
Note: left the debug prints for testing purpose, will remove in the final code.
Brief Function Logic Explanations
Data Collection Functions
collectMemoryData()
Purpose: Collects memory telemetry from a device
Enumerates memory modules using zesDeviceEnumMemoryModules()
For each module: gets bandwidth counters (read/write) and memory state (free/used)
Stores in deviceData.memoryBandwidth and deviceData.memoryStates
Returns gracefully if no memory modules found
collectPowerData()
Purpose: Collects power consumption telemetry from a device
Enumerates power domains using zesDeviceEnumPowerDomains()
For each domain: gets energy counters using zesPowerGetEnergyCounter()
Stores in deviceData.powerEnergy
Returns gracefully if no power domains found
collectTemperatureData()
Purpose: Collects temperature readings from a device
Enumerates temperature sensors using zesDeviceEnumTemperatureSensors()
For each sensor: gets temperature value using zesTemperatureGetState()
Stores in deviceData.temperatures
Returns gracefully if no temperature sensors found
collectPciData()
Purpose: CRITICAL - Collects PCI info and creates unique device ID
Gets PCI properties using zesDevicePciGetProperties()
Creates BDF string: "bus:device:function" (e.g., "3:0:0")
Gets PCI traffic stats using zesDevicePciGetStats()
Returns false if PCI properties fail (test-critical failure)
Validation Functions
validateUniquePciBdf()
Purpose: CORE PMT VALIDATION - Ensures no duplicate PCI addresses
Uses std::set to detect duplicate BDF identifiers
Returns false if duplicate found → PMT mapping error detected
Most critical validation - proves each device has unique address
validateMemoryDataIsolation()
Purpose: Ensures memory counters differ between all device pairs
Double loop: Compares every device pair (i vs j where j > i)
Checks memory bandwidth: read/write counters must differ
Checks memory state: free memory should differ between devices
EXPECT_FALSE on identical data → detects PMT cross-contamination
validatePowerDataIsolation()
Purpose: Ensures power readings differ between all device pairs
Double loop: Compares every device pair
Checks energy counters: power consumption values must differ
EXPECT_FALSE on identical energy → detects shared power data
validateTemperatureDataIsolation()
Purpose: Validates temperature readings are realistic per device
Double loop: Validates each device's temperature range
Range check: 0°C < temperature < 150°C per device
No uniqueness requirement (idle GPUs may have similar temps)
Ensures PMT thermal interface is accessible
validatePciDataIsolation()
Purpose: CRITICAL - Validates PCI bus uniqueness and traffic isolation
EXPECT_NE on PCI bus numbers → different devices must be on different buses
Compares PCI traffic stats: RX/TX/packet counters must differ
EXPECT_FALSE on identical stats → detects PMT interface sharing
Core PMT mapping validation - validates the commit's fix