Conversation
|
CI MESSAGE: [44656169]: BUILD STARTED |
|
CI MESSAGE: [44667089]: BUILD STARTED |
|
CI MESSAGE: [44667089]: BUILD FAILED |
|
CI MESSAGE: [44656169]: BUILD PASSED |
8fe96ee to
d879927
Compare
|
CI MESSAGE: [44698015]: BUILD STARTED |
|
CI MESSAGE: [44698471]: BUILD STARTED |
|
CI MESSAGE: [44717682]: BUILD STARTED |
|
CI MESSAGE: [44717682]: BUILD FAILED |
|
CI MESSAGE: [44719838]: BUILD STARTED |
|
CI MESSAGE: [44698471]: BUILD FAILED |
|
CI MESSAGE: [44719838]: BUILD FAILED |
|
CI MESSAGE: [44719838]: BUILD PASSED |
2362e72 to
77c81e3
Compare
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
---- Signed-off-by: Michał Zientkiewicz <michalz@nvidia.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
|
CI MESSAGE: [45440403]: BUILD STARTED |
Greptile SummaryThis PR introduces a thread-pool abstraction layer: it extracts a Key issues found:
Confidence Score: 2/5
Important Files Changed
Class Diagram%%{init: {'theme': 'neutral'}}%%
classDiagram
class ThisThreadIdx {
+this_thread_idx() int
}
class ThreadPool {
<<interface>>
+AddWork(void() work, int64 priority)
+AddWork(void(int) work, int64 priority)
+RunAll(bool wait)
+WaitForWork()
+NumThreads() int
+GetThreadIds() vector~thread_id~
}
class ThreadPoolBase {
+Init(int n, on_start_fn)
+NumThreads() int
+GetThreadIds() vector~thread_id~
+this_thread_idx() int
}
class OldThreadPool {
-vector~thread~ threads_
-priority_queue work_queue_
+AddWork(void() work, int64 priority)
+AddWork(void(int) work, int64 priority)
+RunAll(bool wait)
+WaitForWork()
+NumThreads() int
+GetThreadIds() vector~thread_id~
}
class NewThreadPool {
-optional~int~ device_id_
-string name_
-NvmlInstance nvml_handle_
+NewThreadPool(int n, optional~int~ device_id, bool affinity, string name)
-OnThreadStart(int idx, bool affinity) any
}
class ThreadPoolFacade {
-ThreadPoolBase* tp_
-optional~Job~ job_
+AddWork(void() work, int64 priority)
+AddWork(void(int) work, int64 priority)
+RunAll(bool wait)
+WaitForWork()
+NumThreads() int
+GetThreadIds() vector~thread_id~
}
ThisThreadIdx <|-- ThreadPool
ThreadPool <|-- OldThreadPool
ThreadPool <|-- ThreadPoolFacade
ThreadPoolBase <|-- NewThreadPool
ThreadPoolFacade --> ThreadPoolBase : delegates to (tp_*)
|
| NewThreadPool::NewThreadPool( | ||
| int num_threads, | ||
| std::optional<int> device_id, | ||
| bool set_affinity, | ||
| std::string name) | ||
| : name_(name) { | ||
| if (device_id.has_value() && *device_id == CPU_ONLY_DEVICE_ID) | ||
| device_id = std::nullopt; | ||
| #if NVML_ENABLED | ||
| // We use NVML only for setting thread affinity | ||
| if (device_id.has_value() && set_affinity) { | ||
| nvml_handle_ = nvml::NvmlInstance::CreateNvmlInstance(); | ||
| } | ||
| #endif | ||
| Init(num_threads, [=, this](int thread_idx) { | ||
| return OnThreadStart(thread_idx, set_affinity); | ||
| }); | ||
| } |
There was a problem hiding this comment.
device_id_ never assigned — CUDA device context silently lost
The constructor parameter device_id is modified (nullified for CPU_ONLY_DEVICE_ID) and used to conditionally initialize nvml_handle_, but it is never stored into the member device_id_. Because device_id_ defaults to std::nullopt, OnThreadStart at line 49 will never create the DeviceGuard, meaning threads spawned by NewThreadPool will never bind to the requested CUDA device.
NewThreadPool::NewThreadPool(
int num_threads,
std::optional<int> device_id,
bool set_affinity,
std::string name)
: name_(name) {
if (device_id.has_value() && *device_id == CPU_ONLY_DEVICE_ID)
device_id = std::nullopt;
device_id_ = device_id; // ← add this line
#if NVML_ENABLED
...Without this assignment, any GPU-bound pipeline using the new thread pool will run operator CPU threads without a current CUDA context, which will cause silent failures or hard crashes on any CUDA API call made from those threads.
| if (UseNewThreadPool()) { | ||
| std::cerr << "\n!!! Forced use of NewThreadPool !!!" << std::endl; | ||
| if (graph_info_.num_cpu > 0) { |
There was a problem hiding this comment.
Debug std::cerr left in production code
This diagnostic print was presumably added during development but should not be shipped:
| if (UseNewThreadPool()) { | |
| std::cerr << "\n!!! Forced use of NewThreadPool !!!" << std::endl; | |
| if (graph_info_.num_cpu > 0) { | |
| if (UseNewThreadPool()) { | |
| if (graph_info_.num_cpu > 0) { |
Remove the std::cerr line entirely — it will spam every user's stderr whenever they set DALI_USE_NEW_THREAD_POOL=1.
| taskset --cpu-list 0-71 python hw_decoder_bench.py --width_hint 6000 --height_hint 6000 -b 408 -d 0 -g gpu -w 100 -t 100000 -i ${DALI_EXTRA_PATH}/db/single/jpeg -p rn50 -j 72 --hw_load 0.11 --experimental_decoder | tee ${LOG2_TP} | ||
| DALI_USE_NEW_THREAD_POOL=1 taskset --cpu-list 0-71 python hw_decoder_bench.py --width_hint 6000 --height_hint 6000 -b 408 -d 0 -g gpu -w 100 -t 100000 -i ${DALI_EXTRA_PATH}/db/single/jpeg -p rn50 -j 72 --hw_load 0.11 | tee ${LOG1} | ||
| DALI_USE_NEW_THREAD_POOL=1 taskset --cpu-list 0-71 python hw_decoder_bench.py --width_hint 6000 --height_hint 6000 -b 408 -d 0 -g gpu -w 100 -t 100000 -i ${DALI_EXTRA_PATH}/db/single/jpeg -p rn50 -j 72 --hw_load 0.11 --experimental_decoder | tee ${LOG2_TP} |
There was a problem hiding this comment.
GraceHopper branch: log-file naming errors corrupt baseline data
In the else (GraceHopper) branch, three separate bugs are introduced together:
- Line 44 — the non-new-tp experimental_decoder run writes to
${LOG2_TP}instead of${LOG2}. - Line 45 — the new-tp legacy run writes to
${LOG1}, overwriting the baseline measurement captured on the line above it. - Line 46 — the new-tp experimental_decoder run also writes to
${LOG2_TP}, overwriting the result from line 44.
End state: LOG1 holds new-tp data (not baseline), LOG2 is never written, and LOG2_TP holds new-tp data (not old-tp). Downstream, PERF_RESULT3 (perf_check "${LOG2}" …) and PERF_RESULT2_TP (perf_check "${LOG2_TP}" "$(extract_perf "${LOG2}")" 2) will both silently evaluate against stale/empty values, making the performance gate meaningless for GraceHopper.
The Hopper branch (lines 29–34) follows the correct pattern and should be mirrored:
taskset --cpu-list 0-71 python hw_decoder_bench.py ... | tee ${LOG1}
taskset --cpu-list 0-71 python hw_decoder_bench.py ... --experimental_decoder | tee ${LOG2}
DALI_USE_NEW_THREAD_POOL=1 taskset --cpu-list 0-71 python hw_decoder_bench.py ... | tee ${LOG1_TP}
DALI_USE_NEW_THREAD_POOL=1 taskset --cpu-list 0-71 python hw_decoder_bench.py ... --experimental_decoder | tee ${LOG2_TP}| PERF_RESULT1_TP=$(perf_check "${LOG1_TP}" "$MIN_PERF") | ||
| PERF_RESULT2_TP=$(perf_check "${LOG2_TP}" "$MIN_PERF2") |
There was a problem hiding this comment.
Redundant first assignments to PERF_RESULT1_TP / PERF_RESULT2_TP
PERF_RESULT1_TP and PERF_RESULT2_TP are assigned here (absolute-minimum check) and then unconditionally reassigned to relative-comparison results on lines 93–94. The values computed on lines 87–88 are never read and should be removed to avoid confusion.
| PERF_RESULT1_TP=$(perf_check "${LOG1_TP}" "$MIN_PERF") | |
| PERF_RESULT2_TP=$(perf_check "${LOG2_TP}" "$MIN_PERF2") | |
| PERF_RESULT1_NDD=$(perf_check "${LOG1_NDD}" "$MIN_PERF_NDD") |
| echo "PERF_RESULT1=${PERF_RESULT1_TP}" | ||
| echo "PERF_RESULT2=${PERF_RESULT2_TP}" |
There was a problem hiding this comment.
Wrong echo labels for new-thread-pool results
Both lines print PERF_RESULT1= and PERF_RESULT2= rather than PERF_RESULT1_TP= and PERF_RESULT2_TP=. The CI log will show duplicate keys, making it impossible to distinguish old- vs new-thread-pool results at a glance.
| echo "PERF_RESULT1=${PERF_RESULT1_TP}" | |
| echo "PERF_RESULT2=${PERF_RESULT2_TP}" | |
| echo "PERF_RESULT1_TP=${PERF_RESULT1_TP}" | |
| echo "PERF_RESULT2_TP=${PERF_RESULT2_TP}" |
Category:
Refactoring (Redesign of existing code that doesn't affect functionality)
Description:
This change does the following:
ThreadPool(now calledThreadPool)ThreadPooltoOldThreadPoolNewThreadPooland a ThreadPool Facade that aggregates a pointer to a NewThreadPool and Job objectAdditional information:
Affected modules and functionalities:
Key points relevant for the review:
Tests:
New qa tests script
Checklist
Documentation
DALI team only
Requirements
REQ IDs: N/A
JIRA TASK: N/A