I'm opening this PR to keep track of the work needed to port the content of the #996 PR to the main branch.
The idea is to split that PR (which is huge and based on a quite old version of the codebase) and, starting from the current state of the main branch, port its main elements in smaller PRs.
I'll keep this issue updated as I work on this.
Many changes are not strictly related to supporting distributed training but may benefit Avalanche in general.
- I'm starting with porting the modernized object detection/segmentation dataset, strategies, and metrics. I'll also port the generalized batch collate functionality.
Changes in Distributed Training PR #996:
Legend:
- 🔲 Not ported
- ⌛ Work in progress
- 💬 PR opened, discussion in progress
- ✔️ Merged into main branch
Base elements
Strategy e plugins
Models
Detection
Data Loader
Loggers and metrics
Unit tests
- 🔲 Called in both environment-update and unit-test actions
Typing
I'm opening this PR to keep track of the work needed to port the content of the #996 PR to the main branch.
The idea is to split that PR (which is huge and based on a quite old version of the codebase) and, starting from the current state of the main branch, port its main elements in smaller PRs.
I'll keep this issue updated as I work on this.
Many changes are not strictly related to supporting distributed training but may benefit Avalanche in general.
Changes in Distributed Training PR #996:
Legend:
Base elements
Strategy e plugins
supports_distributedplugin field (Add base elements to support distributed comms. Add supports_distributed plugin flag. #1370)_distributed_checkstrategy field and related_check_distributed_training_compatibility()check (Add base elements to support distributed comms. Add supports_distributed plugin flag. #1370)wrap_distributed_modelstrategy lifecycle method. Called from..._observation.py_obtain_common_dataloader_parametersstrategy method (unrelated to distributed training) (Add base elements to support distributed comms. Add supports_distributed plugin flag. #1370)_obtain_common_dataloader_parameters(unrelated to distributed training) (Add base elements to support distributed comms. Add supports_distributed plugin flag. #1370)_backward(),_forward(), ... while wrapping happens inbackward,forward. Wrapper methods should be final, but Python is not strict on this (flexibility).Models
avalanche_forward, generalize usingis_multi_task_moduleto consider DDP wrapping (Add base elements to support distributed comms. Add supports_distributed plugin flag. #1370)Detection
Data Loader
Loggers and metrics
evaluatorconstructor parameter (evaluator=default_evaluator()->evaluator=default_evaluator). (Add base elements to support distributed comms. Add supports_distributed plugin flag. #1370)evaluatorparameter value to use a factory. (Add base elements to support distributed comms. Add supports_distributed plugin flag. #1370)Unit tests
Typing