Skip to content

12 generalize downloader#13

Open
liellnima wants to merge 38 commits intomainfrom
12-generalize-downloader
Open

12 generalize downloader#13
liellnima wants to merge 38 commits intomainfrom
12-generalize-downloader

Conversation

@liellnima
Copy link
Collaborator

@liellnima liellnima commented Oct 26, 2024

I am creating a pull request, so the branches are not splitting up too much over time.

Current state:

  • removed some parts that were blocking generalizing. e.g. sticking to the selected_scenario file instead of the broader files. removed other files (or moved them somewhere else) if I realized we are not using them and don't need them in the future.
  • simplified the constant files: simple model list instead of dictionary. generalized to one node-link per project.
  • extended options: we can use all desired models for cmip6 now
  • extended scenarios: we can use all desired scenarios/experiments for cmip6 now
  • extended projects: first naive approach to extend the downloader to different projects. This will need some major rewriting though, which is why I stopped this and would like to do this in a separate PR. There is a first naive implementation. I think we need an abstract Downlaoder, and then children for the different projects (input4mips, cmip6, cmip7). This way the user can implement Downloader classes for the projects we are not supporting.
  • max_ensemble number of ensemble_members: this was already supported by the previous code and works now as well.
  • removed some other things that we don't need to simplify the code
  • implemented some failure/breaking points where needed (because code is otherwise downloading stuff the user doesn't need). still needs to be done for the input4mips part as well.
  • each project gets its own constant files (not sure this is elegant). this way we can load the relevant constants for this project and a separate node link (necessary!). each project might have different models / variables or e.g. no models at all (see input4mips)
  • cmip7 (or cmip6plus) config is there, and I tested it - unclear if it's an server issue / pyesgf issue / or issue on our side
  • updated the yaml downloader configs accordingly to the named changes (tag "project" - if ommited it should default to cmip6)

The other future_cases are not working yet and have not been implemented yet. Please ignore those.

Next immediate steps: (Input4Mips handling)

  • input4mips: rewrite some funcs, such as generate_variables
  • split the raw / model vars in the cmip6 constant files for input4mips vs cmip6
  • get input4mips configs running as expected
  • remove default downloader behaviour

Other next steps:

  • Rewrite downloader class, so that the project types are handled in a clean way. Each project needs its own downloading function, different handling for special variables if desired, and init_globs (or sth like that) imo.
  • grid handling: we should get rid of the explicit grid mentions. first of all, different variables will need different grids (oceans need gr instead of gn e.g.). secondly, each variable should only have one grid available, i.e. we want to retrieve it automatically. This is a potential failure point for downloading certain data.
  • extend configs for testing the downloader so each model component is tested once (not just atmosphere and sea-ice as it is now)
  • think about reducing logger output / terminal output. Is it too much?

Backlog:

  • time-window (we can calmly ignore that for now)

@liellnima liellnima linked an issue Oct 26, 2024 that may be closed by this pull request
@liellnima
Copy link
Collaborator Author

I haven't run the tests yet and I can already see that the linters are failing here (it worked locally for me though).

I just wanted to make sure that you have access to the latest version and can look into the changes. And I think it would be good to merge current changes before I continue adapting the downloader further.

With those changes all my cmip6 test cases are now working (6/6), so that's already a big step :D

@f-PLT f-PLT self-requested a review November 5, 2024 15:55
@f-PLT
Copy link
Collaborator

f-PLT commented Feb 27, 2025

@liellnima

Pushed my updates for review/testing.

  • Updated download configs
  • Added config classes that are used by the different downloaders
    • These config classes handle some of the validating and preparations
  • Fixed the downloader tests
  • Added an example script, scripts/download_example.py

Let me know what you thinks and if you find any bugs/typos of problems!

@liellnima
Copy link
Collaborator Author

Code looks all good, thank you so much! Here a summary of my tests:

Data can be retrieved from:

  • Input4MIPS (disclaimer: it downloads ALL the data, scenario specifier does not work rn)
    • CO2
    • CH4
    • SO2
    • BC
  • CMIP6
    • AWI sea ice
    • CanESM CO2 internal variable
    • FGOALS temperature ssp --> https request is not going through, I think the server is down
    • NorESM precipiation historical
    • NorESM temperature ssp
    • UKESM temperature piControl

I will add a commit to fix a typo in the CO2 configs, and update the configs for future reference of issues listed here.


Collecting the issues I have found, that will be separated out into issues though (and delete them from this comment here).

Issue 1: (highest priority) bc_historical.yaml is downloading all SSP scenario data (+historical data) instead of only historical data. The same is happening for bc_ssp.yaml (all instead of just the one ssp scenario). The same is happening for the CH4 config files. Same for the SO2 configs. Same for CO2.

Issue 2: Make sure failed data retrieval (Result len 0; or no overlap between requested and avail ensemble members) is not failing silently

  • Colored / adapted warning of current "WARNING: no overlap between available and desired ensemble members!"
  • Colored warning when "Result len 0" happens (especially when it looks like a match was found before) with the suggestion to try different ensemble members / scenarios / versions). Maybe we can even return an error code from esgf for more information?

Issue 3: Result len 0, when all inputs say it should actually be available?

  • Double check: when it's result len 0 - are we looping over the different available versions to retrieve alternative data? What’s going on here?

Issue 4: Try out the abstract downloader classes by setting it up for omip and cmip6plus (looking at the code, it should be really easy now!)

Issue 5: Far in the future: Add year span in downloader.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Generalize Downloader

2 participants