The current codebase uses pandas for data processing and relies on local or flat-file storage. As our data volume grows, performance has degraded, leading to slower analysis and longer runtimes.
Describe the solution you'd like
Describe alternatives you've considered
- Keeping pandas but optimizing with Dask or Vaex (still not as fast as Polars for most operations).
- Using a SQL database, but MongoDB offers more flexibility for semi-structured data.
Additional context
- Existing pandas code is located in:
src/data_analysis/
- We rely on reading/writing large CSVs and DataFrames (often 1M+ rows).
- Please ensure all tests pass and update documentation/examples as needed.
Tasks Checklist
References
The current codebase uses pandas for data processing and relies on local or flat-file storage. As our data volume grows, performance has degraded, leading to slower analysis and longer runtimes.
Describe the solution you'd like
Migrate data analysis code from
pandastopolars:polarssyntax and idioms.Integrate
MongoDBas the primary data source and sink:pymongoor appropriate async libraries.Describe alternatives you've considered
Additional context
src/data_analysis/Tasks Checklist
polarsReferences
pymongo)