Skip to content

🚀 Feature Request: Migrate Codebase from Pandas to Polars & Integrate MongoDB for Faster Analysis #10

@Shaadalam9

Description

@Shaadalam9

The current codebase uses pandas for data processing and relies on local or flat-file storage. As our data volume grows, performance has degraded, leading to slower analysis and longer runtimes.


Describe the solution you'd like

  • Migrate data analysis code from pandas to polars:

    • Polars offers much faster, multi-threaded data frame operations.
    • Update all scripts, modules, and notebooks to use polars syntax and idioms.
    • Ensure output and results remain consistent.
  • Integrate MongoDB as the primary data source and sink:

    • Move relevant data storage/loading from CSV/Parquet/Excel to MongoDB collections.
    • Refactor data ingestion/extraction logic to use pymongo or appropriate async libraries.
    • Benchmark performance improvements for common analysis tasks.

Describe alternatives you've considered

  • Keeping pandas but optimizing with Dask or Vaex (still not as fast as Polars for most operations).
  • Using a SQL database, but MongoDB offers more flexibility for semi-structured data.

Additional context

  • Existing pandas code is located in: src/data_analysis/
  • We rely on reading/writing large CSVs and DataFrames (often 1M+ rows).
  • Please ensure all tests pass and update documentation/examples as needed.

Tasks Checklist

  • Inventory all pandas usages and data-loading code
  • Convert scripts and modules to use polars
  • Replace local file I/O with MongoDB queries where appropriate
  • Add/modify tests to cover new code paths
  • Update README and any usage docs
  • Provide before/after benchmarks (runtime, memory usage)

References

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions