Skip to content

Releases: ContextLab/data-wrangler

v0.4.0 (June, 2025)

14 Jun 11:59

Choose a tag to compare

🚀 High-Performance Polars Backend + Simplified Text API

🎯 Key Features

⚡ NEW: High-Performance Polars Backend (2-100x faster!)

  • Dual DataFrame Support: Choose between pandas (default) or Polars backends
  • Zero Code Changes: Add backend='polars' to any operation for instant speedups
  • Comprehensive Coverage: All data types (arrays, text, files) work with both backends
  • Smart Type Preservation: DataFrames maintain their type when no backend specified
  • Global Configuration: Set default backend preference with set_dataframe_backend('polars')
  • Cross-Backend Conversion: Seamlessly convert between pandas and Polars DataFrames

📊 Performance Gains with Polars

  • Array Processing: 2-100x faster conversion for large datasets
  • Text Embeddings: 3-10x faster document processing
  • Memory Efficiency: 30-70% reduction in memory usage
  • Parallel Processing: Built-in multi-core optimization

🎨 Simplified Text Model API (80% reduction in verbosity)

  • Simple String Format: {'model': 'all-MiniLM-L6-v2'} now works everywhere
  • Automatic Normalization: All model formats converted to unified dict internally
  • List Support: Lists of models work with simplified format
  • Full Backward Compatibility: All existing verbose syntax continues working

📋 Quick Start Examples

High-Performance Processing

import datawrangler as dw
import numpy as np

# Large dataset example
large_array = np.random.rand(50000, 20)

# Traditional pandas backend
pandas_df = dw.wrangle(large_array)  # Default

# High-performance Polars backend (2-100x faster!)
polars_df = dw.wrangle(large_array, backend='polars')

# Set global preference
from datawrangler.core.configurator import set_dataframe_backend
set_dataframe_backend('polars')  # All operations now use Polars

Simplified Text Processing

# Before v0.4.0 (verbose)
text_kwargs = {
    'model': {
        'model': 'all-MiniLM-L6-v2',
        'args': [],
        'kwargs': {}
    }
}

# After v0.4.0 (simplified!)
text_kwargs = {'model': 'all-MiniLM-L6-v2'}

# Works with Polars for 3-10x faster text processing
fast_embeddings = dw.wrangle(texts, text_kwargs=text_kwargs, backend='polars')

🔧 Additional Improvements

- Google Colab Fix: Eliminated installation warning popup
- Cleaner Dependencies: Removed redundant configparser
- Enhanced Documentation: All examples updated for both backends
- API Consistency: Fixed all docstring examples to use public API

📈 When to Use Each Backend

- Use pandas for: Small datasets, complex index operations, maximum ecosystem compatibility
- Use Polars for: Large datasets, performance-critical applications, memory efficiency

🚀 Installation

pip install --upgrade pydata-wrangler

# For full ML capabilities including sentence-transformers
pip install --upgrade "pydata-wrangler[hf]"

🧪 Verified Quality

-All 45 tests passing
-Documentation builds successfully
-Full backward compatibility maintained
-Comprehensive API examples tested

This release maintains full backward compatibility while delivering significant performance improvements and API simplification. Upgrade today to experience the power of high-performance data wrangling!

v0.3.0 (June, 2025)

13 Jun 15:07

Choose a tag to compare

🎉 Major Release: NumPy 2.0+ Compatibility & Modern ML Libraries

This release brings full compatibility with NumPy 2.0+ and pandas 2.0+ while modernizing the text embedding infrastructure with sentence-transformers.

🚀 New Features

  • Full NumPy 2.0+ and pandas 2.0+ compatibility
  • Modern sentence-transformers integration for text embeddings
  • Support for latest scikit-learn, matplotlib, and scipy versions
  • Enhanced error handling for missing dependencies
  • Updated Python support (3.9-3.12)

🔧 Breaking Changes

  • Replaced Flair with sentence-transformers for text embeddings
  • Removed gensim dependency (eliminates NumPy version conflicts)
  • Updated text embedding API to use sentence-transformers models
  • Dropped Python 3.6-3.8 support in favor of modern Python versions

🐛 Bug Fixes

  • Fixed numpy.str_ deprecation that broke in NumPy 2.0+
  • Updated HuggingFace datasets import for API changes
  • Fixed sklearn model detection preventing incorrect sentence-transformers usage
  • Fixed pandas iteritems deprecation for pandas 2.0+ compatibility
  • Replaced deprecated matplotlib.pyplot.imread

📚 Documentation & Examples

  • Updated all examples to use sentence-transformers syntax
  • Modernized installation instructions and model references
  • Comprehensive tutorial updates with new embedding approaches

🔄 Migration Guide

Old Flair syntax:
{'model': 'TransformerDocumentEmbeddings', 'args': ['bert-base-uncased']}

New sentence-transformers syntax:
{'model': 'all-mpnet-base-v2', 'args': [], 'kwargs': {}}

🛠️ Technical Changes

  • Sklearn models (CountVectorizer, etc.) now properly detected before sentence-transformers
  • Enhanced model detection prevents accidental model misclassification
  • Improved error messages for missing optional dependencies
  • Full compatibility with modern scientific Python stack

v0.2.2 (July, 2022)

25 Jul 21:00

Choose a tag to compare

  • v0.2.1: Bug fixes when hugging-face libraries aren't installed
  • v0.2.2: Better error handling when hugging-face libraries aren't installed and user asks to embed text using hugging-face models

v0.2.0 (July, 2022)

25 Jul 18:56

Choose a tag to compare

  • Adds CUDA (GPU) support for pytorch models
  • Streamline package by not installing hugging-face (🤗) support by default
  • Adds Python 3.10 support (and associated tests)
  • Relaxes some tests to support a wider range of platforms (mostly this is relevant for GitHub CI)
  • Relaxes requirements.txt versioning to improve compatibility with other libraries when installing via pip

v0.1.7 (August, 2021)

09 Aug 19:28

Choose a tag to compare

Updates model defaults to support more use cases

v0.1.6 (August, 2021)

09 Aug 19:23

Choose a tag to compare

Note: this version will be replaced by 0.1.7 shortly; tagging for archival purposes.

More fixes to dw.unstack.

v0.1.5

09 Aug 17:05

Choose a tag to compare

Fixes a bug in dw.unstack

v0.1.4 (August, 2021)

05 Aug 02:44

Choose a tag to compare

This seems to be an auspicious day for datawrangler releases! This minor version reflects a new flag added to apply_default_options that supports (optionally) supplying a customized "defaults" dictionary.

v0.1.3 (August, 2021)

04 Aug 19:41

Choose a tag to compare

Fixed an annoying bug related to unstacking multi-index dataframes

v0.1.2 (August, 2021)

04 Aug 15:02

Choose a tag to compare

Minor release:

  • Fixes links to Khan Academy and NeurIPS corpora
  • Better handling of .npy and .npz files (thanks @paxtonfitzpatrick!)