14 Jun 11:59

jeremymanning

ebe866d

v0.4.0 (June, 2025) Latest

Latest

🚀 High-Performance Polars Backend + Simplified Text API

🎯 Key Features

⚡ NEW: High-Performance Polars Backend (2-100x faster!)

Dual DataFrame Support: Choose between pandas (default) or Polars backends
Zero Code Changes: Add backend='polars' to any operation for instant speedups
Comprehensive Coverage: All data types (arrays, text, files) work with both backends
Smart Type Preservation: DataFrames maintain their type when no backend specified
Global Configuration: Set default backend preference with set_dataframe_backend('polars')
Cross-Backend Conversion: Seamlessly convert between pandas and Polars DataFrames

📊 Performance Gains with Polars

Array Processing: 2-100x faster conversion for large datasets
Text Embeddings: 3-10x faster document processing
Memory Efficiency: 30-70% reduction in memory usage
Parallel Processing: Built-in multi-core optimization

🎨 Simplified Text Model API (80% reduction in verbosity)

Simple String Format: {'model': 'all-MiniLM-L6-v2'} now works everywhere
Automatic Normalization: All model formats converted to unified dict internally
List Support: Lists of models work with simplified format
Full Backward Compatibility: All existing verbose syntax continues working

📋 Quick Start Examples

High-Performance Processing

import datawrangler as dw
import numpy as np

# Large dataset example
large_array = np.random.rand(50000, 20)

# Traditional pandas backend
pandas_df = dw.wrangle(large_array)  # Default

# High-performance Polars backend (2-100x faster!)
polars_df = dw.wrangle(large_array, backend='polars')

# Set global preference
from datawrangler.core.configurator import set_dataframe_backend
set_dataframe_backend('polars')  # All operations now use Polars

Simplified Text Processing

# Before v0.4.0 (verbose)
text_kwargs = {
    'model': {
        'model': 'all-MiniLM-L6-v2',
        'args': [],
        'kwargs': {}
    }
}

# After v0.4.0 (simplified!)
text_kwargs = {'model': 'all-MiniLM-L6-v2'}

# Works with Polars for 3-10x faster text processing
fast_embeddings = dw.wrangle(texts, text_kwargs=text_kwargs, backend='polars')

🔧 Additional Improvements

- Google Colab Fix: Eliminated installation warning popup
- Cleaner Dependencies: Removed redundant configparser
- Enhanced Documentation: All examples updated for both backends
- API Consistency: Fixed all docstring examples to use public API

📈 When to Use Each Backend

- Use pandas for: Small datasets, complex index operations, maximum ecosystem compatibility
- Use Polars for: Large datasets, performance-critical applications, memory efficiency

🚀 Installation

pip install --upgrade pydata-wrangler

# For full ML capabilities including sentence-transformers
pip install --upgrade "pydata-wrangler[hf]"

🧪 Verified Quality

- ✅ All 45 tests passing
- ✅ Documentation builds successfully
- ✅ Full backward compatibility maintained
- ✅ Comprehensive API examples tested

This release maintains full backward compatibility while delivering significant performance improvements and API simplification. Upgrade today to experience the power of high-performance data wrangling!

Assets 2

13 Jun 15:07

jeremymanning

v0.3.0

27bb16d

v0.3.0 (June, 2025)

🎉 Major Release: NumPy 2.0+ Compatibility & Modern ML Libraries

This release brings full compatibility with NumPy 2.0+ and pandas 2.0+ while modernizing the text embedding infrastructure with sentence-transformers.

🚀 New Features

Full NumPy 2.0+ and pandas 2.0+ compatibility
Modern sentence-transformers integration for text embeddings
Support for latest scikit-learn, matplotlib, and scipy versions
Enhanced error handling for missing dependencies
Updated Python support (3.9-3.12)

🔧 Breaking Changes

Replaced Flair with sentence-transformers for text embeddings
Removed gensim dependency (eliminates NumPy version conflicts)
Updated text embedding API to use sentence-transformers models
Dropped Python 3.6-3.8 support in favor of modern Python versions

🐛 Bug Fixes

Fixed numpy.str_ deprecation that broke in NumPy 2.0+
Updated HuggingFace datasets import for API changes
Fixed sklearn model detection preventing incorrect sentence-transformers usage
Fixed pandas iteritems deprecation for pandas 2.0+ compatibility
Replaced deprecated matplotlib.pyplot.imread

📚 Documentation & Examples

Updated all examples to use sentence-transformers syntax
Modernized installation instructions and model references
Comprehensive tutorial updates with new embedding approaches

🔄 Migration Guide

Old Flair syntax:
{'model': 'TransformerDocumentEmbeddings', 'args': ['bert-base-uncased']}

New sentence-transformers syntax:
{'model': 'all-mpnet-base-v2', 'args': [], 'kwargs': {}}

🛠️ Technical Changes

Sklearn models (CountVectorizer, etc.) now properly detected before sentence-transformers
Enhanced model detection prevents accidental model misclassification
Improved error messages for missing optional dependencies
Full compatibility with modern scientific Python stack

Assets 2

25 Jul 21:00

jeremymanning

v0.2.2

9e991a7

v0.2.2 (July, 2022)

v0.2.1: Bug fixes when hugging-face libraries aren't installed
v0.2.2: Better error handling when hugging-face libraries aren't installed and user asks to embed text using hugging-face models

Assets 2

25 Jul 18:56

jeremymanning

v0.2.0

b5e060c

v0.2.0 (July, 2022)

Adds CUDA (GPU) support for pytorch models
Streamline package by not installing hugging-face (🤗) support by default
Adds Python 3.10 support (and associated tests)
Relaxes some tests to support a wider range of platforms (mostly this is relevant for GitHub CI)
Relaxes requirements.txt versioning to improve compatibility with other libraries when installing via pip

Assets 2

09 Aug 19:28

jeremymanning

v0.1.7

80cebdb

v0.1.7 (August, 2021)

Updates model defaults to support more use cases

Assets 2

09 Aug 19:23

jeremymanning

v0.1.6

890d4f3

v0.1.6 (August, 2021)

Note: this version will be replaced by 0.1.7 shortly; tagging for archival purposes.

More fixes to dw.unstack.

Assets 2

09 Aug 17:05

jeremymanning

v0.1.5

0cd5baf

v0.1.5

Fixes a bug in dw.unstack

Assets 2

05 Aug 02:44

jeremymanning

v0.1.4

82160d3

v0.1.4 (August, 2021)

This seems to be an auspicious day for datawrangler releases! This minor version reflects a new flag added to apply_default_options that supports (optionally) supplying a customized "defaults" dictionary.

Assets 2

04 Aug 19:41

jeremymanning

v0.1.3

5ae28c7

v0.1.3 (August, 2021)

Fixed an annoying bug related to unstacking multi-index dataframes

Assets 2

04 Aug 15:02

jeremymanning

v0.1.2

de4552e

v0.1.2 (August, 2021)

Minor release:

Fixes links to Khan Academy and NeurIPS corpora
Better handling of .npy and .npz files (thanks @paxtonfitzpatrick!)

Contributors

paxtonfitzpatrick

Assets 2

Releases: ContextLab/data-wrangler

v0.4.0 (June, 2025)

🚀 High-Performance Polars Backend + Simplified Text API

🎯 Key Features

⚡ NEW: High-Performance Polars Backend (2-100x faster!)

📊 Performance Gains with Polars

🎨 Simplified Text Model API (80% reduction in verbosity)

📋 Quick Start Examples

High-Performance Processing

Uh oh!

v0.3.0 (June, 2025)

Uh oh!

v0.2.2 (July, 2022)

Uh oh!

v0.2.0 (July, 2022)

Uh oh!

v0.1.7 (August, 2021)

Uh oh!

v0.1.6 (August, 2021)

Uh oh!

v0.1.5

Uh oh!

v0.1.4 (August, 2021)

Uh oh!

v0.1.3 (August, 2021)

Uh oh!

v0.1.2 (August, 2021)

Contributors

Uh oh!