Skip to content

bed_to_regions silently drops strand when polars Categorical #152

@bschilder

Description

@bschilder

Summary

_dataset/_utils.py::bed_to_regions checks bed.schema.get('strand', None) == pl.Utf8 before mapping +/- to 1/-1. When strand is a Categorical column (which is what gvl.write stores when the input BED has a repeated strand vocabulary — very common), the check returns False, and the fallback branch cols.append(pl.col('strand')) appends the raw string column without mapping.

.to_numpy() on the resulting polars frame then produces dtype=object (int columns + string strand). Downstream, the njit-compiled get_diffs_sparse and reconstruct_haplotypes_from_sparse fail with:

numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
non-precise type array(pyobject, 1d, A)

Reproduction

On GVL 0.21.4 + polars 1.40 + numba 0.65:

import polars as pl
import genvarloader as gvl

bed = pl.DataFrame({
    'chrom': ['chr17'] * 3,
    'chromStart': [100, 200, 300],
    'chromEnd': [150, 250, 350],
    'strand': pl.Series(['+', '+', '-'], dtype=pl.Categorical),
    'transcript_id': ['t1', 't1', 't2'],
    'exon_number': [1, 2, 1],
})
gvl.write('/tmp/test_gvl', bed, '<some_pgen>')
ds = gvl.Dataset.open('/tmp/test_gvl', reference='...').with_seqs('haplotypes')
ds[0, 0]  # -> numba TypingError

Fix

Two small changes in _dataset/_utils.py::bed_to_regions:

-    if bed.schema.get('strand', None) == pl.Utf8:
+    if bed.schema.get('strand', None) in (pl.Utf8, pl.String, pl.Categorical, pl.Enum):
         cols.append(
-            pl.col('strand').replace_strict({'+': 1, '-': -1}, return_dtype=pl.Int32)
+            pl.col('strand').cast(pl.String).replace_strict({'+': 1, '-': -1}, return_dtype=pl.Int32)
         )

(.cast(pl.String) is a no-op on String columns and unwraps Categorical/Enum cleanly.)

Verified locally: ds._full_regions.dtype goes from object to int32, and ds[i, j] returns the expected Ragged without tripping numba.

Happy to open a PR if helpful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions