Summary
_dataset/_utils.py::bed_to_regions checks bed.schema.get('strand', None) == pl.Utf8 before mapping +/- to 1/-1. When strand is a Categorical column (which is what gvl.write stores when the input BED has a repeated strand vocabulary — very common), the check returns False, and the fallback branch cols.append(pl.col('strand')) appends the raw string column without mapping.
.to_numpy() on the resulting polars frame then produces dtype=object (int columns + string strand). Downstream, the njit-compiled get_diffs_sparse and reconstruct_haplotypes_from_sparse fail with:
numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
non-precise type array(pyobject, 1d, A)
Reproduction
On GVL 0.21.4 + polars 1.40 + numba 0.65:
import polars as pl
import genvarloader as gvl
bed = pl.DataFrame({
'chrom': ['chr17'] * 3,
'chromStart': [100, 200, 300],
'chromEnd': [150, 250, 350],
'strand': pl.Series(['+', '+', '-'], dtype=pl.Categorical),
'transcript_id': ['t1', 't1', 't2'],
'exon_number': [1, 2, 1],
})
gvl.write('/tmp/test_gvl', bed, '<some_pgen>')
ds = gvl.Dataset.open('/tmp/test_gvl', reference='...').with_seqs('haplotypes')
ds[0, 0] # -> numba TypingError
Fix
Two small changes in _dataset/_utils.py::bed_to_regions:
- if bed.schema.get('strand', None) == pl.Utf8:
+ if bed.schema.get('strand', None) in (pl.Utf8, pl.String, pl.Categorical, pl.Enum):
cols.append(
- pl.col('strand').replace_strict({'+': 1, '-': -1}, return_dtype=pl.Int32)
+ pl.col('strand').cast(pl.String).replace_strict({'+': 1, '-': -1}, return_dtype=pl.Int32)
)
(.cast(pl.String) is a no-op on String columns and unwraps Categorical/Enum cleanly.)
Verified locally: ds._full_regions.dtype goes from object to int32, and ds[i, j] returns the expected Ragged without tripping numba.
Happy to open a PR if helpful.
Summary
_dataset/_utils.py::bed_to_regionschecksbed.schema.get('strand', None) == pl.Utf8before mapping+/-to1/-1. Whenstrandis a Categorical column (which is whatgvl.writestores when the input BED has a repeated strand vocabulary — very common), the check returns False, and the fallback branchcols.append(pl.col('strand'))appends the raw string column without mapping..to_numpy()on the resulting polars frame then producesdtype=object(int columns + string strand). Downstream, the njit-compiledget_diffs_sparseandreconstruct_haplotypes_from_sparsefail with:Reproduction
On GVL 0.21.4 + polars 1.40 + numba 0.65:
Fix
Two small changes in
_dataset/_utils.py::bed_to_regions:(
.cast(pl.String)is a no-op on String columns and unwraps Categorical/Enum cleanly.)Verified locally:
ds._full_regions.dtypegoes fromobjecttoint32, andds[i, j]returns the expectedRaggedwithout tripping numba.Happy to open a PR if helpful.