Skip to content

BUG: Formula with categoricals results in singular matrix #920

@lorentzenchr

Description

Issue A

When a formula with a categorical "C(..)" is used without penalty, no level is dropped, i.e. one-hot encoding is used instead of treatment coding.

from glum import GeneralizedLinearRegressor
import pandas as pd

df = pd.DataFrame({"y": [0, 1, 1, 2], "cat": pd.Series(["a", "a", "b", "b"], dtype="category")})
model = GeneralizedLinearRegressor(
    alpha=0,
    formula="y ~ C(cat)"
)
model.fit(df)

results in

LinAlgError: Matrix is singular.

model.feature_names_ outputs ['C(cat)[a]', 'C(cat)[b]'] and reveals that no reference level is chosen.

The is no difference in:

  • "y ~ C(cat)"
  • "y ~ C(cat, spans_intercept=False)"
  • "y ~ C(cat, spans_intercept=True)"

However, the error goes away with GeneralizedLinearRegressor(drop_first=True, ..).

Issue B

According to formulaic, it should be possible to call

model = GeneralizedLinearRegressor(
    alpha=0,
    formula="y ~ C(cat, contr.treatment)"
)
model.fit(df)

But this results in

FactorEvaluationError: Unable to evaluate factor `C(cat, contr.treatment)`. [TypeError: _C() takes 1 positional argument but 2 were given]

Formulas are specified as strings and neither formulaic nor glum nor tabmat have an API reference for, e.g., C. As a user, there is no way to infer the API because C is only ever specified as string inside a formula, i.e., "y ~ C(cat)" because the symbol C is not exposed anywhere.

The only way to see what happens is digging into the source code. There I learned that glum delegates formulas to tabmat. But tabmat overrides the C of formulaic, see https://github.com/Quantco/tabmat/blob/0e2608fcaca9f11830bafdf873a700c89de1391f/src/tabmat/formula.py#L665. This "C" (or _C) has a different API with different parameters.

This deficiency of strings for formulas was one reason I proposed a programmatic alternative to strings in #731.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions