Issue A
When a formula with a categorical "C(..)" is used without penalty, no level is dropped, i.e. one-hot encoding is used instead of treatment coding.
from glum import GeneralizedLinearRegressor
import pandas as pd
df = pd.DataFrame({"y": [0, 1, 1, 2], "cat": pd.Series(["a", "a", "b", "b"], dtype="category")})
model = GeneralizedLinearRegressor(
alpha=0,
formula="y ~ C(cat)"
)
model.fit(df)
results in
LinAlgError: Matrix is singular.
model.feature_names_ outputs ['C(cat)[a]', 'C(cat)[b]'] and reveals that no reference level is chosen.
The is no difference in:
"y ~ C(cat)"
"y ~ C(cat, spans_intercept=False)"
"y ~ C(cat, spans_intercept=True)"
However, the error goes away with GeneralizedLinearRegressor(drop_first=True, ..).
Issue B
According to formulaic, it should be possible to call
model = GeneralizedLinearRegressor(
alpha=0,
formula="y ~ C(cat, contr.treatment)"
)
model.fit(df)
But this results in
FactorEvaluationError: Unable to evaluate factor `C(cat, contr.treatment)`. [TypeError: _C() takes 1 positional argument but 2 were given]
Formulas are specified as strings and neither formulaic nor glum nor tabmat have an API reference for, e.g., C. As a user, there is no way to infer the API because C is only ever specified as string inside a formula, i.e., "y ~ C(cat)" because the symbol C is not exposed anywhere.
The only way to see what happens is digging into the source code. There I learned that glum delegates formulas to tabmat. But tabmat overrides the C of formulaic, see https://github.com/Quantco/tabmat/blob/0e2608fcaca9f11830bafdf873a700c89de1391f/src/tabmat/formula.py#L665. This "C" (or _C) has a different API with different parameters.
This deficiency of strings for formulas was one reason I proposed a programmatic alternative to strings in #731.
Issue A
When a formula with a categorical
"C(..)"is used without penalty, no level is dropped, i.e. one-hot encoding is used instead of treatment coding.results in
model.feature_names_outputs['C(cat)[a]', 'C(cat)[b]']and reveals that no reference level is chosen.The is no difference in:
"y ~ C(cat)""y ~ C(cat, spans_intercept=False)""y ~ C(cat, spans_intercept=True)"However, the error goes away with
GeneralizedLinearRegressor(drop_first=True, ..).Issue B
According to formulaic, it should be possible to call
But this results in
Formulas are specified as strings and neither formulaic nor glum nor tabmat have an API reference for, e.g.,
C. As a user, there is no way to infer the API becauseCis only ever specified as string inside a formula, i.e.,"y ~ C(cat)"because the symbolCis not exposed anywhere.The only way to see what happens is digging into the source code. There I learned that glum delegates formulas to tabmat. But tabmat overrides the
Cof formulaic, see https://github.com/Quantco/tabmat/blob/0e2608fcaca9f11830bafdf873a700c89de1391f/src/tabmat/formula.py#L665. This"C"(or_C) has a different API with different parameters.This deficiency of strings for formulas was one reason I proposed a programmatic alternative to strings in #731.