Skip to content

Type Annotations#320

Open
nachomaiz wants to merge 27 commits intoRoche:pyfile_devfrom
nachomaiz:pyfile_typehints
Open

Type Annotations#320
nachomaiz wants to merge 27 commits intoRoche:pyfile_devfrom
nachomaiz:pyfile_typehints

Conversation

@nachomaiz
Copy link
Copy Markdown

@nachomaiz nachomaiz commented Feb 13, 2026

Hi @ofajardo!

This PR aims to fix #299, adding type annotations to all public interface functions and classes.

I based them on the docstrings and how I understand the code is operating with the different parameters and class attributes, but I might have missed something.

I wasn't able to compile the library in this machine, however I have done a runtime check of the type annotations to make sure everything runs in py3.10+.

How it works:

  • pyclasses.py:
    • I've written TypedDict classes for missing ranges and MR sets.
    • Because the instance is not meant to be initialized by the user, I've set the type annotations for optional parameters as the end type. It might be better to turn it into a dataclass or add default values of the same type.
  • pyreadstat.py:
    • Created a FileLike protocol with the methods read and seek.
    • Use os.PathLike for flexibility with os.fsencode
    • Added overloads to read functions for the different output format types.
    • Write functions accept any dataframe object supported by narwhals. Write functions accept either a pandas.DataFrame or a polars.DataFrame as the first argument.
    • Chunk- and multi-read functions only accept a PyreadstatReadFunction callable type. It's first argument must be a path/file-like object and it must return a tuple of data and metadata.
  • pyfunctions.py
    • Use narwhals type vars to signal the return of the same type of dataframe.
  • py.typed: to signal to type checkers that the public interface of the library is type hinted.

While I added type annotations, I saw a few issues with the docstrings. I took the liberty to sync them up with the type annotations.

I also used a formatter for the function signatures as they were getting unwieldy. This changed the formatting of some of the code within the functions, so let me know if you would prefer I revert those.

Looking forward to your feedback.

@ofajardo
Copy link
Copy Markdown
Collaborator

hi @nachomaiz thanks a lot! I am a bit snowed right now, but will take a look as soon as I get some time.
By looking at what you wrote here, I have two comments:

  • If you like, you can transform the metadata container to a data class if you think it will work better.
  • For the write functions, they should only accept a pandas or a polars dataframe, accepting any dataframe that narwhals would accept is misleading, because only those two are supported.

@jonathon-love please check this PR, probably it addresses the same as in yours?

@nachomaiz
Copy link
Copy Markdown
Author

nachomaiz commented Feb 16, 2026

Thank you for the feedback!

I've now changed metadata_container to a dataclass. I saw that all fields are assigned to in _readstat_parser, so I've removed the Nones and added equivalent, type-compatible values. datetime fields default to datetime.now, but hopefully it should be a minimal amount of extra compute.

I also took a quick look at @jonathon-love's PR, and I realized I was missing specific types for PathLike and np.ndarray, so I ported over some of those nicer type definitions. Could I ask for your thoughts as well?

I switched the dataframe types for the write functions to pandas.DataFrame | polars.DataFrame to keep the hints restricted to those libraries only.

Additionally, I've added py.typed to MANIFEST.in and setup.py, as I got reminded by @jonathon-love to include that as well.

@ofajardo
Copy link
Copy Markdown
Collaborator

hi @nachomaiz thanks for your efforts!

I have cloned your fork, compiled it (all ok) and then run the tests.

The test_basic.py and test_narwhalified.py fail with 3 errors. The origin is probably that in the metadata _container class, if you look carefully, there are some members like number_rows that before were by default None, and now you are defining them as 0, this raises an inconsistency when using metadataonly, which also breaks read_in_chunks when reading an export file. So, could you please review those members and adapt them to be as they were before?

oh now I see your comment

I saw that all fields are assigned to in _readstat_parser, so I've removed the Nones and added equivalent, type-compatible values. datetime fields default to datetime.now, but hopefully it should be a minimal amount of extra compute.

No, that is not correct, those values are not always set and therefore the None's need to be there, also do not default datetime.now but to None.

Another one: typing_test.py raises a lot of errors. I am less familiar with mypy so I have not checked what they are about.

Found 22 errors in 4 files (checked 1 source file)

I think you have to get a machine where you can compile pyreadstat and be able to run the tests, please run test_basic.py, test_narwhalified.py with backend==pandas and backend==polars and test_http_integratio.py. BTW,please rename typing_test.py to test_typing.py just to keep the naming pattern.

Last one:
In setup.py there are these two additions:

 package_data={"pyreadstat": ["py.typed"]},
    include_package_data=True,

I think this might be unnecessary if you included py.typed in the manifest. The issue with package data, is that on windows, when people install Python from the window app market store, it installs the package and package data into different places (can't remember exactly), and I am not sure if in such case the IDE will see the py.typed (maybe yes?). I had such an issue in the past when I had to deliver dll files for windows, and python was not able to find them. I think this has to be tested.

Otherwise it looks good! =) Speed is also the same as before when I converted the files from pyx to py, so it seems the dataclass change is neutral in terms of performance.

@nachomaiz
Copy link
Copy Markdown
Author

Hello!

Thank you for reviewing and for your feedback!

I'm working on setting up a machine to be able to compile and run tests, will hopefully have it soon.

In the meantime, just wanted to get your thoughts on a couple of the things you mentioned above.

I have now gone through the code a bit more carefully and found the places where the num_rows distinction between 0 and None is made, and I see that it's generally related to POR and XPORT (?) files not having row counts in their metadata, and how that interacts with metadataonly=True and chunk/multiprocessing reads...

What makes it a bit complicated in my view is that if we set num_rows as int | None, any access of num_rows for any other file type will always need to be preceded by:

if meta.number_rows is not None:
    ...

Which may be the easier way to handle things in the end, but also feels redundant when for many users it would never be None. Would there be any alternatives? Maybe a subclass of metadata_container only for POR and XPORT files, in which number_rows can be None as well? But you're probably more familiar with the code in terms of other potential ways to keep the logic working.

...

On the typing_tests.pyi file, do note that it's a PYI file, so it's not executable. This is just to run mypy tests against it, with mypy tests/typing_tests.pyi. I was worried that naming it as test_typing.py would make the test runner think it's a file with actual runtime tests. I suppose since it's a PYI file I could rename it to test_typing.pyi and it should be ok.

There are a few rows which should error, there should be 5 errors in that file (there are comments in the file where it points them out). It also analyzes other files as the test file imports them, so I'm ignoring those files for the purpose of these annotations.

I noticed that there's an import error in the file so at the moment it doesn't work correctly, I'll fix that soon. I should also mention that both polars and pandas-stubs must be installed for mypy to do the type checks correctly.

I'll fix those few bits and remove the extra setup.py lines in my next batch of commits, but I'd be keen to hear your thoughts on alternatives to setting int | None for all types of files.

@ofajardo
Copy link
Copy Markdown
Collaborator

hi @nachomaiz

Regarding the topic of num_rows being int or None, None signals that it was not possible to recover the information from the metadata and therefore it is undefined. It is not correct to say that happens only for POR and XPORT files, theoretically can happen to any file type if the writing application did not write that information, for example in the case of SPSS SAV files, some applications do not write the number of rows and therefore cannot be determined and should stay as None (see for example #109).

However, I am not 100% sure of what the problem is ... this is the way it has been for years and there has been no problems so far. I am also reluctant to change the interface unless it is strictly necessary. So can you please explain a bit more what your concern is? If you mean the user needs to check the possibility that num_rows is None, yes, the user should do that if wants to be strict, no way around that, for the reason explained before.

Please also notice that I would like all the members that were None before to stay as they are, not only num_rows.

@nachomaiz
Copy link
Copy Markdown
Author

Ok! That makes sense. My mistake for assuming things. 😅

Will bring back all the None values, try to run the tests, and push new commits, aiming for later today.

Hopefully that gets it to a good place to merge!

@ofajardo
Copy link
Copy Markdown
Collaborator

hi @nachomaiz thanks!

Regarding the typing tests, please indicate in the comment at the top of the file, where you indicate that it has to be run with mypy, which other packages need to be installed in order to run.

We need the tests to be executable, they should have assertions which should all of them pass if everything is fine and fail if something is wrong. These tests will be then run in order to make the wheels and expected to pass, so reveal_type is not enough. So please transform your tests into an executable and rename it as suggested before. I have never done this, so not sure what is better, a quick search says you can use either assert_type (would be nice as no extra package needed, then you could do similarly as test_narwhalified.py) or pytest-mypy-plugin (would require to install extra stuff, but apparently you can write negative tests more easily).

@nachomaiz
Copy link
Copy Markdown
Author

nachomaiz commented Feb 24, 2026

Hello again!

So I've managed to setup my machine to run tests! And yes I saw where they were failing. Sorry about that.

I've done the following:

  • Reverted the previously optional values to None by default
  • Added int | None to MRSet["counted_value"] as I saw that it was a valid type in the tests.
  • Created typing test cases to run with pytest-mypy-plugins. This was actually pretty interesting, so thanks for the challenge! 😄

The typing tests are run using the following command (also available at the top of the file):

pytest tests/test_typing.yml --mypy-ini-file=tests/test_mypy_setup.ini --mypy-only-local-stub

The mypy-* flags are needed because mypy checks imported libraries by default (numpy and narwhals), and that brings up tons of noise from typing errors there.

pytest will also pick up these tests if you just run pytest tests for example, so the mypy flags are required there too.

I've made just a few quick type tests on the metadata_container object, but they can be expanded as needed.

Let me know if you see anything else to change!

PS. I didn't go with assert_type because it's not available in Python 3.10.

@nachomaiz
Copy link
Copy Markdown
Author

Hi again,

Apologies, just realized that I missed number_columns when reverting default values to None.

Should hopefully be correct now! 😄

@ofajardo
Copy link
Copy Markdown
Collaborator

hi @nachomaiz sorry, I am quite busy these days. Today I run the tests and the old tests are all passing for me as well, so all looks good!
For the typing tests however, several tests are failing for me, so maybe first thing is to check if I am running them correctly.
I am doing:

pytest tests/test_typing.yml --mypy-ini-file=tests/test_mypy_setup.ini --mypy-only-local-stub

To run these I installed:

pip install pytest mypy pytest-mypy-plugins pandas-stubs

Anything else to install?

below I paste the errors I am getting. They seem to refer to numpy mostly. What version of numpy are you using? I was using first 2.3.2 and then updated to 2.4.3 ...

tests/test_typing.yml ...F....F....F....F....F....F.....F...FF........F.                                                                                             [ 98%]

================================================================================= FAILURES =================================================================================
__ read_sav_output_types[output_format=dict,expected_type=builtins.dict[builtins.str, numpy.ndarray[builtins.tuple[builtins.int, ...], numpy.dtype[numpy.generic[Any]]]]] __
/myhome/rancher/githubrepos/pub/pyreadstat_nachomaiz/tests/test_typing.yml:25:
E   pytest_mypy_plugins.utils.TypecheckAssertionError: Invalid output:
E   Actual:
E     main:4: note: Revealed type is "builtins.dict[builtins.str, numpy.ndarray[builtins.tuple[Any, ...], numpy.dtype[numpy.generic[Any]]]]" (diff)
E   Expected:
E     main:4: note: Revealed type is "builtins.dict[builtins.str, numpy.ndarray[builtins.tuple[builtins.int, ...], numpy.dtype[numpy.generic[Any]]]]" (diff)
E   Alignment of first line difference:
E     E: ...numpy.ndarray[builtins.tuple[builtins.int, ...], numpy.dtype[numpy.ge...
E     A: ...numpy.ndarray[builtins.tuple[Any, ...], numpy.dtype[numpy.generic[Any...
E                                        ^
__ read_dta_output_types[output_format=dict,expected_type=builtins.dict[builtins.str, numpy.ndarray[builtins.tuple[builtins.int, ...], numpy.dtype[numpy.generic[Any]]]]] __
/myhome/rancher/githubrepos/pub/pyreadstat_nachomaiz/tests/test_typing.yml:52:
E   pytest_mypy_plugins.utils.TypecheckAssertionError: Invalid output:
E   Actual:
E     main:3: note: Revealed type is "builtins.dict[builtins.str, numpy.ndarray[builtins.tuple[Any, ...], numpy.dtype[numpy.generic[Any]]]]" (diff)
E   Expected:
E     main:3: note: Revealed type is "builtins.dict[builtins.str, numpy.ndarray[builtins.tuple[builtins.int, ...], numpy.dtype[numpy.generic[Any]]]]" (diff)
E   Alignment of first line difference:
E     E: ...numpy.ndarray[builtins.tuple[builtins.int, ...], numpy.dtype[numpy.ge...
E     A: ...numpy.ndarray[builtins.tuple[Any, ...], numpy.dtype[numpy.generic[Any...
E                                        ^
__ read_por_output_types[output_format=dict,expected_type=builtins.dict[builtins.str, numpy.ndarray[builtins.tuple[builtins.int, ...], numpy.dtype[numpy.generic[Any]]]]] __
/myhome/rancher/githubrepos/pub/pyreadstat_nachomaiz/tests/test_typing.yml:79:
E   pytest_mypy_plugins.utils.TypecheckAssertionError: Invalid output:
E   Actual:
E     main:3: note: Revealed type is "builtins.dict[builtins.str, numpy.ndarray[builtins.tuple[Any, ...], numpy.dtype[numpy.generic[Any]]]]" (diff)
E   Expected:
E     main:3: note: Revealed type is "builtins.dict[builtins.str, numpy.ndarray[builtins.tuple[builtins.int, ...], numpy.dtype[numpy.generic[Any]]]]" (diff)
E   Alignment of first line difference:
E     E: ...numpy.ndarray[builtins.tuple[builtins.int, ...], numpy.dtype[numpy.ge...
E     A: ...numpy.ndarray[builtins.tuple[Any, ...], numpy.dtype[numpy.generic[Any...
E                                        ^
_ read_sas7bdat_output_types[output_format=dict,expected_type=builtins.dict[builtins.str, numpy.ndarray[builtins.tuple[builtins.int, ...], numpy.dtype[numpy.generic[Any]]]]] _
/myhome/rancher/githubrepos/pub/pyreadstat_nachomaiz/tests/test_typing.yml:106:
E   pytest_mypy_plugins.utils.TypecheckAssertionError: Invalid output:
E   Actual:
E     main:3: note: Revealed type is "builtins.dict[builtins.str, numpy.ndarray[builtins.tuple[Any, ...], numpy.dtype[numpy.generic[Any]]]]" (diff)
E   Expected:
E     main:3: note: Revealed type is "builtins.dict[builtins.str, numpy.ndarray[builtins.tuple[builtins.int, ...], numpy.dtype[numpy.generic[Any]]]]" (diff)
E   Alignment of first line difference:
E     E: ...numpy.ndarray[builtins.tuple[builtins.int, ...], numpy.dtype[numpy.ge...
E     A: ...numpy.ndarray[builtins.tuple[Any, ...], numpy.dtype[numpy.generic[Any...
E                                        ^
_ read_xport_output_types[output_format=dict,expected_type=builtins.dict[builtins.str, numpy.ndarray[builtins.tuple[builtins.int, ...], numpy.dtype[numpy.generic[Any]]]]] _
/myhome/rancher/githubrepos/pub/pyreadstat_nachomaiz/tests/test_typing.yml:133:
E   pytest_mypy_plugins.utils.TypecheckAssertionError: Invalid output:
E   Actual:
E     main:3: note: Revealed type is "builtins.dict[builtins.str, numpy.ndarray[builtins.tuple[Any, ...], numpy.dtype[numpy.generic[Any]]]]" (diff)
E   Expected:
E     main:3: note: Revealed type is "builtins.dict[builtins.str, numpy.ndarray[builtins.tuple[builtins.int, ...], numpy.dtype[numpy.generic[Any]]]]" (diff)
E   Alignment of first line difference:
E     E: ...numpy.ndarray[builtins.tuple[builtins.int, ...], numpy.dtype[numpy.ge...
E     A: ...numpy.ndarray[builtins.tuple[Any, ...], numpy.dtype[numpy.generic[Any...
E                                        ^
_ read_sas7bcat_output_types[output_format=dict,expected_type=builtins.dict[builtins.str, numpy.ndarray[builtins.tuple[builtins.int, ...], numpy.dtype[numpy.generic[Any]]]]] _
/myhome/rancher/githubrepos/pub/pyreadstat_nachomaiz/tests/test_typing.yml:160:
E   pytest_mypy_plugins.utils.TypecheckAssertionError: Invalid output:
E   Actual:
E     main:3: note: Revealed type is "builtins.dict[builtins.str, numpy.ndarray[builtins.tuple[Any, ...], numpy.dtype[numpy.generic[Any]]]]" (diff)
E   Expected:
E     main:3: note: Revealed type is "builtins.dict[builtins.str, numpy.ndarray[builtins.tuple[builtins.int, ...], numpy.dtype[numpy.generic[Any]]]]" (diff)
E   Alignment of first line difference:
E     E: ...numpy.ndarray[builtins.tuple[builtins.int, ...], numpy.dtype[numpy.ge...
E     A: ...numpy.ndarray[builtins.tuple[Any, ...], numpy.dtype[numpy.generic[Any...
E                                        ^
________________________________________________________________ read_file_multiprocessing_invalid_callable ________________________________________________________________
/myhome/rancher/githubrepos/pub/pyreadstat_nachomaiz/tests/test_typing.yml:195:
E   pytest_mypy_plugins.utils.TypecheckAssertionError: Invalid output:
E   Actual:
E     main:4: error: Argument 1 to "read_file_multiprocessing" has incompatible type "Callable[[int], int]"; expected "def (str | bytes | PathLike[str] | PathLike[bytes] | FileLike, /, *Any, **Any) -> tuple[pandas.core.frame.DataFrame | polars.dataframe.frame.DataFrame | dict[str, ndarray[tuple[Any, ...], dtype[generic[Any]]]], metadata_container]"  [arg-type] (diff)
E     ...
E   Expected:
E     main:4: error: Argument 1 to "read_file_multiprocessing" has incompatible type "Callable[[int], int]"; expected "def (str | bytes | PathLike[str] | PathLike[bytes] | FileLike, /, *Any, **Any) -> tuple[pandas.core.frame.DataFrame | polars.dataframe.frame.DataFrame | dict[str, ndarray[tuple[int, ...], dtype[generic[Any]]]], metadata_container]"  [arg-type] (diff)
E     ...
E   Alignment of first line difference:
E     E: ...[str, ndarray[tuple[int, ...], dtype[generic[Any]]]], metadata_contai...
E     A: ...[str, ndarray[tuple[Any, ...], dtype[generic[Any]]]], metadata_contai...
E                               ^
_ read_file_in_chunks_output_types[output_format=dict,expected_type=builtins.dict[builtins.str, numpy.ndarray[builtins.tuple[builtins.int, ...], numpy.dtype[numpy.generic[Any]]]]] _
/myhome/rancher/githubrepos/pub/pyreadstat_nachomaiz/tests/test_typing.yml:217:
E   pytest_mypy_plugins.utils.TypecheckAssertionError: Invalid output:
E   Actual:
E     main:3: note: Revealed type is "builtins.dict[builtins.str, numpy.ndarray[builtins.tuple[Any, ...], numpy.dtype[numpy.generic[Any]]]]" (diff)
E   Expected:
E     main:3: note: Revealed type is "builtins.dict[builtins.str, numpy.ndarray[builtins.tuple[builtins.int, ...], numpy.dtype[numpy.generic[Any]]]]" (diff)
E   Alignment of first line difference:
E     E: ...numpy.ndarray[builtins.tuple[builtins.int, ...], numpy.dtype[numpy.ge...
E     A: ...numpy.ndarray[builtins.tuple[Any, ...], numpy.dtype[numpy.generic[Any...
E                                        ^
___________________________________________________________________ read_file_in_chunks_invalid_callable ___________________________________________________________________
/myhome/rancher/githubrepos/pub/pyreadstat_nachomaiz/tests/test_typing.yml:224:
E   pytest_mypy_plugins.utils.TypecheckAssertionError: Invalid output:
E   Actual:
E     main:4: error: Argument 1 to "read_file_in_chunks" has incompatible type "Callable[[int], int]"; expected "def (str | bytes | PathLike[str] | PathLike[bytes] | FileLike, /, *Any, **Any) -> tuple[pandas.core.frame.DataFrame | polars.dataframe.frame.DataFrame | dict[str, ndarray[tuple[Any, ...], dtype[generic[Any]]]], metadata_container]"  [arg-type] (diff)
E     ...
E   Expected:
E     main:4: error: Argument 1 to "read_file_in_chunks" has incompatible type "Callable[[int], int]"; expected "def (str | bytes | PathLike[str] | PathLike[bytes] | FileLike, /, *Any, **Any) -> tuple[pandas.core.frame.DataFrame | polars.dataframe.frame.DataFrame | dict[str, ndarray[tuple[int, ...], dtype[generic[Any]]]], metadata_container]"  [arg-type] (diff)
E     ...
E   Alignment of first line difference:
E     E: ...dict[str, ndarray[tuple[int, ...], dtype[generic[Any]]]], metadata_co...
E     A: ...dict[str, ndarray[tuple[Any, ...], dtype[generic[Any]]]], metadata_co...
E                                   ^
_______________________________________________________________________________ worker_types _______________________________________________________________________________
/myhome/rancher/githubrepos/pub/pyreadstat_nachomaiz/tests/test_typing.yml:311:
E   pytest_mypy_plugins.utils.TypecheckAssertionError: Invalid output:
E   Actual:
E     main:5: note: Revealed type is "pandas.core.frame.DataFrame | polars.dataframe.frame.DataFrame | builtins.dict[builtins.str, numpy.ndarray[builtins.tuple[Any, ...], numpy.dtype[Any]]]" (diff)
E     main:6: error: Incompatible types in assignment (expression has type "tuple[Callable[[Any], str], str, int, int, dict[str, Any]]", variable has type "tuple[def (str | bytes | PathLike[str] | PathLike[bytes] | FileLike, /, *Any, **Any) -> tuple[pandas.core.frame.DataFrame | polars.dataframe.frame.DataFrame | dict[str, ndarray[tuple[Any, ...], dtype[generic[Any]]]], metadata_container], str | bytes | PathLike[Any] | FileLike, int, int, dict[str, Any]]")  [assignment] (diff)
E     ...
E   Expected:
E     main:5: note: Revealed type is "pandas.core.frame.DataFrame | polars.dataframe.frame.DataFrame | builtins.dict[builtins.str, numpy.ndarray[Any, Any]]" (diff)
E     main:6: error: Incompatible types in assignment (expression has type "tuple[Callable[[Any], str], str, int, int, dict[str, Any]]", variable has type "tuple[def (str | bytes | PathLike[str] | PathLike[bytes] | FileLike, /, *Any, **Any) -> tuple[pandas.core.frame.DataFrame | polars.dataframe.frame.DataFrame | dict[str, ndarray[tuple[int, ...], dtype[generic[Any]]]], metadata_container], str | bytes | PathLike[Any] | FileLike, int, int, dict[str, Any]]")  [assignment] (diff)
E     ...
E   Alignment of first line difference:
E     E: ...[builtins.str, numpy.ndarray[Any, Any]]"...
E     A: ...[builtins.str, numpy.ndarray[builtins.tuple[Any, ...], numpy.dtype[An...
E                                        ^
========================================================================= short test summary info ==========================================================================
FAILED tests/test_typing.yml::read_sav_output_types[output_format=dict,expected_type=builtins.dict[builtins.str, numpy.ndarray[builtins.tuple[builtins.int, ...], numpy.dtype[numpy.generic[Any]]]]] -
FAILED tests/test_typing.yml::read_dta_output_types[output_format=dict,expected_type=builtins.dict[builtins.str, numpy.ndarray[builtins.tuple[builtins.int, ...], numpy.dtype[numpy.generic[Any]]]]] -
FAILED tests/test_typing.yml::read_por_output_types[output_format=dict,expected_type=builtins.dict[builtins.str, numpy.ndarray[builtins.tuple[builtins.int, ...], numpy.dtype[numpy.generic[Any]]]]] -
FAILED tests/test_typing.yml::read_sas7bdat_output_types[output_format=dict,expected_type=builtins.dict[builtins.str, numpy.ndarray[builtins.tuple[builtins.int, ...], numpy.dtype[numpy.generic[Any]]]]] -
FAILED tests/test_typing.yml::read_xport_output_types[output_format=dict,expected_type=builtins.dict[builtins.str, numpy.ndarray[builtins.tuple[builtins.int, ...], numpy.dtype[numpy.generic[Any]]]]] -
FAILED tests/test_typing.yml::read_sas7bcat_output_types[output_format=dict,expected_type=builtins.dict[builtins.str, numpy.ndarray[builtins.tuple[builtins.int, ...], numpy.dtype[numpy.generic[Any]]]]] -
FAILED tests/test_typing.yml::read_file_multiprocessing_invalid_callable -
FAILED tests/test_typing.yml::read_file_in_chunks_output_types[output_format=dict,expected_type=builtins.dict[builtins.str, numpy.ndarray[builtins.tuple[builtins.int, ...], numpy.dtype[numpy.generic[Any]]]]] -
FAILED tests/test_typing.yml::read_file_in_chunks_invalid_callable -
FAILED tests/test_typing.yml::worker_types -
================================================================ 10 failed, 40 passed in 455.83s (0:07:35) =============================

@ofajardo
Copy link
Copy Markdown
Collaborator

Other observations about test_typing.yml:

  • read_sas7bcat_default_types is defined twice, but I think the first one should be for xport

@nachomaiz
Copy link
Copy Markdown
Author

Hello again! Apologies I took a few days of vacation and wasn't able to look into this until now.

It seems like numpy type definitions are a bit finicky. I originally had the Any version but mypy was giving me actual type information. I'll need to check which versions of numpy and mypy I was using, but I imagine they would be the latest versions that support Python 3.10 (though numpy has ended support for 3.10 in their most recent versions) since it is the minimum supported version for this library.

I'm now thinking that might be the issue... I was using Python 3.10 to test the type definitions, so the most recent version of numpy I could have used is v2.2.6. The main typing change in v2.3.0 I can see is a removal of numpy.typing.mypy_plugin which is probably affecting these test discrepancies. (numpy/numpy#28129)

So I think I'd need your guidance on next steps. It might be worth considering if Python 3.10 should still be supported by this library given the direct dependency on numpy and it's lack of support for that version. Alternatively, I think to fix it and allow the tests to work across all supported Python versions, I'd need to change the type tests to use regex instead, and just check that the returned type is any form of numpy.ndarray without qualifiers.

Thank you as well for catching the duplicated test, apologies for that. I'll fix that when I get back to the machine I can run the tests on.

@ofajardo
Copy link
Copy Markdown
Collaborator

hi @nachomaiz , welcome back! I hope you had a nice vacation!.

Thanks for looking into this. In my opinion the best would be to make it good for Python 3.11+ and just forget about 3.10. Although I like backward compatibility a lot, at some point it becomes difficult, and 3.10 is relatively old now, so I would be happy to provide support for 3.11+ only. Anyway if somebody needs it for 3.10, it still can use pyreadstat, it is just type annotations would not work 100%, which I think is fair enough.

So if you are cool with that, please proceed to make it compatible with 3.11+ .

Another maybe a bit too agressive thought is that the problem seems to be on the dict output. I wonder how many people use that one. It was meant as a bridge to import the data into polars, but now that there is polars output maybe nobody uses it. As I don't know, I will not remove it, but I wonder also how important is it to have very good type annotations for that one. For now I would say let's just go as you did, later if it breaks again because numpy annotations would be stable, maybe it would pay off to have a more generic type (like just saying it is a numpy nd array, without specifying the types of the inner elements.)

Copy link
Copy Markdown
Author

@nachomaiz nachomaiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ofajardo,

Thank you!

I've reworked the numpy type annotations for 3.11+ and have successfully run all the tests.

I've also fixed the duplicated test and replaced it with the correct xport test.

It seems like last week pytest-mypy-plugins updated, and with the new version the behavior of the --mypy-only-local-stub flag is now the default, so it's not needed anymore if you update the package to 4.0. I've reflected that in the test file as well.

Let me know if you have any other comments! Cheers!

@nachomaiz
Copy link
Copy Markdown
Author

Hello again,

While I was updating the Python version of my venv for 3.11+, I found it a bit tricky to make sure I had all test dependencies.

There's been an option to define group dependencies for some time, and it might be helpful to bring that into this project's pyproject.toml.

Something like:

[dependency-groups]
dev = ["pandas[pyarrow]>=2.0.0", "polars>=1.30.0"]
test = ["pytest>9.0.0", "pytest-mypy-plugins>=4.0.0"]

Then, the environment setup command for testing or development could look like (requires pip>=25.1):

pip install --group dev --group test

Some tools like uv even auto-install the dev group by default when setting up the project environment.

Given this PR adds a new test dependency (pytest-mypy-plugins), would you like me to include this within the PR?

@ofajardo
Copy link
Copy Markdown
Collaborator

hi @nachomaiz thanks for the improvements!

  • Regarding the dependencies-groups I think that is a nice, idea, please add it. For dev, I would say cython is needed in addition to what you wrote, I don't think you need pyarrow though at the moment. For test, in my hands I also needed pandas-stubs and mypy in addition to what you indicated.

  • Regarding the tests, for me there is one failing, can you please check?

========================================================== FAILURES ==========================================================
________________________________________________________ worker_types ________________________________________________________
Traceback (most recent call last):
  File "/opt/conda/lib/python3.12/site-packages/_pytest/runner.py", line 353, in from_call
    result: TResult | None = func()
                             ^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/_pytest/runner.py", line 245, in <lambda>
    lambda: runtest_hook(item=item, **kwds),
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/pluggy/_hooks.py", line 512, in __call__
    return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/pluggy/_manager.py", line 120, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/pluggy/_callers.py", line 167, in _multicall
    raise exception
  File "/opt/conda/lib/python3.12/site-packages/pluggy/_callers.py", line 139, in _multicall
    teardown.throw(exception)
  File "/opt/conda/lib/python3.12/site-packages/_pytest/logging.py", line 850, in pytest_runtest_call
    yield
  File "/opt/conda/lib/python3.12/site-packages/pluggy/_callers.py", line 139, in _multicall
    teardown.throw(exception)
  File "/opt/conda/lib/python3.12/site-packages/_pytest/capture.py", line 900, in pytest_runtest_call
    return (yield)
            ^^^^^
  File "/opt/conda/lib/python3.12/site-packages/pluggy/_callers.py", line 139, in _multicall
    teardown.throw(exception)
  File "/opt/conda/lib/python3.12/site-packages/_pytest/skipping.py", line 268, in pytest_runtest_call
    return (yield)
            ^^^^^
  File "/opt/conda/lib/python3.12/site-packages/pluggy/_callers.py", line 121, in _multicall
    res = hook_impl.function(*args)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/_pytest/runner.py", line 179, in pytest_runtest_call
    item.runtest()
  File "/opt/conda/lib/python3.12/site-packages/pytest_mypy_plugins/item.py", line 441, in runtest
    ).run()
      ^^^^^
  File "/opt/conda/lib/python3.12/site-packages/pytest_mypy_plugins/item.py", line 294, in run
    self.output_checker.check(returncode, stdout, stderr)
  File "/opt/conda/lib/python3.12/site-packages/pytest_mypy_plugins/item.py", line 251, in check
    assert_expected_matched_actual(expected=self.expected_output, actual=output_lines)
  File "/opt/conda/lib/python3.12/site-packages/pytest_mypy_plugins/utils.py", line 225, in assert_expected_matched_actual
    if e is None or a is None or not e.matches(a)
                                     ^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/pytest_mypy_plugins/utils.py", line 81, in matches
    return bool(regex.match(pattern, actual))
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/regex/_main.py", line 256, in match
    pat = _compile(pattern, flags, ignore_unused, kwargs, True)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/regex/_main.py", line 552, in _compile
    raise error(caught_exception.msg, caught_exception.pattern,
regex._regex_core.error: multiple repeat at position 231
================================================== short test summary info ===================================================
FAILED tests/test_typing.yml::worker_types - regex._regex_core.error: multiple repeat at position 231
========================================== 1 failed, 49 passed in 412.95s (0:06:52) ==========================================

@ofajardo
Copy link
Copy Markdown
Collaborator

I am also asking Claude code what it thinks about the tests, it says several things most of them I consider it is been too picky, but a couple catched my eye, so I write them here for your consideration:

1. read_file_multiprocessing_output_types doesn't actually test its parametrization (line 177-188)

  The parametrized block defines 3 variants (pandas/polars/dict) with expected_type, but the main block hardcodes output_format="polars" on line 187 and checks against
  polars.dataframe.frame.DataFrame on line 188. The {{ output_format }} and {{ expected_type }} templates are never used. This means the pandas and dict overloads of
  read_file_multiprocessing are not actually tested — only the polars path runs. Compare with line 24 where the template is correctly used: output_format="{{ output_format}}".

5. worker_types expected error is fragile (line 312)
  The full error message for the invalid lambda is inlined as one massive line. Any change to mypy's error formatting, or to the type aliases, will break this test even if
  the type checking is still correct. The # E: pattern is unavoidable with pytest-mypy-plugins, but the companion out: block on line 313-314 adds a second assertion on a
  note: that's an implementation detail of mypy's lambda inference. This makes the test brittle.

I also asked about the type hints in pyreadstat.py and a couple that I was intrigued about: particularly in number 4 ... is FileLike not going to work with worker.py? if that's true, that is something to correct. Number 7 is a good one, not related to the typing, so I can also fix it later, or you could fix it if you feel like ... Number 5, makes sense ... what do you think about that one? it may become float, but may also be int most of times.

4. worker.py Input type is a flat tuple (worker.py:29)

  Input: TypeAlias = "tuple[PyreadstatReadFunction, str | bytes | PathLike | FileLike, int, int, dict[str, Any]]"

  This duplicates the path types already defined as FilePathLike in pyreadstat.py — it should reuse FilePathLike instead. Also FileLike shouldn't be here since worker.py
  only works with file paths (you can't pickle a file-like object across processes).

5. read_file_multiprocessing has row_limit typed as float at runtime (pyreadstat.py:1146)

  row_limit = kwargs.pop("row_limit", float("inf"))

  The signature says the function returns tuple[DataFrame | DictOutput, metadata_container], but row_limit defaults to float("inf") internally. This isn't a type hint issue
   per se, but numrows = min(max(numrows - row_offset, 0), row_limit) on line 1159 then mixes int and float, which can produce a float where an int is expected downstream.

  6. read_por overloads have bare defaults instead of ... (lines 722-723)

  row_limit: int = 0,
  row_offset: int = 0,

  All other overloads use ... for defaults (the correct convention for overloads). This is clearly a copy-paste miss.

7. __init__.py has no __all__

  The public API is defined by explicit imports (line 17-22) which is fine, but adding __all__ would make the public surface explicit for type checkers and documentation
  generators. Not critical but a small gap.

  8. Docstrings still reference "pandas dataframe" (pyfunctions.py:29, 43, 123, 139)

  The function signatures correctly use IntoDataFrameT (supporting both pandas and polars), but the docstrings say "pandas dataframe" everywhere. This is misleading for
  users who rely on docstrings.

@nachomaiz
Copy link
Copy Markdown
Author

nachomaiz commented Mar 31, 2026

Oh very cool! Thanks for checking and running this through Claude, and sorry for the back and forth. There must be something that changes enough in each combination of pandas/mypy/numpy typing that breaks some of these tests, and I agree that they are a bit flakey due to that, because they pass when I run them, which is a bit concerning.

I'll need to read through the rest properly & implement changes, though it looks like most are simple fixes due to copy-pasting mistakes (apologies).

On the "inf" row limit question, I'd also be surprised to get a float row limit because rows can only be whole numbers. But that is logically impossible since all possible integer values must be smaller than "inf" when the function goes through min-max clamping. I think it's fine to have type hints only at the API boundaries, as long as the logic is consistent it doesn't need to be fully typed within the function. So I'd recommend ignoring that. But if we want to be type correct there, I'd suggest setting the default in the get call to an arbitrarily large number, like the maximum value of an unsigned long long (0xFFFFFFFFFFFFFFFF or 18_446_744_073_709_551_615, though probably overkill) or unsinged int (0xFFFFFFFF or 4_294_967_295) for the clamp operation, and keeping the type as int.

On the file-like question for worker, yes it should have been PathLike only, will correct soon.

And I'll read through the rest & implement changes soon as well.

@ofajardo
Copy link
Copy Markdown
Collaborator

ofajardo commented Apr 1, 2026

hi,

Yes, I am also a bit concerned on the maintainability of the tests if they some time pass and sometimes not. Let's try to find something that is more robust. If you find the reason for the problem and the solution please explain what was happening.

Regarding the row limit thing, I would let's leave it as is. Regarding the other claude comments, check them, but it is an LLM, so please share your thoughts on why changing or not changing something.

I have more question: in pyreadstat.py lines 34 to 44 (if TYPE_CHECKING), here you try to import pandas and polars and if the import fails you pass, which is cool, becase the user may not have both of them installed. But on line 44 then you define the Dataframe type alias as either the polars or the pandas dataframe. If one of them failed to get imported, then this will just default to the one that is present? Have you tested what happens if you have only one of them installed?

Thanks again for your great work, I think we are very close to finish!

@nachomaiz
Copy link
Copy Markdown
Author

While I was fixing some of the issues above, I started testing a few things with the numpy output formats in runtime and I noticed something:

>>> data, meta = pyreadstat.read_sav("test_data/basic/example.sav", output_format="dict")
>>> type(data["mychar"])
list  # Expected `numpy.ndarray` based on documentation

I started looking for where the creation of the column data container happens, and I found it here in _readstat_parser.pyx:

https://github.com/nachomaiz/pyreadstat/blob/60bde002c04dcffb415c3a7cf535fa4cb4fe944d/pyreadstat/_readstat_parser.pyx#L559-L573

So it seems like the data is never stored as numpy arrays when "dict" is passed, only with "pandas". Then I see the data being read or written into in data_container_to_dict and handle_value, with specific "pandas" paths with numpy manipulation, but types are handled consistently with the passed output format.

This actually would simplify this typing issue with numpy; we can set the output type to dict[str, list[Any]] and make the type tests much simpler.

I'm not sure if there's a reason to create numpy arrays for pandas specifically, I'd expect lists would work too and the data dictionary is not visible to the user, but in any case I'd imagine the docstrings and type hints should reflect this behavior with "dict" as the output format.

Before marking the outputs as list, I wanted to check if that sounds good to you, or if numpy is actually the output you want for "dict", in which case the code would need to be changed.

Copy link
Copy Markdown
Author

@nachomaiz nachomaiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, sorry one correction, read_file_multiprocessing actually casts the column data to numpy:

https://github.com/nachomaiz/pyreadstat/blob/60bde002c04dcffb415c3a7cf535fa4cb4fe944d/pyreadstat/pyreadstat.py#L1178-L1182

This seems a bit inconsistent overall... I'd suggest picking one format and sticking with it throughout the API, so I'll wait for your thoughts before making any changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants