Improvements to reading in chunks and multiprocessing to speed up reading large files

This might be related to #193, but I tried to investigate the situation and found some interesting bits.

I have a 77Gb sas7bat file (see below for some details about the file). Since that does not fit in my server memory, I tried pyreadstat.read_file_in_chunks(), which worked but... took 25 hours to complete (using the default chunksize of 100_000). 

First, reading only the metadata of that file (using `empty_df, meta = pyreadstat.read_sas7bdat(src_path, metadataonly=True)`  took more than 7 minutes ! I don't think it is supposed to be so slow, but I guess that this is a readstat problem and not a pyreadstat problem.

However, after investigation, I think that the ways chunking is implemented in pyreadstat (by repeatedly calling the "normal" read function) also makes things a lot slower than it could, especially with "small" chunk sizes. 

Even when reading the file repeatedly (and thus I guess Windows caches it), there is some fixed overhead of around 10 seconds to read a single chunk, and I suppose it is related to reading the metadata over and over again for each chunk. Here are some timings of reading the first chunk with varying chunk sizes. As you can see, **if** you subtract approximately 10 seconds for each, this is almost perfectly linear as it should.

Chunk size -> time:
* 100_000 -> 10.96 seconds (9124 rows per second)
* 200_000 -> 12.97 seconds (15417 rows per second)
* 400_000 -> 14.67 seconds (27274 rows per second)
* 800_000 -> 19.30 seconds (41455 rows per second)
* 1_000_000 -> 20.99 seconds (47639 rows per second)
* 2_000_000 -> 30.20 seconds (66232 rows per second)
* 4_000_000 -> 51.29 seconds (77991 rows per second)
* 8_000_000 -> 87.97 seconds (90943 rows per second)

I also noticed file reading get slower and slower as you increase the row offset. I suppose there is some overhead to move to a given row offset in the file. This probably cannot be helped on pyreadstat side of things, but might be useful to investigate nonetheless.

* 5_000_000 rows chunk
  - first chunk done in 52.19 seconds (95799 rows per second)
  - last chunk done in 1 minute 31.37 seconds (54722 rows per second)
* 10_000_000 rows chunk
  - first chunk done in 1 minute 34.19 seconds (106168 rows per second)
  - last chunk done in 2 minutes 30.06 seconds (66642 rows per second)

In any case, I think it would help a lot (work around both problems) if you changed the chunking approach to avoid paying for the "setup overhead" (reading the metadata and whatever else is done per file) over and over and read chunks one after another instead of closing the file, and re-jumping to the "current" chunk each time.

I don't think this is relevant but I am on Windows 64bit using Python 3.12 from Anaconda and pyreadstat is installed using pip.

### 
Relevant METADATA about my file

>
> file_encoding = WINDOWS-1252
> file_format = sas7bdat
> number_columns = 32
> number_rows = 385169714
> 
> readstat_variable_types = {'year': 'double', 'ID_DEMO_C': 'string', 'CD_SEX': 'string', 'MS_AGE': 'double', 'CD_REFNIS': 'double', 'arr': 'double', 'reg': 'string', 'CD_NATLTY': 'string', 'natgroup': 'string', 'CD_CNTRY_BTH': 'string', 'CD_FST_NATLTY': 'string', 'HH_TYPE_LIPRO': 'double', 'HH_POS_LIPRO': 'double', 'DT_REFDATE': 'double', 'DT_BTH': 'double', 'CD_CIV': 'string', 'CD_REG': 'string', 'ID_DEMO_HH_HD_C': 'string', 'ID_HH_MEMBR': 'double', 'CD_REL_HH_HD': 'string', 'ID_HH_C': 'string', 'ID_DEMO_PAR_1_C': 'string', 'ID_DEMO_PAR_2_C': 'string', 'CD_SEX_PAR_1': 'string', 'CD_SEX_PAR_2': 'string', 'CD_FST_NATLTY_PAR_1': 'string', 'CD_FST_NATLTY_PAR_2': 'string', 'CD_NATLTY_PAR_1': 'string', 'CD_NATLTY_PAR_2': 'string', 'CD_DSCNT_NATLTY': 'string', 'CD_NAT_DSCNT': 'double', 'CD_DSCNT': 'string'}
> 
> variable_storage_width = {'year': 4, 'ID_DEMO_C': 16, 'CD_SEX': 1, 'MS_AGE': 3, 'CD_REFNIS': 8, 'arr': 3, 'reg': 6, 'CD_NATLTY': 3, 'natgroup': 8, 'CD_CNTRY_BTH': 3, 'CD_FST_NATLTY': 3, 'HH_TYPE_LIPRO': 3, 'HH_POS_LIPRO': 3, 'DT_REFDATE': 8, 'DT_BTH': 8, 'CD_CIV': 1, 'CD_REG': 1, 'ID_DEMO_HH_HD_C': 16, 'ID_HH_MEMBR': 8, 'CD_REL_HH_HD': 2, 'ID_HH_C': 16, 'ID_DEMO_PAR_1_C': 16, 'ID_DEMO_PAR_2_C': 16, 'CD_SEX_PAR_1': 1, 'CD_SEX_PAR_2': 1, 'CD_FST_NATLTY_PAR_1': 3, 'CD_FST_NATLTY_PAR_2': 3, 'CD_NATLTY_PAR_1': 3, 'CD_NATLTY_PAR_2': 3, 'CD_DSCNT_NATLTY': 3, 'CD_NAT_DSCNT': 3, 'CD_DSCNT': 25}
>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improvements to reading in chunks and multiprocessing to speed up reading large files #285

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improvements to reading in chunks and multiprocessing to speed up reading large files #285

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions