Recently, I've been looking at a problem where data is joined with enrichment data stored in a Pandas dataframe. For each record being processed, a lookup is performed and the single record from the enrichment data selected based on a key. Essentially, this is a SQL join operation, but one of the datasets couldn't fit into memory.
Having had experience with dataframes being slow (particularly when iterating through rows), I investigated whether the lookup would be faster if a enrichment data was stored in a Python
In my experiments, I was able to get a 70 times speed improvement using a
The Python code used in the experiments is here:
Having had experience with dataframes being slow (particularly when iterating through rows), I investigated whether the lookup would be faster if a enrichment data was stored in a Python
dict
as opposed to a dataframe.
In my experiments, I was able to get a 70 times speed improvement using a
dict
over a dataframe, even when indexing the dataframe.
The Python code used in the experiments is here:
Comments