Skip to main content

Using Pandas dataframes to perform a lookup

Recently, I've been looking at a problem where data is joined with enrichment data stored in a Pandas dataframe. For each record being processed, a lookup is performed and the single record from the enrichment data selected based on a key. Essentially, this is a SQL join operation, but one of the datasets couldn't fit into memory.
Having had experience with dataframes being slow (particularly when iterating through rows), I investigated whether the lookup would be faster if a enrichment data was stored in a Python dict as opposed to a dataframe.
In my experiments, I was able to get a 70 times speed improvement using a dict over a dataframe, even when indexing the dataframe.
The Python code used in the experiments is here:
# Parameters
num_entries = 1000000 # Number of rows of data to generate
num_columns = 10 # Number of columns in each row
num_lookups = 10000 # Number of test lookups to perform
# Import the required libraries
from faker import Faker
import pandas as pd
import random
import timeit
fake = Faker()
def generate_value():
return fake.sentence()
def generate_row(num_columns):
"""Generate a single row of data."""
return {f"column-{idx}": generate_value() for idx in range(num_columns)}
def generate_id(idx):
"""Generate an ID (used in the lookup)."""
return f"ID-{idx}"
def generate_dataframe(num_entries, num_columns):
"""Generate dataset."""
rows_for_df = []
rows = {}
for idx in range(num_entries):
data = generate_row(num_columns)
rows[generate_id(idx)] = data
data["id"] = generate_id(idx)
rows_for_df.append(data)
return pd.DataFrame(rows_for_df), rows
# Generate the dataset
df, rows = generate_dataframe(num_entries, num_columns)
# Index the dataframe by the 'id' column for a faster lookup
df = df.set_index("id")
print(f"Generated {len(rows)} rows of data")
def perform_lookup_on_dicts(rows, num_lookups, max_id):
for _ in range(num_lookups):
idx = random.randint(0, max_id-1)
data = rows[generate_id(idx)]
def perform_lookup_on_dataframe(df, num_lookups, max_id):
for _ in range(num_lookups):
idx = random.randint(0, max_id-1)
data = df.loc[generate_id(idx)]
%timeit perform_lookup_on_dicts(rows, num_lookups, num_entries)
%timeit perform_lookup_on_dataframe(df, num_lookups, num_entries)

Comments

Popular posts from this blog

Python libraries on an air-gapped machine

The Problem Development and test clusters may be air-gapped so that client data or sensitive software under development is less likely to be leaked. This can cause problems when trying to install libraries, e.g. for Python-based software, especially if the cluster has an old version of a Linux OS installed. The Solution Setup Python's pip on the remote machine On an Internet-enabled machine, download the Wheel file for pip from  https://pypi.python.org/pypi/pip , such as pip-9.0.1-py2.py3-none-any.whl. Copy the Wheel file (e.g. pip-9.0.1-py2.py3-none-any.whl) to the remote machine, e.g. using scp: scp pip-9.0.1-py2.py3-none-any.whl user@host:/path On the remote machine: python pip-9.0.1-py2.py3-none-any.whl/pip install --no-index pip-9.0.1-py2.py3-none-any.whl pip --version # this should display the version number if correctly installed Download the required libraries On an Internet-enabled machine, download the library and its d...

Getting started with Kafka and Kafka Tool

This provides a quick introduction to setting up a local Kafka instance and using Kafka Tool to view the messages. The initial part of the Kafka tutorial is adapted slightly from  http://kafka.apache.org/quickstart.html . The following are steps for: Downloading and unpacking Apache Kafka in Linux Starting the Zookeeper and Kafka servers Creating a new topic Adding and viewing messages using the command line tools Use Kafka Tool to view the messages in a topic Step 1: Download Kafka cd ./Downloads/ wget http://mirror.catn.com/pub/apache/kafka/0.10.0.0/kafka_2.11-0.10.0.0.tgz tar -zxf kafka_2.11-0.10.0.0.tgz cd kafka_2.11-0.10.0.0 Step 2: Start the Zookeeper and Kafka servers Check the Kafka port in zookeeper.properties by looking at clientPort (typically 2181) in config/zookeeper.properties Start Zookeeper with: bin/zookeeper-server-start.sh config/zookeeper.properties and then start Kafka: bin/kafka-server-start.sh config/server.properties ...

Python on a Windows Mobile PDA

I've been doing development work in Python and as an experiment I thought I'd have a go at getting Python installed on my Windows Mobile PDA . It's an HP iPaq running Windows Mobile 6 (CE OS 5.2.1.1616). The Python CE Wiki is located at: http://pythonce.sourceforge.net/Wikka/HomePage I installed PythonCE-2.5-20061219-setup.exe from sourceforget.net and amazingly it worked first time! After clicking Start -> Programs -> Python I was able to verify that it worked by typing >>> print 'Hello World' Hello World Using Ilium Software Screen Capture software I was able to get a screen shot very easily. It can be downloaded from http://www.mobiletopsoft.com/pocket-pc/download-ilium-software-screen-capture-free-1-1.html