Using Pandas dataframes to perform a lookup

Recently, I've been looking at a problem where data is joined with enrichment data stored in a Pandas dataframe. For each record being processed, a lookup is performed and the single record from the enrichment data selected based on a key. Essentially, this is a SQL join operation, but one of the datasets couldn't fit into memory.
Having had experience with dataframes being slow (particularly when iterating through rows), I investigated whether the lookup would be faster if a enrichment data was stored in a Python dict as opposed to a dataframe.
In my experiments, I was able to get a 70 times speed improvement using a dict over a dataframe, even when indexing the dataframe.
The Python code used in the experiments is here:

Comments

Getting started with Kafka and Kafka Tool

This provides a quick introduction to setting up a local Kafka instance and using Kafka Tool to view the messages. The initial part of the Kafka tutorial is adapted slightly from http://kafka.apache.org/quickstart.html . The following are steps for: Downloading and unpacking Apache Kafka in Linux Starting the Zookeeper and Kafka servers Creating a new topic Adding and viewing messages using the command line tools Use Kafka Tool to view the messages in a topic Step 1: Download Kafka cd ./Downloads/ wget http://mirror.catn.com/pub/apache/kafka/0.10.0.0/kafka_2.11-0.10.0.0.tgz tar -zxf kafka_2.11-0.10.0.0.tgz cd kafka_2.11-0.10.0.0 Step 2: Start the Zookeeper and Kafka servers Check the Kafka port in zookeeper.properties by looking at clientPort (typically 2181) in config/zookeeper.properties Start Zookeeper with: bin/zookeeper-server-start.sh config/zookeeper.properties and then start Kafka: bin/kafka-server-start.sh config/server.properties ...

Python libraries on an air-gapped machine

The Problem Development and test clusters may be air-gapped so that client data or sensitive software under development is less likely to be leaked. This can cause problems when trying to install libraries, e.g. for Python-based software, especially if the cluster has an old version of a Linux OS installed. The Solution Setup Python's pip on the remote machine On an Internet-enabled machine, download the Wheel file for pip from https://pypi.python.org/pypi/pip , such as pip-9.0.1-py2.py3-none-any.whl. Copy the Wheel file (e.g. pip-9.0.1-py2.py3-none-any.whl) to the remote machine, e.g. using scp: scp pip-9.0.1-py2.py3-none-any.whl user@host:/path On the remote machine: python pip-9.0.1-py2.py3-none-any.whl/pip install --no-index pip-9.0.1-py2.py3-none-any.whl pip --version # this should display the version number if correctly installed Download the required libraries On an Internet-enabled machine, download the library and its d...

Python on a Windows Mobile PDA

I've been doing development work in Python and as an experiment I thought I'd have a go at getting Python installed on my Windows Mobile PDA . It's an HP iPaq running Windows Mobile 6 (CE OS 5.2.1.1616). The Python CE Wiki is located at: http://pythonce.sourceforge.net/Wikka/HomePage I installed PythonCE-2.5-20061219-setup.exe from sourceforget.net and amazingly it worked first time! After clicking Start -> Programs -> Python I was able to verify that it worked by typing >>> print 'Hello World' Hello World Using Ilium Software Screen Capture software I was able to get a screen shot very easily. It can be downloaded from http://www.mobiletopsoft.com/pocket-pc/download-ilium-software-screen-capture-free-1-1.html

def(x: Experience):Code

Search This Blog