def(x: Experience):Code

Posts

Using Pandas dataframes to perform a lookup

Recently, I've been looking at a problem where data is joined with enrichment data stored in a Pandas dataframe. For each record being processed, a lookup is performed and the single record from the enrichment data selected based on a key. Essentially, this is a SQL join operation, but one of the datasets couldn't fit into memory. Having had experience with dataframes being slow (particularly when iterating through rows), I investigated whether the lookup would be faster if a enrichment data was stored in a Python dict as opposed to a dataframe. In my experiments, I was able to get a 70 times speed improvement using a dict over a dataframe, even when indexing the dataframe. The Python code used in the experiments is here:

Neo4j 4.1 in Docker

Getting Neo4j 4.1.0 to work in Docker has been a real struggle! The docker-compose file was: version: "3" services: neo4j: image: neo4j container_name: neo4j ports: - 7474:7474 - 7687:7687 environment: - "NEO4J_AUTH=none" Note that authentication has been turned off, so just login with a blank username and password. The browser kept returning: WebSocket connection failure. Due to security constraints in your web browser, the reason for the failure is not available to this Neo4j Driver. Please use your browsers development console to determine the root cause of the failure. Common reasons include the database being unavailable, using the wrong connection URL or temporary network problems. If you have enabled encryption, ensure your browser is configured to trust the certificate Neo4j is configured to use. WebSocket `readyState` is: 3 To solve this, when you login change neo4j:// to bolt://

Python libraries on an air-gapped machine

The Problem Development and test clusters may be air-gapped so that client data or sensitive software under development is less likely to be leaked. This can cause problems when trying to install libraries, e.g. for Python-based software, especially if the cluster has an old version of a Linux OS installed. The Solution Setup Python's pip on the remote machine On an Internet-enabled machine, download the Wheel file for pip from https://pypi.python.org/pypi/pip , such as pip-9.0.1-py2.py3-none-any.whl. Copy the Wheel file (e.g. pip-9.0.1-py2.py3-none-any.whl) to the remote machine, e.g. using scp: scp pip-9.0.1-py2.py3-none-any.whl user@host:/path On the remote machine: python pip-9.0.1-py2.py3-none-any.whl/pip install --no-index pip-9.0.1-py2.py3-none-any.whl pip --version # this should display the version number if correctly installed Download the required libraries On an Internet-enabled machine, download the library and its d...

Getting started with Kafka and Kafka Tool

This provides a quick introduction to setting up a local Kafka instance and using Kafka Tool to view the messages. The initial part of the Kafka tutorial is adapted slightly from http://kafka.apache.org/quickstart.html . The following are steps for: Downloading and unpacking Apache Kafka in Linux Starting the Zookeeper and Kafka servers Creating a new topic Adding and viewing messages using the command line tools Use Kafka Tool to view the messages in a topic Step 1: Download Kafka cd ./Downloads/ wget http://mirror.catn.com/pub/apache/kafka/0.10.0.0/kafka_2.11-0.10.0.0.tgz tar -zxf kafka_2.11-0.10.0.0.tgz cd kafka_2.11-0.10.0.0 Step 2: Start the Zookeeper and Kafka servers Check the Kafka port in zookeeper.properties by looking at clientPort (typically 2181) in config/zookeeper.properties Start Zookeeper with: bin/zookeeper-server-start.sh config/zookeeper.properties and then start Kafka: bin/kafka-server-start.sh config/server.properties ...

Generating nullable objects in Scala ScalaCheck has completely changed how I approach testing. In a recent problem, I needed to be able to easily generate objects that could be null. Here's how I approached it: -

Apache Pig - getting the maximum and minimum values in a column

Apache Pig is fantastic for writing Big Data algorithms without having to write Map-Reduce jobs from scratch. I recently encountered a problem where I needed to get the maximum and minimum values in a column of data. For example, say the data (in CSV format) looks like this: Cat, 5 Mouse, 8 Dog, 4 This Pig Latin script reads in the data and finds the required values: animal_ages = LOAD 'data.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage() AS (animal:chararray, age:int); ages = FOREACH animal_ages GENERATE age; ages_grp = GROUP ages ALL; min_age = FOREACH ages_grp GENERATE MIN(ages) as min_val; max_age = FOREACH ages_grp GENERATE MAX(ages) as max_val; The minimum and maximum ages can be used as a scalar: min_age.min_val max_age.max_val