Skip to main content

Apache Pig - getting the maximum and minimum values in a column

Apache Pig is fantastic for writing Big Data algorithms without having to write Map-Reduce jobs from scratch.

I recently encountered a problem where I needed to get the maximum and minimum values in a column of data. For example, say the data (in CSV format) looks like this:

Cat, 5
Mouse, 8
Dog, 4

This Pig Latin script reads in the data and finds the required values:

animal_ages = LOAD 'data.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage() AS
(animal:chararray, age:int);

ages = FOREACH animal_ages GENERATE age;
ages_grp = GROUP ages ALL;
min_age = FOREACH ages_grp GENERATE MIN(ages) as min_val;
max_age = FOREACH ages_grp GENERATE MAX(ages) as max_val;

The minimum and maximum ages can be used as a scalar:

min_age.min_val
max_age.max_val

Comments

Popular posts from this blog

Python libraries on an air-gapped machine

The Problem Development and test clusters may be air-gapped so that client data or sensitive software under development is less likely to be leaked. This can cause problems when trying to install libraries, e.g. for Python-based software, especially if the cluster has an old version of a Linux OS installed. The Solution Setup Python's pip on the remote machine On an Internet-enabled machine, download the Wheel file for pip from  https://pypi.python.org/pypi/pip , such as pip-9.0.1-py2.py3-none-any.whl. Copy the Wheel file (e.g. pip-9.0.1-py2.py3-none-any.whl) to the remote machine, e.g. using scp: scp pip-9.0.1-py2.py3-none-any.whl user@host:/path On the remote machine: python pip-9.0.1-py2.py3-none-any.whl/pip install --no-index pip-9.0.1-py2.py3-none-any.whl pip --version # this should display the version number if correctly installed Download the required libraries On an Internet-enabled machine, download the library and its dependencies using the fol

Getting started with Kafka and Kafka Tool

This provides a quick introduction to setting up a local Kafka instance and using Kafka Tool to view the messages. The initial part of the Kafka tutorial is adapted slightly from  http://kafka.apache.org/quickstart.html . The following are steps for: Downloading and unpacking Apache Kafka in Linux Starting the Zookeeper and Kafka servers Creating a new topic Adding and viewing messages using the command line tools Use Kafka Tool to view the messages in a topic Step 1: Download Kafka cd ./Downloads/ wget http://mirror.catn.com/pub/apache/kafka/0.10.0.0/kafka_2.11-0.10.0.0.tgz tar -zxf kafka_2.11-0.10.0.0.tgz cd kafka_2.11-0.10.0.0 Step 2: Start the Zookeeper and Kafka servers Check the Kafka port in zookeeper.properties by looking at clientPort (typically 2181) in config/zookeeper.properties Start Zookeeper with: bin/zookeeper-server-start.sh config/zookeeper.properties and then start Kafka: bin/kafka-server-start.sh config/server.properties Step 3:

Python on a Windows Mobile PDA

I've been doing development work in Python and as an experiment I thought I'd have a go at getting Python installed on my Windows Mobile PDA . It's an HP iPaq running Windows Mobile 6 (CE OS 5.2.1.1616). The Python CE Wiki is located at: http://pythonce.sourceforge.net/Wikka/HomePage I installed PythonCE-2.5-20061219-setup.exe from sourceforget.net and amazingly it worked first time! After clicking Start -> Programs -> Python I was able to verify that it worked by typing >>> print 'Hello World' Hello World Using Ilium Software Screen Capture software I was able to get a screen shot very easily. It can be downloaded from http://www.mobiletopsoft.com/pocket-pc/download-ilium-software-screen-capture-free-1-1.html