Skip to main content

Apache Pig - getting the maximum and minimum values in a column

Apache Pig is fantastic for writing Big Data algorithms without having to write Map-Reduce jobs from scratch.

I recently encountered a problem where I needed to get the maximum and minimum values in a column of data. For example, say the data (in CSV format) looks like this:

Cat, 5
Mouse, 8
Dog, 4

This Pig Latin script reads in the data and finds the required values:

animal_ages = LOAD 'data.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage() AS
(animal:chararray, age:int);

ages = FOREACH animal_ages GENERATE age;
ages_grp = GROUP ages ALL;
min_age = FOREACH ages_grp GENERATE MIN(ages) as min_val;
max_age = FOREACH ages_grp GENERATE MAX(ages) as max_val;

The minimum and maximum ages can be used as a scalar:

min_age.min_val
max_age.max_val

Comments