Skip to main content

Posts

Showing posts from October, 2014

Apache Pig - getting the maximum and minimum values in a column

Apache Pig is fantastic for writing Big Data algorithms without having to write Map-Reduce jobs from scratch. I recently encountered a problem where I needed to get the maximum and minimum values in a column of data. For example, say the data (in CSV format) looks like this: Cat, 5 Mouse, 8 Dog, 4 This Pig Latin script reads in the data and finds the required values: animal_ages = LOAD 'data.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage() AS (animal:chararray, age:int); ages = FOREACH animal_ages GENERATE age; ages_grp = GROUP ages ALL; min_age = FOREACH ages_grp GENERATE MIN(ages) as min_val; max_age =  FOREACH ages_grp GENERATE MAX(ages) as max_val; The minimum and maximum ages can be used as a scalar: min_age.min_val max_age.max_val