Apache Pig is fantastic for writing Big Data algorithms without having to write Map-Reduce jobs from scratch.
I recently encountered a problem where I needed to get the maximum and minimum values in a column of data. For example, say the data (in CSV format) looks like this:
Cat, 5
Mouse, 8
Dog, 4
This Pig Latin script reads in the data and finds the required values:
animal_ages = LOAD 'data.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage() AS
(animal:chararray, age:int);
ages = FOREACH animal_ages GENERATE age;
ages_grp = GROUP ages ALL;
min_age = FOREACH ages_grp GENERATE MIN(ages) as min_val;
max_age = FOREACH ages_grp GENERATE MAX(ages) as max_val;
The minimum and maximum ages can be used as a scalar:
min_age.min_val
max_age.max_val
I recently encountered a problem where I needed to get the maximum and minimum values in a column of data. For example, say the data (in CSV format) looks like this:
Cat, 5
Mouse, 8
Dog, 4
This Pig Latin script reads in the data and finds the required values:
animal_ages = LOAD 'data.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage() AS
(animal:chararray, age:int);
ages = FOREACH animal_ages GENERATE age;
ages_grp = GROUP ages ALL;
min_age = FOREACH ages_grp GENERATE MIN(ages) as min_val;
max_age = FOREACH ages_grp GENERATE MAX(ages) as max_val;
The minimum and maximum ages can be used as a scalar:
min_age.min_val
max_age.max_val
Comments