Microsoft® Big Data Solutions by Adam Jorgensen & James Rowland-Jones & John Welch & Dan Clark & Christopher Price & Brian Mitchell

Microsoft® Big Data Solutions by Adam Jorgensen & James Rowland-Jones & John Welch & Dan Clark & Christopher Price & Brian Mitchell

Author:Adam Jorgensen & James Rowland-Jones & John Welch & Dan Clark & Christopher Price & Brian Mitchell
Language: eng
Format: mobi
ISBN: 9781118729557
Published: 2014-03-03T00:00:00+00:00


Building Your Own UDFs for Pig

Unless you are an experienced Java programmer, writing your own UDF is not trivial, as mentioned earlier. However, if you have experience in another object-oriented programming language such as C#, you should be able to transition to writing UDFs in Java without too much difficulty. One thing you may want to do to make things easier is to download and install a Java interface development environment (IDE) such as Eclipse (http://www.eclipse.org/). If you are used to working in Visual Studio, you should be comfortable developing in Eclipse.

You can create several types of UDFs, depending on the functionality. The most common type is the eval function. An eval function accepts a tuple as an input, completes some processing on it, and sends it back out. They are typically used in conjunction with a FOREACH statement in HiveQL. For example, the following script calls a custom UDF to convert string values to lowercase:

Register C:\hdp\hadoop\pig-0.11.0.1.3.0.0-0380\SampleUDF.jar; Define lcase com.BigData.hadoop.pig.SampleUDF.Lower; FlightData = LOAD '/user/test/FlightPerformance.csv' using PigStorage(',') as (flight_date:chararray,airline_cd:int,airport_cd:chararray, delay:int,dep_time:int); Lower = FOREACH FlightData GENERATE lcase(airport_cd);

To create the UDF, you first add a reference to the pig.jar file. After doing so, you need to create a class that extends the EvalFunc class. The EvalFunc is the base class for all eval functions. The import statements at the top of the file indicate the various classes you are going to use from the referenced jar files:

import java.io.IOException; import org.apache.pig.EvalFunc; import org.apache.pig.data.Tuple; public class Lower extends EvalFunc<String> { }

The next step is to add an exec function that implements the processing. It has an input parameter of a tuple and an output of a string:

public String exec(Tuple arg0) throws IOException { if (arg0 == null || arg0.size() == 0) return null; try { String str = (String)arg0.get(0); return str.toLowerCase(); } catch(Exception e) { throw new IOException("Caught exception processing input row ", e); } }

The first part of the code checks the input tuple to make sure that it is valid and then uses a try-catch block. The try block converts the string to lowercase and returns it back to the caller. If an error occurs in the try block, the catch block returns an error message to the caller.

Next, you need to build the class and export it to a jar file. Place the jar file in the Pig directory, and you are ready to use it in your scripts.

Another common type of function is the filter function. Filter functions are eval functions that return a Boolean result. For example, the IsPositive function is used here to filter out negative and zero-delay values (integers):

Register C:\hdp\hadoop\pig-0.11.0.1.3.0.0-0380\SampleUDF.jar; Define isPos com.BigData.hadoop.pig.SampleUDF.isPositive; FlightData = LOAD '/user/test/FlightPerformance.csv' using PigStorage(',') as (flight_date:chararray,airline_cd:int,airport_cd:chararray, delay:int,dep_time:int); PosDelay = Filter FlightData BY isPos(delay);

The code for the isPositive UDF is shown here:

package com.BigData.hadoop.pig.SampleUDF; import java.io.IOException; import org.apache.pig.FilterFunc; import org.apache.pig.data.Tuple; public class isPositive extends FilterFunc { @Override public Boolean exec(Tuple arg0) throws IOException { if (arg0 == null || arg0.size() != 1) return null; try { if (arg0.get(0) instanceof Integer) { if ((Integer)arg0.get(0)>0) return true;



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.