300+TOP Apache PIG Hadoop Interview Questions and Answers

Apache PIG Interview Questions for freshers experienced :-

1. What is pig?
Pig is a Apache open source project which is run on hadoop,provides engine for data flow in parallel on hadoop.It includes language called pig latin,which is for expressing these data flow.It includes different operations like joins,sort,filter ..etc and also ability to write UserDefine Functions(UDF) for proceesing and reaing and writing.pig uses both HDFS and MapReduce i,e storing and processing.

2. What is differnce between pig and sql?
Pig latin is procedural version of SQl.pig has certainly similarities,more difference from sql.sql is a query language for user asking question in query form.sql makes answer for given but dont tell how to answer the given question.suppose ,if user want to do multiple operations on tables,we have write maultiple queries and also use temporary table for storing,sql is support for subqueries but intermediate we have to use temporary tables,SQL users find subqueries confusing and difficult to form properly.using sub-queries creates an inside-out design where the first step in the data pipeline is the innermost query .pig is designed with a long series of data operations in mind, so there is no need to write the data pipeline in an inverted set of subqueries or to worry about storing data in temporary tables.

3. How Pig differs from MapReduce
In mapreduce,groupby operation performed at reducer side and filter,projection can be implemented in the map phase.pig latin also provides standard-operation similar to mapreduce like orderby and filters,groupby..etc.we can analyze pig script and know data flows ans also early to find the error checking.pig Latin is much lower cost to write and maintain thanJava code for MapReduce.

4. How is Pig Useful For?
In three categories,we can use pig .they are 1)ETL data pipline 2)Research on raw data 3)Iterative processing
Most common usecase for pig is data pipeline.Let us take one example, web based compaines gets the weblogs,so before storing data into warehouse,they do some operations on data like cleaning and aggeration operations..etc.i,e transformations on data.

5. What are the scalar datatypes in pig?
scalar datatype
int -4bytes,
float -4bytes,
double -8bytes,
long -8bytes,
chararray,
bytearray

6. What are the complex datatypes in pig?
map:
map in pig is chararray to data element mapping where element have pig data type including complex data type.
example of map [‘city’#’hyd’,’pin’#500086]
the above example city and pin are data elements(key) mapping to values

tuple:
tuple have fixed length and it have collection datatypes.tuple containing multiple fields and also tuples are ordered.
example, (hyd,500086) which containing two fields.

bag:
A bag containing collection of tuples which are unordered,Bag constants are constructed using braces, with tuples in the bag separated by com-
mas. For example, {(‘hyd’, 500086), (‘chennai’, 510071), (‘bombay’, 500185)}

7. Whether pig latin language is case-sensitive or not?

  • pig latin is some times not a case sensitive.let us see example,Load is equivalent to load.
  • A=load ‘b’ is not equivalent to a=load ‘b’
  • UDF are also case sensitive,count is not equivalent to COUNT.

8. How should ‘load’ keyword is useful in pig scripts?
first step in dataflow language we need to specify the input,which is done by using ‘load’ keyword.load looks for your data on HDFS in a tab-delimited file using the default load function ‘PigStorage’.suppose if we want to load data from hbase,we would use the loader for hbase
‘HBaseStorage’.
example of pigstorage loader
A = LOAD ‘/home/ravi/work/flight.tsv’ using PigStorage (‘t’) AS (origincode:chararray, destinationcode:chararray, origincity:chararray, destinationcity:chararray, passengers:int, seats:int, flights:int, distance:int, year:int, originpopulation:int, destpopulation:int);
example of hbasestorage loader
x= load ‘a’ using HBaseStorage();
if dont specify any loader function,it will takes built in function is ‘PigStorage’
the ‘load’ statement can also have ‘as’ keyword for creating schema,which allows you to specify the schema of the data you are loading.
PigStorage and TextLoader, the two built-in Pig load functions that operate on HDFS files.

9. How should ‘store’ keyword is useful in pig scripts?After we have completed process,then result should write into somewhere,Pig provides the store statement for this purpose
store processed into ‘/data/ex/process’;
If you do not specify a store function, PigStorage will be used. You can specify a different store function with a using clause:

  • store processed into ‘?processed’ using HBaseStorage();
  • we can also pass argument to store function,example,store processed into ‘processed’ using PigStorage(‘,’);

10. What is the purpose of ‘dump’ keyword in pig?

  • dump diaplay the output on the screen
  • dump ‘processed’
Apache PIG Hadoop Interview Questions
Apache PIG Hadoop Interview Questions

11. what are relational operations in pig latin?
they are

  1. for each
  2. order by
  3. filters
  4. group
  5. distinct
  6. join
  7. limit

12. How to use ‘foreach’ operation in pig scripts?
foreach takes a set of expressions and applies them to every record in the data pipeline
A = load ‘input’ as (user:chararray, id:long, address:chararray, phone:chararray,preferences:map[]);
B = foreach A generate user, id;
positional references are preceded by a $ (dollar sign) and start from 0:
c= load d generate $2-$1

13. How to write ‘foreach’ statement for map datatype in pig scripts?
for map we can use hash(‘#’)
bball = load ‘baseball’ as (name:chararray, team:chararray,position:bag{t:(p:chararray)}, bat:map[]);
avg = foreach bball generate bat#’batting_average’;

14. How to write ‘foreach’ statement for tuple datatype in pig scripts?
for tuple we can use dot(‘.’)
A = load ‘input’ as (t:tuple(x:int, y:int));
B = foreach A generate t.x, t.$1;

15. How to write ‘foreach’ statement for bag datatype in pig scripts?
when you project fields in a bag, you are creating a new bag with only those fields:
A = load ‘input’ as (b:bag{t:(x:int, y:int)});
B = foreach A generate b.x;
we can also project multiple field in bag
A = load ‘input’ as (b:bag{t:(x:int, y:int)});
B = foreach A generate b.(x, y);

16. why should we use ‘filters’ in pig scripts?
Filters are similar to where clause in SQL.filter which contain predicate.If that predicate evaluates to true for a given record, that record will be passed down the pipeline. Otherwise, it will not.predicate contain different operators like ==,>=,<=,!=.so,== and != can be applied to maps and tuples.
A= load ‘inputs’ as(name,address)
B=filter A by symbol matches ‘CM.*’;

17. why should we use ‘group’ keyword in pig scripts?
The group statement collects together records with the same key.In SQL the group by clause creates a group that must feed directly into one or more aggregate functions. In Pig Latin there is no direct connection between group and aggregate functions.
input2 = load ‘daily’ as (exchanges, stocks);
grpds = group input2 by stocks;

18. why should we use ‘orderby’ keyword in pig scripts?
The order statement sorts your data for you, producing a total order of your output data.The syntax of order is similar to group. You indicate a key or set of keys by which you wish to order your data
input2 = load ‘daily’ as (exchanges, stocks);
grpds = order input2 by exchanges;

19. why should we use ‘distinct’ keyword in pig scripts?
The distinct statement is very simple. It removes duplicate records. It works only on entire records, not on individual fields:
input2 = load ‘daily’ as (exchanges, stocks);
grpds = distinct exchanges;

20. is it posible to join multiple fields in pig scripts?
yes,Join select records from one input and join with another input.This is done by indicating keys for each input. When those keys are equal, the two rows are joined.
input2 = load ‘daily’ as (exchanges, stocks);
input3 = load ‘week’ as (exchanges, stocks);
grpds = join input2 by stocks,input3 by stocks;

we can also join multiple keys
example:
input2 = load ‘daily’ as (exchanges, stocks);
input3 = load ‘week’ as (exchanges, stocks);
grpds = join input2 by (exchanges,stocks),input3 by (exchanges,stocks);

21. is it possible to display the limited no of results?
yes,Sometimes you want to see only a limited number of results. ‘limit’ allows you do this:
input2 = load ‘daily’ as (exchanges, stocks);
first10 = limit input2 10;

22. What are the different modes available in Pig?

Two modes are available in the pig.

  1. Local Mode (Runs on localhost file system)
  2. MapReduce Mode (Runs on Hadoop Cluster)

23. What does FOREACH do?

FOREACH is used to apply transformations to the data and to generate new data items. The name itself is indicating that for each element of a data bag, the respective action will be performed.

Syntax : FOREACH bagname GENERATE expression1, expression2, …..

The meaning of this statement is that the expressions mentioned after GENERATE will be applied to the current record of the data bag.

24. why should we use ‘filters’ in pig scripts?

Filters are similar to where clause in SL.filter which contain predicate.If that predicate evaluates to true for a given record, that record will be passed down the pipeline. Otherwise, it will not.predicate contain different operators like ==,>=,<=,!=.so,== and != can be applied to maps and tuples.

A= load ‘inputs’ as(name,address)

B=filter A by symbol matches ‘CM.*’;

25. What is bag?

A bag is one of the data models present in Pig. It is an unordered collection of tuples with possible duplicates. Bags are used to store collections while grouping. The size of bag is the size of the local disk, this means that the size of the bag is limited. When the bag is full, then Pig will spill this bag into local disk and keep only some parts of the bag in memory. There is no necessity that the complete bag should fit into memory. We represent bags with “{}”.

26. why should we use ‘orderby’ keyword in pig scripts?

The order statement sorts your data for you, producing a total order of your output data.The syntax of order is similar to group. You indicate a key or set of keys by which you wish to order your data

input2 = load ‘daily’ as (exchanges, stocks);

grpds = order input2 by exchanges;

27. Pig Features ?

i) Data Flow Language

User Specifies a Seuence of Steps where each step specifies only a single high-level data transformation.

ii) User Defined Functions (UDF)
iii)Debugging Environment

iv) Nested data Model

28. What are the advantages of pig language?

  • The pig is easy to learn: Pig is easy to learn, it overcomes the need for writing complex MapReduce programs to some extent. Pig works in a step by step manner. So it is easy to write, and even better, it is easy to read.
  • It can handle heterogeneous data: Pig can handle all types of data – structured, semi-structured, or unstructured.
  • Pig is Faster: Pig’s multi-uery approach combines certain types of operations together in a single pipeline, reducing the number of times data is scanned.
  • Pig does more with less: Pig provides the common data operations (filters, joins, ordering, etc.) And nested data types (e.g. Tuples, bags, and maps) which can be used in processing data.
  • Pig is Extensible: Pig is easily extensible by UDFs – including Python, Java, JavaScript, and Ruby so you can use them to load, aggregate and analysis. Pig insulates your code from changes to the Hadoop Java API.

29. What is the Physical plan in pig architecture?

The physical form of execution of pig script happens at this stage. Physical plan is responsible for converting operators to Physical Plan.

30. What Is Difference Between Mapreduce and Pig ?

  • In MR Need to write entire logic for operations like join,group,filter,sum etc ..
  • In Pig Built in functions are available
  • In MR Number of lines of code reuired is too much even for a simple functionality
  • In Pig 10 lines of pig latin eual to 200 lines of java
  • In MR Time of effort in coding is high
  • In Pig What took 4hrs to write in java took 15 mins in pig latin (approx)
  • In MRLess productivity
  • In PIG High Productivity

31. What are the relational operators available related to Grouping and joining in pig language?

Grouping and Joining operators are the most powerful operators in pig language. Because core MapReduce creation for grouping and joins are very typical in low-level MapReduce language.

JOIN

GROUP

COGROUP

CROSS

JOIN is used to join two or more relations. GROUP is used for aggregation of a single relation. COGROUP is used for the aggregation of multiple relations. CROSS is used to create a cartesian product of two or more relations.

32. Why do we need Pig?

Pig is a high level scripting language that is used with Apache Hadoop. Pig excels at describing data analysis problems as data flows. Pig is complete in that you can do all the reuired data manipulations in Apache Hadoop with Pig. In addition through the User Defined Functions(UDF) facility in Pig you can have Pig invoke code in many languages like JRuby, Jython and Java. Conversely you can execute Pig scripts in other languages. The result is that you can use Pig as a component to build larger and more complex applications that tackle real business problems.

33. What are the different String functions available in pig?

Below are most commonly used STRING pig functions

UPPER

LOWER

TRIM

SUBSTRING

INDEXOF

STRSPLIT

LAST_INDEX_OF

34. What is a relation in Pig?

A Pig relation is a bag of tuples. A Pig relation is similar to a table in a relational database, where the tuples in the bag correspond to the rows in a table. Unlike a relational table, however, Pig relations don t reuire that every tuple contain the same number of fields or that the fields in the same position (column) have the same type.

35. What is a tuple?

A tuple is an ordered set of fields and A field is a piece of data.

36. What is the MapReduce plan in pig architecture?

In MapReduce than the output of Physical plan is converted into an actual MapReduce program. Which then executed across the Hadoop Cluster.

37. What is the logical plan in pig architecture?

In the Logical plan stage of Pig, statements are parsed for syntax error. Validation of input files and the data structure of the file is also analysed. A DAG (Directed Acyclic Graph) of operators as nodes and data flow as edges are then created. Optimization of pig scripts also materialized to the logical plan.

38. What is UDF in Pig?

The pig has wide-ranging inbuilt functions, but occasionally we need to write complex business logic, which may not be implemented using primitive functions. Thus, Pig provides support to allow writing User Defined Functions (UDFs) as a way to stipulate custom processing.

Pig UDFs can presently be implemented in Java, Python, JavaScript, Ruby and Groovy. The most far-reaching support is provided for Java functions. You can customize all parts of the processing, including data load/store, column transformation, and aggregation. Java functions are also additional efficient because they are implemented in the same language as Pig and because additional interfaces are supported. Such as the Algebraic Interface and the Accumulator Interface. Limited support is provided for Python, JavaScript, Ruby and Groovy functions.

39. What are the primitive data types in pig?

Following are the primitive data types in pig

Int

Long

Float

Double

Char array

Byte array

40. What is bag data type in pig?

The bag data type worked as a container for tuples and other bags. It is a complex data type in pig latin language.

41. why should we use ‘distinct’ keyword in pig scripts?

The distinct statement is very simple. It removes duplicate records. It works only on entire records, not on individual fields:

input2 = load ‘daily’ as (exchanges, stocks);

grpds = distinct exchanges;

42. What are the different math functions available in pig?

Below are most commonly used math pig functions

ABS

ACOS

EXP

LOG

ROUND

CBRT

RANDOM

SRT

43. What are the different Eval functions available in pig?

Below are most commonly used Eval pig functions

AVG

CONCAT

MAX

MIN

SUM

SIZE

COUNT

COUNT_STAR

DIFF

TOKENIZE

IsEmpty

44. What are the relational operators available related to loading and storing in pig language?

For Loading data and Storing it into HDFS, Pig uses following operators.

LOAD

STORE

LOADS, load the data from the file system. STORE, stores the data in the file system.

45. Explain about co-group in Pig.

COGROUP operator in Pig is used to work with multiple tuples. COGROUP operator is applied on statements that contain or involve two or more relations. The COGROUP operator can be applied on up to 127 relations at a time. When using the COGROUP operator on two tables at once-Pig first groups both the tables and after that joins the two tables on the grouped columns.

46. What are the relational operators available related to combining and splitting in pig language?

UNION and SPLIT used for combining and splitting relations in the pig.

47. What are different modes of execution in Apache Pig?

Apache Pig runs in 2 modes- one is the “Pig (Local Mode) Command Mode” and the other is the “Hadoop MapReduce (Java) Command Mode”. Local Mode reuires access to only a single machine where all files are installed and executed on a local host whereas MapReduce reuires accessing the Hadoop cluster.

48. Does Pig support multi-line commands?

Yes

49. How would you diagnose or do exception handling in the pig?

For exception handling of pig script, we can use following operators.

DUMP

DESCRIBE

ILLUSTRATE

EXPLAIN

DUMP displays the results on screen. Describe displays the schema of a particular relation. ILLUSTRATE displays step by step execution of a seuence of pig statements. EXPLAIN displays the execution plan for pig latin statements.

50. What is the difference between store and dumps commands?

Dump Command after process the data displayed on the terminal, but it’s not stored anywhere. Where as store store in local file system or HDFS and output execute in a folder. In the protection environment most opften hadoop developer used ‘store’ command to store data in the HDFS.

51. Can we say cogroup is a group of more than 1 data set?

Cogroup is a group of one data set. But in the case of more than one data sets, cogroup will group all the data sets and join them based on the common field. Hence, we can say that cogroup is a group of more than one data set and join of that data set as well.

PIG Questions and Answers pdf Download