300+ TOP Apache Spark Interview Questions and Answers

Apache Spark Interview Questions for freshers experienced :-

1. What is Apache Spark?

Apache Spark is easy to use and flexible data processing framework. Spark can round on Hadoop, standalone, or in the cloud. It is capable of assessing diverse data source, which includes HDFS, Cassandra, and others.

2. Explain Dsstream with reference to Apache Spark

Dstream is a sequence of resilient distributed database which represent a stream of data. You can create Dstream from various source like HDFS, Apache Flume, Apache Kafka, etc.

3. Name three data source available in SparkSQL

There data source available in SparkSQL are:

  • JSON Datasets
  • Hive tables
  • Parquet file

4. Name some internal daemons used in spark?

Important daemon used in spark are Blockmanager, Memestore, DAGscheduler, Driver, Worker, Executor, Tasks,etc.

5. Define the term ‘Sparse Vector.’

Sparse vector is a vector which has two parallel arrays, one for indices, one for values, use for storing non-zero entities to save space.

6. Name the language supported by Apache Spark for developing big data applications

Important language use for developing big data application are:

  • Java
  • Python
  • R
  • Clojure
  • Scala

7. What is the method to create a Data frame?

In Apache Spark, a Data frame can be created using Tables in Hive and Structured data files.

8. Explain SchemaRDD

An RDD which consists of row object with schema information about the type of data in each column is called SchemaRDD.

9. What are accumulators?

Accumulators are the write-only variables. They are initialized once and sent to the workers. These workers will update based on the logic written, which will send back to the driver.

10. What are the components of Spark Ecosystem?

An important component of Spark are:

  • Spark Core: It is a base engine for large-scale parallel and distributed data processing
  • Spark Streaming: This component used for real-time data streaming.
  • Spark SQL: Integrates relational processing by using Spark’s functional programming API
  • GraphX: Allows graphs and graph-parallel computation
  • MLlib: Allows you to perform machine learning in Apache Spark
Apache Spark Interview Questions
Apache Spark Interview Questions

11. Name three features of using Apache Spark

Three most important feature of using Apache Spark are:

  • Support for Sophisticated Analytics
  • Helps you to Integrate with Hadoop and Existing Hadoop Data
  • It allows you to run an application in Hadoop cluster, up to 100 times faster in memory, and ten times faster on disk.

12. Explain the default level of parallelism in Apache Spark

If the user isn’t able to specify, then the number of partitions are considered as default level of parallelism in Apache Spark.

13. Name three companies which is used Spark Streaming services

Three known companies using Spark Streaming services are:

  1. Uber
  2. Netflix
  3. Pinterest

14. What is Spark SQL?

Spark SQL is a module for structured data processing where we take advantage of SQL queries running on that database.

15. what is Parquet file ?

Paraquet is a columnar format file support by many other data processing systems. Spark SQL allows you to performs both read and write operations with Parquet file.

16. what is Spark Driver?

Spark Driver is the program which runs on the master node of the machine and declares transformations and actions on data RDDs.

17. How can you store the data in spark?

Spark is a processing engine which doesn’t have any storage engine. It can retrieve data from another storage engine like HDFS, S3.

18. Explain the use of File system API in Apache Spark

File system API allows you to read data from various storage devices like HDFS, S3 or local Fileyste.

19. What is the task of Spark Engine

Spark Engine is helpful for scheduling, distributing and monitoring the data application across the cluster.

20. What is the user of sparkContext?

SparkContent is the entry point to spark. SparkContext allows you to create RDDs which provided various way of churning data.

21. How can you implement machine learning in Spark?

MLif is a versatile machine learning library given by Spark.

22. Can you do real-time processing with Spark SQL?

Real-time data processing is not possible directly. However, it is possible by registering existing RDD as a SQL table and trigger the SQL queries on priority.

23. State the difference between Spark SQL and Hql

SparkSQL is an essential component on the spark Core engine. It supports SQL and Hive Query Language without altering its syntax.

24. Can you run Apache Spark On Apache Mesos?

Yes, you can run Apache Spark on the hardware clusters managed by Mesos.

25. Explain partitions

Partition is a smaller and logical division of data. It is the method for deriving logical units of data to speed up the processing process.

26. Define the term ‘Lazy Evolution’ with reference to Apache Spark

Apache Spark delays its evaluation until it is needed. For the transformations, Spark adds them to a DAG of computation and only when derive request some data.

27. Explain the use of broadcast variables

The most common use of broadcast variables are:

  • Broadcast variables help programmer to keep a read-only variable cached on each machine instead of shipping a copy of it with tasks.
  • You can also use them to give every node a copy of a large input dataset in an efficient manner.
  • Broadcast algorithms also help you to reduce communication cost

28. How you can use Akka with Spark?

Spark uses Akka use for scheduling. It also uses Akka for messaging between the workers and masters.

29. Which the fundamental data structure of Spark

Data frame is fundamental is the fundamental data structure of Spark.

30. Can you use Spark for ETL process?

Yes, you can use spark for the ETL process.

31. What is the use of map transformation?

Map transformation on an RDD produces another RDD by translating each element. It helps you to translates every element by executing the function provided by the user.

32. What are the disadvantages of using Spark?

The following are some of the disadvantages of using Spark:

  1. Spark consume a huge amount of data compared with Hadoop.
  2. You can’t run everything on a single node as work must be distrusted over multiple clusters.
  3. Developers needs extra care while running their application in Spark.
  4. Spark streaming does not provide support for record-based window criteria.

33. What are common uses of Apache Spark?

Apache Spark is used for:

  • Interactive machine learning
  • Stream processing
  • Data analytics and processing
  • Sensor data processing

34. State the difference between persist() and cache() functions.

Persist() function allows the user to specify the storage level whereas cache() use the default storage level.

35. Name the Spark Library which allows reliable file sharing at memory speed across different cluster frameworks.

Tachyon is a spark library which allows reliable file sharing at memory speed across various cluster frameworks.

36. Apache Spark is a good fit for which type of machine learning techniques?

Apache Spark is ideal for simple machine learning algorithms like clustering, regression, and classification.

37. How you can remove the element with a critical present in any other Rdd is Apache spark?

In order to remove the elements with a key present in any other rdd, you need to use substractkey() function.

38. What is the use of checkpoints in spark?

Checkpoints allow the program to run all around the clock. Moreover, it helps to make it resilient towards failure irrespective to application logic.

39. Explain lineage graph

Lineage graph information computer each RDD on demand. Therefore, whenever a part of persistent RDD is lost. In that situation, you can recover this data using lineage graph information.

40. What are the file formats supported by spark?

Spark supports file format json, tsv, snappy, orc, rc, etc.

41. What are Actions?

Action helps you to bring back the data from RDD to the local machine. Its execution is the result of all previously created transformations.

42. What is Yarn?

Yarn is one of the most important features of Apache Spark. Running spark on Yarn makes binary distribution of spark as it is built on Yarn support.

43. Explain Spark Executor

An executor is a Spark process which runs computations and stores the data on the worker node. The final tasks by SparkContent are transferred to the executor for their execution.

44. is it necessary to install Spark on all nodes while running Spark application on Yarn?

No, you don’t necessarily need to install spark on all nodes as spark runs on top of Yarn.

45. What is a worker node in Apache Spark?

A worker node is any node which can run the application code in a cluster.

46. How can you launch Spark jobs inside Hadoop MapReduce?

Spark in MapReduce allows users to run all kind of spark job inside MapReduce without need to obtain admin rights of that application.

47. Explain the process to trigger automatic clean-up in Spark to manage accumulated metadata.

You can trigger automatic clean-ups by seeing the parameter ‘spark.cleaner.ttf or by separating the long-running jobs into various batches and writing the intermediate results to the disk.

48. what is the use of Blinkdb

BlinkDB is a query engine tool which allows you to execute SQL queries on huge volumes of data and renders query results in the meaningful error bars.

49. Does Hoe Spark handle monitoring and logging in Standalone mode?

Yes, a spark can handle monitoring and logging in standalone mode as it has a web-based user interface.

50. How can you identify whether a given operation is Transformation or Action?

You can identify the operation based on the return type. If the return type is not RDD, then the operation is an action. However, if the return type is the same as the RDD, then the operation is transformation.

51. Can You Use Apache Spark To Analyze and Access Data Stored In Cassandra Databases?

Yes, you can use Spark Cassandra Connector which allows you to access and analyze data stored in Cassandra Database.

52. How is Spark SQL different from HQL and SQL?

Spark SQL is a special component on the Spark Core engine that supports SQL and Hive Query Language without changing any syntax. It is possible to join SQL table and HQL table to Spark SQL.

53. When running Spark applications, is it necessary to install Spark on all the nodes of YARN cluster?

Spark need not be installed when running a job under YARN or Mesos because Spark can execute on top of YARN or Mesos clusters without affecting any change to the cluster.

54. What are the various data sources available in Spark SQL?

Parquet file, JSON datasets and Hive tables are the data sources available in Spark SQL.

55. How can you trigger automatic clean-ups in Spark to handle accumulated metadata?

You can trigger the clean-ups by setting the parameter ‘spark.cleaner.ttl’ or by dividing the long running jobs into different batches and writing the intermediary results to the disk.

Apache Spark Questions and Answers Pdf Download

300+ [LATEST] Apache Spark Interview Questions and Answers

Q1. Is It Possible To Run Spark And Mesos Along With Hadoop?

Yes, it is possible to run Spark and Mesos with Hadoop by launching each of these as a separate service on the machines. Mesos acts as a unified scheduler that assigns tasks to either Spark or Hadoop.

Q2. List Some Use Cases Where Spark Outperforms Hadoop In Processing.?

  1. Sensor Data Processing –Apache Spark’s ‘In-memory computing’ works best here, as data is retrieved and combined from different sources.
  2. Spark is preferred over Hadoop for real time querying of data
  3. Stream Processing – For processing logs and detecting frauds in live streams for alerts, Apache Spark is the best solution.

Q3. What Do You Understand By Lazy Evaluation?

Spark is intellectual in the manner in which it operates on data. When you tell Spark to operate on a given dataset, it heeds the instructions and makes a note of it, so that it does not forget – but it does nothing, unless asked for the final result.

When a trformation like map () is called on a RDD-the operation is not performed immediately. Trformations in Spark are not evaluated till you perform an action. This helps optimize the overall data processing workflow.

Q4. How Can You Remove The Elements With A Key Present In Any Other Rdd?

Use the subtractByKey () function.

Q5. What Do You Understand By Schemardd?

An RDD that consists of row objects (wrappers around basic string or integer arrays) with schema information about the type of data in each column.

Q6. What Are The Common Mistakes Developers Make When Running Spark Applications?

Developers often make the mistake of:-

  1. Hitting the web service several times by using multiple clusters.
  2. Run everything on the local node instead of distributing it.
  3. Developers need to be careful with this, as Spark makes use of memory for processing.

Q7. What Is The Difference Between Persist() And Cache()?

persist () allows the user to specify the storage level where as cache () uses the default storage level.

Q8. How Can You Minimize Data Trfers When Working With Spark?

Minimizing data trfers and avoiding shuffling helps write spark programs that run in a fast and reliable manner.

The various ways in which data trfers can be minimized when working with Apache Spark are:

  1. Using Broadcast Variable- Broadcast variable enhances the efficiency of joins between small and large RDDs.
  2. Using Accumulators – Accumulators help update the values of variables in parallel while executing.
  3. The most common way is to avoid operations ByKey, repartition or any other operations which trigger shuffles.

Q9. Explain About The Major Libraries That Constitute The Spark Ecosystem?

Spark MLib- Machine learning library in Spark for commonly used learning algorithms like clustering, regression, classification, etc.

Spark Streaming – This library is used to process real time streaming data.

Spark GraphX – Spark API for graph parallel computations with basic operators like joinVertices, subgraph, aggregateMessages, etc.

Spark SQL – Helps execute SQL like queries on Spark data using standard visualization or BI tools.

Q10. Explain About The Different Types Of Trformations On Dstreams?

Stateless Trformations:- Processing of the batch does not depend on the output of the previous batch.

Examples: map (), reduceByKey (), filter ().

Stateful Trformations:- Processing of the batch depends on the intermediary results of the previous batch.

Examples: Trformations that depend on sliding windows.

Q11. What Is The Significance Of Sliding Window Operation?

Sliding Window controls trmission of data packets between various computer networks. Spark Streaming library provides windowed computations where the trformations on RDDs are applied over a sliding window of data. Whenever the window slides, the RDDs that fall within the particular window are combined and operated upon to produce new RDDs of the windowed DStream.

Q12. What Is A Dstream?

Discretized Stream is a sequence of Resilient Distributed Databases that represent a stream of data. DStreams can be created from various sources like Apache Kafka, HDFS, and Apache Flume.

DStreams have two operations: –

  1. Trformations that produce a new DStream.
  2. Output operations that write data to an external system.

Q13. Does Apache Spark Provide Check Pointing?

Lineage graphs are always useful to recover RDDs from a failure but this is generally time consuming if the RDDs have long lineage chains. Spark has an API for check pointing i.e. a REPLICATE flag to persist. However, the decision on which data to checkpoint – is decided by the user. Checkpoints are useful when the lineage graphs are long and have wide dependencies.

Q14. What Is Spark Core?

It has all the basic functionalities of Spark, like – memory management, fault recovery, interacting with storage systems, scheduling tasks, etc.

Q15. What Are The Benefits Of Using Spark With Apache Mesos?

It renders scalable partitioning among various Spark instances and dynamic partitioning between Spark and other big data frameworks.

Q16. Is It Possible To Run Apache Spark On Apache Mesos?

Yes, Apache Spark can be run on the hardware clusters managed by Mesos.

Q17. Why Is There A Need For Broadcast Variables When Working With Apache Spark?

These are read only variables, present in-memory cache on every machine. When working with Spark, usage of broadcast variables eliminates the necessity to ship copies of a variable for every task, so data can be processed faster. Broadcast variables help in storing a lookup table inside the memory which enhances the retrieval efficiency when compared to an RDD lookup ().

Q18. Can You Use Spark To Access And Analyse Data Stored In Cassandra Databases?

Yes, it is possible if you use Spark Cassandra Connector.

Q19. What Does The Spark Engine Do?

Spark engine schedules, distributes and monitors the data application across the spark cluster.

Q20. Explain About Trformations And Actions In The Context Of Rdds.?

Trformations are functions executed on demand, to produce a new RDD. All trformations are followed by actions. Some examples of trformations include map, filter and reduceByKey.

Actions are the results of RDD computations or trformations. After an action is performed, the data from RDD moves back to the local machine. Some examples of actions include reduce, collect, first, and take.

Q21. Name A Few Companies That Use Apache Spark In Production.?

Pinterest, Conviva, Shopify, Open Table

Q22. What Are The Various Levels Of Persistence In Apache Spark?

Apache Spark automatically persists the intermediary data from various shuffle operations, however it is often suggested that users call persist () method on the RDD in case they plan to reuse it. Spark has various persistence levels to store the RDDs on disk or in memory or as a combination of both with different replication levels.

The various storage/persistence levels in Spark are:

MEMORY_ONLY
MEMORY_ONLY_SER
MEMORY_AND_DISK
MEMORY_AND_DISK_SER, DISK_ONLY
OFF_HEAP

Q23. Is Apache Spark A Good Fit For Reinforcement Learning?

No. Apache Spark works well only for simple machine learning algorithms like clustering, regression, classification.

Q24. Is It Necessary To Install Spark On All The Nodes Of A Yarn Cluster While Running Apache Spark On Yarn ?

No , it is not necessary because Apache Spark runs on top of YARN.

Q25. How Can You Launch Spark Jobs Inside Hadoop Mapreduce?

Using SIMR (Spark in MapReduce) users can run any spark job inside MapReduce without requiring any admin rights.

Q26. Which Spark Library Allows Reliable File Sharing At Memory Speed Across Different Cluster Frameworks?

Tachyon

Q27. How Spark Uses Akka?

Spark uses Akka basically for scheduling. All the workers request for a task to master after registering. The master just assigns the task. Here Spark uses Akka for messaging between the workers and masters.

Q28. How Spark Handles Monitoring And Logging In Standalone Mode?

Spark has a web based user interface for monitoring the cluster in standalone mode that shows the cluster and job statistics. The log output for each job is written to the work directory of the slave nodes.

Q29. What Is Rdd?

RDDs (Resilient Distributed Datasets) are basic abstraction in Apache Spark that represent the data coming into the system in object format. RDDs are used for in-memory computations on large clusters, in a fault tolerant manner. RDDs are read-only portioned, collection of records, that are –

Immutable – RDDs cannot be altered.

Resilient – If a node holding the partition fails the other node takes the data.

Q30. What Are The Various Data Sources Available In Sparksql?

  1. Parquet file
  2. JSON Datasets
  3. Hive tables

Q31. What Do You Understand By Pair Rdd?

Special operations can be performed on RDDs in Spark using key/value pairs and such RDDs are referred to as Pair RDDs. Pair RDDs allow users to access each key in parallel. They have a reduceByKey () method that collects data based on each key and a join () method that combines different RDDs together, based on the elements having the same key.

Q32. How Can You Achieve High Availability In Apache Spark?

  • Implementing single node recovery with local file system
  • Using StandBy Masters with Apache ZooKeeper.

Q33. What Makes Apache Spark Good At Low-latency Workloads Like Graph Processing And Machine Learning?

Apache Spark stores data in-memory for faster model building and training. Machine learning algorithms require multiple iterations to generate a resulting optimal model and similarly graph algorithms traverse all the nodes and edges.

These low latency workloads that need multiple iterations can lead to increased performance. Less disk access and  controlled network traffic make a huge difference when there is lots of data to be processed.

Q34. How Can Spark Be Connected To Apache Mesos?

To connect Spark with Mesos:

  1. Configure the spark driver program to connect to Mesos. Spark binary package should be in a location accessible by Mesos. (or)
  2. Install Apache Spark in the same location as that of Apache Mesos and configure the property ‘spark.mesos.executor.home’ to point to the location where it is installed.

Q35. What Is Catalyst Framework?

Catalyst framework is a new optimization framework present in Spark SQL. It allows Spark to automatically trform SQL queries by adding new optimizations to build a faster processing system.

Q36. What Is Shark?

Most of the data users know only SQL and are not good at programming. Shark is a tool, developed for people who are from a database background – to access Scala MLib capabilities through Hive like SQL interface. Shark tool helps data users run Hive on Spark – offering compatibility with Hive metastore, queries and data.

Q37. Why Is Blinkdb Used?

BlinkDB is a query engine for executing interactive SQL queries on huge volumes of data and renders query results marked with meaningful error bars. BlinkDB helps users balance ‘query accuracy’ with response time.

Q38. What Are The Disadvantages Of Using Apache Spark Over Hadoop Mapreduce?

Apache spark does not scale well for compute intensive jobs and consumes large number of system resources. Apache Spark’s in-memory capability at times comes a major roadblock for cost efficient processing of big data. Also, Spark does have its own file management system and hence needs to be integrated with other cloud based data platforms or apache hadoop.

Q39. How Can You Trigger Automatic Clean-ups In Spark To Handle Accumulated Metadata?

You can trigger the clean-ups by setting the parameter ‘spark.cleaner.ttl’ or by dividing the long running jobs into different batches and writing the intermediary results to the disk.

Q40. Explain About The Different Cluster Managers In Apache Spark?

The 3 different clusters managers supported in Apache Spark are:

  1. YARN
  2. Apache Mesos –Has rich resource scheduling capabilities and is well suited to run Spark along with other applications. It is advantageous when several users run interactive shells because it scales down the CPU allocation between commands.
  3. Standalone deployments – Well suited for new deployments which only run and are easy to set up.

Q41. Explain About The Core Components Of A Distributed Spark Application.?

Driver: The process that runs the main () method of the program to create RDDs and perform trformations and actions on them.

Executor: The worker processes that run the individual tasks of a Spark job.

Cluster Manager: A pluggable component in Spark, to launch Executors and Drivers. The cluster manager allows Spark to run on top of other external managers like Apache Mesos or YARN.

Q42. Explain About The Popular Use Cases Of Apache Spark?

Apache Spark is mainly used for:

  • Iterative machine learning.
  • Interactive data analytics and processing.
  • Stream processing
  • Sensor data processing

Q43. What Are The Languages Supported By Apache Spark For Developing Big Data Applications?

Scala, Java, Python, R and Clojure

Q44. Hadoop Uses Replication To Achieve Fault Tolerance. How Is This Achieved In Apache Spark?

Data storage model in Apache Spark is based on RDDs. RDDs help achieve fault tolerance through lineage. RDD always has the information on how to build from other datasets. If any partition of a RDD is lost due to failure, lineage helps build only that particular lost partition.

Q45. Define A Worker Node.?

A node that can run the Spark application code in a cluster can be called as a worker node. A worker node can have more than one worker which is configured by setting the SPARK_ WORKER_INSTANCES property in the spark-env.sh file. Only one worker is started if the SPARK_ WORKER_INSTANCES property is not defined.

Q46. What Do You Understand By Executor Memory In A Spark Application?

Every spark application has same fixed heap size and fixed number of cores for a spark executor. The heap size is what referred to as the Spark executor memory which is controlled with the spark.executor.memory property of the –executor-memory flag.

Every spark application will have one executor on each worker node. The executor memory is basically a measure on how much memory of the worker node will the application utilize.

Q47. What Is A Sparse Vector?

A sparse vector has two parallel arrays –one for indices and the other for values. These vectors are used for storing non-zero entries to save space.

Q48. When Running Spark Applications, Is It Necessary To Install Spark On All The Nodes Of Yarn Cluster?

Spark need not be installed when running a job under YARN or Mesos because Spark can execute on top of YARN or Mesos clusters without affecting any change to the cluster.

Q49. How Can You Compare Hadoop And Spark In Terms Of Ease Of Use?

Hadoop MapReduce requires programming in Java which is difficult, though Pig and Hive make it considerably easier. Learning Pig and Hive syntax takes time. Spark has interactive APIs for different languages like Java, Python or Scala and also includes Shark i.e. Spark SQL for SQL lovers – making it comparatively easier to use than Hadoop.

Q50. What Are The Key Features Of Apache Spark That You Like?

  • Spark provides advanced analytic options like graph algorithms, machine learning, streaming data, etc
  • It has built-in APIs in multiple languages like Java, Scala, Python and R
  • It has good performance gains, as it helps run an application in the Hadoop cluster ten times faster on disk and 100 times faster in memory.