Impetus Technologies
AstaGuru Auction House Interview Questions and Answers
Q1. Difference between partitioning and bucketing. Types of joins in spark Optimization Techniques in spark Broadcast variable and broadcast join Difference between ORC and Parquet Difference between RDD and Datafr...
read moreExplaining partitioning, bucketing, joins, optimization, broadcast variables, ORC vs Parquet, RDD vs Dataframe, project architecture and responsibilities for Big Data Engineer role.
Partitioning is dividing data into smaller chunks for parallel processing, while bucketing is organizing data into buckets based on a hash function.
Types of joins in Spark include inner join, outer join, left join, right join, and full outer join.
Optimization techniques in Spark include caching, re...read more
Q2. Second round: spark how to handle upserts in spark
Spark can handle upserts using merge() function
Use merge() function to handle upserts in Spark
Specify the primary key column(s) to identify matching rows
Specify the update column(s) to update existing rows
Specify the insert column(s) to insert new rows
Example: df1.merge(df2, on='id', whenMatched='update', whenNotMatched='insert')
Q3. SQL question Remove duplicate records 5th highest salary department wise
Remove duplicate records and find 5th highest salary department wise using SQL.
Use DISTINCT keyword to remove duplicate records.
Use GROUP BY clause to group the records by department.
Use ORDER BY clause to sort the salaries in descending order.
Use LIMIT clause to get the 5th highest salary.
Combine all the above clauses to get the desired result.
Q4. Spark memory optimisation techniques
Spark memory optimisation techniques
Use broadcast variables to reduce memory usage
Use persist() or cache() to store RDDs in memory
Use partitioning to reduce shuffling and memory usage
Use off-heap memory to avoid garbage collection overhead
Tune memory settings such as spark.driver.memory and spark.executor.memory
Q5. Hadoop serialisation techniques.
Hadoop serialisation techniques are used to convert data into a format that can be stored and processed in Hadoop.
Hadoop uses Writable interface for serialisation and deserialisation of data
Avro, Thrift, and Protocol Buffers are popular serialisation frameworks used in Hadoop
Serialisation can be customised using custom Writable classes or external libraries
Serialisation plays a crucial role in Hadoop performance and efficiency
Q6. Java collection vs collections
Java collection is a single interface while collections is a utility class.
Java collection is an interface that provides a unified architecture for manipulating and storing groups of objects.
Collections is a utility class that provides static methods for working with collections.
Java collection is a part of the Java Collections Framework while collections is not.
Examples of Java collections include List, Set, and Map while examples of methods in collections include sort, reve...read more
Reviews
Interviews
Salaries
Users/Month