EPAM Systems
10+ MResult Services Interview Questions and Answers
Q1. Write code for printing duplicate numbers in a list.
Code to print duplicate numbers in a list.
Iterate through the list and keep track of the count of each number using a dictionary.
Print the numbers that have a count greater than 1.
Q2. Write code to print reverse of a sentence word by word.
Code to print reverse of a sentence word by word.
Split the sentence into words using space as delimiter
Store the words in an array
Print the words in reverse order
Q3. Difference between cache and persist, repartition and coalesce.
Cache and persist are used to store data in memory. Repartition and coalesce are used to change the number of partitions.
Cache stores the data in memory for faster access while persist allows the user to choose the storage level.
Repartition increases the number of partitions while coalesce decreases the number of partitions.
Cache and persist are transformations while repartition and coalesce are actions.
Cache and persist are used for iterative algorithms while repartition and...read more
Q4. Elaboration of Spark optimization techniques. Types of transformations, shuffling.
Spark optimization techniques include partitioning, caching, and using appropriate transformations.
Partitioning data can improve performance by reducing shuffling.
Caching frequently used data can reduce the need for recomputation.
Transformations like filter, map, and reduceByKey can be used to optimize data processing.
Shuffling can be minimized by using operations like reduceByKey instead of groupByKey.
Broadcasting small data can improve performance by reducing network traffi...read more
Q5. Hive types of tables and difference between them
Hive has two types of tables - Managed and External. Managed tables are managed by Hive, while External tables are managed outside of Hive.
Managed tables are created using 'CREATE TABLE' command and data is stored in Hive's warehouse directory
External tables are created using 'CREATE EXTERNAL TABLE' command and data is stored outside of Hive's warehouse directory
Managed tables are deleted when the table is dropped, while External tables are not
Managed tables have full control...read more
Q6. Difference between RDD, Dataframe, Dataset.
RDD, Dataframe, and Dataset are data structures in Apache Spark with different characteristics and functionalities.
RDD (Resilient Distributed Datasets) is a fundamental data structure in Spark that represents an immutable distributed collection of objects. It provides low-level APIs for distributed data processing and fault tolerance.
Dataframe is a distributed collection of data organized into named columns. It is similar to a table in a relational database and provides a hig...read more
Q7. Connecting Spark to Azure SQL Database.
Spark can connect to Azure SQL Database using JDBC driver.
Download and install the JDBC driver for Azure SQL Database.
Set up the connection string with the appropriate credentials.
Use the JDBC API to connect Spark to Azure SQL Database.
Example: val df = spark.read.jdbc(jdbcUrl, tableName, connectionProperties)
Ensure that the firewall rules for the Azure SQL Database allow access from the Spark cluster.
Q8. Discuss project and it's architecture.
Developed a data pipeline to process and analyze customer behavior data.
Used Apache Kafka for real-time data streaming
Implemented data processing using Apache Spark
Stored data in Hadoop Distributed File System (HDFS)
Used Tableau for data visualization
Q9. How will you handle data skewness in spark
Data skewness can be handled in Spark by using techniques like partitioning, bucketing, and broadcasting.
Partitioning the data based on a key column can distribute the data evenly across the cluster.
Bucketing can further divide the data into smaller buckets based on a hash function.
Broadcasting small tables can reduce the amount of data shuffled across the network.
Using dynamic allocation can also help in handling data skewness by allocating more resources to tasks that are t...read more
Q10. What Azure solutions have you worked with?
I have worked with Azure Data Factory, Azure Databricks, and Azure SQL Database.
Azure Data Factory for data integration and orchestration
Azure Databricks for big data processing and analytics
Azure SQL Database for relational database management
Q11. What tech stack are used
The tech stack used includes Python, SQL, Apache Spark, Hadoop, AWS, and Docker.
Python for data processing and analysis
SQL for database querying
Apache Spark for big data processing
Hadoop for distributed storage and processing
AWS for cloud services
Docker for containerization
Q12. types of Variables in Scala
Scala has two types of variables - mutable and immutable.
Scala has mutable variables that can be reassigned using the var keyword.
Scala also has immutable variables that cannot be reassigned once they are initialized using the val keyword.
Example: var mutableVariable = 10; val immutableVariable = 20;
Interview Process at MResult Services
Top Data Engineer Interview Questions from Similar Companies
Reviews
Interviews
Salaries
Users/Month