KPMG India
10+ Moringa Techsolv Interview Questions and Answers
Q1. How do you handle changing schema from source. What are the common issues faced in hadoop and how did you resolve it?
Handling changing schema from source in Hadoop
Use schema evolution techniques like Avro or Parquet to handle schema changes
Implement a flexible ETL pipeline that can handle schema changes
Use tools like Apache NiFi to dynamically adjust schema during ingestion
Common issues include data loss, data corruption, and performance degradation
Resolve issues by implementing proper testing, monitoring, and backup strategies
Q2. Write Pyspark code to read csv file and show top 10 records.
Pyspark code to read csv file and show top 10 records.
Import the necessary libraries
Create a SparkSession
Read the CSV file using the SparkSession
Display the top 10 records using the show() method
Q3. What are the optimization techniques applied in pyspark code?
Optimization techniques in PySpark code include partitioning, caching, and using broadcast variables.
Partitioning data based on key columns to optimize join operations
Caching frequently accessed data in memory to avoid recomputation
Using broadcast variables to efficiently share small data across nodes
Using appropriate data types and avoiding unnecessary type conversions
Avoiding shuffling of data by using appropriate transformations and actions
Using appropriate data structures...read more
Q4. Write pyspark code to change column name, divide one column by another column.
Pyspark code to change column name and divide one column by another column.
Use 'withColumnRenamed' method to change column name
Use 'withColumn' method to divide one column by another column
Example: df = df.withColumnRenamed('old_col_name', 'new_col_name').withColumn('new_col_name', df['col1']/df['col2'])
Q5. 1. What is columnar storage,parquet,delta? Why it is used
Columnar storage is a data storage format that stores data in columns rather than rows, improving query performance.
Columnar storage stores data in a column-wise manner instead of row-wise.
It improves query performance by reducing the amount of data that needs to be read from disk.
Parquet is a columnar storage file format that is optimized for big data workloads.
It is used in Apache Spark and other big data processing frameworks.
Delta is an open-source storage layer that prov...read more
Q6. Given a dictionary, find out the greatest number for same key in Python.
Find the greatest number for same key in a Python dictionary.
Use max() function with key parameter to find the maximum value for each key in the dictionary.
Iterate through the dictionary and apply max() function on each key.
If the dictionary is nested, use recursion to iterate through all the keys.
Q7. RDDs vs DataFrames. Which is better and why
DataFrames are better than RDDs due to their optimized performance and ease of use.
DataFrames are optimized for better performance than RDDs.
DataFrames have a schema, making it easier to work with structured data.
DataFrames support SQL queries and can be used with Spark SQL.
RDDs are more low-level and require more manual optimization.
RDDs are useful for unstructured data or when fine-grained control is needed.
Q8. Write function to check if number is an Armstrong Number
Function to check if a number is an Armstrong Number
An Armstrong Number is a number that is equal to the sum of its own digits raised to the power of the number of digits
To check if a number is an Armstrong Number, we need to calculate the sum of each digit raised to the power of the number of digits
If the sum is equal to the original number, then it is an Armstrong Number
Q9. 4. How to connect SQL server to databricks
To connect SQL server to Databricks, use JDBC/ODBC drivers and configure the connection settings.
Install the appropriate JDBC/ODBC driver for SQL server
Configure the connection settings in Databricks
Use the JDBC/ODBC driver to establish the connection
Q10. how do you copy data from on-premise to azure cloud
Data can be copied from on-premise to Azure cloud using various methods like Azure Data Factory, Azure Storage Explorer, Azure Data Migration Service, etc.
Use Azure Data Factory to create data pipelines for moving data from on-premise to Azure cloud
Utilize Azure Storage Explorer to manually copy data from on-premise to Azure Blob Storage
Leverage Azure Data Migration Service for migrating large volumes of data from on-premise databases to Azure SQL Database
Consider using Azure...read more
Q11. How to initiate Sparkcontext
To initiate Sparkcontext, create a SparkConf object and pass it to SparkContext constructor.
Create a SparkConf object with app name and master URL
Pass the SparkConf object to SparkContext constructor
Example: conf = SparkConf().setAppName('myApp').setMaster('local[*]') sc = SparkContext(conf=conf)
Stop SparkContext using sc.stop()
Q12. 3. Explain detail project architecture
The project architecture involves the design and organization of data pipelines and systems for efficient data processing and storage.
The architecture includes components such as data sources, data processing frameworks, storage systems, and data delivery mechanisms.
It focuses on scalability, reliability, and performance to handle large volumes of data.
Example: A project architecture may involve using Apache Kafka for real-time data ingestion, Apache Spark for data processing...read more
Q13. what is integration run time in adf
Integration run time in ADF is a compute infrastructure used to run activities in Azure Data Factory pipelines.
Integration run time is a managed compute infrastructure in Azure Data Factory.
It is used to run activities within pipelines, such as data movement or data transformation tasks.
Integration run time can be auto-scaled based on the workload requirements.
It supports various data integration scenarios, including batch processing and real-time data processing.
Examples of ...read more
Q14. Optimisation techniques used
Optimisation techniques used in data engineering
Partitioning data to improve query performance
Using indexing to speed up data retrieval
Implementing caching mechanisms to reduce data access time
Optimizing data storage formats for efficient storage and processing
Parallel processing and distributed computing for faster data processing
Using compression techniques to reduce storage space and improve data transfer
Applying query optimization techniques like query rewriting and query...read more
Q15. Optimising technique that you have used
I have used partitioning and indexing to optimize query performance.
Implemented partitioning on large tables to improve query performance by limiting the data scanned
Created indexes on frequently queried columns to speed up data retrieval
Utilized clustering keys to physically organize data on disk for faster access
Q16. Spark optimization techniques
Spark optimization techniques involve partitioning, caching, and tuning resources for efficient data processing.
Partitioning data to distribute workload evenly
Caching frequently accessed data to avoid recomputation
Tuning resources like memory allocation and parallelism
Using broadcast variables for small lookup tables
Interview Process at Moringa Techsolv
Top Data Engineer Interview Questions from Similar Companies
Reviews
Interviews
Salaries
Users/Month