EPAM Systems
10+ Greenlight Financial Technology Interview Questions and Answers
Q1. Write code to print reverse of a sentence word by word.
Q2. Write code for printing duplicate numbers in a list.
This code identifies and prints duplicate numbers from a given list using a dictionary to track occurrences.
Use a dictionary to count occurrences of each number.
Iterate through the list and update the count in the dictionary.
Print numbers that have a count greater than 1.
Example: For the list [1, 2, 3, 2, 4, 3], the output should be 2 and 3.
Q3. Difference between cache and persist, repartition and coalesce.
Cache stores data in memory for quick access, while persist saves it to disk. Repartition changes data distribution; coalesce reduces partitions.
Cache: Stores DataFrame in memory for faster access during subsequent operations.
Persist: Saves DataFrame to disk, allowing for fault tolerance but slower than cache.
Repartition: Increases or decreases the number of partitions, potentially shuffling data across nodes.
Coalesce: Reduces the number of partitions without a full shuffle, ...read more
Q4. Elaboration of Spark optimization techniques. Types of transformations, shuffling.
Spark optimization techniques enhance performance through efficient data processing and resource management.
Use DataFrames and Datasets for optimized execution plans.
Leverage lazy evaluation to minimize unnecessary computations.
Apply partitioning to distribute data evenly across nodes, e.g., using 'repartition' or 'coalesce'.
Minimize shuffling by using narrow transformations like 'map' and 'filter' instead of wide transformations like 'groupBy'.
Broadcast smaller datasets to a...read more
Q5. How will you handle data skewness in spark
Data skewness can be handled in Spark by using techniques like partitioning, bucketing, and broadcasting.
Partitioning the data based on a key column can distribute the data evenly across the cluster.
Bucketing can further divide the data into smaller buckets based on a hash function.
Broadcasting small tables can reduce the amount of data shuffled across the network.
Using dynamic allocation can also help in handling data skewness by allocating more resources to tasks that are t...read more
Q6. Hive types of tables and difference between them
Hive supports two types of tables: Managed and External, each with distinct data management and storage characteristics.
Managed Tables: Hive manages both the schema and the data. Dropping the table deletes the data.
External Tables: Hive manages only the schema. Dropping the table does not delete the data, which remains in the external storage.
Use Managed Tables for temporary data that can be recreated easily.
Use External Tables for data that is shared with other applications ...read more
Q7. What Azure solutions have you worked with?
I have worked with Azure Data Factory, Azure Databricks, and Azure SQL Database.
Azure Data Factory for data integration and orchestration
Azure Databricks for big data processing and analytics
Azure SQL Database for relational database management
Q8. Difference between RDD, Dataframe, Dataset.
RDD, DataFrame, and Dataset are core abstractions in Apache Spark for handling distributed data processing.
RDD (Resilient Distributed Dataset) is the fundamental data structure in Spark, representing an immutable distributed collection of objects.
DataFrames are similar to RDDs but are optimized for performance and allow for schema-based operations, making them easier to use.
Datasets combine the benefits of RDDs and DataFrames, providing type safety and the ability to use both...read more
Q9. Connecting Spark to Azure SQL Database.
Connecting Spark to Azure SQL Database involves configuring JDBC and using Spark's DataFrame API for data operations.
Use the JDBC driver for Azure SQL Database to establish a connection.
Example connection string: 'jdbc:sqlserver://<server>.database.windows.net:1433;database=<database>;user=<user>@<server>;password=<password>'
Utilize Spark's DataFrame API to read and write data: df.write.jdbc(url, table, properties).
Ensure that the Azure SQL Database firewall allows access fro...read more
Q10. Discuss project and it's architecture.
A data engineering project focused on building a scalable ETL pipeline for healthcare data analytics.
Architecture includes data ingestion, processing, and storage layers.
Used Apache Kafka for real-time data streaming from various sources.
Implemented Apache Spark for batch processing and data transformation.
Stored processed data in Amazon Redshift for analytics and reporting.
Utilized Airflow for orchestrating ETL workflows and scheduling tasks.
Q11. What tech stack are used
The tech stack used includes Python, SQL, Apache Spark, Hadoop, AWS, and Docker.
Python for data processing and analysis
SQL for database querying
Apache Spark for big data processing
Hadoop for distributed storage and processing
AWS for cloud services
Docker for containerization
Q12. types of Variables in Scala
Scala has two types of variables - mutable and immutable.
Scala has mutable variables that can be reassigned using the var keyword.
Scala also has immutable variables that cannot be reassigned once they are initialized using the val keyword.
Example: var mutableVariable = 10; val immutableVariable = 20;
Interview Process at Greenlight Financial Technology
Top Data Engineer Interview Questions from Similar Companies
Reviews
Interviews
Salaries
Users/Month