Cognizant
20+ Street Surge Technologies Interview Questions and Answers
Q1. What are all the issues you faced in your project? What is Global Parameter? Why do we need parameters inADF? What are the API's in Spark?
Answering questions related to data engineering
Issues faced in project: data quality, scalability, performance, integration
Global parameter: a parameter that can be accessed across multiple components in a system
Parameters in ADF: used to pass values between activities in a pipeline
APIs in Spark: Spark SQL, Spark Streaming, MLlib, GraphX
Q2. What is the difference between supervised and unsupervised learning?
Supervised learning uses labeled data to train the model, while unsupervised learning uses unlabeled data.
Supervised learning requires a target variable to be predicted, while unsupervised learning does not.
In supervised learning, the model learns from labeled training data, whereas in unsupervised learning, the model finds patterns in unlabeled data.
Examples of supervised learning include regression and classification tasks, while clustering is a common unsupervised learning...read more
Q3. How to find delta between two tables in SQL?
To find delta between two tables in SQL, use the EXCEPT or MINUS operator.
Use the EXCEPT operator in SQL to return rows from the first table that do not exist in the second table.
Use the MINUS operator in SQL to return distinct rows from the first table that do not exist in the second table.
Q4. What is ADLS How we can pass papameter from ADF to databricks
ADLS is Azure Data Lake Storage, a scalable and secure data lake solution in Azure.
ADLS is designed for big data analytics workloads
It supports Hadoop Distributed File System (HDFS) and Blob storage APIs
It provides enterprise-grade security and compliance features
To pass parameters from ADF to Databricks, use the 'Set Parameters' activity in ADF and reference them in Databricks notebooks
Q5. How did you overcome out of memory issues
I optimized code, increased memory allocation, used efficient data structures, and implemented data partitioning.
Optimized code by identifying and fixing memory leaks
Increased memory allocation for the application
Used efficient data structures like arrays, hashmaps, and trees
Implemented data partitioning to distribute data across multiple nodes
Q6. What are the Optimizations you can do in spark
Optimizations in Spark include partitioning, caching, broadcast variables, and using appropriate data structures.
Partitioning data based on key can improve performance by reducing data shuffling
Caching frequently accessed data in memory can avoid recomputation
Using broadcast variables can reduce data transfer between nodes
Choosing appropriate data structures like DataFrames or Datasets can optimize query execution
Using column pruning and predicate pushdown can reduce the amou...read more
Q7. In hadoop what happens if a name node fails
If a name node fails in Hadoop, the entire Hadoop cluster becomes unavailable.
The name node is responsible for managing the metadata of the Hadoop file system.
If the name node fails, the cluster cannot access or process any data.
To handle name node failures, Hadoop provides mechanisms like high availability and automatic failover.
In high availability mode, there are multiple name nodes in the cluster, and if one fails, another takes over.
Automatic failover ensures uninterrupt...read more
Q8. Elaborate concepts of Object Oriented Programming in Python.
Object Oriented Programming in Python focuses on creating classes and objects to organize code and data.
Python supports classes, objects, inheritance, polymorphism, and encapsulation.
Classes are blueprints for creating objects, which are instances of classes.
Inheritance allows a class to inherit attributes and methods from another class.
Polymorphism enables objects to be treated as instances of their parent class.
Encapsulation restricts access to certain components of an obje...read more
Q9. How to connect S3 from Databricks
To connect S3 from Databricks, you can use the AWS connector provided by Databricks.
Use the AWS connector provided by Databricks to connect to S3
Provide the necessary AWS credentials and S3 bucket details in the connector configuration
You can access S3 data using the file system API in Databricks
Q10. Word count by spark,, falt map and map difference
Word count by spark, flatMap, and map difference
Spark is a distributed computing framework for big data processing
flatMap is used to split each input string into words
map is used to transform each word into a key-value pair for counting
The difference lies in how the data is processed and transformed
Q11. spark explain cluster mode vs client mode
Cluster mode runs the Spark driver on one of the worker nodes, while client mode runs the driver on the client machine.
In cluster mode, the driver runs on one of the worker nodes in the cluster, while in client mode, the driver runs on the machine where the Spark application is submitted.
Cluster mode is suitable for production environments where fault tolerance and scalability are important, while client mode is more commonly used for development and testing purposes.
In clust...read more
Q12. What do you mean by CDC
CDC stands for Change Data Capture, a process of identifying and capturing changes made to data in a database.
CDC is used to track changes in data over time, allowing for real-time data integration and analysis.
It captures inserts, updates, and deletes made to data, providing a historical record of changes.
CDC is commonly used in data warehousing, data replication, and data integration processes.
Examples of CDC tools include Oracle GoldenGate, Attunity Replicate, and Informat...read more
Q13. Give an example of decorators in Python?
Decorators in Python are functions that modify the behavior of other functions.
Decorators are defined using the @decorator_name syntax before the function definition.
They can be used for logging, timing, authentication, etc.
Example: @staticmethod decorator in Python makes a method static.
Q14. Spark optimization used in our project
Spark optimization techniques used in project
Partitioning data to optimize parallel processing
Caching frequently accessed data to reduce computation time
Using broadcast variables for efficient data sharing across nodes
Optimizing shuffle operations to minimize data movement
Tuning memory and CPU settings for better performance
Q15. Difference between coalesce and repartition?
Coalesce reduces the number of partitions in a DataFrame, while repartition increases the number of partitions.
Coalesce is used to reduce the number of partitions in a DataFrame without shuffling data
Repartition is used to increase the number of partitions in a DataFrame and can involve shuffling data
Coalesce is more efficient for reducing partitions when no data movement is required
Repartition is typically used for evenly distributing data across a larger number of partition...read more
Q16. What is XCom in Airflow
XCom in Airflow is a way for tasks to exchange messages or small amounts of data.
XCom allows tasks to communicate with each other by passing small pieces of data
It can be used to share information between tasks in a DAG
XCom can be used to pass information like task status, results, or any other data
Q17. Different types of joins in SQL
Different types of joins in SQL include inner join, left join, right join, and full outer join.
Inner join: Returns rows when there is a match in both tables.
Left join: Returns all rows from the left table and the matched rows from the right table.
Right join: Returns all rows from the right table and the matched rows from the left table.
Full outer join: Returns rows when there is a match in either table.
Q18. Different types of Joins in spark
Different types of joins in Spark include inner join, outer join, left join, right join, and full join.
Inner join: Returns only the rows that have matching values in both datasets.
Outer join: Returns all rows when there is a match in either dataset.
Left join: Returns all rows from the left dataset and the matched rows from the right dataset.
Right join: Returns all rows from the right dataset and the matched rows from the left dataset.
Full join: Returns all rows when there is ...read more
Q19. explain the architecture of delta lake
Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads.
Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing.
It stores data in Parquet format and uses Apache Spark for processing.
Delta Lake ensures data reliability and data quality by providing schema enforcement and data versioning.
It supports time travel queries, allowing users to access previous versions of...read more
Q20. Illustrate exception handling in python.
Exception handling in Python allows for graceful handling of errors and preventing program crashes.
Use try-except blocks to catch and handle exceptions.
Multiple except blocks can be used to handle different types of exceptions.
Finally block can be used to execute code regardless of whether an exception was raised or not.
Custom exceptions can be defined by creating a new class that inherits from the built-in Exception class.
Q21. Pationong and bucket difference
Partitioning is dividing data into smaller chunks for better organization and performance, while bucketing is grouping data based on a specific criteria.
Partitioning is dividing data into smaller subsets based on a column or key.
Bucketing is grouping data based on a specific number of buckets or ranges.
Partitioning is commonly used in distributed systems for better data organization and query performance.
Bucketing is often used for data skew handling and optimizing query perf...read more
Q22. Optimizations used in present project
Various optimizations such as indexing, caching, and parallel processing were used in the project.
Implemented indexing on frequently queried columns to improve query performance
Utilized caching mechanisms to store frequently accessed data and reduce database load
Implemented parallel processing to speed up data processing tasks
Optimized algorithms and data structures for efficient data retrieval and manipulation
Q23. what is list in python
A list in Python is a collection of items that are ordered and mutable.
Lists are created using square brackets []
Items in a list can be of different data types
Lists can be modified by adding, removing, or changing items
Example: my_list = [1, 'apple', True]
Q24. Tuning operations in databricks
Tuning operations in Databricks involves optimizing performance and efficiency of data processing tasks.
Use cluster configuration settings to allocate resources efficiently
Optimize code by minimizing data shuffling and reducing unnecessary operations
Leverage Databricks Auto Optimize to automatically tune performance
Monitor job performance using Databricks Runtime Metrics and Spark UI
Q25. Flat map and map difference
Flat map is used to flatten nested arrays while map is used to transform each element in an array.
Flat map is used to flatten nested arrays into a single array.
Map is used to transform each element in an array using a function.
Flat map is commonly used in functional programming languages like JavaScript and Scala.
Map is a higher-order function that applies a given function to each element in an array.
Q26. Write the binary sort program
Binary sort program sorts an array by repeatedly dividing it into two halves and comparing elements.
Divide the array into two halves
Compare the middle element with the target value
Repeat the process on the sub-array where the target value may be located
Q27. Spark optimization techniques
Spark optimization techniques involve partitioning, caching, and tuning configurations.
Partitioning data to distribute workload evenly
Caching frequently accessed data to avoid recomputation
Tuning configurations like memory allocation and parallelism
Using broadcast joins for small tables
Avoiding shuffling operations whenever possible
Q28. spark optimization techniques
Optimization techniques in Spark improve performance and efficiency of data processing.
Partitioning data to distribute workload evenly
Caching frequently accessed data in memory
Using broadcast variables for small lookup tables
Avoiding shuffling operations whenever possible
More about working at Cognizant
Top HR Questions asked in Street Surge Technologies
Interview Process at Street Surge Technologies
Top Data Engineer Interview Questions from Similar Companies
Reviews
Interviews
Salaries
Users/Month