Filter interviews by
Spark is a distributed computing framework designed for big data processing.
Spark is built around the concept of Resilient Distributed Datasets (RDDs) which allow for fault-tolerant parallel processing of data.
It provides high-level APIs in Java, Scala, Python, and R for ease of use.
Spark can run on top of Hadoop, Mesos, Kubernetes, or in standalone mode.
It includes modules for SQL, streaming, machine learning, an...
MapReduce is a programming model and processing technique for parallel and distributed computing.
MapReduce is used to process large datasets in parallel across a distributed cluster of computers.
It consists of two main functions - Map function for processing key/value pairs and Reduce function for aggregating the results.
Popularly used in big data processing frameworks like Hadoop for tasks like data sorting, sear...
Lazy evaluation in Spark delays the execution of transformations until an action is called.
Lazy evaluation allows Spark to optimize the execution plan by combining multiple transformations into a single stage.
Transformations are not executed immediately, but are stored as a directed acyclic graph (DAG) of operations.
Actions trigger the execution of the DAG and produce results.
Example: map() and filter() are transf...
Imputer function in PySpark is used to replace missing values in a DataFrame.
Imputer is a transformer in PySpark ML library.
It replaces missing values in a DataFrame with either mean, median, or mode of the column.
It can be used with both numerical and categorical columns.
Example: imputer = Imputer(inputCols=['col1', 'col2'], outputCols=['col1_imputed', 'col2_imputed'], strategy='mean')
Example: imputed_df = impute...
What people are saying about HSBC Group
To delete duplicate rows from a table, use the DISTINCT keyword or GROUP BY clause.
Use the DISTINCT keyword to select unique rows from the table.
Use the GROUP BY clause to group the rows by a specific column and select the unique rows.
Use the DELETE statement with a subquery to delete the duplicate rows.
Create a new table with the unique rows and drop the old table.
Null values in PySpark are handled using functions such as dropna(), fillna(), and replace().
dropna() function is used to drop rows or columns with null values
fillna() function is used to fill null values with a specified value or method
replace() function is used to replace null values with a specified value
coalesce() function is used to replace null values with the first non-null value in a list of columns
SQL query to retrieve the second highest salary from each department
Use the RANK() function to assign a rank to each salary within each department
Filter the results to only include rows with a rank of 2
Group the results by department to get the second highest salary for each department
Window functions are used to perform calculations across a set of rows that are related to the current row.
Commonly used window functions include ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, FIRST_VALUE, LAST_VALUE, and NTILE.
Window functions are used in conjunction with the OVER clause to define the window or set of rows to perform the calculation on.
Window functions can be used to calculate running totals, moving av...
Merge two unsorted lists into a sorted list using inbuilt sorting functions.
Use inbuilt sorting functions to sort the input lists
Merge the sorted lists using a merge algorithm
Return the merged and sorted list
Skewness is a measure of asymmetry in a distribution. Skewed tables are tables with imbalanced data distribution.
Skewness is a statistical measure that describes the asymmetry of the data distribution around the mean.
Positive skewness indicates a longer tail on the right side of the distribution, while negative skewness indicates a longer tail on the left side.
Skewed tables in data engineering refer to tables with...
I applied via Walk-in and was interviewed in Apr 2024. There were 3 interview rounds.
Lazy evaluation in Spark delays the execution of transformations until an action is called.
Lazy evaluation allows Spark to optimize the execution plan by combining multiple transformations into a single stage.
Transformations are not executed immediately, but are stored as a directed acyclic graph (DAG) of operations.
Actions trigger the execution of the DAG and produce results.
Example: map() and filter() are transformat...
MapReduce is a programming model and processing technique for parallel and distributed computing.
MapReduce is used to process large datasets in parallel across a distributed cluster of computers.
It consists of two main functions - Map function for processing key/value pairs and Reduce function for aggregating the results.
Popularly used in big data processing frameworks like Hadoop for tasks like data sorting, searching...
Skewness is a measure of asymmetry in a distribution. Skewed tables are tables with imbalanced data distribution.
Skewness is a statistical measure that describes the asymmetry of the data distribution around the mean.
Positive skewness indicates a longer tail on the right side of the distribution, while negative skewness indicates a longer tail on the left side.
Skewed tables in data engineering refer to tables with imba...
Spark is a distributed computing framework designed for big data processing.
Spark is built around the concept of Resilient Distributed Datasets (RDDs) which allow for fault-tolerant parallel processing of data.
It provides high-level APIs in Java, Scala, Python, and R for ease of use.
Spark can run on top of Hadoop, Mesos, Kubernetes, or in standalone mode.
It includes modules for SQL, streaming, machine learning, and gra...
I applied via Naukri.com and was interviewed in Mar 2024. There was 1 interview round.
I applied via Job Portal and was interviewed in Jul 2024. There was 1 interview round.
English, aptitude test, reasoning
I applied via Naukri.com and was interviewed in Apr 2022. There were 4 interview rounds.
Null values in PySpark are handled using functions such as dropna(), fillna(), and replace().
dropna() function is used to drop rows or columns with null values
fillna() function is used to fill null values with a specified value or method
replace() function is used to replace null values with a specified value
coalesce() function is used to replace null values with the first non-null value in a list of columns
SQL query to retrieve the second highest salary from each department
Use the RANK() function to assign a rank to each salary within each department
Filter the results to only include rows with a rank of 2
Group the results by department to get the second highest salary for each department
To delete duplicate rows from a table, use the DISTINCT keyword or GROUP BY clause.
Use the DISTINCT keyword to select unique rows from the table.
Use the GROUP BY clause to group the rows by a specific column and select the unique rows.
Use the DELETE statement with a subquery to delete the duplicate rows.
Create a new table with the unique rows and drop the old table.
Window functions are used to perform calculations across a set of rows that are related to the current row.
Commonly used window functions include ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, FIRST_VALUE, LAST_VALUE, and NTILE.
Window functions are used in conjunction with the OVER clause to define the window or set of rows to perform the calculation on.
Window functions can be used to calculate running totals, moving average...
Merge two unsorted lists into a sorted list using inbuilt sorting functions.
Use inbuilt sorting functions to sort the input lists
Merge the sorted lists using a merge algorithm
Return the merged and sorted list
UDF stands for User-Defined Function in Spark. It allows users to define their own functions to process data.
UDFs can be written in different programming languages like Python, Scala, and Java.
UDFs can be used to perform complex operations on data that are not available in built-in functions.
PySpark code to check the validity of mobile_number column can be written using regular expressions and the `regexp_extract` func...
I applied via Recruitment Consulltant and was interviewed before Feb 2023. There were 3 interview rounds.
Was based on behaviour and cognitive interview round
I applied via Campus Placement
I appeared for an interview in Oct 2016.
I cannot provide investment advice, but here are five companies that have shown strong financial performance in recent years.
Apple - consistently high revenue and profit margins
Amazon - dominant player in e-commerce and cloud computing
Microsoft - strong growth in cloud computing and enterprise software
Alphabet (Google) - diversified revenue streams and strong advertising business
Visa - dominant player in the payments i...
The Brexit vote could have both positive and negative effects on the Indian economy.
Positive effects: Increased trade opportunities with the UK, potential for attracting foreign investments from companies relocating from the UK.
Negative effects: Uncertainty in global markets leading to volatility in exchange rates, potential decline in exports to the UK.
Example: Indian IT companies may face challenges due to stricter i...
Some of the top questions asked at the HSBC Group Data Engineer interview -
based on 5 interview experiences
Difficulty level
Duration
based on 8 reviews
Rating in categories
Assistant Manager
2.8k
salaries
| ₹6 L/yr - ₹13.5 L/yr |
Manager
2.2k
salaries
| ₹13.9 L/yr - ₹24 L/yr |
Senior Software Engineer
1.7k
salaries
| ₹13 L/yr - ₹23.7 L/yr |
Assistant Vice President
1.7k
salaries
| ₹25 L/yr - ₹42.5 L/yr |
Software Engineer
1.5k
salaries
| ₹7.8 L/yr - ₹14 L/yr |
Wells Fargo
JPMorgan Chase & Co.
Cholamandalam Investment & Finance
Citicorp