Filter interviews by
I applied via LinkedIn and was interviewed in Apr 2024. There was 1 interview round.
Top trending discussions
I applied via Naukri.com and was interviewed in Dec 2024. There was 1 interview round.
I applied via Recruitment Consulltant and was interviewed in Nov 2024. There were 2 interview rounds.
Different types of joins available in Databricks include inner join, outer join, left join, right join, and cross join.
Inner join: Returns only the rows that have matching values in both tables.
Outer join: Returns all rows when there is a match in either table.
Left join: Returns all rows from the left table and the matched rows from the right table.
Right join: Returns all rows from the right table and the matched rows ...
Implementing fault tolerance in a data pipeline involves redundancy, monitoring, and error handling.
Use redundant components to ensure continuous data flow
Implement monitoring tools to detect failures and bottlenecks
Set up automated alerts for immediate response to issues
Design error handling mechanisms to gracefully handle failures
Use checkpoints and retries to ensure data integrity
AutoLoader is a feature in data engineering that automatically loads data from various sources into a data warehouse or database.
Automates the process of loading data from different sources
Reduces manual effort and human error
Can be scheduled to run at specific intervals
Examples: Apache Nifi, AWS Glue
To connect to different services in Azure, you can use Azure SDKs, REST APIs, Azure Portal, Azure CLI, and Azure PowerShell.
Use Azure SDKs for programming languages like Python, Java, C#, etc.
Utilize REST APIs to interact with Azure services programmatically.
Access and manage services through the Azure Portal.
Leverage Azure CLI for command-line interface interactions.
Automate tasks using Azure PowerShell scripts.
Linked Services are connections to external data sources or destinations in Azure Data Factory.
Linked Services define the connection information needed to connect to external data sources or destinations.
They can be used in Data Factory pipelines to read from or write to external systems.
Examples of Linked Services include Azure Blob Storage, Azure SQL Database, and Amazon S3.
ADF questions refer to Azure Data Factory questions which are related to data integration and data transformation processes.
ADF questions are related to Azure Data Factory, a cloud-based data integration service.
These questions may involve data pipelines, data flows, activities, triggers, and data movement.
Candidates may be asked about their experience with designing, monitoring, and managing data pipelines in ADF.
Exam...
Aptitude test involved with quantative aptitude, logical reasoning and reading comprehensions.
I have strong skills in data processing, ETL, data modeling, and programming languages like Python and SQL.
Proficient in data processing and ETL techniques
Strong knowledge of data modeling and database design
Experience with programming languages like Python and SQL
Familiarity with big data technologies such as Hadoop and Spark
Yes, I am open to relocating for the right opportunity.
I am willing to relocate for the right job opportunity.
I have experience moving for previous roles.
I am flexible and adaptable to new locations.
I am excited about the possibility of exploring a new city or country.
I can join within two weeks of receiving an offer.
I can start within two weeks of receiving an offer.
I need to give notice at my current job before starting.
I have some personal commitments that I need to wrap up before joining.
I applied via Job Portal and was interviewed in Aug 2024. There were 3 interview rounds.
Its mandatory test even for experience people
I applied via Recruitment Consulltant and was interviewed in Jun 2024. There was 1 interview round.
I would rate myself 4 in Pyspark, 5 in Python, and 4 in SQL.
Strong proficiency in Python programming language
Experience in working with Pyspark for big data processing
Proficient in writing complex SQL queries for data manipulation
Familiarity with optimizing queries for performance
Hands-on experience in data engineering projects
Use Python's built-in data structures like sets or dictionaries to handle duplicates.
Use a set to remove duplicates from a list: unique_list = list(set(original_list))
Use a dictionary to remove duplicates from a list while preserving order: unique_list = list(dict.fromkeys(original_list))
Use Databricks provided tools like databricks-connect and databricks-cli to migrate Hive metadata to Unity catalog.
Use databricks-connect to connect to the Databricks workspace from your local development environment.
Use databricks-cli to export the Hive metadata from the existing Hive metastore.
Create a new Unity catalog in Databricks and import the exported metadata using databricks-cli.
Validate the migration by chec...
To read a CSV file from an ADLS path, you can use libraries like pandas or pyspark.
Use pandas library in Python to read a CSV file from ADLS path
Use pyspark library in Python to read a CSV file from ADLS path
Ensure you have the necessary permissions to access the ADLS path
The number of stages created from the code provided depends on the specific code and its functionality.
The number of stages can vary based on the complexity of the code and the specific tasks being performed.
Stages may include data extraction, transformation, loading, and processing.
It is important to analyze the code and identify distinct stages to determine the total number.
Narrow transformation processes one record at a time, while wide transformation processes multiple records at once.
Narrow transformation processes one record at a time, making it easier to parallelize and optimize.
Wide transformation processes multiple records at once, which can lead to shuffling and performance issues.
Examples of narrow transformations include map and filter operations, while examples of wide transfor
Actions and transformations are key concepts in data engineering, involving the manipulation and processing of data.
Actions are operations that trigger the execution of a data transformation job in a distributed computing environment.
Transformations are functions that take an input dataset and produce an output dataset, often involving filtering, aggregating, or joining data.
Examples of actions include 'saveAsTextFile'...
Enforcing the schema ensures data consistency and validation, while manually defining the schema in code allows for more flexibility and customization.
Enforcing the schema ensures that all data conforms to a predefined structure and format, preventing errors and inconsistencies.
Manually defining the schema in code allows for more flexibility in handling different data types and structures.
Enforcing the schema can be do...
Optimizations like partitioning, caching, and using efficient file formats can reduce overhead in reading large datasets in Spark.
Partitioning data based on key can reduce the amount of data shuffled during joins and aggregations
Caching frequently accessed datasets in memory can avoid recomputation
Using efficient file formats like Parquet or ORC can reduce disk I/O and improve read performance
SQL query to find the name of person who logged in last within each country from Person Table
Use a subquery to find the max login time for each country
Join the Person table with the subquery on country and login time to get the name of the person
List is mutable, Tuple is immutable in Python.
List can be modified after creation, Tuple cannot be modified.
List is defined using square brackets [], Tuple is defined using parentheses ().
Example: list_example = [1, 2, 3], tuple_example = (4, 5, 6)
Rank assigns a unique rank to each row, Dense Rank assigns a unique rank to each distinct row, and Row Number assigns a unique number to each row.
Rank assigns the same rank to rows with the same value, leaving gaps in the ranking if there are ties.
Dense Rank assigns a unique rank to each distinct row, leaving no gaps in the ranking.
Row Number assigns a unique number to each row, without any regard for the values in the...
List comprehension is a concise way to create lists in Python by applying an expression to each item in an iterable.
Syntax: [expression for item in iterable]
Can include conditions: [expression for item in iterable if condition]
Example: squares = [x**2 for x in range(10)]
Interactive clusters allow for real-time interaction and exploration, while job clusters are used for running batch jobs.
Interactive clusters are used for real-time data exploration and analysis.
Job clusters are used for running batch jobs and processing large amounts of data.
Interactive clusters are typically smaller in size and have shorter lifespans.
Job clusters are usually larger and more powerful to handle heavy w...
To add a column in a dataframe, use the 'withColumn' method. To rename a column, use the 'withColumnRenamed' method.
To add a column, use the 'withColumn' method with the new column name and the expression to compute the values for that column.
Example: df.withColumn('new_column', df['existing_column'] * 2)
To rename a column, use the 'withColumnRenamed' method with the current column name and the new column name.
Example:...
Coalesce is used to combine multiple small partitions into a larger one, while Repartition is used to increase or decrease the number of partitions in a DataFrame.
Coalesce reduces the number of partitions in a DataFrame by combining small partitions into larger ones.
Repartition increases or decreases the number of partitions in a DataFrame by shuffling the data across partitions.
Coalesce is more efficient than Repartit...
based on 1 interview
Interview experience
Software Engineer
71
salaries
| ₹3.9 L/yr - ₹15.1 L/yr |
Senior Analyst
54
salaries
| ₹6 L/yr - ₹11.2 L/yr |
Consultant
47
salaries
| ₹5 L/yr - ₹15.6 L/yr |
Senior Software Engineer
44
salaries
| ₹10.2 L/yr - ₹35 L/yr |
Senior Consultant
38
salaries
| ₹10.1 L/yr - ₹19.4 L/yr |
TransUnion
CIBIL
Experian
Crif High Mark Credit Information Services