Filter interviews by
I applied via LinkedIn and was interviewed in Feb 2024. There was 1 interview round.
Optimizing pyspark jobs involves tuning configurations, partitioning data, caching, and using efficient transformations.
Tune configurations such as executor memory, number of executors, and parallelism to optimize performance.
Partition data properly to distribute workload evenly and avoid shuffling.
Cache intermediate results to avoid recomputation.
Use efficient transformations like map, filter, and reduceByKey instead ...
Stored procedures in Databricks can be written using SQL or Python.
Use %sql magic command to write SQL stored procedures
Use %python magic command to write Python stored procedures
Stored procedures can be saved and executed in Databricks notebooks
I applied via Recruitment Consulltant and was interviewed in Aug 2024. There were 3 interview rounds.
The output after inner join of table 1 and table 2 will be 2,3,5.
Inner join only includes rows that have matching values in both tables.
Values 2, 3, and 5 are present in both tables, so they will be included in the output.
Null values are not considered as matching values in inner join.
The project involves building a data pipeline to ingest, process, and analyze large volumes of data from various sources in Azure.
Utilizing Azure Data Factory for data ingestion and orchestration
Implementing Azure Databricks for data processing and transformation
Storing processed data in Azure Data Lake Storage
Using Azure Synapse Analytics for data warehousing and analytics
Leveraging Azure DevOps for CI/CD pipeline aut
Designing an effective ADF pipeline involves considering various metrics and factors.
Understand the data sources and destinations
Identify the dependencies between activities
Optimize data movement and processing for performance
Monitor and track pipeline execution for troubleshooting
Consider security and compliance requirements
Use parameterization and dynamic content for flexibility
Implement error handling and retries fo
I applied via Naukri.com and was interviewed in Oct 2024. There was 1 interview round.
Activities in Azure Data Factory (ADF) are the building blocks of a pipeline and perform various tasks like data movement, data transformation, and data orchestration.
Activities can be used to copy data from one location to another (Copy Activity)
Activities can be used to transform data using mapping data flows (Data Flow Activity)
Activities can be used to run custom code or scripts (Custom Activity)
Activities can be u...
Dataframes in pyspark are distributed collections of data organized into named columns.
Dataframes are similar to tables in a relational database, with rows and columns.
They can be created from various data sources like CSV, JSON, Parquet, etc.
Dataframes support SQL queries and transformations using PySpark functions.
Example: df = spark.read.csv('file.csv')
I applied via Naukri.com
Use DISTINCT keyword in SQL to remove duplicates from a dataset.
Use SELECT DISTINCT column_name FROM table_name to retrieve unique values from a specific column.
Use SELECT DISTINCT * FROM table_name to retrieve unique rows from the entire table.
Use GROUP BY clause with COUNT() function to remove duplicates based on specific criteria.
I applied via Recruitment Consulltant and was interviewed in Mar 2024. There was 1 interview round.
I connect onPrem to Azure using Azure ExpressRoute or VPN Gateway.
Use Azure ExpressRoute for private connection through a dedicated connection.
Set up a VPN Gateway for secure connection over the internet.
Ensure proper network configurations and security settings.
Use Azure Virtual Network Gateway to establish the connection.
Consider using Azure Site-to-Site VPN for connecting onPremises network to Azure Virtual Network.
Autoloader in Databricks is a feature that automatically loads new data files as they arrive in a specified directory.
Autoloader monitors a specified directory for new data files and loads them into a Databricks table.
It supports various file formats such as CSV, JSON, Parquet, Avro, and ORC.
Autoloader simplifies the process of ingesting streaming data into Databricks without the need for manual intervention.
It can be ...
Json data normalization involves structuring data to eliminate redundancy and improve efficiency.
Identify repeating groups of data
Create separate tables for each group
Establish relationships between tables using foreign keys
Eliminate redundant data by referencing shared values
I applied via Company Website and was interviewed in Apr 2024. There was 1 interview round.
Challenges in production deployment include scalability, data consistency, and monitoring.
Ensuring scalability to handle increasing data volumes and user loads
Maintaining data consistency across different databases and systems
Implementing effective monitoring and alerting to quickly identify and resolve issues
Developed a data pipeline to ingest, process, and analyze customer feedback data for a retail company.
Used Azure Data Factory to orchestrate data flow
Implemented Azure Databricks for data processing and analysis
Utilized Azure Synapse Analytics for data warehousing
Generated visualizations using Power BI for insights
Implemented machine learning models for sentiment analysis
I applied via Referral and was interviewed in May 2024. There was 1 interview round.
Polybase is a feature in Azure SQL Data Warehouse that allows users to query data stored in Hadoop or Azure Blob Storage.
Polybase enables users to access and query external data sources without moving the data into the database.
It provides a virtualization layer that allows SQL queries to seamlessly integrate with data stored in Hadoop or Azure Blob Storage.
Polybase can significantly improve query performance by levera...
I applied via Naukri.com and was interviewed in May 2024. There were 2 interview rounds.
The project architecture includes Spark transformations for processing large volumes of data.
Spark transformations are used to manipulate data in distributed computing environments.
Examples of Spark transformations include map, filter, reduceByKey, join, etc.
Use window functions like ROW_NUMBER() to find highest sales from each city in SQL.
Use PARTITION BY clause in ROW_NUMBER() to partition data by city
Order the data by sales in descending order
Filter the results to only include rows with row number 1
Databricks can be mounted using the Databricks CLI or the Databricks REST API.
Use the Databricks CLI command 'databricks fs mount' to mount a storage account to a Databricks workspace.
Alternatively, you can use the Databricks REST API to programmatically mount storage.
Senior Software Engineer
10
salaries
| ₹7.2 L/yr - ₹17.1 L/yr |
Associate Trainee
7
salaries
| ₹4.1 L/yr - ₹5 L/yr |
Senior Developer
7
salaries
| ₹8.4 L/yr - ₹13 L/yr |
Devops Engineer
6
salaries
| ₹4 L/yr - ₹8.9 L/yr |
Cloud Engineer
6
salaries
| ₹6.8 L/yr - ₹9 L/yr |
Thyrocare Technologies
Metropolis Healthcare
DRJ & CO
SRL Diagnostics