i
Diggibyte Technologies
Filter interviews by
I applied via Naukri.com and was interviewed in May 2022. There were 2 interview rounds.
Spark architecture is a distributed computing framework that processes large datasets in parallel across a cluster of nodes.
Spark has a master-slave architecture with a driver program that communicates with the cluster manager to allocate resources and tasks to worker nodes.
Worker nodes execute tasks in parallel and store data in memory or disk.
Spark supports various data sources and APIs for batch processing, streamin...
DAG stands for Directed Acyclic Graph and is a way to represent dependencies between tasks. RDD stands for Resilient Distributed Datasets and is a fundamental data structure in Apache Spark.
DAG is used to represent a series of tasks or operations where each task depends on the output of the previous task.
RDD is a distributed collection of data that can be processed in parallel across multiple nodes in a cluster.
RDDs ar...
Serialization is the process of converting an object into a stream of bytes for storage or transmission.
Serialization is used to transfer objects between different applications or systems.
It allows objects to be stored in a file or database.
Serialization can be used for caching and improving performance.
Examples of serialization formats include JSON, XML, and binary formats like Protocol Buffers and Apache Avro.
Accumulators are variables used for aggregating data in Spark. GroupByKey and ReduceByKey are operations used for data transformation.
Accumulators are used to accumulate values across multiple tasks in a distributed environment.
GroupByKey is used to group data based on a key and create a pair of key-value pairs.
ReduceByKey is used to aggregate data based on a key and reduce the data to a single value.
GroupByKey is less...
Choose a cluster based on data size, complexity, and processing requirements.
Consider the size and complexity of the data to be processed.
Determine the processing requirements, such as batch or real-time processing.
Choose a cluster with appropriate resources, such as CPU, memory, and storage.
Examples of Azure clusters include HDInsight, Databricks, and Synapse Analytics.
To create mount points in ADLS, use the Azure Storage Explorer or Azure Portal. To load data source, use Azure Data Factory or Azure Databricks.
Mount points can be created using Azure Storage Explorer or Azure Portal
To load data source, use Azure Data Factory or Azure Databricks
Mount points allow you to access data in ADLS as if it were a local file system
Data can be loaded into ADLS using various tools such as Azure D...
Top trending discussions
I applied via Naukri.com and was interviewed in Nov 2024. There were 2 interview rounds.
I applied via Recruitment Consulltant and was interviewed in Aug 2024. There were 3 interview rounds.
The output after inner join of table 1 and table 2 will be 2,3,5.
Inner join only includes rows that have matching values in both tables.
Values 2, 3, and 5 are present in both tables, so they will be included in the output.
Null values are not considered as matching values in inner join.
The project involves building a data pipeline to ingest, process, and analyze large volumes of data from various sources in Azure.
Utilizing Azure Data Factory for data ingestion and orchestration
Implementing Azure Databricks for data processing and transformation
Storing processed data in Azure Data Lake Storage
Using Azure Synapse Analytics for data warehousing and analytics
Leveraging Azure DevOps for CI/CD pipeline aut
Designing an effective ADF pipeline involves considering various metrics and factors.
Understand the data sources and destinations
Identify the dependencies between activities
Optimize data movement and processing for performance
Monitor and track pipeline execution for troubleshooting
Consider security and compliance requirements
Use parameterization and dynamic content for flexibility
Implement error handling and retries fo
I was interviewed in Dec 2024.
I applied via Company Website and was interviewed in Dec 2024. There was 1 interview round.
I applied via Naukri.com and was interviewed in Oct 2024. There was 1 interview round.
Activities in Azure Data Factory (ADF) are the building blocks of a pipeline and perform various tasks like data movement, data transformation, and data orchestration.
Activities can be used to copy data from one location to another (Copy Activity)
Activities can be used to transform data using mapping data flows (Data Flow Activity)
Activities can be used to run custom code or scripts (Custom Activity)
Activities can be u...
Dataframes in pyspark are distributed collections of data organized into named columns.
Dataframes are similar to tables in a relational database, with rows and columns.
They can be created from various data sources like CSV, JSON, Parquet, etc.
Dataframes support SQL queries and transformations using PySpark functions.
Example: df = spark.read.csv('file.csv')
I applied via Recruitment Consulltant and was interviewed in Mar 2024. There was 1 interview round.
I connect onPrem to Azure using Azure ExpressRoute or VPN Gateway.
Use Azure ExpressRoute for private connection through a dedicated connection.
Set up a VPN Gateway for secure connection over the internet.
Ensure proper network configurations and security settings.
Use Azure Virtual Network Gateway to establish the connection.
Consider using Azure Site-to-Site VPN for connecting onPremises network to Azure Virtual Network.
Autoloader in Databricks is a feature that automatically loads new data files as they arrive in a specified directory.
Autoloader monitors a specified directory for new data files and loads them into a Databricks table.
It supports various file formats such as CSV, JSON, Parquet, Avro, and ORC.
Autoloader simplifies the process of ingesting streaming data into Databricks without the need for manual intervention.
It can be ...
Json data normalization involves structuring data to eliminate redundancy and improve efficiency.
Identify repeating groups of data
Create separate tables for each group
Establish relationships between tables using foreign keys
Eliminate redundant data by referencing shared values
I applied via Referral and was interviewed in May 2024. There was 1 interview round.
Polybase is a feature in Azure SQL Data Warehouse that allows users to query data stored in Hadoop or Azure Blob Storage.
Polybase enables users to access and query external data sources without moving the data into the database.
It provides a virtualization layer that allows SQL queries to seamlessly integrate with data stored in Hadoop or Azure Blob Storage.
Polybase can significantly improve query performance by levera...
Use DISTINCT keyword in SQL to remove duplicates from a dataset.
Use SELECT DISTINCT column_name FROM table_name to retrieve unique values from a specific column.
Use SELECT DISTINCT * FROM table_name to retrieve unique rows from the entire table.
Use GROUP BY clause with COUNT() function to remove duplicates based on specific criteria.
Data Engineer
27
salaries
| ₹3 L/yr - ₹10 L/yr |
Scrum Master
4
salaries
| ₹11 L/yr - ₹19 L/yr |
Front end Developer
4
salaries
| ₹3 L/yr - ₹12.5 L/yr |
Qliksense Developer
4
salaries
| ₹5 L/yr - ₹7.7 L/yr |
Data Scientist
3
salaries
| ₹3.7 L/yr - ₹10 L/yr |
Infosys
TCS
Wipro
HCLTech