Filter interviews by
I applied via Approached by Company and was interviewed in Feb 2024. There was 1 interview round.
Top trending discussions
I applied via Naukri.com and was interviewed in Nov 2024. There was 1 interview round.
I am a Senior Data Engineer with experience in building scalable data pipelines and optimizing data processing workflows.
Experience in designing and implementing ETL processes using tools like Apache Spark and Airflow
Proficient in working with large datasets and optimizing query performance
Strong background in data modeling and database design
Worked on projects involving real-time data processing and streaming analytic
Decorators in Python are functions that modify the behavior of other functions or methods.
Decorators are defined using the @decorator_name syntax before a function definition.
They can be used to add functionality to existing functions without modifying their code.
Decorators can be used for logging, timing, authentication, and more.
Example: @staticmethod decorator in Python is used to define a static method in a class.
SQL query to group by employee ID and combine first name and last name with a space
Use the GROUP BY clause to group by employee ID
Use the CONCAT function to combine first name and last name with a space
Select employee ID, CONCAT(first_name, ' ', last_name) AS full_name
Constructors in Python are special methods used for initializing objects. They are called automatically when a new instance of a class is created.
Constructors are defined using the __init__() method in a class.
They are used to initialize instance variables of a class.
Example: class Person: def __init__(self, name, age): self.name = name self.age = age person1 = Person('Alice', 30)
Indexing in SQL is a technique used to improve the performance of queries by creating a data structure that allows for faster retrieval of data.
Indexes are created on columns in a database table to speed up the retrieval of rows that match a certain condition in a WHERE clause.
Indexes can be created using CREATE INDEX statement in SQL.
Types of indexes include clustered indexes, non-clustered indexes, unique indexes, an...
Spark works well with Parquet files due to its columnar storage format, efficient compression, and ability to push down filters.
Parquet files are columnar storage format, which aligns well with Spark's processing model of working on columns rather than rows.
Parquet files support efficient compression, reducing storage space and improving read performance in Spark.
Spark can push down filters to Parquet files, allowing f...
I was interviewed in Sep 2024.
Pyspark is a Python library for big data processing using Spark framework.
Pyspark is used for processing large datasets in parallel.
It provides APIs for data manipulation, querying, and analysis.
Example: Using pyspark to read a CSV file and perform data transformations.
Databricks optimisation techniques improve performance and efficiency of data processing on the Databricks platform.
Use cluster sizing and autoscaling to optimize resource allocation based on workload
Leverage Databricks Delta for optimized data storage and processing
Utilize caching and persisting data to reduce computation time
Optimize queries by using appropriate indexing and partitioning strategies
Databricks is a unified data analytics platform that provides a collaborative environment for data engineers.
Databricks is built on top of Apache Spark and provides a workspace for data engineering tasks.
It allows for easy integration with various data sources and tools for data processing.
Databricks provides features like notebooks, clusters, and libraries for efficient data engineering workflows.
posted on 25 Jul 2024
Developed a real-time data processing system for analyzing customer behavior
Designed and implemented data pipelines using Apache Kafka and Spark
Optimized data processing algorithms to handle large volumes of streaming data
Collaborated with data scientists to integrate machine learning models into the system
I applied via Naukri.com and was interviewed in Jun 2024. There was 1 interview round.
Extract only comments from twitter
I applied via Naukri.com and was interviewed in Aug 2024. There were 2 interview rounds.
Developed a real-time data processing system for analyzing customer behavior on an e-commerce platform.
Used Apache Kafka for real-time data streaming
Implemented data pipelines using Apache Spark for processing large volumes of data
Designed and optimized data models in PostgreSQL for storing and querying customer data
Types of SCD include Type 1, Type 2, and Type 3.
Type 1 SCD: Overwrites old data with new data, no history is maintained.
Type 2 SCD: Maintains historical data by creating new records for changes.
Type 3 SCD: Creates separate columns to store historical and current data.
Examples: Type 1 - Employee address updates overwrite old address. Type 2 - Employee salary changes create new record with effective date. Type 3 - Employ
Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads.
Delta Lake provides ACID transactions, schema enforcement, and data versioning on top of data lakes.
Data Lake is a centralized repository that allows you to store all your structured and unstructured data at any scale.
Delta Lake is optimized for big data workloads and is built on top of Apache Spark.
Data Lake...
To write a file in a delta table, you can use the Delta Lake API or Spark SQL commands.
Use Delta Lake API to write data to a delta table
Use Spark SQL commands like INSERT INTO to write data to a delta table
Ensure that the data being written is in the correct format and schema
Optimisation techniques used in the project include indexing, query optimization, caching, and parallel processing.
Indexing: Creating indexes on frequently queried columns to improve search performance.
Query optimization: Rewriting queries to make them more efficient and reduce execution time.
Caching: Storing frequently accessed data in memory to reduce the need for repeated database queries.
Parallel processing: Distri...
Tasks and stages are components of the execution plan in Spark UI.
Tasks are the smallest unit of work in Spark, representing a single operation on a partition of data.
Stages are groups of tasks that are executed together as part of a larger computation.
Tasks within a stage can be executed in parallel, while stages are executed sequentially.
Tasks are created based on the transformations and actions in the Spark applicat...
DAG (Directed Acyclic Graph) in Apache Spark is used to represent a series of data processing steps and their dependencies.
DAG in Spark helps optimize the execution of tasks by determining the order in which they should be executed based on dependencies.
It breaks down a Spark job into smaller tasks and organizes them in a way that minimizes unnecessary computations.
DAGs are created automatically by Spark when actions a...
I have used various transformation techniques such as data cleaning, normalization, aggregation, and feature engineering in my projects.
Data cleaning to remove missing values and outliers
Normalization to scale numerical features
Aggregation to summarize data at different levels
Feature engineering to create new relevant features
I keep myself updated by regularly attending conferences, workshops, online courses, and reading industry blogs.
Attend conferences and workshops related to data engineering
Take online courses on platforms like Coursera, Udemy, or DataCamp
Read industry blogs and follow thought leaders on social media
Join online communities and forums to discuss latest technologies and trends
I applied via Recruitment Consulltant and was interviewed in Jul 2024. There were 2 interview rounds.
I have successfully led the development of a real-time data processing system, resulting in a 30% increase in efficiency.
Led the development of a real-time data processing system
Achieved a 30% increase in efficiency
Implemented data quality checks to ensure accuracy
I have faced challenges in optimizing data pipelines, handling large volumes of data, and ensuring data quality.
Optimizing data pipelines to improve efficiency and performance
Handling large volumes of data to prevent bottlenecks and ensure scalability
Ensuring data quality by implementing data validation processes and error handling mechanisms
posted on 16 Nov 2023
I applied via Recruitment Consulltant and was interviewed before Nov 2022. There were 4 interview rounds.
I applied via Naukri.com and was interviewed in Sep 2022. There were 2 interview rounds.
Vertices are nodes and edges are connections between nodes in a directed acyclic graph (DAG).
Vertices represent the tasks or operations in a DAG.
Edges represent the dependencies between tasks or operations.
Vertices can have multiple incoming edges and outgoing edges.
Edges can be weighted to represent the cost or time required to complete a task.
Examples of DAGs include data processing pipelines and task scheduling syst
Calculating resources based on cores and memory given with overhead and driver memory
Calculate the total memory available by multiplying the number of cores with memory per core
Deduct the overhead memory required for the operating system and other processes
Allocate driver memory for each executor based on the workload
Consider the memory requirements for other services like Hadoop, Spark, etc.
Example: For 16 cores with ...
I applied via Recruitment Consulltant and was interviewed in May 2024. There were 2 interview rounds.
Questions related to GCP and cloud functions
GCP architecture refers to the structure and components of Google Cloud Platform for building and managing applications and services.
GCP architecture is based on a global network of data centers that provide secure, scalable infrastructure for cloud services.
Key components include Compute Engine for virtual machines, Cloud Storage for object storage, and BigQuery for data analytics.
GCP architecture also includes networ...
Optimizing queries in BigQuery involves using partitioned tables, clustering, and optimizing joins.
Partition tables by date or another relevant column to reduce the amount of data scanned
Use clustering to group related rows together, reducing the amount of data scanned for queries
Avoid unnecessary joins and denormalize data where possible to reduce query complexity
Yes, I have experience in Dataflow, Dataproc, and cloud composer.
I have worked with Dataflow to process and analyze large datasets in real-time.
I have used Dataproc to create and manage Apache Spark and Hadoop clusters for big data processing.
I have experience with cloud composer for orchestrating workflows and managing data pipelines.
Different types of joins in SQL include inner join, left join, right join, and full outer join.
Inner join: Returns rows when there is a match in both tables.
Left join: Returns all rows from the left table and the matched rows from the right table.
Right join: Returns all rows from the right table and the matched rows from the left table.
Full outer join: Returns rows when there is a match in either table.
Example: SELECT ...
Cloud functions like Cloud Build and Cloud Run in GCP are serverless computing services for building and running applications in the cloud.
Cloud Build is a service that executes your builds on Google Cloud Platform infrastructure. It automatically builds and tests your code in the cloud.
Cloud Run is a managed compute platform that enables you to run stateless containers that are invocable via HTTP requests. It automati...
based on 3 interviews
Interview experience
based on 1 review
Rating in categories
GL Accountant
191
salaries
| ₹3.6 L/yr - ₹10 L/yr |
Financial Analyst
126
salaries
| ₹3.6 L/yr - ₹9.5 L/yr |
Financial Associate
90
salaries
| ₹3 L/yr - ₹6.5 L/yr |
Data Engineer
74
salaries
| ₹8.9 L/yr - ₹37 L/yr |
Software Engineer
55
salaries
| ₹6 L/yr - ₹22 L/yr |
Accenture
IBM
TCS
Wipro