Premium Employer

i

This company page is being actively managed by Infosys Team. If you also belong to the team, you can get access from here

Infosys Verified Tick

Compare button icon Compare button icon Compare
3.7

based on 37.1k Reviews

Filter interviews by

Infosys Big Data Developer Interview Questions and Answers

Updated 15 May 2024

Infosys Big Data Developer Interview Experiences

Interview questions from similar companies

Interview experience
1
Bad
Difficulty level
-
Process Duration
-
Result
-
Round 1 - Technical 

(2 Questions)

  • Q1. Related to bigdata
  • Q2. Spark, hive and scala questins
Round 2 - Technical 

(1 Question)

  • Q1. Related to bigdata

Interview Preparation Tips

Interview preparation tips for other job seekers - Don't attend interviews in TCS, it's waste of time. Received tentative salary breakup email post salary discussion. Upon multiple follow ups with the HR for offer letter, they told my profile is on hold from business team and didn't receive offer letter. So please don't waste your time by attending interviews for TCS

Big Data Developer Interview Questions & Answers

Wipro user image Dhanendra Dwivedi

posted on 27 May 2022

I applied via Naukri.com and was interviewed before May 2021. There were 3 interview rounds.

Round 1 - Technical 

(1 Question)

  • Q1. Complete technical evaluations/questions related to the Tools i.e. Hadoop, Hive, Impala, Spark, Sqoop, SQL, Scala, Python.
Round 2 - Group Discussion 

2nd round is technical + Manager both discussions

Round 3 - HR 

(1 Question)

  • Q1. HR discussion related Role, Salary, location etc.

Interview Preparation Tips

Topics to prepare for Wipro Big Data Developer interview:
  • Hadoop
  • Spark
  • Agile Methodology
  • SCALA
  • Python
  • SQL
  • impala
  • Cloud
  • Hive
Interview preparation tips for other job seekers - Prepare well before attending the interview and your technical skills and knowledge should be enough.
Interview experience
5
Excellent
Difficulty level
Moderate
Process Duration
Less than 2 weeks
Result
-

I applied via Recruitment Consulltant

Round 1 - Technical 

(5 Questions)

  • Q1. Explain ETL pipeline ecosystem in Azure Databricks?
  • Q2. Star vs Snowflake schema, when to use?
  • Q3. Find Salary higher than Average department salary
  • Q4. Implementation of SCD2 table
  • Q5. How incremental loading is done
Interview experience
5
Excellent
Difficulty level
Moderate
Process Duration
Less than 2 weeks
Result
Not Selected

I applied via Naukri.com and was interviewed in Dec 2024. There was 1 interview round.

Round 1 - Technical 

(5 Questions)

  • Q1. Scenario based questions on Azure data factory and pipelines
  • Q2. Optimisation technic to improve the performance of databricks
  • Q3. What is Autoloader
  • Q4. What is unity catalog
  • Q5. How you do the alerting mechanism in adf for failed pipelines
Interview experience
3
Average
Difficulty level
Moderate
Process Duration
Less than 2 weeks
Result
No response

I applied via Naukri.com and was interviewed in Oct 2024. There were 2 interview rounds.

Round 1 - Technical 

(7 Questions)

  • Q1. How do you optimize SQL queries?
  • Ans. 

    Optimizing SQL queries involves using indexes, avoiding unnecessary joins, and optimizing the query structure.

    • Use indexes on columns frequently used in WHERE clauses

    • Avoid using SELECT * and only retrieve necessary columns

    • Optimize joins by using INNER JOIN instead of OUTER JOIN when possible

    • Use EXPLAIN to analyze query performance and make necessary adjustments

  • Answered by AI
  • Q2. How do you do performance optimization in Spark. Tell how you did it in you project.
  • Ans. 

    Performance optimization in Spark involves tuning configurations, optimizing code, and utilizing caching.

    • Tune Spark configurations such as executor memory, number of executors, and shuffle partitions.

    • Optimize code by reducing unnecessary shuffles, using efficient transformations, and avoiding unnecessary data movements.

    • Utilize caching to store intermediate results in memory and avoid recomputation.

    • Example: In my projec...

  • Answered by AI
  • Q3. What is SparkContext and SparkSession?
  • Ans. 

    SparkContext is the main entry point for Spark functionality, while SparkSession is the entry point for Spark SQL.

    • SparkContext is the entry point for low-level API functionality in Spark.

    • SparkSession is the entry point for Spark SQL functionality.

    • SparkContext is used to create RDDs (Resilient Distributed Datasets) in Spark.

    • SparkSession provides a unified entry point for reading data from various sources and performing

  • Answered by AI
  • Q4. When a spark job is submitted, what happens at backend. Explain the flow.
  • Ans. 

    When a spark job is submitted, various steps are executed at the backend to process the job.

    • The job is submitted to the Spark driver program.

    • The driver program communicates with the cluster manager to request resources.

    • The cluster manager allocates resources (CPU, memory) to the job.

    • The driver program creates DAG (Directed Acyclic Graph) of the job stages and tasks.

    • Tasks are then scheduled and executed on worker nodes ...

  • Answered by AI
  • Q5. Calculate second highest salary using SQL as well as pyspark.
  • Ans. 

    Calculate second highest salary using SQL and pyspark

    • Use SQL query with ORDER BY and LIMIT to get the second highest salary

    • In pyspark, use orderBy() and take() functions to achieve the same result

  • Answered by AI
  • Q6. 2 types of modes for Spark architecture ?
  • Ans. 

    The two types of modes for Spark architecture are standalone mode and cluster mode.

    • Standalone mode: Spark runs on a single machine with a single JVM and is suitable for development and testing.

    • Cluster mode: Spark runs on a cluster of machines managed by a cluster manager like YARN or Mesos for production workloads.

  • Answered by AI
  • Q7. If you want very less latency - which is better standalone or client mode?
  • Ans. 

    Client mode is better for very less latency due to direct communication with the cluster.

    • Client mode allows direct communication with the cluster, reducing latency.

    • Standalone mode requires an additional layer of communication, increasing latency.

    • Client mode is preferred for real-time applications where low latency is crucial.

  • Answered by AI
Round 2 - Technical 

(2 Questions)

  • Q1. Scenario based. Write SQL and pyspark code for a dataset.
  • Q2. If you have to find latest record based on latest timestamp in a table for a particular customer(table is having history) , how will you do it. Self join and nested query will be expensive. Optimized query...

Interview Preparation Tips

Topics to prepare for LTIMindtree Data Engineer interview:
  • SQL
  • pyspark
  • ETL
Interview preparation tips for other job seekers - L2 was scheduled next day to L1 so the process is fast. Brush up your practical knowledge more.

Skills evaluated in this interview

Data Engineer Interview Questions & Answers

Genpact user image Sashikanta Parida

posted on 17 Dec 2024

Interview experience
5
Excellent
Difficulty level
Moderate
Process Duration
Less than 2 weeks
Result
Not Selected

I applied via Recruitment Consulltant and was interviewed in Nov 2024. There were 2 interview rounds.

Round 1 - Technical 

(3 Questions)

  • Q1. What are different type of joins available in Databricks?
  • Ans. 

    Different types of joins available in Databricks include inner join, outer join, left join, right join, and cross join.

    • Inner join: Returns only the rows that have matching values in both tables.

    • Outer join: Returns all rows when there is a match in either table.

    • Left join: Returns all rows from the left table and the matched rows from the right table.

    • Right join: Returns all rows from the right table and the matched rows ...

  • Answered by AI
  • Q2. How do you make your data pipeline fault tolerant?
  • Ans. 

    Implementing fault tolerance in a data pipeline involves redundancy, monitoring, and error handling.

    • Use redundant components to ensure continuous data flow

    • Implement monitoring tools to detect failures and bottlenecks

    • Set up automated alerts for immediate response to issues

    • Design error handling mechanisms to gracefully handle failures

    • Use checkpoints and retries to ensure data integrity

  • Answered by AI
  • Q3. What is AutoLoader?
  • Ans. 

    AutoLoader is a feature in data engineering that automatically loads data from various sources into a data warehouse or database.

    • Automates the process of loading data from different sources

    • Reduces manual effort and human error

    • Can be scheduled to run at specific intervals

    • Examples: Apache Nifi, AWS Glue

  • Answered by AI
Round 2 - Technical 

(2 Questions)

  • Q1. How do you connect to different services in Azure?
  • Ans. 

    To connect to different services in Azure, you can use Azure SDKs, REST APIs, Azure Portal, Azure CLI, and Azure PowerShell.

    • Use Azure SDKs for programming languages like Python, Java, C#, etc.

    • Utilize REST APIs to interact with Azure services programmatically.

    • Access and manage services through the Azure Portal.

    • Leverage Azure CLI for command-line interface interactions.

    • Automate tasks using Azure PowerShell scripts.

  • Answered by AI
  • Q2. What are linked Services?
  • Ans. 

    Linked Services are connections to external data sources or destinations in Azure Data Factory.

    • Linked Services define the connection information needed to connect to external data sources or destinations.

    • They can be used in Data Factory pipelines to read from or write to external systems.

    • Examples of Linked Services include Azure Blob Storage, Azure SQL Database, and Amazon S3.

  • Answered by AI
Interview experience
4
Good
Difficulty level
Easy
Process Duration
-
Result
-

I applied via Recruitment Consulltant and was interviewed in Nov 2024. There was 1 interview round.

Round 1 - Technical 

(7 Questions)

  • Q1. Difference between bigtable and bigquery.
  • Ans. 

    Bigtable is a NoSQL database for real-time analytics, while BigQuery is a fully managed data warehouse for running SQL queries.

    • Bigtable is a NoSQL database designed for real-time analytics and high throughput, while BigQuery is a fully managed data warehouse for running SQL queries.

    • Bigtable is used for storing large amounts of semi-structured data, while BigQuery is used for analyzing structured data using SQL queries.

    • ...

  • Answered by AI
  • Q2. How to remove duplicate rows from bigquery? find the month of a given date in bigquery.
  • Ans. 

    To remove duplicate rows from BigQuery, use the DISTINCT keyword. To find the month of a given date, use the EXTRACT function.

    • To remove duplicate rows, use SELECT DISTINCT * FROM table_name;

    • To find the month of a given date, use SELECT EXTRACT(MONTH FROM date_column) AS month_name FROM table_name;

    • Make sure to replace 'table_name' and 'date_column' with the appropriate values in your query.

  • Answered by AI
  • Q3. What operator is used in composer to move data from gcs to bq
  • Ans. 

    The operator used in Composer to move data from GCS to BigQuery is the GCS to BigQuery operator.

    • The GCS to BigQuery operator is used in Apache Airflow, which is the underlying technology of Composer.

    • This operator allows you to transfer data from Google Cloud Storage (GCS) to BigQuery.

    • You can specify the source and destination parameters in the operator to define the data transfer process.

  • Answered by AI
  • Q4. Write a code for this - input = [1,2,3,4] output = [1,4,9,16]
  • Ans. 

    Code to square each element in the input array.

    • Iterate through the input array and square each element.

    • Store the squared values in a new array to get the desired output.

  • Answered by AI
  • Q5. Dataflow vs dataproc.
  • Ans. 

    Dataflow is a fully managed stream and batch processing service, while Dataproc is a managed Apache Spark and Hadoop service.

    • Dataflow is a serverless data processing service that automatically scales to handle your data, while Dataproc is a managed Spark and Hadoop service that requires you to provision and manage clusters.

    • Dataflow is designed for both batch and stream processing, allowing you to process data in real-t...

  • Answered by AI
  • Q6. Architecture of bq. Query optimization techniques in bigquery.
  • Ans. 

    BigQuery architecture includes storage, execution, and optimization components for efficient query processing.

    • BigQuery stores data in Capacitor storage system for fast access.

    • Query execution is distributed across multiple nodes for parallel processing.

    • Query optimization techniques include partitioning tables, clustering tables, and using query cache.

    • Using partitioned tables can help eliminate scanning unnecessary data.

    • ...

  • Answered by AI
  • Q7. RDD vs dataframe vs dataset in pyspark
  • Ans. 

    RDD vs dataframe vs dataset in PySpark

    • RDD (Resilient Distributed Dataset) is the basic abstraction in PySpark, representing a distributed collection of objects

    • Dataframe is a distributed collection of data organized into named columns, similar to a table in a relational database

    • Dataset is a distributed collection of data with the ability to use custom classes for type safety and user-defined functions

    • Dataframes and Data...

  • Answered by AI
Interview experience
4
Good
Difficulty level
-
Process Duration
-
Result
-
Round 1 - Technical 

(3 Questions)

  • Q1. What are the optimization techniques used in Apache Spark?
  • Q2. 2 SQL queries , 1 PySpark code and 1 Python Code .
  • Q3. 2-3 Scenario Based questions from ADF and databricks .
Interview experience
5
Excellent
Difficulty level
Moderate
Process Duration
6-8 weeks
Result
Selected Selected
Round 1 - Technical 

(1 Question)

  • Q1. More on Technical area
Round 2 - Technical 

(1 Question)

  • Q1. More on Technical area
Round 3 - One-on-one 

(1 Question)

  • Q1. Technical + Behaviour
Round 4 - One-on-one 

(1 Question)

  • Q1. Technical + Behaviour
Round 5 - HR 

(1 Question)

  • Q1. Expectation and Genaral

Infosys Interview FAQs

How many rounds are there in Infosys Big Data Developer interview?
Infosys interview process usually has 1 rounds. The most common rounds in the Infosys interview process are One-on-one Round.
How to prepare for Infosys Big Data Developer interview?
Go through your CV in detail and study all the technologies mentioned in your CV. Prepare at least two technologies or languages in depth if you are appearing for a technical interview at Infosys. The most common topics and skills that interviewers at Infosys expect are Big Data, Spark, Hadoop, SCALA and Python.

Tell us how to improve this page.

Join Infosys Creating the next opportunity for people, businesses & communities

Interview Questions from Similar Companies

TCS Interview Questions
3.7
 • 10.3k Interviews
Accenture Interview Questions
3.9
 • 8k Interviews
Wipro Interview Questions
3.7
 • 5.5k Interviews
Cognizant Interview Questions
3.8
 • 5.5k Interviews
Capgemini Interview Questions
3.8
 • 4.8k Interviews
Tech Mahindra Interview Questions
3.6
 • 3.8k Interviews
HCLTech Interview Questions
3.5
 • 3.7k Interviews
Genpact Interview Questions
3.9
 • 3k Interviews
LTIMindtree Interview Questions
3.9
 • 2.9k Interviews
IBM Interview Questions
4.1
 • 2.4k Interviews
View all
Infosys Big Data Developer Salary
based on 130 salaries
₹2.8 L/yr - ₹10.2 L/yr
41% less than the average Big Data Developer Salary in India
View more details

Infosys Big Data Developer Reviews and Ratings

based on 5 reviews

4.9/5

Rating in categories

4.9

Skill development

4.9

Work-Life balance

4.3

Salary & Benefits

4.6

Job Security

4.9

Company culture

4.6

Promotions/Appraisal

4.6

Work Satisfaction

Explore 5 Reviews and Ratings
Technology Analyst
56k salaries
unlock blur

₹3 L/yr - ₹11 L/yr

Senior Systems Engineer
49.3k salaries
unlock blur

₹2.8 L/yr - ₹9.2 L/yr

System Engineer
38.8k salaries
unlock blur

₹2.5 L/yr - ₹5.5 L/yr

Technical Lead
30.6k salaries
unlock blur

₹5.2 L/yr - ₹19.5 L/yr

Senior Associate Consultant
27k salaries
unlock blur

₹6.2 L/yr - ₹16.8 L/yr

Explore more salaries
Compare Infosys with

TCS

3.7
Compare

Wipro

3.7
Compare

Cognizant

3.8
Compare

Accenture

3.9
Compare

Calculate your in-hand salary

Confused about how your in-hand salary is calculated? Enter your annual salary (CTC) and get your in-hand salary
Did you find this page helpful?
Yes No
write
Share an Interview