i
Persistent
Systems
Work with us
Filter interviews by
Repartition increases or decreases the number of partitions in a DataFrame, while Coalesce only decreases the number of partitions.
Repartition can increase or decrease the number of partitions in a DataFrame, leading to a shuffle of data across the cluster.
Coalesce only decreases the number of partitions in a DataFrame without performing a full shuffle, making it more efficient than repartition.
Repartition is typi...
DAGs handle fault tolerance by rerunning failed tasks and maintaining task dependencies.
DAGs rerun failed tasks automatically to ensure completion.
DAGs maintain task dependencies to ensure proper sequencing.
DAGs can be configured to retry failed tasks a certain number of times before marking them as failed.
Use Spark and SQL to find the top 5 countries with the highest population.
Use Spark to load the data and perform data processing.
Use SQL queries to group by country and sum the population.
Order the results in descending order and limit to top 5.
Example: SELECT country, SUM(population) AS total_population FROM table_name GROUP BY country ORDER BY total_population DESC LIMIT 5
Cores and worker nodes are decided based on the workload requirements and scalability needs of the data processing system.
Consider the size and complexity of the data being processed
Evaluate the processing speed and memory requirements of the tasks
Take into account the parallelism and concurrency needed for efficient data processing
Monitor the system performance and adjust cores and worker nodes as needed
Enforcing schema ensures that data conforms to a predefined structure and rules.
Ensures data integrity by validating incoming data against predefined schema
Helps in maintaining consistency and accuracy of data
Prevents data corruption and errors in data processing
Can lead to rejection of data that does not adhere to the schema
To find different records for different joins using two tables
Use the SQL query to perform different joins like INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL JOIN
Identify the key columns in both tables to join on
Select the columns from both tables and use WHERE clause to filter out the different records
Use the len() function to check the length of the data frame.
Use len() function to get the number of rows in the data frame.
If the length is 0, then the data frame is empty.
Example: if len(df) == 0: print('Data frame is empty')
SCD stands for Slowly Changing Dimension, a concept in data warehousing to track changes in data over time.
SCD is used to maintain historical data in a data warehouse.
There are three types of SCD - Type 1, Type 2, and Type 3.
Type 1 SCD overwrites old data with new data.
Type 2 SCD creates a new record for each change, preserving history.
Type 3 SCD maintains both old and new values in the same record.
SCD is importan...
Merging two schemas in PySpark involves combining DataFrames with different structures into a unified format.
Use the `unionByName()` method to merge DataFrames with different column names.
Example: df1.unionByName(df2, allowMissingColumns=True) merges df1 and df2, filling missing columns with nulls.
For schema evolution, use `mergeSchema` option when reading from Parquet files.
Example: spark.read.option('mergeSchema...
Reverse a string using SQL and Python codes.
In SQL, use the REVERSE function to reverse a string.
In Python, use slicing with a step of -1 to reverse a string.
I applied via Naukri.com and was interviewed in Aug 2024. There were 2 interview rounds.
I am a Senior Data Engineer with experience in developing data pipelines and optimizing data storage for various projects.
Developed data pipelines using Apache Spark for real-time data processing
Optimized data storage using technologies like Hadoop and AWS S3
Worked on a project to analyze customer behavior and improve marketing strategies
My day-to-day job in the project involved designing and implementing data pipelines, optimizing data workflows, and collaborating with cross-functional teams.
Designing and implementing data pipelines to extract, transform, and load data from various sources
Optimizing data workflows to improve efficiency and performance
Collaborating with cross-functional teams including data scientists, analysts, and business stakeholde...
DAGs handle fault tolerance by rerunning failed tasks and maintaining task dependencies.
DAGs rerun failed tasks automatically to ensure completion.
DAGs maintain task dependencies to ensure proper sequencing.
DAGs can be configured to retry failed tasks a certain number of times before marking them as failed.
Shuffling is the process of redistributing data across partitions in a distributed computing environment.
Shuffling is necessary when data needs to be grouped or aggregated across different partitions.
It can be handled efficiently by minimizing the amount of data being shuffled and optimizing the partitioning strategy.
Techniques like partitioning, combiners, and reducers can help reduce the amount of shuffling in MapRed...
Repartition increases or decreases the number of partitions in a DataFrame, while Coalesce only decreases the number of partitions.
Repartition can increase or decrease the number of partitions in a DataFrame, leading to a shuffle of data across the cluster.
Coalesce only decreases the number of partitions in a DataFrame without performing a full shuffle, making it more efficient than repartition.
Repartition is typically...
Incremental data is handled by identifying new data since the last update and merging it with existing data.
Identify new data since last update
Merge new data with existing data
Update data warehouse or database with incremental changes
SCD stands for Slowly Changing Dimension, a concept in data warehousing to track changes in data over time.
SCD is used to maintain historical data in a data warehouse.
There are three types of SCD - Type 1, Type 2, and Type 3.
Type 1 SCD overwrites old data with new data.
Type 2 SCD creates a new record for each change, preserving history.
Type 3 SCD maintains both old and new values in the same record.
SCD is important for...
Reverse a string using SQL and Python codes.
In SQL, use the REVERSE function to reverse a string.
In Python, use slicing with a step of -1 to reverse a string.
Use Spark and SQL to find the top 5 countries with the highest population.
Use Spark to load the data and perform data processing.
Use SQL queries to group by country and sum the population.
Order the results in descending order and limit to top 5.
Example: SELECT country, SUM(population) AS total_population FROM table_name GROUP BY country ORDER BY total_population DESC LIMIT 5
To find different records for different joins using two tables
Use the SQL query to perform different joins like INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL JOIN
Identify the key columns in both tables to join on
Select the columns from both tables and use WHERE clause to filter out the different records
A catalyst optimizer is a query optimization tool used in Apache Spark to improve performance by generating an optimal query plan.
Catalyst optimizer is a rule-based query optimization framework in Apache Spark.
It leverages rules to transform the logical query plan into a more optimized physical plan.
The optimizer applies various optimization techniques like predicate pushdown, constant folding, and join reordering.
By o...
Used query optimization techniques to improve performance in database queries.
Utilized indexing to speed up search queries.
Implemented query caching to reduce redundant database calls.
Optimized SQL queries by restructuring joins and subqueries.
Utilized database partitioning to improve query performance.
Used query profiling tools to identify and optimize slow queries.
Merging two schemas in PySpark involves combining DataFrames with different structures into a unified format.
Use the `unionByName()` method to merge DataFrames with different column names.
Example: df1.unionByName(df2, allowMissingColumns=True) merges df1 and df2, filling missing columns with nulls.
For schema evolution, use `mergeSchema` option when reading from Parquet files.
Example: spark.read.option('mergeSchema', 't...
Use the len() function to check the length of the data frame.
Use len() function to get the number of rows in the data frame.
If the length is 0, then the data frame is empty.
Example: if len(df) == 0: print('Data frame is empty')
Cores and worker nodes are decided based on the workload requirements and scalability needs of the data processing system.
Consider the size and complexity of the data being processed
Evaluate the processing speed and memory requirements of the tasks
Take into account the parallelism and concurrency needed for efficient data processing
Monitor the system performance and adjust cores and worker nodes as needed
Enforcing schema ensures that data conforms to a predefined structure and rules.
Ensures data integrity by validating incoming data against predefined schema
Helps in maintaining consistency and accuracy of data
Prevents data corruption and errors in data processing
Can lead to rejection of data that does not adhere to the schema
I applied via Naukri.com and was interviewed before Jun 2023. There were 3 interview rounds.
It’s just reasoning type questions.
SSIS stands for SQL Server Integration Services, a tool provided by Microsoft for data integration and workflow applications.
SSIS is a platform for building high-performance data integration and workflow solutions.
It allows you to create packages that move data from various sources to destinations.
SSIS includes a visual design interface for creating, monitoring, and managing data integration processes.
You can use SSIS ...
SSIS packages are used for ETL processes in SQL Server. Union combines datasets vertically, while merge combines them horizontally.
SSIS packages are used for Extract, Transform, Load (ETL) processes in SQL Server.
Union in SSIS combines datasets vertically, stacking rows on top of each other.
Merge in SSIS combines datasets horizontally, matching rows based on specified columns.
Union All in SSIS combines datasets vertica...
What people are saying about Persistent Systems
posted on 26 Feb 2021
I applied via Company Website and was interviewed before Feb 2020. There were 4 interview rounds.
Handled high pressure from client by prioritizing tasks and communicating effectively.
Identified critical issues and addressed them first
Communicated regularly with the client to provide updates and manage expectations
Collaborated with team members to delegate tasks and ensure timely delivery
Maintained a calm and professional demeanor to avoid escalating the situation
Release management is the process of planning, scheduling, coordinating, and deploying software releases.
It involves identifying the scope of the release and the features to be included
Creating a release plan and schedule
Coordinating with different teams involved in the release process
Testing the release to ensure it meets quality standards
Deploying the release to production
Monitoring the release to ensure it is stable...
posted on 29 Jan 2021
I applied via Naukri.com and was interviewed in Jul 2020. There were 4 interview rounds.
posted on 23 Jan 2022
I applied via Naukri.com and was interviewed in Jul 2021. There were 3 interview rounds.
I applied via Naukri.com and was interviewed before Jan 2021. There were 3 interview rounds.
I applied via Referral and was interviewed in Jun 2021. There were 3 interview rounds.
I applied via Internshala and was interviewed in May 2021. There were 3 interview rounds.
Discussing my resume highlights my skills, experiences, and projects relevant to the software engineering role.
Experience with Java and Python in developing web applications.
Led a team project that improved application performance by 30%.
Contributed to open-source projects, enhancing my coding skills and collaboration.
Completed an internship at XYZ Corp, where I developed a feature that increased user engagement.
I applied via Campus Placement and was interviewed before Aug 2021. There were 3 interview rounds.
The first round was an aptitude test with questions ranging from basic mathematical concepts to logical/analytical questions. English was also included in the test. The difficulty was medium and I was able to solve 70-80% of the questions.
Two coding questions were the part of the test. I was supposed to solve and pass all the test cases for both the questions. The coding questions tested my knowledge in the field of arrays, loops and pointers. I was able to solve one and partially solve another.
I applied via Campus Placement and was interviewed before Sep 2021. There were 4 interview rounds.
Prepare normal for aptitude - maths, quant, analytic
My GD topic was Is internet good for students or not
I didn't attempt this as I was noob back in third year Engg
Some of the top questions asked at the Persistent Systems Senior Data Engineer interview -
based on 2 interview experiences
Difficulty level
Duration
based on 3 reviews
Rating in categories
Software Engineer
4.6k
salaries
| ₹4.7 L/yr - ₹11.1 L/yr |
Senior Software Engineer
4.6k
salaries
| ₹6.8 L/yr - ₹18.7 L/yr |
Lead Software Engineer
3.6k
salaries
| ₹8.4 L/yr - ₹17.4 L/yr |
Lead Engineer
3.5k
salaries
| ₹13.7 L/yr - ₹25 L/yr |
Project Lead
2.1k
salaries
| ₹21.2 L/yr - ₹36 L/yr |
Cognizant
TCS
IBM
LTIMindtree