EPAM Systems
10+ Fresenius Kabi Interview Questions and Answers
Q1. how to migrate 1000s of tables using spark(databricks) notebooks
Use Spark (Databricks) notebooks to migrate 1000s of tables efficiently.
Utilize Spark's parallel processing capabilities to handle large volumes of data
Leverage Databricks notebooks for interactive data exploration and transformation
Automate the migration process using scripts or workflows
Optimize performance by tuning Spark configurations and cluster settings
Q2. What is the process for finding the missing number from a list?
To find the missing number from a list, calculate the sum of all numbers in the list and subtract it from the expected sum of the list.
Calculate the sum of all numbers in the list using a loop or a built-in function.
Calculate the expected sum of the list using the formula n*(n+1)/2, where n is the length of the list.
Subtract the sum of the list from the expected sum to find the missing number.
Q3. Dataflow vs Dataproc, layering processing and curated environments in gcp , Data cleaning
Dataflow and Dataproc are both processing services in GCP, but with different approaches and use cases.
Dataflow is a fully managed service for executing batch and streaming data processing pipelines.
Dataproc is a managed Spark and Hadoop service for running big data processing and analytics workloads.
Dataflow provides a serverless and auto-scaling environment, while Dataproc offers more control and flexibility.
Dataflow is suitable for real-time streaming and complex data tran...read more
Q4. What are some methods for optimizing Spark performance?
Optimizing Spark performance involves tuning configurations, partitioning data, caching, and using efficient transformations.
Tune Spark configurations for memory allocation, parallelism, and resource management.
Partition data properly to distribute work evenly across nodes and minimize shuffling.
Cache intermediate results in memory to avoid recomputation.
Use efficient transformations like map, filter, and reduceByKey instead of costly operations like groupByKey.
Opt for column...read more
Q5. end to end project architecture.
The end-to-end project architecture involves designing and implementing the entire data pipeline from data ingestion to data visualization.
Data ingestion: Collecting data from various sources such as databases, APIs, and files.
Data processing: Cleaning, transforming, and aggregating the data using tools like Apache Spark or Hadoop.
Data storage: Storing the processed data in data warehouses or data lakes like Amazon S3 or Google BigQuery.
Data analysis: Performing analysis on t...read more
Q6. delete duplicates from table in spark and sql
To delete duplicates from a table in Spark and SQL, you can use the DISTINCT keyword or the dropDuplicates() function.
In SQL, you can use the DISTINCT keyword in a SELECT statement to retrieve unique rows from a table.
In Spark, you can use the dropDuplicates() function on a DataFrame to remove duplicate rows.
Both methods compare all columns by default, but you can specify specific columns to consider for duplicates.
You can also use the partitionBy() function in Spark to remov...read more
Q7. expectations from EPAM
I expect EPAM to provide challenging projects, opportunities for growth, a collaborative work environment, and support for continuous learning.
Challenging projects that allow me to utilize my skills and knowledge
Opportunities for professional growth and advancement within the company
A collaborative work environment where teamwork is valued
Support for continuous learning through training programs and resources
Q8. types of transformations,no of jobs,tasks,actions
The question is asking about types of transformations, number of jobs, tasks, and actions in the context of a Senior Data Engineer role.
Types of transformations: Extract, Transform, Load (ETL), MapReduce, Spark transformations, SQL transformations
Number of jobs: Depends on the complexity and scale of the data engineering projects
Number of tasks: Varies based on the number of data sources, data transformations, and data destinations
Actions: Data ingestion, data cleaning, data ...read more
Q9. optimisation in spark,sql,bigquery,airflow
Optimization techniques in Spark, SQL, BigQuery, and Airflow.
Use partitioning and bucketing in Spark to optimize data processing.
Optimize SQL queries by using indexes, query rewriting, and query optimization techniques.
In BigQuery, use partitioning and clustering to improve query performance.
Leverage Airflow's task parallelism and resource allocation to optimize workflow execution.
Q10. architecture of spark,airflow,bigquery,
Spark is a distributed processing engine, Airflow is a workflow management system, and BigQuery is a fully managed data warehouse.
Spark is designed for big data processing and provides in-memory computation capabilities.
Airflow is used for orchestrating and scheduling data pipelines.
BigQuery is a serverless data warehouse that allows for fast and scalable analytics.
Spark can be integrated with Airflow to schedule and monitor Spark jobs.
BigQuery can be used as a data source or...read more
Interview Process at Fresenius Kabi
Top Senior Data Engineer Interview Questions from Similar Companies
Reviews
Interviews
Salaries
Users/Month