i
Ingram Micro
Filter interviews by
Top trending discussions
Spark performance tuning methods involve optimizing resource allocation, data partitioning, and caching.
Optimize resource allocation by adjusting memory and CPU settings in Spark configurations.
Partition data effectively to distribute work evenly across nodes.
Utilize caching to store intermediate results in memory for faster access.
Use broadcast variables for small lookup tables to reduce shuffle operations.
Monitor and...
Use Pyspark to remove regex characters from column values
Use the regexp_replace function in Pyspark to remove regex characters from column values
Specify the regex pattern to match and the replacement string
Apply the regexp_replace function to the desired column in the DataFrame
I have experience working as a Data Engineer at XYZ Company for 2 years.
Developed ETL pipelines to extract, transform, and load data from various sources
Optimized database performance and implemented data quality checks
Collaborated with cross-functional teams to design and implement data solutions
I applied via Naukri.com and was interviewed in Nov 2024. There was 1 interview round.
posted on 25 Sep 2024
I applied via Walk-in and was interviewed in Aug 2024. There were 5 interview rounds.
Maths grammar & communication
You're like this job opportunity
posted on 29 Jul 2024
Handling imbalanced data involves techniques like resampling, using different algorithms, and adjusting class weights.
Use resampling techniques like oversampling or undersampling to balance the data
Utilize algorithms that are robust to imbalanced data, such as Random Forest or XGBoost
Adjust class weights in the model to give more importance to minority class
Python code to calculate correlation between two features
Import pandas library
Use df.corr() method to calculate correlation between two features
Specify the two features as arguments to the corr() method
Outliers can be handled by removing, transforming, or imputing them based on the context of the data.
Identify outliers using statistical methods like Z-score, IQR, or visualization techniques.
Remove outliers if they are due to data entry errors or measurement errors.
Transform skewed data using log transformation or winsorization to reduce the impact of outliers.
Impute outliers with the median or mean if they are valid ...
I would communicate openly with the client, provide updates on the progress, and discuss potential solutions to meet the deadline.
Communicate proactively with the client about the delay
Provide regular updates on the progress of the task
Discuss potential solutions to meet the deadline, such as reallocating resources or extending the timeline
Apologize for the delay and take responsibility for the situation
Ensure that the...
I applied via LinkedIn and was interviewed in Jul 2024. There were 2 interview rounds.
It was pair programming round where we need to attempt a couple of Spark Scenario along with the Interviewer. You will have a boiler plate code with some functionalities to be filled up. You will be assessed on writing clean and extensible code and test cases.
Types of clusters in Databricks include Standard, High Concurrency, and Single Node clusters.
Standard clusters are used for general-purpose workloads
High Concurrency clusters are optimized for concurrent workloads
Single Node clusters are used for development and testing purposes
Catalyst optimizer is a query optimizer in Apache Spark that leverages advanced techniques to optimize and improve the performance of Spark SQL queries.
Catalyst optimizer uses a rule-based and cost-based optimization approach to generate an optimized query plan.
It performs various optimizations such as constant folding, predicate pushdown, and projection pruning to improve query performance.
Catalyst optimizer also leve...
Explode function is used in Apache Spark to split an array into multiple rows.
Used in Apache Spark to split an array into multiple rows
Creates a new row for each element in the array
Commonly used in data processing and transformation tasks
Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads.
Delta Lake provides ACID transactions, schema enforcement, and time travel capabilities on top of data lakes.
Data lakes are a storage repository that holds a vast amount of raw data in its native format until it is needed.
Delta Lake is optimized for big data workloads and provides reliability and performance ...
RDD stands for Resilient Distributed Dataset, a fundamental data structure in Apache Spark.
RDD is a fault-tolerant collection of elements that can be operated on in parallel.
RDDs are immutable, meaning they cannot be changed once created.
RDDs support two types of operations: transformations (creating a new RDD from an existing one) and actions (returning a value to the driver program).
Interview experience
Software Engineer
126
salaries
| ₹3 L/yr - ₹14.2 L/yr |
Senior Software Engineer
125
salaries
| ₹6 L/yr - ₹20 L/yr |
DEP Manager, Sales
103
salaries
| ₹5 L/yr - ₹13 L/yr |
Product Manager
77
salaries
| ₹6.7 L/yr - ₹25 L/yr |
Senior Associate
55
salaries
| ₹2.5 L/yr - ₹7.2 L/yr |
Tech Data
Redington
Tech Data Corporation
SYNNEX Corporation