i
Affine
Work with us
Filter interviews by
Optimize ETL processes and Spark jobs through efficient design, resource management, and performance tuning.
Use partitioning to improve data processing speed. For example, partitioning large datasets by date can speed up queries.
Implement data caching in Spark to store intermediate results, reducing the need for repeated computations.
Optimize data formats by using columnar storage formats like Parquet or ORC, whic...
Linear programming optimizes a linear objective function subject to linear constraints, finding the best outcome in a feasible region.
Linear programming involves maximizing or minimizing a linear objective function.
Constraints are linear inequalities that define the feasible region.
The feasible region is typically a convex polygon in multi-dimensional space.
The Simplex method is a popular algorithm used to solve l...
XGBoost is an efficient implementation of gradient boosting that optimizes performance and accuracy through ensemble learning.
1. **Gradient Boosting Framework**: XGBoost builds models in a sequential manner, where each new model corrects errors made by the previous ones.
2. **Decision Trees**: It primarily uses decision trees as base learners, where each tree is built to minimize the loss function.
3. **Regularizati...
ML metrics help evaluate model performance, each with trade-offs affecting accuracy, interpretability, and application.
Accuracy vs. Precision: High accuracy may come with low precision in imbalanced datasets. Example: Classifying rare diseases.
Recall vs. F1 Score: High recall may lower F1 score, impacting balance in precision and recall. Example: Fraud detection.
ROC-AUC vs. PR-AUC: ROC-AUC is sensitive to class im...
This program counts the most frequently occurring integer in an array, identifying the maximum repetitive integer efficiently.
Use a dictionary to store the count of each integer in the array. For example, for the array [1, 2, 2, 3], the counts would be {1: 1, 2: 2, 3: 1}.
Iterate through the array and update the count in the dictionary for each integer encountered.
After counting, find the integer with the maximum c...
To create a pipeline in Databricks, you can use Databricks Jobs or Apache Airflow for orchestration.
Use Databricks Jobs to create a pipeline by scheduling notebooks or Spark jobs.
Utilize Apache Airflow for more complex pipeline orchestration with dependencies and monitoring.
Leverage Databricks Delta for managing data pipelines with ACID transactions and versioning.
Identify employees with the same salary within the same department using SQL and PySpark.
Use SQL's GROUP BY clause to group employees by department and salary.
Example SQL query: SELECT department, salary FROM employees GROUP BY department, salary HAVING COUNT(*) > 1;
In PySpark, use DataFrame operations to group by department and salary.
Example PySpark code: df.groupBy('department', 'salary').count().filter('cou...
Joins in SQL are used to combine rows from two or more tables based on a related column between them.
Types of joins include INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL JOIN
INNER JOIN returns rows when there is at least one match in both tables
LEFT JOIN returns all rows from the left table and the matched rows from the right table
RIGHT JOIN returns all rows from the right table and the matched rows from the left ta...
I have used various transformations such as filtering, joining, aggregating, and pivoting in my data engineering projects.
Filtering data based on certain conditions
Joining multiple datasets together
Aggregating data to summarize information
Pivoting data from rows to columns or vice versa
Dual mode in Power BI allows users to switch between DirectQuery and Import modes for data sources.
Dual mode allows users to combine the benefits of both DirectQuery and Import modes in Power BI.
Users can switch between DirectQuery and Import modes for different data sources within the same report.
DirectQuery mode connects directly to the data source for real-time data retrieval, while Import mode loads data into ...
I applied via Naukri.com and was interviewed in Feb 2022. There were 4 interview rounds.
Test had a mix of questions on Statistics, Probability, Machine Learning, SQL and Python.
To retain special characters in pandas data, use encoding parameter while reading the data.
Use encoding parameter while reading the data in pandas
Specify the encoding type of the data file
Example: pd.read_csv('filename.csv', encoding='utf-8')
Use pandas' read_csv() method with appropriate parameters to read large .csv files quickly.
Use the chunksize parameter to read the file in smaller chunks
Use the low_memory parameter to optimize memory usage
Use the dtype parameter to specify data types for columns
Use the usecols parameter to read only necessary columns
Use the skiprows parameter to skip unnecessary rows
Use the nrows parameter to read only a specific numb...
Use vectorized operations, avoid loops, and optimize memory usage.
Use vectorized operations like apply(), map(), and applymap() instead of loops.
Avoid using iterrows() and itertuples() as they are slower than vectorized operations.
Optimize memory usage by using appropriate data types and dropping unnecessary columns.
Use inplace=True parameter to modify the DataFrame in place instead of creating a copy.
Use the pd.eval()...
Generators are functions that allow you to iterate over a sequence of values without creating the entire sequence in memory. Decorators are functions that modify the behavior of other functions.
Generators use the yield keyword to return values one at a time
Generators are memory efficient and can handle large datasets
Decorators are functions that take another function as input and return a modified version of that funct...
my_list[5] retrieves the 6th element of the list.
Indexing starts from 0 in Python.
The integer inside the square brackets is the index of the element to retrieve.
If the index is out of range, an IndexError is raised.
To create dictionaries in Python with repeated keys, use defaultdict from the collections module.
Import the collections module
Create a defaultdict object
Add key-value pairs to the dictionary using the same key multiple times
Access the values using the key
Example: from collections import defaultdict; d = defaultdict(list); d['key'].append('value1'); d['key'].append('value2')
Lambda functions are anonymous functions used for short and simple operations. They are different from regular functions in their syntax and usage.
Lambda functions are defined without a name and keyword 'lambda' is used to define them.
They can take any number of arguments but can only have one expression.
They are commonly used in functional programming and as arguments to higher-order functions.
Lambda functions are oft...
Merge and join are used to combine dataframes in pandas.
Merge is used to combine dataframes based on a common column or index.
Join is used to combine dataframes based on their index.
Merge can handle different column names, while join cannot.
Merge can handle different types of joins (inner, outer, left, right), while join only does inner join by default.
The resultant table will have all the columns from both tables and the rows will be a combination of matching rows.
The resultant table will have all the columns from both tables
The rows in the resultant table will be a combination of matching rows
If the second table has repeated keys, there will be multiple rows with the same key in the resultant table
Eigenvalues and eigenvectors are linear algebra concepts used in machine learning for dimensionality reduction and feature extraction.
Eigenvalues represent the scaling factor of the eigenvectors.
Eigenvectors are the directions along which a linear transformation acts by stretching or compressing.
In machine learning, eigenvectors are used for principal component analysis (PCA) to reduce the dimensionality of data.
Eigenv...
PCA is a dimensionality reduction technique used to transform high-dimensional data into a lower-dimensional space.
PCA can be used for feature extraction, data visualization, and noise reduction.
PCA cannot be used for causal inference or to handle missing data.
PCA assumes linear relationships between variables and may not work well with non-linear data.
PCA can be applied to various fields such as finance, image process...
VIF stands for Variance Inflation Factor, a measure of multicollinearity in regression analysis.
VIF is calculated for each predictor variable in a regression model.
It measures how much the variance of the estimated regression coefficient is increased due to multicollinearity.
A VIF of 1 indicates no multicollinearity, while a VIF greater than 1 indicates increasing levels of multicollinearity.
VIF is calculated as 1 / (1...
AIC & BIC are statistical measures used to evaluate the goodness of fit of a linear regression model.
AIC stands for Akaike Information Criterion and BIC stands for Bayesian Information Criterion.
Both AIC and BIC are used to compare different models and select the best one.
AIC penalizes complex models less severely than BIC.
Lower AIC/BIC values indicate a better fit of the model to the data.
AIC and BIC can be calculated...
We minimize the loss in logistic regression.
The goal of logistic regression is to minimize the loss function.
The loss function measures the difference between predicted and actual values.
The optimization algorithm tries to find the values of coefficients that minimize the loss function.
Minimizing the loss function leads to better model performance.
Examples of loss functions used in logistic regression are cross-entropy...
One vs Rest is a technique used to extend binary classification to multi-class problems in logistic regression.
It involves training multiple binary classifiers, one for each class.
In each classifier, one class is treated as the positive class and the rest as negative.
The class with the highest probability is predicted as the final output.
It is also known as one vs all or one vs others.
Example: In a 3-class problem, we ...
One vs one classification is a binary classification method where multiple models are trained to classify each pair of classes.
It is used when there are more than two classes in the dataset.
It involves training multiple binary classifiers for each pair of classes.
The final prediction is made by combining the results of all the binary classifiers.
Example: In a dataset with 5 classes, 10 binary classifiers will be traine...
Estimate the number of white cars using surveys, traffic data, and image recognition techniques.
Conduct surveys: Ask residents about car colors in their neighborhoods.
Use traffic cameras: Analyze footage to count white cars during peak hours.
Leverage social media: Analyze posts or images of cars in the city.
Utilize machine learning: Train a model on images of cars to identify white ones.
Collaborate with local authoriti...
I applied via Naukri.com and was interviewed in Dec 2024. There was 1 interview round.
Identify employees with the same salary within the same department using SQL and PySpark.
Use SQL's GROUP BY clause to group employees by department and salary.
Example SQL query: SELECT department, salary FROM employees GROUP BY department, salary HAVING COUNT(*) > 1;
In PySpark, use DataFrame operations to group by department and salary.
Example PySpark code: df.groupBy('department', 'salary').count().filter('count &g...
To create a pipeline in Databricks, you can use Databricks Jobs or Apache Airflow for orchestration.
Use Databricks Jobs to create a pipeline by scheduling notebooks or Spark jobs.
Utilize Apache Airflow for more complex pipeline orchestration with dependencies and monitoring.
Leverage Databricks Delta for managing data pipelines with ACID transactions and versioning.
I appeared for an interview in Sep 2024.
A coding test was administered to address several machine learning problem statements.
ML metrics help evaluate model performance, each with trade-offs affecting accuracy, interpretability, and application.
Accuracy vs. Precision: High accuracy may come with low precision in imbalanced datasets. Example: Classifying rare diseases.
Recall vs. F1 Score: High recall may lower F1 score, impacting balance in precision and recall. Example: Fraud detection.
ROC-AUC vs. PR-AUC: ROC-AUC is sensitive to class imbalan...
XGBoost is an efficient implementation of gradient boosting that optimizes performance and accuracy through ensemble learning.
1. **Gradient Boosting Framework**: XGBoost builds models in a sequential manner, where each new model corrects errors made by the previous ones.
2. **Decision Trees**: It primarily uses decision trees as base learners, where each tree is built to minimize the loss function.
3. **Regularization**:...
This program reverses an array containing both integers and strings.
Use a loop to iterate through the array from the last index to the first.
Create a new array to store the reversed elements.
Example: For input ['apple', 1, 'banana', 2], output should be [2, 'banana', 1, 'apple'].
This program counts the most frequently occurring integer in an array, identifying the maximum repetitive integer efficiently.
Use a dictionary to store the count of each integer in the array. For example, for the array [1, 2, 2, 3], the counts would be {1: 1, 2: 2, 3: 1}.
Iterate through the array and update the count in the dictionary for each integer encountered.
After counting, find the integer with the maximum count....
Linear programming optimizes a linear objective function subject to linear constraints, finding the best outcome in a feasible region.
Linear programming involves maximizing or minimizing a linear objective function.
Constraints are linear inequalities that define the feasible region.
The feasible region is typically a convex polygon in multi-dimensional space.
The Simplex method is a popular algorithm used to solve linear...
I appeared for an interview in Apr 2025, where I was asked the following questions.
Developed ETL pipeline to ingest, clean, and analyze customer data for personalized marketing campaigns
Gathered requirements from stakeholders to understand data sources and business objectives
Designed data model to store customer information and campaign performance metrics
Implemented ETL process using Python and Apache Spark to extract, transform, and load data
Performed data quality checks and created visualizations ...
I have used various transformations such as filtering, joining, aggregating, and pivoting in my data engineering projects.
Filtering data based on certain conditions
Joining multiple datasets together
Aggregating data to summarize information
Pivoting data from rows to columns or vice versa
I applied via Naukri.com and was interviewed in May 2024. There was 1 interview round.
Use conditional formatting to highlight odd cells in Excel
Select the range of cells you want to highlight
Go to the 'Home' tab and click on 'Conditional Formatting'
Choose 'New Rule' and select 'Use a formula to determine which cells to format'
Enter the formula '=MOD(A1,2)=1' (assuming A1 is the top-left cell of your selected range)
Choose the formatting style you want for the odd cells
I have worked on various types of data sets including sales data, customer data, financial data, and social media data.
Sales data
Customer data
Financial data
Social media data
Calculate delta sales growth using DAX formula
Use the following DAX formula: Delta Sales Growth = (SUM(Sales[SalesAmount]) - CALCULATE(SUM(Sales[SalesAmount]), PREVIOUSMONTH('Date'[DateKey]))) / CALCULATE(SUM(Sales[SalesAmount]), PREVIOUSMONTH('Date'[DateKey]))
Make sure to replace 'Sales[SalesAmount]' with the actual column name in your dataset
Ensure that 'Date'[DateKey]' is the date column in your dataset
I applied via Referral and was interviewed in Feb 2024. There were 3 interview rounds.
The first round was a combination of MCQs and SQL Coding test. It consisted of 23 MCQs on SQL, 10 MCQs on Power BI and 5 SQL Coding questions.
Data Validation in Excel ensures that data entered in a cell meets certain criteria or conditions.
Data Validation allows you to set rules for what can be entered in a cell, such as a range of values, a list of items, or a custom formula.
Examples of Data Validation include setting a drop-down list of options for a cell, restricting input to a certain number range, or ensuring dates are entered in a specific format.
Data ...
The order of execution of an SQL query involves multiple steps to retrieve data from a database.
1. Parsing: The SQL query is first parsed to check for syntax errors.
2. Optimization: The query optimizer creates an execution plan to determine the most efficient way to retrieve data.
3. Compilation: The optimized query is compiled into an executable form.
4. Execution: The compiled query is executed by the database engine t...
Tree Map visualizes hierarchical data using nested rectangles, while Heatmap displays data values using color gradients.
Tree Map displays data hierarchically with nested rectangles, where the size and color represent different measures.
Heatmap visualizes data values using color gradients, with darker colors indicating higher values.
Tree Map is useful for showing hierarchical data structures, while Heatmap is effective ...
Extract Data saves a snapshot of data in Tableau workbook, while Live Connection directly connects to data source.
Extract Data creates a static copy of data in Tableau workbook, while Live Connection directly queries data source in real-time.
Extract Data is useful for working offline or with small datasets, while Live Connection is ideal for large datasets or when data is frequently updated.
Extract Data can improve per...
Dual mode in Power BI allows users to switch between DirectQuery and Import modes for data sources.
Dual mode allows users to combine the benefits of both DirectQuery and Import modes in Power BI.
Users can switch between DirectQuery and Import modes for different data sources within the same report.
DirectQuery mode connects directly to the data source for real-time data retrieval, while Import mode loads data into Power...
I applied via Company Website and was interviewed in Mar 2024. There were 3 interview rounds.
Self-join is a SQL query that joins a table to itself.
Self-join is used when a table needs to be joined with itself to compare rows within the same table.
It is achieved by using table aliases to differentiate between the two instances of the same table.
Commonly used in hierarchical data structures or when comparing related records within the same table.
A stored procedure is a set of SQL statements that are stored in a database and can be called by other programs or scripts.
Stored procedures can improve performance by reducing network traffic and executing complex operations on the database server.
They can be used to encapsulate business logic and enforce security measures.
Example: CREATE PROCEDURE GetCustomerOrders AS SELECT * FROM Orders WHERE CustomerID = @Customer...
DROP deletes the table structure and data, while TRUNCATE deletes only the data.
DROP statement removes the table from the database, including all data and structure.
TRUNCATE statement removes all data from the table, but keeps the table structure intact.
DROP is a DDL (Data Definition Language) command, while TRUNCATE is a DML (Data Manipulation Language) command.
Top trending discussions
Some of the top questions asked at the Affine interview -
The duration of Affine interview process can vary, but typically it takes about less than 2 weeks to complete.
based on 41 interview experiences
Difficulty level
Duration
based on 180 reviews
Rating in categories
Senior Associate
139
salaries
| ₹10 L/yr - ₹18.3 L/yr |
Business Analyst
104
salaries
| ₹5.8 L/yr - ₹12.6 L/yr |
Consultant
91
salaries
| ₹12 L/yr - ₹31.4 L/yr |
Senior Business Analyst
87
salaries
| ₹8 L/yr - ₹18 L/yr |
Data Engineer
47
salaries
| ₹6 L/yr - ₹15 L/yr |
HCL Infosystems
Softenger
Capital Numbers Infotech
JK Tech