Data Scientist

1000+ Data Scientist Interview Questions and Answers

Updated 19 Jul 2025

Asked in Comviva Technology

1w ago

Q. How would you extract the highest score and corresponding subject for each student from a table containing student names, their five subjects, and scores for two consecutive years? Additionally, how would you c...

Ans.

Extract highest scores and calculate growth for students across subjects over two years.

Use a data structure (like a DataFrame) to store student names, subjects, and scores.
Group data by student and subject to find the maximum score for each subject.
Example: If Student A has scores [80, 90, 85, 70, 95] in subjects [Math, Science, English, History, Art], the highest score is 95 in Art.
For growth calculation, compare scores from last year to this year.
If last year's score is mi...read more

Asked in UnitedHealth

2w ago

Q. 1) Model building process of one of my previous projects 2) Random forest hyperparameters 3) ROC curve, using the ROC curve to set probability cutoffs in classication models 4) Gradient boosting techniques like...

Ans.

Data Scientist interview questions on model building, random forest, ROC curve, gradient boosting, and real estate valuation

For model building, I followed the CRISP-DM process and used various algorithms like logistic regression, decision trees, and random forest
Random forest hyperparameters include number of trees, maximum depth, minimum samples split, and minimum samples leaf
ROC curve is a graphical representation of the trade-off between true positive rate and false positi...read more

Asked in Citicorp

2w ago

Q. What is R-squared, and how does R-squared differ from Adjusted R-squared?

Ans.

R square is a statistical measure that represents the proportion of the variance in the dependent variable explained by the independent variables.

R square is a value between 0 and 1, where 0 indicates that the independent variables do not explain any of the variance in the dependent variable, and 1 indicates that they explain all of it.
It is used to evaluate the goodness of fit of a regression model.
Adjusted R square takes into account the number of predictors in the model an...read more

Asked in Boston Ivy Healthcare Solutions

1w ago

Q. 1.Explain why Decorators are used, why not functions can we be modified ? 2.Logistic Regression has regression in it's name, then how come it is a Classification and not regression? 3.explain Random Forest like...

Ans.

This response covers decorators, logistic regression, random forests, handling nulls and outliers, and database concepts like DML and DDL.

Decorators: They are functions that modify the behavior of another function, allowing for reusable code enhancements without changing the original function.
Example of Decorators: A logging decorator can wrap a function to log its execution time without altering the function's core logic.
Logistic Regression: Despite its name, it predicts pro...read more

Are these interview questions helpful?

Asked in Capgemini

2w ago

Q. Can you write an SQL query to find the unique values of a column from a table and get the mean value against each category?

Ans.

SQL query to find unique values and mean of a column grouped by categories.

Use the SELECT statement to specify the columns you want to retrieve.
Utilize the DISTINCT keyword to get unique values from a specific column.
Employ the AVG() function to calculate the mean for each category.
Group the results using the GROUP BY clause to categorize the data.
Example: SELECT category, AVG(value) FROM table_name GROUP BY category;

Asked in Ericsson

1w ago

Q. What are the assumptions of a linear regression model?

Ans.

Assumptions of a linear regression model include linearity, independence, homoscedasticity, and normality.

Linearity: The relationship between the independent and dependent variables is linear.
Independence: The residuals are independent of each other.
Homoscedasticity: The variance of the residuals is constant across all levels of the independent variables.
Normality: The residuals are normally distributed.
No multicollinearity: The independent variables are not highly correlated...read more

Data Scientist Jobs

Excelher -Data Scientist • 4-8 years

Volvo Penta

•

4.1

₹ 12 L/yr - ₹ 29 L/yr

(AmbitionBox estimate)

Bangalore / Bengaluru

Data Scientist-Artificial Intelligence • 10-15 years

IBM India Pvt. Limited

•

3.9

Bangalore / Bengaluru

Data Scientist-Artificial Intelligence • 3-7 years

IBM India Pvt. Limited

•

3.9

₹ 5 L/yr - ₹ 19 L/yr

(AmbitionBox estimate)

Pune

View all Data Scientist jobs

Asked in Ericsson

1w ago

Q. What problems does multicollinearity cause in regression analysis?

Ans.

Multicollinearity in regression analysis causes issues like inflated standard errors, unstable coefficients, and difficulty in interpreting the importance of predictors.

Multicollinearity leads to inflated standard errors, making it difficult to determine the significance of predictors.
It causes unstable coefficients, as small changes in the data can result in large changes in the coefficients.
Interpreting the importance of predictors becomes challenging, as multicollinearity ...read more

Asked in Prgx India

2w ago

Q. What are joins and what are their types?

Ans.

Joins are used in DBMS to combine rows from two or more tables based on a related column between them.

Types of joins include INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL JOIN.
INNER JOIN returns rows when there is at least one match in both tables.
LEFT JOIN returns all rows from the left table and the matched rows from the right table.
RIGHT JOIN returns all rows from the right table and the matched rows from the left table.
FULL JOIN returns rows when there is a match in one of ...read more

Share interview questions and help millions of jobseekers 🌟

Asked in Affine

1w ago

Q. How can you efficiently read large .csv files using pandas?

Ans.

Use pandas' read_csv() method with appropriate parameters to read large .csv files quickly.

Use the chunksize parameter to read the file in smaller chunks
Use the low_memory parameter to optimize memory usage
Use the dtype parameter to specify data types for columns
Use the usecols parameter to read only necessary columns
Use the skiprows parameter to skip unnecessary rows
Use the nrows parameter to read only a specific number of rows
Use the na_values parameter to specify values to...read more

Asked in Walmart

1w ago

Q. How do you fit a time series model? State all the steps you would follow.

Ans.

Steps to fit a time series model

Identify the time series pattern
Choose a suitable model
Split data into training and testing sets
Fit the model to the training data
Evaluate model performance on testing data
Refine the model if necessary
Forecast future values using the model

Asked in Boston Ivy Healthcare Solutions

4d ago

Q. what is the size of your data nad whExplain why decoraExplain why decorators are employed. What prevents the modification of functions? Explain the rationale for the use of decorators. Why cannot functions be c...

Ans.

Decorators in Python enhance functions without modifying their structure, promoting code reusability and separation of concerns.

Function Enhancement: Decorators allow you to add functionality to existing functions, such as logging or access control, without changing their code.
Syntax: A decorator is applied using the '@decorator_name' syntax above the function definition, making it clear and concise.
Example: @app.route('/') in Flask is a decorator that maps a function to a UR...read more

Asked in Nielsen

1w ago

Q. Given two lists a=[1,2,3,4] and b=[9,8,5,5,2,3,3,4,1,1,10,9,2,3,4,10,10,9,7,7,8], write a program to remove duplicates from list b, keep only the elements of b that are not present in a, and sort the final list...

Ans.

Remove duplicates from list b, keep elements not in list a, and sort in ascending order.

Create a set from list b to remove duplicates
Use list comprehension to keep elements not in list a
Sort the final list in ascending order

Asked in Affine

2w ago

Q. How does lookup happen in a list when you do my_list[5]?

Ans.

my_list[5] retrieves the 6th element of the list.

Indexing starts from 0 in Python.
The integer inside the square brackets is the index of the element to retrieve.
If the index is out of range, an IndexError is raised.

Asked in AB InBev India

2w ago

Q. Why was this model/approach used instead of others?

Ans.

The model/approach was chosen based on its accuracy, interpretability, and scalability.

The chosen model/approach had the highest accuracy compared to others.
The chosen model/approach was more interpretable and easier to explain to stakeholders.
The chosen model/approach was more scalable and could handle larger datasets.
Other models/approaches were considered but did not meet the requirements or had limitations.
The chosen model/approach was also more suitable for the specific ...read more

Asked in ION Group

3d ago

Q. You have three 1GB memory chips and need to store 3GB of data. How would you store the data across these chips so that no data is lost even if one chip is corrupted?

Ans.

Use RAID 5 to store data across all three memory chips with parity bits for fault tolerance.

Implement RAID 5 to distribute data and parity bits across all three memory chips.
If one memory chip is corrupted, the data can be reconstructed using the parity bits from the other two chips.
Example: Store 1GB of data on each chip and use the remaining space for parity bits to ensure fault tolerance.

Asked in ObjectWin Technology

1w ago

Q. Can you describe a specific instance in which you applied data-driven analysis to a business project and how you coordinated and persuaded team members to take specific actions to enhance the model's performanc...

Ans.

Led a data-driven project to improve customer retention using predictive modeling and team collaboration.

Identified key metrics for customer retention through exploratory data analysis.
Developed a predictive model using logistic regression to forecast churn.
Presented findings to the team, emphasizing the potential impact on revenue.
Facilitated workshops to gather feedback and refine the model based on team insights.
Used A/B testing to validate model predictions and adjust str...read more

Asked in Sainsburys

1w ago

Q. What is bais-variance tradeoff? Explain P values to non technical and technical audience.

Ans.

Bais-variance tradeoff is the balance between overfitting and underfitting. P values measure the significance of statistical results.

Bais-variance tradeoff is the tradeoff between the model's ability to fit the training data and its ability to generalize to new data.
Overfitting occurs when the model is too complex and fits the training data too closely, resulting in poor performance on new data.
Underfitting occurs when the model is too simple and fails to capture the underlyi...read more

Asked in ExxonMobil

1w ago

Q. How do you determine if a point is inside or outside of a regular polygon?

Ans.

To find if a point is inside or outside of a regular polygon, we can use the Ray Casting algorithm.

Draw a line from the point to a point outside the polygon
Count the number of times the line intersects with the polygon edges
If the number of intersections is odd, the point is inside the polygon
If the number of intersections is even, the point is outside the polygon

Asked in Ericsson

1w ago

Q. What are the different measures used to evaluate the performance of a classification model?

Ans.

Different measures used to evaluate classification model performance

Accuracy: Overall correctness of the model's predictions
Precision: Proportion of true positive predictions among all positive predictions
Recall: Proportion of true positive predictions among all actual positives
F1 Score: Harmonic mean of precision and recall
Confusion Matrix: Summarizes the performance of a classification model

Asked in Celebal Technologies

3d ago

Q. What are the best practices for handling large data sets?

Ans.

Best practices for handling large data sets include data preprocessing, using distributed computing frameworks, and optimizing storage and retrieval methods.

Perform data preprocessing to clean and transform data before analysis.
Utilize distributed computing frameworks like Hadoop or Spark for parallel processing.
Optimize storage and retrieval methods by using efficient data structures and indexing.
Consider using cloud services for scalable storage and processing capabilities....read more

Asked in Chubb

1w ago

Q. In what scenarios would you advise me not to use ReLU in my hidden layers?

Ans.

Avoid ReLU when dealing with negative values or vanishing gradients.

When dealing with negative values, use Leaky ReLU or ELU instead.
When facing vanishing gradients, use other activation functions like tanh or sigmoid.
In some cases, using ReLU in all layers can lead to dead neurons.
Consider the nature of your data and the problem you are trying to solve before choosing an activation function.

Asked in Infosys

1w ago

Q. With the XGBoost algorithm using 10-20 features, how are the splits decided, and on which feature will they be divided?

Ans.

XgBoost algorithm uses a greedy approach to determine splits based on feature importance.

XgBoost algorithm calculates the information gain for each feature to determine the best split.
The feature with the highest information gain is chosen for the split.
This process is repeated recursively for each node in the tree.
Features can be split based on numerical values or categories.
Example: If a feature like 'age' has the highest information gain, the data will be split based on di...read more

Asked in Affine

1w ago

Q. What is the purpose of a lambda function when regular functions exist? How are they different?

Ans.

Lambda functions are anonymous functions used for short and simple operations. They are different from regular functions in their syntax and usage.

Lambda functions are defined without a name and keyword 'lambda' is used to define them.
They can take any number of arguments but can only have one expression.
They are commonly used in functional programming and as arguments to higher-order functions.
Lambda functions are often used for short and simple operations that do not requir...read more

Asked in NCR Corporation

1w ago

Q. How was your project related to the business problem, and how did you solve it?

Ans.

Developed a predictive model to identify potential customer churn for a telecom company

Identified key factors contributing to customer churn through exploratory data analysis
Built a logistic regression model to predict customer churn with 85% accuracy
Provided actionable insights to the business team to reduce customer churn and improve customer retention
Implemented the model in production environment using Python and SQL

Asked in Walmart

1w ago

Q. What are outlier values and how do you treat them?

Ans.

Outlier values are data points that significantly differ from the rest of the data, potentially affecting the analysis.

Outliers can be identified using statistical methods like Z-score or IQR.
Treatment options include removing outliers, transforming the data, or using robust statistical methods.
Example: In a dataset of salaries, a value much higher or lower than the rest may be considered an outlier.

Asked in Bharat Financial Inclusion

2w ago

Q. What type of statistics did you use in your previous organization to analyze data and build models?

Ans.

The organization used descriptive and inferential statistics to analyze and build models.

Descriptive statistics were used to summarize and describe the data, such as mean, median, and standard deviation.
Inferential statistics were used to make predictions and draw conclusions about the population based on the sample data, such as hypothesis testing and regression analysis.
The organization may have also used time series analysis, clustering, and classification models.
Examples ...read more

Asked in Capgemini

1w ago

Q. Can you write code to identify prime numbers between two given numbers?

Ans.

Code to identify prime numbers between two given numbers.

Create a function that takes two numbers as input.
Loop through the range of numbers between the two inputs.
Check if each number is divisible by any number other than 1 and itself.
If not, add it to a list of prime numbers.
Return the list of prime numbers.

Asked in Affine

6d ago

Q. How can you create dictionaries in Python with repeated keys?

Ans.

To create dictionaries in Python with repeated keys, use defaultdict from the collections module.

Import the collections module
Create a defaultdict object
Add key-value pairs to the dictionary using the same key multiple times
Access the values using the key
Example: from collections import defaultdict; d = defaultdict(list); d['key'].append('value1'); d['key'].append('value2')

Asked in MasterCard

2w ago

Q. How do you choose the optimum probability threshold from an ROC curve?

Ans.

To choose optimum probability threshold from ROC, we need to balance between sensitivity and specificity.

Choose the threshold that maximizes the sum of sensitivity and specificity
Use Youden's J statistic to find the optimal threshold
Consider the cost of false positives and false negatives
Use cross-validation to evaluate the performance of different thresholds

Asked in Affine

1w ago

Q. explain eign vectors and eign values? what purpose do they serve in ML?

Ans.

Eigenvalues and eigenvectors are linear algebra concepts used in machine learning for dimensionality reduction and feature extraction.

Eigenvalues represent the scaling factor of the eigenvectors.
Eigenvectors are the directions along which a linear transformation acts by stretching or compressing.
In machine learning, eigenvectors are used for principal component analysis (PCA) to reduce the dimensionality of data.
Eigenvalues and eigenvectors are also used in image processing f...read more