Data Scientist

1000+ Data Scientist Interview Questions and Answers

Updated 7 Jul 2025
search-icon

Asked in UnitedHealth

1w ago

Q. 1) Model building process of one of my previous projects 2) Random forest hyperparameters 3) ROC curve, using the ROC curve to set probability cutoffs in classication models 4) Gradient boosting techniques like...

read more
Ans.

Data Scientist interview questions on model building, random forest, ROC curve, gradient boosting, and real estate valuation

  • For model building, I followed the CRISP-DM process and used various algorithms like logistic regression, decision trees, and random forest

  • Random forest hyperparameters include number of trees, maximum depth, minimum samples split, and minimum samples leaf

  • ROC curve is a graphical representation of the trade-off between true positive rate and false positi...read more

Asked in Citicorp

1w ago

Q. What is R-squared, and how does R-squared differ from Adjusted R-squared?

Ans.

R square is a statistical measure that represents the proportion of the variance in the dependent variable explained by the independent variables.

  • R square is a value between 0 and 1, where 0 indicates that the independent variables do not explain any of the variance in the dependent variable, and 1 indicates that they explain all of it.

  • It is used to evaluate the goodness of fit of a regression model.

  • Adjusted R square takes into account the number of predictors in the model an...read more

Q. 1.Explain why Decorators are used, why not functions can we be modified ? 2.Logistic Regression has regression in it's name, then how come it is a Classification and not regression? 3.explain Random Forest like...

read more
Ans.

This response covers decorators, logistic regression, random forests, handling nulls and outliers, and database concepts like DML and DDL.

  • Decorators: They are functions that modify the behavior of another function, allowing for reusable code enhancements without changing the original function.

  • Example of Decorators: A logging decorator can wrap a function to log its execution time without altering the function's core logic.

  • Logistic Regression: Despite its name, it predicts pro...read more

Asked in Capgemini

1w ago

Q. Can you write an SQL query to find the unique values of a column from a table and get the mean value against each category?

Ans.

SQL query to find unique values and mean of a column grouped by categories.

  • Use the SELECT statement to specify the columns you want to retrieve.

  • Utilize the DISTINCT keyword to get unique values from a specific column.

  • Employ the AVG() function to calculate the mean for each category.

  • Group the results using the GROUP BY clause to categorize the data.

  • Example: SELECT category, AVG(value) FROM table_name GROUP BY category;

Are these interview questions helpful?

Asked in Ericsson

2w ago
Q. What are the assumptions of a linear regression model?
Ans.

Assumptions of a linear regression model include linearity, independence, homoscedasticity, and normality.

  • Linearity: The relationship between the independent and dependent variables is linear.

  • Independence: The residuals are independent of each other.

  • Homoscedasticity: The variance of the residuals is constant across all levels of the independent variables.

  • Normality: The residuals are normally distributed.

  • No multicollinearity: The independent variables are not highly correlated...read more

Asked in Ericsson

2w ago
Q. What problems does multicollinearity cause in regression analysis?
Ans.

Multicollinearity in regression analysis causes issues like inflated standard errors, unstable coefficients, and difficulty in interpreting the importance of predictors.

  • Multicollinearity leads to inflated standard errors, making it difficult to determine the significance of predictors.

  • It causes unstable coefficients, as small changes in the data can result in large changes in the coefficients.

  • Interpreting the importance of predictors becomes challenging, as multicollinearity ...read more

Data Scientist Jobs

Robert Bosch Engineering and Business Solutions Private Limited logo
Data Scientist 4-6 years
Robert Bosch Engineering and Business Solutions Private Limited
4.1
Bangalore / Bengaluru
IBM India Pvt. Limited logo
Data Scientist-Advanced Analytics 3-7 years
IBM India Pvt. Limited
4.0
₹ 5 L/yr - ₹ 19 L/yr
(AmbitionBox estimate)
Pune
IBM India Pvt. Limited logo
Data Scientist-Artificial Intelligence 3-7 years
IBM India Pvt. Limited
4.0
₹ 5 L/yr - ₹ 28 L/yr
(AmbitionBox estimate)
Hyderabad / Secunderabad

Asked in Affine

2w ago

Q. How can you efficiently read large .csv files using pandas?

Ans.

Use pandas' read_csv() method with appropriate parameters to read large .csv files quickly.

  • Use the chunksize parameter to read the file in smaller chunks

  • Use the low_memory parameter to optimize memory usage

  • Use the dtype parameter to specify data types for columns

  • Use the usecols parameter to read only necessary columns

  • Use the skiprows parameter to skip unnecessary rows

  • Use the nrows parameter to read only a specific number of rows

  • Use the na_values parameter to specify values to...read more

Asked in Walmart

1d ago

Q. How do you fit a time series model? State all the steps you would follow.

Ans.

Steps to fit a time series model

  • Identify the time series pattern

  • Choose a suitable model

  • Split data into training and testing sets

  • Fit the model to the training data

  • Evaluate model performance on testing data

  • Refine the model if necessary

  • Forecast future values using the model

Share interview questions and help millions of jobseekers 🌟

man-with-laptop

Q. what is the size of your data nad whExplain why decoraExplain why decorators are employed. What prevents the modification of functions? Explain the rationale for the use of decorators. Why cannot functions be c...

read more
Ans.

Decorators in Python enhance functions without modifying their structure, promoting code reusability and separation of concerns.

  • Function Enhancement: Decorators allow you to add functionality to existing functions, such as logging or access control, without changing their code.

  • Syntax: A decorator is applied using the '@decorator_name' syntax above the function definition, making it clear and concise.

  • Example: @app.route('/') in Flask is a decorator that maps a function to a UR...read more

Asked in Affine

3d ago

Q. How does lookup happen in a list when you do my_list[5]?

Ans.

my_list[5] retrieves the 6th element of the list.

  • Indexing starts from 0 in Python.

  • The integer inside the square brackets is the index of the element to retrieve.

  • If the index is out of range, an IndexError is raised.

2w ago

Q. Why was this model/approach used instead of others?

Ans.

The model/approach was chosen based on its accuracy, interpretability, and scalability.

  • The chosen model/approach had the highest accuracy compared to others.

  • The chosen model/approach was more interpretable and easier to explain to stakeholders.

  • The chosen model/approach was more scalable and could handle larger datasets.

  • Other models/approaches were considered but did not meet the requirements or had limitations.

  • The chosen model/approach was also more suitable for the specific ...read more

2d ago

Q. Can you describe a specific instance in which you applied data-driven analysis to a business project and how you coordinated and persuaded team members to take specific actions to enhance the model's performanc...

read more
Ans.

Led a data-driven project to improve customer retention using predictive modeling and team collaboration.

  • Identified key metrics for customer retention through exploratory data analysis.

  • Developed a predictive model using logistic regression to forecast churn.

  • Presented findings to the team, emphasizing the potential impact on revenue.

  • Facilitated workshops to gather feedback and refine the model based on team insights.

  • Used A/B testing to validate model predictions and adjust str...read more

Asked in Sainsburys

1w ago

Q. What is bais-variance tradeoff? Explain P values to non technical and technical audience.

Ans.

Bais-variance tradeoff is the balance between overfitting and underfitting. P values measure the significance of statistical results.

  • Bais-variance tradeoff is the tradeoff between the model's ability to fit the training data and its ability to generalize to new data.

  • Overfitting occurs when the model is too complex and fits the training data too closely, resulting in poor performance on new data.

  • Underfitting occurs when the model is too simple and fails to capture the underlyi...read more

Asked in Prgx India

1w ago

Q. What are joins and what are their types?

Ans.

Joins are used in DBMS to combine rows from two or more tables based on a related column between them.

  • Types of joins include INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL JOIN.

  • INNER JOIN returns rows when there is at least one match in both tables.

  • LEFT JOIN returns all rows from the left table and the matched rows from the right table.

  • RIGHT JOIN returns all rows from the right table and the matched rows from the left table.

  • FULL JOIN returns rows when there is a match in one of ...read more

Asked in Nielsen

3d ago

Q. Given two lists a=[1,2,3,4] and b=[9,8,5,5,2,3,3,4,1,1,10,9,2,3,4,10,10,9,7,7,8], write a program to remove duplicates from list b, keep only the elements of b that are not present in a, and sort the final list...

read more
Ans.

Remove duplicates from list b, keep elements not in list a, and sort in ascending order.

  • Create a set from list b to remove duplicates

  • Use list comprehension to keep elements not in list a

  • Sort the final list in ascending order

Asked in ExxonMobil

2w ago

Q. How do you determine if a point is inside or outside of a regular polygon?

Ans.

To find if a point is inside or outside of a regular polygon, we can use the Ray Casting algorithm.

  • Draw a line from the point to a point outside the polygon

  • Count the number of times the line intersects with the polygon edges

  • If the number of intersections is odd, the point is inside the polygon

  • If the number of intersections is even, the point is outside the polygon

Asked in Ericsson

1w ago
Q. What are the different measures used to evaluate the performance of a classification model?
Ans.

Different measures used to evaluate classification model performance

  • Accuracy: Overall correctness of the model's predictions

  • Precision: Proportion of true positive predictions among all positive predictions

  • Recall: Proportion of true positive predictions among all actual positives

  • F1 Score: Harmonic mean of precision and recall

  • Confusion Matrix: Summarizes the performance of a classification model

Asked in ION Group

6d ago

Q. You have three 1GB memory chips and need to store 3GB of data. How would you store the data across these chips so that no data is lost even if one chip is corrupted?

Ans.

Use RAID 5 to store data across all three memory chips with parity bits for fault tolerance.

  • Implement RAID 5 to distribute data and parity bits across all three memory chips.

  • If one memory chip is corrupted, the data can be reconstructed using the parity bits from the other two chips.

  • Example: Store 1GB of data on each chip and use the remaining space for parity bits to ensure fault tolerance.

6d ago

Q. What are the best practices for handling large data sets?

Ans.

Best practices for handling large data sets include data preprocessing, using distributed computing frameworks, and optimizing storage and retrieval methods.

  • Perform data preprocessing to clean and transform data before analysis.

  • Utilize distributed computing frameworks like Hadoop or Spark for parallel processing.

  • Optimize storage and retrieval methods by using efficient data structures and indexing.

  • Consider using cloud services for scalable storage and processing capabilities....read more

Asked in Infosys

2w ago

Q. With the XGBoost algorithm using 10-20 features, how are the splits decided, and on which feature will they be divided?

Ans.

XgBoost algorithm uses a greedy approach to determine splits based on feature importance.

  • XgBoost algorithm calculates the information gain for each feature to determine the best split.

  • The feature with the highest information gain is chosen for the split.

  • This process is repeated recursively for each node in the tree.

  • Features can be split based on numerical values or categories.

  • Example: If a feature like 'age' has the highest information gain, the data will be split based on di...read more

Asked in Affine

4d ago

Q. What is the purpose of a lambda function when regular functions exist? How are they different?

Ans.

Lambda functions are anonymous functions used for short and simple operations. They are different from regular functions in their syntax and usage.

  • Lambda functions are defined without a name and keyword 'lambda' is used to define them.

  • They can take any number of arguments but can only have one expression.

  • They are commonly used in functional programming and as arguments to higher-order functions.

  • Lambda functions are often used for short and simple operations that do not requir...read more

Asked in Chubb

2w ago

Q. In what scenarios would you advise me not to use ReLU in my hidden layers?

Ans.

Avoid ReLU when dealing with negative values or vanishing gradients.

  • When dealing with negative values, use Leaky ReLU or ELU instead.

  • When facing vanishing gradients, use other activation functions like tanh or sigmoid.

  • In some cases, using ReLU in all layers can lead to dead neurons.

  • Consider the nature of your data and the problem you are trying to solve before choosing an activation function.

2w ago

Q. How was your project related to the business problem, and how did you solve it?

Ans.

Developed a predictive model to identify potential customer churn for a telecom company

  • Identified key factors contributing to customer churn through exploratory data analysis

  • Built a logistic regression model to predict customer churn with 85% accuracy

  • Provided actionable insights to the business team to reduce customer churn and improve customer retention

  • Implemented the model in production environment using Python and SQL

Asked in Walmart

1w ago
Q. What are outlier values and how do you treat them?
Ans.

Outlier values are data points that significantly differ from the rest of the data, potentially affecting the analysis.

  • Outliers can be identified using statistical methods like Z-score or IQR.

  • Treatment options include removing outliers, transforming the data, or using robust statistical methods.

  • Example: In a dataset of salaries, a value much higher or lower than the rest may be considered an outlier.

Q. What type of statistics did you use in your previous organization to analyze data and build models?

Ans.

The organization used descriptive and inferential statistics to analyze and build models.

  • Descriptive statistics were used to summarize and describe the data, such as mean, median, and standard deviation.

  • Inferential statistics were used to make predictions and draw conclusions about the population based on the sample data, such as hypothesis testing and regression analysis.

  • The organization may have also used time series analysis, clustering, and classification models.

  • Examples ...read more

Asked in Capgemini

4d ago

Q. Can you write code to identify prime numbers between two given numbers?

Ans.

Code to identify prime numbers between two given numbers.

  • Create a function that takes two numbers as input.

  • Loop through the range of numbers between the two inputs.

  • Check if each number is divisible by any number other than 1 and itself.

  • If not, add it to a list of prime numbers.

  • Return the list of prime numbers.

Asked in Affine

2w ago

Q. How can you create dictionaries in Python with repeated keys?

Ans.

To create dictionaries in Python with repeated keys, use defaultdict from the collections module.

  • Import the collections module

  • Create a defaultdict object

  • Add key-value pairs to the dictionary using the same key multiple times

  • Access the values using the key

  • Example: from collections import defaultdict; d = defaultdict(list); d['key'].append('value1'); d['key'].append('value2')

Asked in MasterCard

1d ago

Q. How do you choose the optimum probability threshold from an ROC curve?

Ans.

To choose optimum probability threshold from ROC, we need to balance between sensitivity and specificity.

  • Choose the threshold that maximizes the sum of sensitivity and specificity

  • Use Youden's J statistic to find the optimal threshold

  • Consider the cost of false positives and false negatives

  • Use cross-validation to evaluate the performance of different thresholds

Asked in Affine

3d ago

Q. explain eign vectors and eign values? what purpose do they serve in ML?

Ans.

Eigenvalues and eigenvectors are linear algebra concepts used in machine learning for dimensionality reduction and feature extraction.

  • Eigenvalues represent the scaling factor of the eigenvectors.

  • Eigenvectors are the directions along which a linear transformation acts by stretching or compressing.

  • In machine learning, eigenvectors are used for principal component analysis (PCA) to reduce the dimensionality of data.

  • Eigenvalues and eigenvectors are also used in image processing f...read more

Asked in Turing

1w ago

Q. Given a string s and integer k, return the maximum number of vowel letters in any substring of s with length k. Vowel letters in English are 'a','e','i','o','u'.

Ans.

Find the maximum number of vowels in any substring of length k in a given string.

  • Iterate through the string with a sliding window of size k, counting vowels in each substring.

  • Keep track of the maximum vowel count encountered.

  • Return the maximum vowel count found.

Previous
1
2
3
4
5
6
7
Next

Interview Experiences of Popular Companies

TCS Logo
3.6
 • 11.1k Interviews
Accenture Logo
3.7
 • 8.7k Interviews
Infosys Logo
3.6
 • 7.9k Interviews
Cognizant Logo
3.7
 • 5.9k Interviews
Capgemini Logo
3.7
 • 5.1k Interviews
View all
interview tips and stories logo
Interview Tips & Stories
Ace your next interview with expert advice and inspiring stories

Calculate your in-hand salary

Confused about how your in-hand salary is calculated? Enter your annual salary (CTC) and get your in-hand salary

Data Scientist Interview Questions
Share an Interview
Stay ahead in your career. Get AmbitionBox app
play-icon
play-icon
qr-code
Trusted by over 1.5 Crore job seekers to find their right fit company
80 L+

Reviews

10L+

Interviews

4 Cr+

Salaries

1.5 Cr+

Users

Contribute to help millions

Made with ❤️ in India. Trademarks belong to their respective owners. All rights reserved © 2025 Info Edge (India) Ltd.

Follow Us
  • Youtube
  • Instagram
  • LinkedIn
  • Facebook
  • Twitter
Profile Image
Hello, Guest
AmbitionBox Employee Choice Awards 2025
Winners announced!
awards-icon
Contribute to help millions!
Write a review
Write a review
Share interview
Share interview
Contribute salary
Contribute salary
Add office photos
Add office photos
Add office benefits
Add office benefits