Data Scientist

800+ Data Scientist Interview Questions and Answers

Updated 15 Dec 2024

Popular Companies

search-icon
Q1. Special Sum of Array

You have been given an array/list ‘arr’ of length ‘N’, which contains single digit elements at every index. Your task is to return the sum of all elements of the array. But the final sum sho...read more

Q2. for a data with 1000 samples and 700 dimensions, how would you find a line that best fits the data, to be able to extrapolate? this is not a supervised ML problem, there's no target. and how would you do it, if...

read more
Ans.

To find a line that best fits the data with 1000 samples and 700 dimensions, we can use linear regression.

  • For unsupervised ML approach, we can use Principal Component Analysis (PCA) to reduce dimensions and then fit a line using linear regression.

  • For supervised ML approach, we need to select a target column. We can choose any of the 700 dimensions as the target and treat it as a regression problem.

  • Potential problems of treating this as a supervised problem include: lack of in...read more

Data Scientist Interview Questions and Answers for Freshers

illustration image

Q3. you have a pandas dataframe with three columns, filled with state names, city names and arbitrary numbers respectively. How to retrieve top 2 cities per state. (top according to the max number in the third colu...

read more
Ans.

Retrieve top 2 cities per state based on max number in third column of pandas dataframe.

  • Group the dataframe by state column

  • Sort each group by the third column in descending order

  • Retrieve the top 2 rows of each group using head(2) function

  • Concatenate the resulting dataframes using pd.concat() function

Q4. Clone a Linked List with random pointers

Given a linked list having two pointers in each node. The first one points to the next node of the list, however, the other pointer is random and can point to any node of...read more

Are these interview questions helpful?
Q5. Technical Question

Given a API reference. You had to make a post request to the API with your personal details.

Q6. coding question of finding index of 2 nos. having total equal to target in a list, without using nested for loop? l= [2,15,5,7] t= 9 output》》[0,3]

Ans.

Finding index of 2 numbers having total equal to target in a list without nested for loop.

  • Use dictionary to store the difference between target and each element of list.

  • Iterate through list and check if element is in dictionary.

  • Return the indices of the two elements that add up to target.

Share interview questions and help millions of jobseekers 🌟

man-with-laptop
Q7. Technical Question

How can you tune the hyper parameters of XGboost algorithm?

Q8. How would you measure model effectiveness without using any of confusion matrix metrics given the data is highly imbalanced

Ans.

One way to measure model effectiveness without using confusion matrix metrics is by using area under the receiver operating characteristic curve (AUC-ROC).

  • Calculate the AUC-ROC score to evaluate the model's ability to distinguish between positive and negative classes.

  • AUC-ROC considers the entire range of classification thresholds and is insensitive to class imbalance.

  • Higher AUC-ROC score indicates better model performance.

  • Example: A model with an AUC-ROC score of 0.85 perform...read more

Data Scientist Jobs

Data Scientist: Artificial Intelligence 5-7 years
IBM India Pvt. Limited
4.1
Bangalore / Bengaluru
JP Morgan Chase - Associate Senior - Data Scientist (3-5 yrs) 3-5 years
JP Morgan Chase
4.1
₹ 0 L/yr - ₹ 0 L/yr
Data Scientist: AI/ML Engineer 2-6 years
IBM India Pvt. Limited
4.1
Noida

Q9. what is tokenization in NLP? and, to get raw tokens for a sentence with words seperated by space, why use tokenizers from nltk instead of str.split()?

Ans.

Tokenization in NLP is the process of breaking down text into smaller units called tokens.

  • Tokenization is a fundamental step in NLP for text preprocessing.

  • Tokens can be words, phrases, or even individual characters.

  • Tokenization helps in preparing text data for further analysis or modeling.

  • NLTK tokenizers provide additional functionalities like handling contractions, punctuation, etc.

  • str.split() may not handle complex tokenization scenarios as effectively as NLTK tokenizers.

Q10. you have two different vectors with only small change in one of the dimensions. but, the predictions/output from the model is drastically different for each vector. can you explain why this can be the case? and...

read more
Ans.

Small change in one dimension causing drastic difference in model output. Explanation and solution.

  • This is known as sensitivity to input

  • It can be caused by non-linearities in the model or overfitting

  • Regularization techniques can be used to reduce sensitivity

  • Cross-validation can help identify overfitting

  • Ensemble methods can help reduce sensitivity

  • It is generally a bad thing as it indicates instability in the model

Q11. In which direction the fluid flows in a vertical pipe when the pressures at two vertical locations are given.

Ans.

The direction of fluid flow in a vertical pipe depends on the pressure difference between two vertical locations.

  • Fluid flows from high pressure to low pressure.

  • If the pressure at the lower location is higher than the pressure at the upper location, the fluid will flow downwards.

  • If the pressure at the upper location is higher than the pressure at the lower location, the fluid will flow upwards.

  • The magnitude of the pressure difference determines the rate of fluid flow.

Q12. How can you tune the hyper parameters of XGboost,Random Forest,SVM algorithm?

Ans.

Hyperparameters of XGBoost, Random Forest, and SVM can be tuned using techniques like grid search, random search, and Bayesian optimization.

  • For XGBoost, important hyperparameters to tune include learning rate, maximum depth, and number of estimators.

  • For Random Forest, important hyperparameters to tune include number of trees, maximum depth, and minimum samples split.

  • For SVM, important hyperparameters to tune include kernel type, regularization parameter, and gamma value.

  • Grid ...read more

Q13. how will you get the embeddings of long sentences/paragraphs that transformer models like BERT truncate? how will you go about using BERT for such sentences? will you use sentence embeddings or word embeddings ...

read more
Ans.

To get embeddings of long sentences/paragraphs truncated by BERT, we can use pooling techniques like mean/max pooling.

  • We can use pooling techniques like mean/max pooling to get embeddings of truncated sentences/paragraphs.

  • We can also use sliding window approach to get embeddings of overlapping segments of the long input.

  • For using BERT on such long inputs, we can use sentence embeddings or word embeddings depending on the task.

  • Models like Longformer and Reformer can handle lon...read more

Q14. What are the types of ML algorithms? Give an example of each.

Ans.

There are several types of ML algorithms, including supervised learning, unsupervised learning, and reinforcement learning.

  • Supervised learning: algorithms learn from labeled data to make predictions or classifications (e.g., linear regression, decision trees)

  • Unsupervised learning: algorithms find patterns or relationships in unlabeled data (e.g., clustering, dimensionality reduction)

  • Reinforcement learning: algorithms learn through trial and error by interacting with an enviro...read more

Q15. DBMS based questions

I was asked about joins, their types and data movements during join operations.
I was asked questions about Transactions and ACID properties,
Then he gave 2 tables and some SQL statements. I w...read more

Q16. Q2.) Difference between list and tuple? a = [1,2,3,4,5,6,7,8,9] print(a[-1:-5]) Without running this code in compiler, tell the output

Ans.

The code will output an empty list as a result of slicing from -1 to -5 in the list 'a'.

  • Slicing in Python allows you to access a subset of elements in a list or tuple.

  • When slicing, the start index is inclusive and the end index is exclusive.

  • In this case, a[-1:-5] will result in an empty list because the start index -1 is greater than the end index -5.

Q17. Technical Question

How to fit a time series model? State all the steps you would follow.

Q18. Why we use mission learning Mission learning used for analysis the data's and we can able to predict and we add some additional algorithm it's mainly used for prediction and AI.

Ans.

Mission learning is used for data analysis and prediction with additional algorithms for AI.

  • Mission learning is a subset of machine learning that focuses on predicting outcomes based on data analysis.

  • It involves using algorithms to learn patterns and make predictions based on new data.

  • Examples include image recognition, natural language processing, and recommendation systems.

Q19. How to retain special characters (that pandas discards by default) in the data while reading it?

Ans.

To retain special characters in pandas data, use encoding parameter while reading the data.

  • Use encoding parameter while reading the data in pandas

  • Specify the encoding type of the data file

  • Example: pd.read_csv('filename.csv', encoding='utf-8')

Q20. Q1.) Given sample data in text, read it in python Solution: Take the text to notepad, save it as CSV and then read it in python Check the number of null values Check the number of unique values Make a new colum...

read more
Ans.

Read sample data in text, check for null and unique values, create new column by multiplying two features

  • Save text data as CSV and read in Python using pandas

  • Use isnull() to check for null values

  • Use nunique() to check for unique values

  • Create a new column by multiplying two existing columns

  • Add the new column to the existing dataframe

Q21. How did you prevent your model from overfitting ? What did you do when it was underfit ?

Ans.

To prevent overfitting, I used techniques like regularization, cross-validation, and early stopping. For underfitting, I tried increasing model complexity and adding more features.

  • Used regularization techniques like L1 and L2 regularization to penalize large weights

  • Used cross-validation to evaluate model performance on different subsets of data

  • Used early stopping to prevent the model from continuing to train when performance on validation set stops improving

  • For underfitting, ...read more

Q22. 4. What is the difference between Linear Regression and Logistic Regression?

Ans.

Linear Regression is used for predicting continuous numerical values, while Logistic Regression is used for predicting binary categorical values.

  • Linear Regression predicts a continuous output, while Logistic Regression predicts a binary output.

  • Linear Regression uses a linear equation to model the relationship between the independent and dependent variables, while Logistic Regression uses a logistic function.

  • Linear Regression assumes a linear relationship between the variables...read more

Q23. What do these hyper parameters in the above mentioned algorithms actually mean?

Ans.

Hyperparameters are settings that control the behavior of machine learning algorithms.

  • Hyperparameters are set before training the model.

  • They control the learning process and affect the model's performance.

  • Examples include learning rate, regularization strength, and number of hidden layers.

  • Optimizing hyperparameters is important for achieving better model accuracy.

Q24. Technical Question

What are outlier values and how do you treat them?

Q25. Discussion

No questions were asked.
I was asked if could relocate to Pune. We discussed about the work hours and days.
We discussed about the compensation
That's it

Q26. Write pandas query to separate the names as first and last name from the full name. Drop the duplicate columns and also the missing values. Write output for the Python code. Write SQL query to retrieve the name...

read more
Ans.

Answering questions related to data science concepts and techniques.

  • Recall is the ratio of correctly predicted positive observations to the total actual positives. Precision is the ratio of correctly predicted positive observations to the total predicted positives.

  • To reduce variance in an ensemble model, techniques like bagging, boosting, and stacking can be used. Bagging involves training multiple models on different subsets of the data and averaging their predictions. Boost...read more

Q27. 3. How do you deal with senior customer when you don't have enough data?

Ans.

Communicate transparently and offer alternative solutions.

  • Explain the limitations of the available data and the potential risks of making decisions based on incomplete information.

  • Offer alternative solutions that can be implemented with the available data.

  • Collaborate with the customer to identify additional data sources or explore other options to gather more data.

  • Provide regular updates on the progress of data collection and analysis.

  • Ensure that all decisions are based on so...read more

Q28. Which test is used in logistic regression to check the significance of the variable

Ans.

The Wald test is used in logistic regression to check the significance of the variable.

  • The Wald test calculates the ratio of the estimated coefficient to its standard error.

  • It follows a chi-square distribution with one degree of freedom.

  • A small p-value indicates that the variable is significant.

  • For example, in Python, the statsmodels library provides the Wald test in the summary of a logistic regression model.

Q29. how will the resultant table be, when you "merge" two tables that match at a column. and the second table has many of keys repeated.

Ans.

The resultant table will have all the columns from both tables and the rows will be a combination of matching rows.

  • The resultant table will have all the columns from both tables

  • The rows in the resultant table will be a combination of matching rows

  • If the second table has repeated keys, there will be multiple rows with the same key in the resultant table

Q30. ML Question

What are the assumptions of linear regression model?

Q31. 1) Model building process of one of my previous projects 2) Random forest hyperparameters 3) ROC curve, using the ROC curve to set probability cutoffs in classication models 4) Gradient boosting techniques like...

read more
Ans.

Data Scientist interview questions on model building, random forest, ROC curve, gradient boosting, and real estate valuation

  • For model building, I followed the CRISP-DM process and used various algorithms like logistic regression, decision trees, and random forest

  • Random forest hyperparameters include number of trees, maximum depth, minimum samples split, and minimum samples leaf

  • ROC curve is a graphical representation of the trade-off between true positive rate and false positi...read more

Q32. What is R square and how R square is different from Adjusted R square

Ans.

R square is a statistical measure that represents the proportion of the variance in the dependent variable explained by the independent variables.

  • R square is a value between 0 and 1, where 0 indicates that the independent variables do not explain any of the variance in the dependent variable, and 1 indicates that they explain all of it.

  • It is used to evaluate the goodness of fit of a regression model.

  • Adjusted R square takes into account the number of predictors in the model an...read more

Q33. What is the neihbourhood in which superhosts have the biggest median pries difference with respect to non superhosts ?

Ans.

The neighbourhood with the biggest median price difference between superhosts and non superhosts is X.

  • Calculate the median price for superhosts and non superhosts in each neighbourhood

  • Find the neighbourhood with the largest difference in median prices between superhosts and non superhosts

  • Example: Neighbourhood X has a median price of $200 for superhosts and $150 for non superhosts, resulting in a $50 difference

Q34. ML Question

What problems do multicollinearity in regression analysis cause?

Q35. How to fit a time series model? State all the steps you would follow.

Ans.

Steps to fit a time series model

  • Identify the time series pattern

  • Choose a suitable model

  • Split data into training and testing sets

  • Fit the model to the training data

  • Evaluate model performance on testing data

  • Refine the model if necessary

  • Forecast future values using the model

Q36. How to read large .csv files in pandas quickly?

Ans.

Use pandas' read_csv() method with appropriate parameters to read large .csv files quickly.

  • Use the chunksize parameter to read the file in smaller chunks

  • Use the low_memory parameter to optimize memory usage

  • Use the dtype parameter to specify data types for columns

  • Use the usecols parameter to read only necessary columns

  • Use the skiprows parameter to skip unnecessary rows

  • Use the nrows parameter to read only a specific number of rows

  • Use the na_values parameter to specify values to...read more

Q37. how does look up happens in a list when you do my_list[5]?

Ans.

my_list[5] retrieves the 6th element of the list.

  • Indexing starts from 0 in Python.

  • The integer inside the square brackets is the index of the element to retrieve.

  • If the index is out of range, an IndexError is raised.

Q38. What is bais-variance tradeoff? Explain P values to non technical and technical audience.

Ans.

Bais-variance tradeoff is the balance between overfitting and underfitting. P values measure the significance of statistical results.

  • Bais-variance tradeoff is the tradeoff between the model's ability to fit the training data and its ability to generalize to new data.

  • Overfitting occurs when the model is too complex and fits the training data too closely, resulting in poor performance on new data.

  • Underfitting occurs when the model is too simple and fails to capture the underlyi...read more

Q39. ML Question

What are different measures used to check performance of classification model?

Q40. DBMS question - What are joins and what are their types?

Ans.

Joins are used in DBMS to combine rows from two or more tables based on a related column between them.

  • Types of joins include INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL JOIN.

  • INNER JOIN returns rows when there is at least one match in both tables.

  • LEFT JOIN returns all rows from the left table and the matched rows from the right table.

  • RIGHT JOIN returns all rows from the right table and the matched rows from the left table.

  • FULL JOIN returns rows when there is a match in one of ...read more

Q41. Why was this model/ approach used instead of others ?

Ans.

The model/approach was chosen based on its accuracy, interpretability, and scalability.

  • The chosen model/approach had the highest accuracy compared to others.

  • The chosen model/approach was more interpretable and easier to explain to stakeholders.

  • The chosen model/approach was more scalable and could handle larger datasets.

  • Other models/approaches were considered but did not meet the requirements or had limitations.

  • The chosen model/approach was also more suitable for the specific ...read more

Q42. what is the purpose of lambda function when regural functions(of def) exist? how are they different?

Ans.

Lambda functions are anonymous functions used for short and simple operations. They are different from regular functions in their syntax and usage.

  • Lambda functions are defined without a name and keyword 'lambda' is used to define them.

  • They can take any number of arguments but can only have one expression.

  • They are commonly used in functional programming and as arguments to higher-order functions.

  • Lambda functions are often used for short and simple operations that do not requir...read more

Q43. How to find if a point is inside or outside of a regular polygon?

Ans.

To find if a point is inside or outside of a regular polygon, we can use the Ray Casting algorithm.

  • Draw a line from the point to a point outside the polygon

  • Count the number of times the line intersects with the polygon edges

  • If the number of intersections is odd, the point is inside the polygon

  • If the number of intersections is even, the point is outside the polygon

Q44. Pattern based - Three memory chips, each of 1GB. You have to store 3GB of data in these chips in such a way that even if one memory chip is corrupted, no data is lost.

Ans.

Use RAID 5 to store data across all three memory chips with parity bits for fault tolerance.

  • Implement RAID 5 to distribute data and parity bits across all three memory chips.

  • If one memory chip is corrupted, the data can be reconstructed using the parity bits from the other two chips.

  • Example: Store 1GB of data on each chip and use the remaining space for parity bits to ensure fault tolerance.

Q45. in what scenarios would you advice me to not use ReLU in my hidden layers?

Ans.

Avoid ReLU when dealing with negative values or vanishing gradients.

  • When dealing with negative values, use Leaky ReLU or ELU instead.

  • When facing vanishing gradients, use other activation functions like tanh or sigmoid.

  • In some cases, using ReLU in all layers can lead to dead neurons.

  • Consider the nature of your data and the problem you are trying to solve before choosing an activation function.

Q46. How is y9ur project related to business problem and how you have solved it

Ans.

Developed a predictive model to identify potential customer churn for a telecom company

  • Identified key factors contributing to customer churn through exploratory data analysis

  • Built a logistic regression model to predict customer churn with 85% accuracy

  • Provided actionable insights to the business team to reduce customer churn and improve customer retention

  • Implemented the model in production environment using Python and SQL

Q47. how to create dictionaries in python with repeated keys?

Ans.

To create dictionaries in Python with repeated keys, use defaultdict from the collections module.

  • Import the collections module

  • Create a defaultdict object

  • Add key-value pairs to the dictionary using the same key multiple times

  • Access the values using the key

  • Example: from collections import defaultdict; d = defaultdict(list); d['key'].append('value1'); d['key'].append('value2')

Q48. 1. How to choose optimum probability threshold from ROC?

Ans.

To choose optimum probability threshold from ROC, we need to balance between sensitivity and specificity.

  • Choose the threshold that maximizes the sum of sensitivity and specificity

  • Use Youden's J statistic to find the optimal threshold

  • Consider the cost of false positives and false negatives

  • Use cross-validation to evaluate the performance of different thresholds

Q49. explain eign vectors and eign values? what purpose do they serve in ML?

Ans.

Eigenvalues and eigenvectors are linear algebra concepts used in machine learning for dimensionality reduction and feature extraction.

  • Eigenvalues represent the scaling factor of the eigenvectors.

  • Eigenvectors are the directions along which a linear transformation acts by stretching or compressing.

  • In machine learning, eigenvectors are used for principal component analysis (PCA) to reduce the dimensionality of data.

  • Eigenvalues and eigenvectors are also used in image processing f...read more

Q50. why does optimisers matter? what's their purpose? what do they do in addition to weights-updation that the vanilla gradient and back-prop does?

Ans.

Optimizers are used to improve the efficiency and accuracy of the training process in machine learning models.

  • Optimizers help in finding the optimal set of weights for a given model by minimizing the loss function.

  • They use various techniques like momentum, learning rate decay, and adaptive learning rates to speed up the training process.

  • Optimizers also prevent the model from getting stuck in local minima and help in generalizing the model to unseen data.

  • Examples of optimizers...read more

1
2
3
4
5
6
7
Next
Interview Tips & Stories
Ace your next interview with expert advice and inspiring stories

Interview experiences of popular companies

3.7
 • 10k Interviews
3.9
 • 7.8k Interviews
3.7
 • 7.3k Interviews
3.8
 • 5.4k Interviews
3.8
 • 4.6k Interviews
3.8
 • 2.8k Interviews
3.6
 • 2.3k Interviews
2.7
 • 221 Interviews
3.6
 • 208 Interviews
View all

Calculate your in-hand salary

Confused about how your in-hand salary is calculated? Enter your annual salary (CTC) and get your in-hand salary

Data Scientist Interview Questions
Share an Interview
Stay ahead in your career. Get AmbitionBox app
qr-code
Helping over 1 Crore job seekers every month in choosing their right fit company
65 L+

Reviews

4 L+

Interviews

4 Cr+

Salaries

1 Cr+

Users/Month

Contribute to help millions
Get AmbitionBox app

Made with ❤️ in India. Trademarks belong to their respective owners. All rights reserved © 2024 Info Edge (India) Ltd.

Follow us
  • Youtube
  • Instagram
  • LinkedIn
  • Facebook
  • Twitter