Data Scientist
800+ Data Scientist Interview Questions and Answers
You have been given an array/list ‘arr’ of length ‘N’, which contains single digit elements at every index. Your task is to return the sum of all elements of the array. But the final sum sho...read more
Q2. for a data with 1000 samples and 700 dimensions, how would you find a line that best fits the data, to be able to extrapolate? this is not a supervised ML problem, there's no target. and how would you do it, if...
read moreTo find a line that best fits the data with 1000 samples and 700 dimensions, we can use linear regression.
For unsupervised ML approach, we can use Principal Component Analysis (PCA) to reduce dimensions and then fit a line using linear regression.
For supervised ML approach, we need to select a target column. We can choose any of the 700 dimensions as the target and treat it as a regression problem.
Potential problems of treating this as a supervised problem include: lack of in...read more
Data Scientist Interview Questions and Answers for Freshers
Q3. you have a pandas dataframe with three columns, filled with state names, city names and arbitrary numbers respectively. How to retrieve top 2 cities per state. (top according to the max number in the third colu...
read moreRetrieve top 2 cities per state based on max number in third column of pandas dataframe.
Group the dataframe by state column
Sort each group by the third column in descending order
Retrieve the top 2 rows of each group using head(2) function
Concatenate the resulting dataframes using pd.concat() function
Given a linked list having two pointers in each node. The first one points to the next node of the list, however, the other pointer is random and can point to any node of...read more
Given a API reference. You had to make a post request to the API with your personal details.
Q6. coding question of finding index of 2 nos. having total equal to target in a list, without using nested for loop? l= [2,15,5,7] t= 9 output》》[0,3]
Finding index of 2 numbers having total equal to target in a list without nested for loop.
Use dictionary to store the difference between target and each element of list.
Iterate through list and check if element is in dictionary.
Return the indices of the two elements that add up to target.
Share interview questions and help millions of jobseekers 🌟
How can you tune the hyper parameters of XGboost algorithm?
Q8. How would you measure model effectiveness without using any of confusion matrix metrics given the data is highly imbalanced
One way to measure model effectiveness without using confusion matrix metrics is by using area under the receiver operating characteristic curve (AUC-ROC).
Calculate the AUC-ROC score to evaluate the model's ability to distinguish between positive and negative classes.
AUC-ROC considers the entire range of classification thresholds and is insensitive to class imbalance.
Higher AUC-ROC score indicates better model performance.
Example: A model with an AUC-ROC score of 0.85 perform...read more
Data Scientist Jobs
Q9. what is tokenization in NLP? and, to get raw tokens for a sentence with words seperated by space, why use tokenizers from nltk instead of str.split()?
Tokenization in NLP is the process of breaking down text into smaller units called tokens.
Tokenization is a fundamental step in NLP for text preprocessing.
Tokens can be words, phrases, or even individual characters.
Tokenization helps in preparing text data for further analysis or modeling.
NLTK tokenizers provide additional functionalities like handling contractions, punctuation, etc.
str.split() may not handle complex tokenization scenarios as effectively as NLTK tokenizers.
Q10. you have two different vectors with only small change in one of the dimensions. but, the predictions/output from the model is drastically different for each vector. can you explain why this can be the case? and...
read moreSmall change in one dimension causing drastic difference in model output. Explanation and solution.
This is known as sensitivity to input
It can be caused by non-linearities in the model or overfitting
Regularization techniques can be used to reduce sensitivity
Cross-validation can help identify overfitting
Ensemble methods can help reduce sensitivity
It is generally a bad thing as it indicates instability in the model
Q11. In which direction the fluid flows in a vertical pipe when the pressures at two vertical locations are given.
The direction of fluid flow in a vertical pipe depends on the pressure difference between two vertical locations.
Fluid flows from high pressure to low pressure.
If the pressure at the lower location is higher than the pressure at the upper location, the fluid will flow downwards.
If the pressure at the upper location is higher than the pressure at the lower location, the fluid will flow upwards.
The magnitude of the pressure difference determines the rate of fluid flow.
Q12. How can you tune the hyper parameters of XGboost,Random Forest,SVM algorithm?
Hyperparameters of XGBoost, Random Forest, and SVM can be tuned using techniques like grid search, random search, and Bayesian optimization.
For XGBoost, important hyperparameters to tune include learning rate, maximum depth, and number of estimators.
For Random Forest, important hyperparameters to tune include number of trees, maximum depth, and minimum samples split.
For SVM, important hyperparameters to tune include kernel type, regularization parameter, and gamma value.
Grid ...read more
Q13. how will you get the embeddings of long sentences/paragraphs that transformer models like BERT truncate? how will you go about using BERT for such sentences? will you use sentence embeddings or word embeddings ...
read moreTo get embeddings of long sentences/paragraphs truncated by BERT, we can use pooling techniques like mean/max pooling.
We can use pooling techniques like mean/max pooling to get embeddings of truncated sentences/paragraphs.
We can also use sliding window approach to get embeddings of overlapping segments of the long input.
For using BERT on such long inputs, we can use sentence embeddings or word embeddings depending on the task.
Models like Longformer and Reformer can handle lon...read more
Q14. What are the types of ML algorithms? Give an example of each.
There are several types of ML algorithms, including supervised learning, unsupervised learning, and reinforcement learning.
Supervised learning: algorithms learn from labeled data to make predictions or classifications (e.g., linear regression, decision trees)
Unsupervised learning: algorithms find patterns or relationships in unlabeled data (e.g., clustering, dimensionality reduction)
Reinforcement learning: algorithms learn through trial and error by interacting with an enviro...read more
I was asked about joins, their types and data movements during join operations.
I was asked questions about Transactions and ACID properties,
Then he gave 2 tables and some SQL statements. I w...read more
Q16. Q2.) Difference between list and tuple? a = [1,2,3,4,5,6,7,8,9] print(a[-1:-5]) Without running this code in compiler, tell the output
The code will output an empty list as a result of slicing from -1 to -5 in the list 'a'.
Slicing in Python allows you to access a subset of elements in a list or tuple.
When slicing, the start index is inclusive and the end index is exclusive.
In this case, a[-1:-5] will result in an empty list because the start index -1 is greater than the end index -5.
How to fit a time series model? State all the steps you would follow.
Q18. Why we use mission learning Mission learning used for analysis the data's and we can able to predict and we add some additional algorithm it's mainly used for prediction and AI.
Mission learning is used for data analysis and prediction with additional algorithms for AI.
Mission learning is a subset of machine learning that focuses on predicting outcomes based on data analysis.
It involves using algorithms to learn patterns and make predictions based on new data.
Examples include image recognition, natural language processing, and recommendation systems.
Q19. How to retain special characters (that pandas discards by default) in the data while reading it?
To retain special characters in pandas data, use encoding parameter while reading the data.
Use encoding parameter while reading the data in pandas
Specify the encoding type of the data file
Example: pd.read_csv('filename.csv', encoding='utf-8')
Q20. Q1.) Given sample data in text, read it in python Solution: Take the text to notepad, save it as CSV and then read it in python Check the number of null values Check the number of unique values Make a new colum...
read moreRead sample data in text, check for null and unique values, create new column by multiplying two features
Save text data as CSV and read in Python using pandas
Use isnull() to check for null values
Use nunique() to check for unique values
Create a new column by multiplying two existing columns
Add the new column to the existing dataframe
Q21. How did you prevent your model from overfitting ? What did you do when it was underfit ?
To prevent overfitting, I used techniques like regularization, cross-validation, and early stopping. For underfitting, I tried increasing model complexity and adding more features.
Used regularization techniques like L1 and L2 regularization to penalize large weights
Used cross-validation to evaluate model performance on different subsets of data
Used early stopping to prevent the model from continuing to train when performance on validation set stops improving
For underfitting, ...read more
Q22. 4. What is the difference between Linear Regression and Logistic Regression?
Linear Regression is used for predicting continuous numerical values, while Logistic Regression is used for predicting binary categorical values.
Linear Regression predicts a continuous output, while Logistic Regression predicts a binary output.
Linear Regression uses a linear equation to model the relationship between the independent and dependent variables, while Logistic Regression uses a logistic function.
Linear Regression assumes a linear relationship between the variables...read more
Q23. What do these hyper parameters in the above mentioned algorithms actually mean?
Hyperparameters are settings that control the behavior of machine learning algorithms.
Hyperparameters are set before training the model.
They control the learning process and affect the model's performance.
Examples include learning rate, regularization strength, and number of hidden layers.
Optimizing hyperparameters is important for achieving better model accuracy.
What are outlier values and how do you treat them?
No questions were asked.
I was asked if could relocate to Pune. We discussed about the work hours and days.
We discussed about the compensation
That's it
Q26. Write pandas query to separate the names as first and last name from the full name. Drop the duplicate columns and also the missing values. Write output for the Python code. Write SQL query to retrieve the name...
read moreAnswering questions related to data science concepts and techniques.
Recall is the ratio of correctly predicted positive observations to the total actual positives. Precision is the ratio of correctly predicted positive observations to the total predicted positives.
To reduce variance in an ensemble model, techniques like bagging, boosting, and stacking can be used. Bagging involves training multiple models on different subsets of the data and averaging their predictions. Boost...read more
Q27. 3. How do you deal with senior customer when you don't have enough data?
Communicate transparently and offer alternative solutions.
Explain the limitations of the available data and the potential risks of making decisions based on incomplete information.
Offer alternative solutions that can be implemented with the available data.
Collaborate with the customer to identify additional data sources or explore other options to gather more data.
Provide regular updates on the progress of data collection and analysis.
Ensure that all decisions are based on so...read more
Q28. Which test is used in logistic regression to check the significance of the variable
The Wald test is used in logistic regression to check the significance of the variable.
The Wald test calculates the ratio of the estimated coefficient to its standard error.
It follows a chi-square distribution with one degree of freedom.
A small p-value indicates that the variable is significant.
For example, in Python, the statsmodels library provides the Wald test in the summary of a logistic regression model.
Q29. how will the resultant table be, when you "merge" two tables that match at a column. and the second table has many of keys repeated.
The resultant table will have all the columns from both tables and the rows will be a combination of matching rows.
The resultant table will have all the columns from both tables
The rows in the resultant table will be a combination of matching rows
If the second table has repeated keys, there will be multiple rows with the same key in the resultant table
What are the assumptions of linear regression model?
Q31. 1) Model building process of one of my previous projects 2) Random forest hyperparameters 3) ROC curve, using the ROC curve to set probability cutoffs in classication models 4) Gradient boosting techniques like...
read moreData Scientist interview questions on model building, random forest, ROC curve, gradient boosting, and real estate valuation
For model building, I followed the CRISP-DM process and used various algorithms like logistic regression, decision trees, and random forest
Random forest hyperparameters include number of trees, maximum depth, minimum samples split, and minimum samples leaf
ROC curve is a graphical representation of the trade-off between true positive rate and false positi...read more
Q32. What is R square and how R square is different from Adjusted R square
R square is a statistical measure that represents the proportion of the variance in the dependent variable explained by the independent variables.
R square is a value between 0 and 1, where 0 indicates that the independent variables do not explain any of the variance in the dependent variable, and 1 indicates that they explain all of it.
It is used to evaluate the goodness of fit of a regression model.
Adjusted R square takes into account the number of predictors in the model an...read more
Q33. What is the neihbourhood in which superhosts have the biggest median pries difference with respect to non superhosts ?
The neighbourhood with the biggest median price difference between superhosts and non superhosts is X.
Calculate the median price for superhosts and non superhosts in each neighbourhood
Find the neighbourhood with the largest difference in median prices between superhosts and non superhosts
Example: Neighbourhood X has a median price of $200 for superhosts and $150 for non superhosts, resulting in a $50 difference
What problems do multicollinearity in regression analysis cause?
Q35. How to fit a time series model? State all the steps you would follow.
Steps to fit a time series model
Identify the time series pattern
Choose a suitable model
Split data into training and testing sets
Fit the model to the training data
Evaluate model performance on testing data
Refine the model if necessary
Forecast future values using the model
Q36. How to read large .csv files in pandas quickly?
Use pandas' read_csv() method with appropriate parameters to read large .csv files quickly.
Use the chunksize parameter to read the file in smaller chunks
Use the low_memory parameter to optimize memory usage
Use the dtype parameter to specify data types for columns
Use the usecols parameter to read only necessary columns
Use the skiprows parameter to skip unnecessary rows
Use the nrows parameter to read only a specific number of rows
Use the na_values parameter to specify values to...read more
Q37. how does look up happens in a list when you do my_list[5]?
my_list[5] retrieves the 6th element of the list.
Indexing starts from 0 in Python.
The integer inside the square brackets is the index of the element to retrieve.
If the index is out of range, an IndexError is raised.
Q38. What is bais-variance tradeoff? Explain P values to non technical and technical audience.
Bais-variance tradeoff is the balance between overfitting and underfitting. P values measure the significance of statistical results.
Bais-variance tradeoff is the tradeoff between the model's ability to fit the training data and its ability to generalize to new data.
Overfitting occurs when the model is too complex and fits the training data too closely, resulting in poor performance on new data.
Underfitting occurs when the model is too simple and fails to capture the underlyi...read more
What are different measures used to check performance of classification model?
Q40. DBMS question - What are joins and what are their types?
Joins are used in DBMS to combine rows from two or more tables based on a related column between them.
Types of joins include INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL JOIN.
INNER JOIN returns rows when there is at least one match in both tables.
LEFT JOIN returns all rows from the left table and the matched rows from the right table.
RIGHT JOIN returns all rows from the right table and the matched rows from the left table.
FULL JOIN returns rows when there is a match in one of ...read more
Q41. Why was this model/ approach used instead of others ?
The model/approach was chosen based on its accuracy, interpretability, and scalability.
The chosen model/approach had the highest accuracy compared to others.
The chosen model/approach was more interpretable and easier to explain to stakeholders.
The chosen model/approach was more scalable and could handle larger datasets.
Other models/approaches were considered but did not meet the requirements or had limitations.
The chosen model/approach was also more suitable for the specific ...read more
Q42. what is the purpose of lambda function when regural functions(of def) exist? how are they different?
Lambda functions are anonymous functions used for short and simple operations. They are different from regular functions in their syntax and usage.
Lambda functions are defined without a name and keyword 'lambda' is used to define them.
They can take any number of arguments but can only have one expression.
They are commonly used in functional programming and as arguments to higher-order functions.
Lambda functions are often used for short and simple operations that do not requir...read more
Q43. How to find if a point is inside or outside of a regular polygon?
To find if a point is inside or outside of a regular polygon, we can use the Ray Casting algorithm.
Draw a line from the point to a point outside the polygon
Count the number of times the line intersects with the polygon edges
If the number of intersections is odd, the point is inside the polygon
If the number of intersections is even, the point is outside the polygon
Q44. Pattern based - Three memory chips, each of 1GB. You have to store 3GB of data in these chips in such a way that even if one memory chip is corrupted, no data is lost.
Use RAID 5 to store data across all three memory chips with parity bits for fault tolerance.
Implement RAID 5 to distribute data and parity bits across all three memory chips.
If one memory chip is corrupted, the data can be reconstructed using the parity bits from the other two chips.
Example: Store 1GB of data on each chip and use the remaining space for parity bits to ensure fault tolerance.
Q45. in what scenarios would you advice me to not use ReLU in my hidden layers?
Avoid ReLU when dealing with negative values or vanishing gradients.
When dealing with negative values, use Leaky ReLU or ELU instead.
When facing vanishing gradients, use other activation functions like tanh or sigmoid.
In some cases, using ReLU in all layers can lead to dead neurons.
Consider the nature of your data and the problem you are trying to solve before choosing an activation function.
Q46. How is y9ur project related to business problem and how you have solved it
Developed a predictive model to identify potential customer churn for a telecom company
Identified key factors contributing to customer churn through exploratory data analysis
Built a logistic regression model to predict customer churn with 85% accuracy
Provided actionable insights to the business team to reduce customer churn and improve customer retention
Implemented the model in production environment using Python and SQL
Q47. how to create dictionaries in python with repeated keys?
To create dictionaries in Python with repeated keys, use defaultdict from the collections module.
Import the collections module
Create a defaultdict object
Add key-value pairs to the dictionary using the same key multiple times
Access the values using the key
Example: from collections import defaultdict; d = defaultdict(list); d['key'].append('value1'); d['key'].append('value2')
Q48. 1. How to choose optimum probability threshold from ROC?
To choose optimum probability threshold from ROC, we need to balance between sensitivity and specificity.
Choose the threshold that maximizes the sum of sensitivity and specificity
Use Youden's J statistic to find the optimal threshold
Consider the cost of false positives and false negatives
Use cross-validation to evaluate the performance of different thresholds
Q49. explain eign vectors and eign values? what purpose do they serve in ML?
Eigenvalues and eigenvectors are linear algebra concepts used in machine learning for dimensionality reduction and feature extraction.
Eigenvalues represent the scaling factor of the eigenvectors.
Eigenvectors are the directions along which a linear transformation acts by stretching or compressing.
In machine learning, eigenvectors are used for principal component analysis (PCA) to reduce the dimensionality of data.
Eigenvalues and eigenvectors are also used in image processing f...read more
Q50. why does optimisers matter? what's their purpose? what do they do in addition to weights-updation that the vanilla gradient and back-prop does?
Optimizers are used to improve the efficiency and accuracy of the training process in machine learning models.
Optimizers help in finding the optimal set of weights for a given model by minimizing the loss function.
They use various techniques like momentum, learning rate decay, and adaptive learning rates to speed up the training process.
Optimizers also prevent the model from getting stuck in local minima and help in generalizing the model to unseen data.
Examples of optimizers...read more
Interview Questions of Similar Designations
Top Interview Questions for Data Scientist Related Skills
Interview experiences of popular companies
Calculate your in-hand salary
Confused about how your in-hand salary is calculated? Enter your annual salary (CTC) and get your in-hand salary
Reviews
Interviews
Salaries
Users/Month