Data Scientist

800+ Data Scientist Interview Questions and Answers

Updated 28 Feb 2025
search-icon

Q101. difference between supervised and unsupervised learning Machine learning vs deep learning CNN vs RNN

Ans.

Supervised learning uses labeled data for training, while unsupervised learning uses unlabeled data. Machine learning is a subset of AI, while deep learning is a subset of machine learning. CNNs are used for image recognition, while RNNs are used for sequential data.

  • Supervised learning requires labeled data for training, where the model learns to map input data to output labels (e.g., classification or regression tasks). Examples include linear regression, logistic regression...read more

Q102. Explain precision and recall, when are they used in which scenario?

Ans.

Precision and recall are metrics used in evaluating the performance of classification models.

  • Precision measures the accuracy of positive predictions, while recall measures the ability of the model to find all positive instances.

  • Precision = TP / (TP + FP)

  • Recall = TP / (TP + FN)

  • Precision is important when false positives are costly, while recall is important when false negatives are costly.

  • For example, in a spam email detection system, high precision is desired to avoid classif...read more

Q103. what is random forest, knn?

Ans.

Random forest and KNN are machine learning algorithms used for classification and regression tasks.

  • Random forest is an ensemble learning method that constructs multiple decision trees and combines their outputs to make a final prediction.

  • KNN (k-nearest neighbors) is a non-parametric algorithm that classifies new data points based on the majority class of their k-nearest neighbors in the training set.

  • Random forest is useful for handling high-dimensional data and avoiding overf...read more

Q104. what are gradients? (not in relation to machine learning)

Ans.

Gradients are the changes in values of a function with respect to its variables.

  • Gradients are used in calculus to measure the rate of change of a function.

  • They are represented as vectors and indicate the direction of steepest ascent.

  • Gradients are used in optimization problems to find the minimum or maximum value of a function.

  • They are also used in physics to calculate the force acting on a particle.

  • Gradients can be calculated using partial derivatives.

Are these interview questions helpful?

Q105. What is python, what is your name , What is your dream

Ans.

Python is a high-level, interpreted programming language used for web development, data analysis, and artificial intelligence.

  • Python is easy to learn and has a simple syntax

  • It supports multiple programming paradigms such as object-oriented, functional, and procedural programming

  • Python has a vast collection of libraries and frameworks for various purposes such as NumPy, Pandas, Django, Flask, etc.

  • Python is widely used in data science and machine learning due to its simplicity ...read more

Q106. What are the relevant projects in Data science & expertise in whatt all tools & technologies

Ans.

Relevant projects in Data Science and expertise in tools and technologies

  • Projects: Predictive modeling, Natural Language Processing, Computer Vision, Recommender Systems, Time Series Analysis

  • Tools: Python, R, SQL, Tableau, Hadoop, Spark, TensorFlow, Keras, Scikit-learn

  • Technologies: Machine Learning, Deep Learning, Big Data, Cloud Computing, Data Visualization

Share interview questions and help millions of jobseekers 🌟

man-with-laptop

Q107. How do you handle outliers? How to handle imbalance dataset? Feature engineering techniques?

Ans.

Outliers can be handled by removing, transforming or imputing them. Imbalanced datasets can be handled by resampling techniques. Feature engineering involves creating new features from existing ones.

  • Outliers can be removed using statistical methods like z-score or IQR.

  • Outliers can be transformed using techniques like log transformation or box-cox transformation.

  • Outliers can be imputed using techniques like mean imputation or regression imputation.

  • Imbalanced datasets can be ha...read more

Q108. What is Encoder Decoder? What is a Transformer model and explain its architecture?

Ans.

Encoder Decoder is a neural network architecture used for sequence-to-sequence tasks. Transformer model is a type of neural network architecture that relies entirely on self-attention mechanisms.

  • Encoder Decoder is commonly used in machine translation tasks where the input sequence is encoded into a fixed-length vector representation by the encoder and then decoded into the target sequence by the decoder.

  • Transformer model consists of an encoder and a decoder, both of which are...read more

Data Scientist Jobs

Data Scientist 3-5 years
Red Hat India Pvt Ltd
4.3
Bangalore / Bengaluru
Data Scientist 8-12 years
Ericsson India Global Services Pvt. Ltd.
4.1
Noida
Data Scientist 1-6 years
Hyundai Motor
4.3
₹ 1 L/yr - ₹ 1 L/yr
Gurgaon / Gurugram

Q109. find unique keys in 2 dictionaries

Ans.

To find unique keys in 2 dictionaries.

  • Create a set of keys for each dictionary

  • Use set operations to find the unique keys

  • Return the unique keys

Q110. Given a scenario to set up a rag system how would you do it?

Ans.

Setting up a RAG system involves defining criteria for red, amber, and green statuses to track progress or performance.

  • Define clear criteria for red, amber, and green statuses based on key metrics or thresholds.

  • Establish a method for regularly monitoring and updating the status of each item or project.

  • Communicate the RAG system and its criteria to all stakeholders to ensure understanding and consistency.

  • Use visual indicators such as color-coding or dashboards to easily identi...read more

Q111. How do you handle a negative outcome?

Ans.

I handle negative outcomes by analyzing the root cause and identifying potential solutions.

  • I remain calm and objective to avoid making hasty decisions.

  • I review the data and identify the root cause of the negative outcome.

  • I brainstorm potential solutions and evaluate their feasibility.

  • I implement the best solution and monitor its effectiveness.

  • I use the experience as a learning opportunity to improve future outcomes.

Q112. what is difference between array and dataframe

Ans.

Arrays are one-dimensional data structures, while dataframes are two-dimensional data structures used in data analysis.

  • Arrays are one-dimensional and can hold only one type of data, while dataframes are two-dimensional and can hold multiple types of data.

  • Dataframes are commonly used in data analysis with libraries like Pandas in Python, while arrays are more basic data structures.

  • Arrays are typically used for simple data storage and manipulation, while dataframes are used for...read more

Q113. How to plot scatter plot of 1000 features at a time?

Ans.

Use dimensionality reduction techniques like PCA or t-SNE to reduce the number of features and plot the scatter plot.

  • Apply Principal Component Analysis (PCA) to reduce the dimensionality of the data

  • Use t-Distributed Stochastic Neighbor Embedding (t-SNE) for non-linear dimensionality reduction

  • Plot the scatter plot using the reduced feature set

Q114. 1. Why Machine Learning?

Ans.

Machine learning enables computers to learn from data and make predictions or decisions without being explicitly programmed.

  • Machine learning can automate and optimize complex processes

  • It can help identify patterns and insights in large datasets

  • It can improve accuracy and efficiency in decision-making

  • Examples include image recognition, natural language processing, and predictive analytics

  • It can also be used for anomaly detection and fraud prevention

Q115. Rate yourself in python and deep dive in python programming language

Ans.

I rate myself 8/10 in Python. I have experience in data manipulation, visualization, and machine learning.

  • Proficient in Python libraries such as Pandas, NumPy, Matplotlib, and Scikit-learn

  • Experience in data cleaning, preprocessing, and feature engineering

  • Developed machine learning models for classification, regression, and clustering

  • Familiar with deep learning frameworks such as TensorFlow and Keras

  • Implemented neural networks for image classification and natural language proc...read more

Q116. what types of errors do you accept (you can see the non meaningfulness of the question already)?

Ans.

I accept errors that are within an acceptable margin of error and do not significantly impact the overall analysis.

  • Acceptable margin of error depends on the specific analysis and its purpose

  • Errors that significantly impact the analysis should be addressed and corrected

  • Human error should be minimized through careful data collection and analysis

  • Errors should be documented and reported to ensure transparency and reproducibility

Q117. How do you convert 3d object collision problem to 2d collision problem

Ans.

3D object collision problem can be converted to 2D by projecting the objects onto a 2D plane.

  • Project the 3D objects onto a 2D plane using a projection matrix.

  • Calculate the 2D coordinates of the projected objects.

  • Perform collision detection in 2D using standard algorithms.

  • Example: projecting a sphere onto a 2D plane results in a circle.

  • Example: projecting a cube onto a 2D plane results in a square.

Q118. What is the loss that deal with imbalance data?

Ans.

The loss that deals with imbalanced data is weighted loss.

  • Weighted loss assigns different weights to different classes based on their frequency in the dataset.

  • It helps in giving more importance to minority class samples during training.

  • Examples include weighted cross-entropy loss, focal loss, and class-weighted loss.

Q119. Use R as a calculator to compute the following values. After you do so, cut and paste your input and output from R to Word. Add numbering in Word to identify each part of each problem.

Ans.

Using R as a calculator to compute values for a Data Scientist interview question.

  • Use R's console to input mathematical expressions and compute values.

  • Make sure to follow the order of operations (PEMDAS) when entering expressions.

  • Use functions like 'sqrt()' for square roots and 'exp()' for exponentiation.

  • Remember to assign variables using the '<-' operator before using them in calculations.

Q120. What is the difference between logistic regression and linear regression? How do you decide the threshold?

Ans.

Logistic regression is used for classification while linear regression is used for regression. Threshold is decided based on the problem.

  • Logistic regression predicts the probability of an event occurring, while linear regression predicts a continuous outcome.

  • Logistic regression uses a sigmoid function to map the predicted values between 0 and 1.

  • Linear regression uses a linear equation to model the relationship between the independent and dependent variables.

  • The threshold in l...read more

Q121. How you convince for clients the model is useful, what are the matrix

Ans.

I convince clients of model usefulness by showcasing its accuracy, precision, recall, and F1 score.

  • Explain the model's accuracy in predicting outcomes compared to actual results

  • Discuss precision - the proportion of true positive predictions out of all positive predictions

  • Highlight recall - the proportion of true positive predictions out of all actual positives

  • Mention F1 score - the balance between precision and recall, useful for imbalanced datasets

Q122. What is the difference between sigmoid and softmax activation function?

Ans.

Sigmoid is used for binary classification while softmax is used for multi-class classification.

  • Sigmoid function outputs values between 0 and 1, suitable for binary classification tasks.

  • Softmax function outputs a probability distribution over multiple classes, summing up to 1.

  • Sigmoid is used in the output layer for binary classification, while softmax is used for multi-class classification.

  • Softmax is the generalization of the sigmoid function for multiple classes.

Q123. Data Science project pipeline ,what components are involved , step by step process

Ans.

Data science project pipeline involves multiple components and follows a step-by-step process.

  • 1. Define the problem statement and objectives of the project.

  • 2. Collect and preprocess the data needed for analysis.

  • 3. Explore and visualize the data to gain insights.

  • 4. Build and train machine learning models to solve the problem.

  • 5. Evaluate the models using appropriate metrics.

  • 6. Deploy the model into production and monitor its performance.

  • 7. Communicate the results and findings t...read more

Q124. Explain SQL joins, explain join in a given situation

Ans.

SQL joins are used to combine data from two or more tables based on a related column.

  • Joins are used to retrieve data from multiple tables in a single query.

  • Common types of joins are inner join, left join, right join, and full outer join.

  • Joining tables can be done using the JOIN keyword and specifying the columns to join on.

  • Example: SELECT * FROM table1 JOIN table2 ON table1.column = table2.column;

  • Joins can be used to combine data from different tables such as customer and ord...read more

Q125. how does one vs rest work for logistic regression?

Ans.

One vs Rest is a technique used to extend binary classification to multi-class problems in logistic regression.

  • It involves training multiple binary classifiers, one for each class.

  • In each classifier, one class is treated as the positive class and the rest as negative.

  • The class with the highest probability is predicted as the final output.

  • It is also known as one vs all or one vs others.

  • Example: In a 3-class problem, we train 3 binary classifiers: class 1 vs rest, class 2 vs re...read more

Q126. What are the different types of algorithms used for classification?

Ans.

There are several algorithms used for classification, including decision trees, logistic regression, k-nearest neighbors, and support vector machines.

  • Decision trees: a tree-like model where each internal node represents a test on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label

  • Logistic regression: a statistical method that uses a logistic function to model a binary dependent variable

  • K-nearest neighbors: a non-parametric...read more

Q127. What is effect of multicollinearity ik regression analysis?

Ans.

Multicollinearity in regression analysis affects the accuracy and interpretability of the model.

  • Multicollinearity occurs when two or more independent variables are highly correlated.

  • It leads to unstable and unreliable estimates of regression coefficients.

  • It reduces the precision of the estimates and increases the standard errors.

  • It makes it difficult to interpret the individual effects of the independent variables.

  • It can be detected using correlation matrix, variance inflatio...read more

Q128. Are all the decision trees same in a random forest ?

Ans.

No, decision trees in a random forest are different due to the use of bootstrapping and feature randomization.

  • Decision trees in a random forest are trained on different subsets of the data through bootstrapping.

  • Each decision tree in a random forest also considers only a random subset of features at each split.

  • The final prediction in a random forest is made by aggregating the predictions of all individual decision trees.

Q129. How will you handle class imbalanced dataset to increase the f1 score ?

Ans.

Handling class imbalanced dataset involves techniques like resampling, using different algorithms, adjusting class weights, and using ensemble methods.

  • Use resampling techniques like oversampling the minority class or undersampling the majority class.

  • Try using different algorithms that are less sensitive to class imbalance, such as Random Forest or XGBoost.

  • Adjust class weights in the model to give more importance to the minority class.

  • Utilize ensemble methods like bagging or b...read more

Q130. Difference between Precision and Recall? How to select features? What’s p-value?

Ans.

Precision and Recall are evaluation metrics for classification models. Feature selection is important for model performance. P-value is a statistical measure.

  • Precision is the ratio of true positives to the total predicted positives. Recall is the ratio of true positives to the total actual positives.

  • Precision is important when false positives are costly, while recall is important when false negatives are costly.

  • Feature selection involves identifying the most relevant features...read more

Q131. Why data science though you are coming from electrical engineering

Ans.

Data science offers a new challenge and opportunity to apply analytical skills from my engineering background.

  • Data science allows me to utilize my analytical skills in a new and challenging field.

  • I can apply my knowledge of statistics and programming to extract insights from data.

  • Data science offers opportunities to work on diverse projects and industries.

  • My background in electrical engineering provides a strong foundation for understanding complex systems and data analysis.

Q132. Why cross entropy loss is used in classification, why not SSE?

Ans.

Cross entropy loss is used in classification because it penalizes incorrect classifications more heavily, making it more suitable for classification tasks compared to SSE.

  • Cross entropy loss is more suitable for classification tasks because it penalizes incorrect classifications more heavily than SSE.

  • Cross entropy loss is commonly used in scenarios where the output is a probability distribution, such as in multi-class classification.

  • SSE (Sum of Squared Errors) is more suitable...read more

Q133. How do you select the appropriate learning algorithm for a problem?

Ans.

Selecting the appropriate learning algorithm involves considering the problem's characteristics and requirements.

  • Understand the problem's nature, such as classification, regression, clustering, etc.

  • Consider the size of the dataset and the computational resources available.

  • Evaluate the complexity of the relationships within the data.

  • Experiment with different algorithms and compare their performance using metrics like accuracy, precision, recall, etc.

  • Choose algorithms based on ...read more

Q134. Can you explain gradient descent?

Ans.

Gradient descent is an iterative optimization algorithm used to minimize a cost function by adjusting model parameters.

  • Gradient descent is used in machine learning to optimize models.

  • It works by iteratively adjusting model parameters to minimize a cost function.

  • The algorithm calculates the gradient of the cost function and moves in the direction of steepest descent.

  • There are different variants of gradient descent, such as batch, stochastic, and mini-batch.

  • Gradient descent can...read more

Q135. How do you identify anomalies in a specific area of operation?

Ans.

Anomalies can be identified by analyzing data using statistical methods and machine learning algorithms.

  • Collect and preprocess data from the area of operation

  • Visualize the data to identify patterns and outliers

  • Apply statistical methods such as mean, standard deviation, and Z-score to detect anomalies

  • Use machine learning algorithms such as clustering, classification, and regression to identify anomalies

  • Validate the results and investigate the anomalies to understand their root...read more

Q136. What are the different types of distances you have worked on?

Ans.

I have worked on various types of distances such as Euclidean, Manhattan, Cosine, and Hamming distances.

  • Euclidean distance measures the straight-line distance between two points in a Euclidean space.

  • Manhattan distance calculates the distance between two points by summing the absolute differences of their coordinates.

  • Cosine distance measures the cosine of the angle between two vectors.

  • Hamming distance calculates the number of positions at which the corresponding symbols are di...read more

Q137. How to check model performance? Over fit vs underfit?

Ans.

Model performance can be checked using various metrics such as accuracy, precision, recall, F1 score, and confusion matrix.

  • Split data into training and testing sets

  • Train the model on the training set

  • Evaluate the model on the testing set using metrics such as accuracy, precision, recall, F1 score, and confusion matrix

  • If the model performs well on the testing set, it is not overfit or underfit

  • If the model performs well on the training set but poorly on the testing set, it is ov...read more

Q138. What methods can be used to remove a specific type of noise from an image?

Ans.

Methods like median filtering, Gaussian filtering, and wavelet denoising can be used to remove noise from an image.

  • Median filtering replaces each pixel's value with the median value of its neighborhood.

  • Gaussian filtering uses a weighted average of neighboring pixels to smooth out noise.

  • Wavelet denoising decomposes the image into different frequency bands and removes noise from each band separately.

Q139. What metrics can be used to evaluate the performance of a regression model?

Ans.

Various metrics like Mean Squared Error, R-squared, Mean Absolute Error can be used to evaluate regression model performance.

  • Mean Squared Error (MSE) - measures the average of the squares of the errors or deviations

  • R-squared (R2) - indicates the proportion of the variance in the dependent variable that is predictable from the independent variables

  • Mean Absolute Error (MAE) - measures the average of the absolute errors between predicted and actual values

  • Root Mean Squared Error ...read more

Q140. What is Dropout &amp; Batch Normalization?

Ans.

Dropout is a regularization technique to prevent overfitting by randomly setting some neuron outputs to zero during training. Batch Normalization is a technique to normalize the inputs of each layer to improve training speed and stability.

  • Dropout randomly sets a fraction of neuron outputs to zero during training to prevent overfitting.

  • Batch Normalization normalizes the inputs of each layer to improve training speed and stability.

  • Dropout is commonly used in neural networks to ...read more

Q141. What is CNN? what is the difference between RCNN and RNN?

Ans.

CNN stands for Convolutional Neural Network. RCNN is Region-based Convolutional Neural Network and RNN is Recurrent Neural Network.

  • CNN is a type of neural network commonly used for image recognition and classification.

  • RCNN is an improvement over CNN that focuses on regions of interest within an image.

  • RNN is a type of neural network designed for sequence data, such as text or speech.

  • The main difference between RCNN and RNN is the focus on regions of interest vs sequential data...read more

Q142. What is structured and unstructured data? what is supervised and unsupervised learning?

Ans.

Structured data is organized and easily searchable, while unstructured data lacks a predefined format. Supervised learning uses labeled data for training, while unsupervised learning finds patterns in unlabeled data.

  • Structured data is organized in a predefined format, such as databases or spreadsheets.

  • Unstructured data lacks a specific format and includes text, images, videos, etc.

  • Supervised learning uses labeled data to train a model to predict outcomes, like classifying ema...read more

Q143. Can we use randome forest for text data (which contains 1000 features and all are important)

Ans.

Yes, random forest can be used for text data with important features.

  • Random forest can handle both numerical and categorical features, including text data.

  • Text data needs to be converted into numerical features using techniques like bag-of-words or TF-IDF.

  • Important features can be identified using feature importance scores provided by random forest.

  • Examples: Classifying emails as spam or not spam, sentiment analysis of customer reviews.

Q144. 3) What is the difference between inner join, outer join, left outer join, right outer join.

Ans.

Inner join returns only the rows that have matching values in both tables, while outer join returns all rows from both tables.

  • Inner join: returns rows with matching values in both tables

  • Outer join: returns all rows from both tables, filling in missing values with NULL

  • Left outer join: returns all rows from the left table and the matched rows from the right table

  • Right outer join: returns all rows from the right table and the matched rows from the left table

Q145. Difference between Baysian and Frequentist statistics

Ans.

Bayesian statistics involves prior knowledge and updating beliefs, while frequentist statistics relies on probability and sampling.

  • Bayesian statistics uses prior knowledge to update beliefs about a parameter, while frequentist statistics relies on probability and sampling.

  • Bayesian statistics involves the use of Bayes' theorem, while frequentist statistics involves hypothesis testing and confidence intervals.

  • Bayesian statistics can handle small sample sizes and complex models,...read more

Q146. Find elements from list 2 which are in list1 without changing their orders.

Ans.

The task is to find elements from list 2 in list 1 without changing their orders.

  • Iterate through both lists simultaneously

  • Compare elements at each index

  • If match found, add to result list

Q147. how are boosting and bagging algorithms different?

Ans.

Boosting and bagging are ensemble learning techniques used to improve model performance.

  • Bagging involves training multiple models on different subsets of the data and averaging their predictions.

  • Boosting involves training multiple models sequentially, with each model focusing on the errors of the previous model.

  • Bagging reduces variance and overfitting, while boosting reduces bias and underfitting.

  • Examples of bagging algorithms include Random Forest and Bootstrap Aggregating, ...read more

Q148. How do you choose which ml model to use?

Ans.

The choice of ML model depends on the problem, data, and desired outcome.

  • Consider the problem type: classification, regression, clustering, etc.

  • Analyze the data: size, quality, features, and target variable.

  • Evaluate model performance: accuracy, precision, recall, F1-score.

  • Consider interpretability, scalability, and computational requirements.

  • Experiment with multiple models: decision trees, SVM, neural networks, etc.

  • Use cross-validation and hyperparameter tuning for model sele...read more

Q149. What PCA, Decision tree and computer vision

Ans.

PCA is a dimensionality reduction technique, decision tree is a classification algorithm, and computer vision is a field of study focused on enabling computers to interpret and understand visual information.

  • PCA is used to reduce the number of variables in a dataset while retaining the most important information.

  • Decision trees are used to classify data based on a set of rules and conditions.

  • Computer vision involves using algorithms and techniques to enable computers to interpr...read more

Q150. Which Algorithms you used , statistical measured , challenges , outcome

Ans.

I have used various algorithms such as linear regression, decision trees, and neural networks to analyze data and make predictions.

  • Used linear regression to predict housing prices based on various features

  • Implemented decision trees to classify customer behavior and recommend products

  • Utilized neural networks for image recognition tasks

  • Challenges included dealing with missing data and overfitting

  • Outcome was improved accuracy and insights into the data

Previous
1
2
3
4
5
6
7
Next
Interview Tips & Stories
Ace your next interview with expert advice and inspiring stories

Interview experiences of popular companies

3.7
 • 10.4k Interviews
3.8
 • 8.1k Interviews
3.6
 • 7.5k Interviews
3.8
 • 5.6k Interviews
3.7
 • 4.7k Interviews
3.8
 • 3.1k Interviews
3.8
 • 2.9k Interviews
3.8
 • 2.8k Interviews
3.7
 • 222 Interviews
View all

Calculate your in-hand salary

Confused about how your in-hand salary is calculated? Enter your annual salary (CTC) and get your in-hand salary

Data Scientist Interview Questions
Share an Interview
Stay ahead in your career. Get AmbitionBox app
qr-code
Helping over 1 Crore job seekers every month in choosing their right fit company
65 L+

Reviews

4 L+

Interviews

4 Cr+

Salaries

1 Cr+

Users/Month

Contribute to help millions

Made with ❤️ in India. Trademarks belong to their respective owners. All rights reserved © 2024 Info Edge (India) Ltd.

Follow us
  • Youtube
  • Instagram
  • LinkedIn
  • Facebook
  • Twitter