Data Science Intern

100+ Data Science Intern Interview Questions and Answers

Updated 29 Nov 2024

Popular Companies

search-icon
Q1. Puzzle Question

In a bank, suppose there are 5 counters. Which approach mentioned below is better?
1) The new customer goes to whichever counter has a smaller queue
2) Each counter has a specific purpose (e.g., ca...read more

Q2. Rotate matrix by 90 degrees

You are given a square matrix of non-negative integers 'MATRIX'. Your task is to rotate that array by 90 degrees in an anti-clockwise direction using constant extra space.

For example...read more

Ans.

The task is to rotate a square matrix by 90 degrees in an anti-clockwise direction using constant extra space.

  • Iterate through each layer of the matrix from outer to inner

  • For each layer, perform a four-way swap of elements

  • Continue this process until all layers have been rotated

Data Science Intern Interview Questions and Answers for Freshers

illustration image
Q3. SQL Question

A table containing the details of all the drivers, and another table containing the details of trips is provided. Query the name and number of trips taken by each driver which has rating more than 4...read more

Q4. SQL Question

A table containing the details of all the drivers, and another table containing the details of trips is provided. Query the name of drivers that have taken at least 1 trip.

Are these interview questions helpful?
Q5. Puzzle Question

How would you predict which user is likely to churn in the next 1 month? What variables would you consider for this?

Q6. SQL Question

User table containing user_id, ph_no, email_id, signup_city, banned, role, timestamp and trips table containing trip_id, client_id, driver_id, city_id, product_id, request_at, status={completed,ride...read more

Share interview questions and help millions of jobseekers 🌟

man-with-laptop
Q7. Puzzle Question

What new feature would you like to add in Uber?

Q8. What is gradient descent, why does gradient descent follow tan angles and please explain and write down the formula of it.

Ans.

Gradient descent is an optimization algorithm used to minimize the cost function of a machine learning model.

  • Gradient descent is used to update the parameters of a model to minimize the cost function.

  • It follows the direction of steepest descent, which is the negative gradient of the cost function.

  • The learning rate determines the step size of the algorithm.

  • The formula for gradient descent is: theta = theta - alpha * (1/m) * sum((hypothesis - y) * x)

  • The cost function should be ...read more

Data Science Intern Jobs

Data Science Intern 0-1 years
Cars24
3.6
Bangalore / Bengaluru
Data Science Intern 0-1 years
Shaadi.com
3.4
Tirodi
Data Science Intern 0-1 years
Pozent Labs
3.1
Chennai

Q9. Get Second highest element from an array (duplicates elements are allowed). Required T.C-->O(N) Single traversal. S.C--->O(1)

Ans.

Get second highest element from an array of strings with O(N) time complexity and O(1) space complexity.

  • Initialize two variables to store the highest and second highest elements.

  • Traverse the array and update the variables accordingly.

  • Return the second highest element.

  • Handle edge cases like empty array or array with only one element.

Q10. SQL Question

A table containing the details of Uber employees were provided. I had to query the maximum number of employees that worked in the company during each employee's tenure.

Q11. Complete the Code

Complete the function to find the value in a specified categorical column with the highest average of a specified numeric column.

import pandas as pddef category_avg(df, cat_column, num_column):...read more

Ans.

The function finds the value in a specified categorical column with the highest average of a specified numeric column.

  • Use the groupby() function in pandas to group the data by the categorical column

  • Calculate the average of the numeric column for each group

  • Find the group with the highest average and return the corresponding value from the categorical column

Q12. Coffiecent of x^7 in equation ? y=(x^101-1)(x^100+1)(x^99-1)...........................................(X^0+1)

Ans.

Coffiecent of x^7 in a given equation

  • Use the binomial theorem to expand the equation

  • Identify the term with x^7

  • The coefficient of x^7 is the coefficient of that term

Q13. From where did you complete the Data Science course?

Ans.

I completed the Data Science course at XYZ University.

  • Completed Data Science course at XYZ University

  • Received hands-on training in machine learning algorithms

  • Worked on real-world projects during the course

Q14. Explain Random forest. What is gini impurity.

Ans.

Random forest is an ensemble learning method that constructs a multitude of decision trees and outputs the mode of the classes. Gini impurity is a measure of impurity or randomness used in decision trees.

  • Random forest is a collection of decision trees that are trained on different subsets of the data.

  • Each decision tree in the forest is trained on a random subset of the features.

  • The final prediction is made by taking the mode of the predictions of all the trees.

  • Gini impurity i...read more

Q15. Which ML algorithm did you use in your project?

Ans.

I used the Random Forest algorithm in my project.

  • Random Forest is an ensemble learning method that combines multiple decision trees to make predictions.

  • It is used for both classification and regression tasks.

  • Random Forest reduces overfitting and provides feature importance.

  • Example: I used Random Forest to predict customer churn in a telecom company.

Q16. what is the difference between call by reference and call by value

Ans.

Call by value passes a copy of the value while call by reference passes the address of the value.

  • Call by value passes a copy of the value while call by reference passes the address of the value.

  • Call by value does not modify the original value while call by reference can modify the original value.

  • Call by value is used for simple data types while call by reference is used for complex data types.

Q17. How to Choose K value in K means? if there are any techniques, name them and explain.

Ans.

Choosing the optimal K value in K-means clustering is crucial for accurate results.

  • Elbow method: Plotting the sum of squared distances vs. K and selecting the K value where the curve bends like an elbow.

  • Silhouette method: Calculating the average silhouette score for different K values and choosing the one with the highest score.

  • Gap statistic method: Comparing the within-cluster dispersion to a reference null distribution to find the optimal K value.

  • Cross-validation: Splitting...read more

Q18. similar table. Find students who scored more than avg marks of both 11th and 12th.

Ans.

Find students who scored more than avg marks in both 11th and 12th grades.

  • Calculate the average marks for each student in 11th and 12th grades.

  • Compare each student's marks with the respective average marks to find those who scored higher in both grades.

Q19. Sql query - Customers who have ordered all products from all categories.

Ans.

Use a SQL query to find customers who have ordered all products from all categories.

  • Join the Customers, Orders, and Products tables

  • Group by customer and count the distinct products ordered

  • Filter for customers who have ordered the total number of products available in each category

Q20. What do you understand by joins in sql?

Ans.

Joins in SQL are used to combine rows from two or more tables based on a related column between them.

  • Joins are used to retrieve data from multiple tables in a single query.

  • Common types of joins include inner join, left join, right join, and full outer join.

  • Joins are performed using the JOIN keyword and specifying the columns to join on.

  • Joins can be used to combine tables based on matching values or non-matching values.

  • Joins help in creating relationships between tables and fe...read more

Q21. Which algorithm will you prefer from Random Forest and XGBoost when the model is low bias?

Ans.

XGBoost is preferred over Random Forest for low bias models due to its ability to reduce bias further.

  • XGBoost is a more complex algorithm compared to Random Forest, allowing it to reduce bias further in low bias models.

  • XGBoost uses gradient boosting which helps in reducing bias by optimizing the loss function iteratively.

  • Random Forest may not be able to further reduce bias in low bias models as effectively as XGBoost.

  • In scenarios where the model already has low bias, XGBoost'...read more

Q22. What is overfitting, How to handle missing values, etc

Ans.

Overfitting is when a model is too complex and fits the training data too closely, leading to poor performance on new data.

  • Regularization techniques like L1 and L2 can be used to prevent overfitting

  • Cross-validation can be used to evaluate model performance on new data

  • Reducing the complexity of the model can also help prevent overfitting

  • Handling missing values can be done by imputing them with mean, median or mode values

  • Alternatively, missing values can be dropped if they are ...read more

Q23. What is Sql? And what is a database?

Ans.

SQL is a programming language used for managing and manipulating relational databases. A database is a structured collection of data.

  • SQL is used to retrieve, insert, update, and delete data from a database.

  • A database is a software system that stores and organizes data in a structured manner.

  • SQL allows users to define the structure of a database, create tables, and establish relationships between tables.

  • Examples of databases include MySQL, Oracle, and SQL Server.

Q24. What are the average no. Of tea people drink in a day in delhi ?

Ans.

The average number of tea people drink in a day in Delhi varies depending on individual preferences and habits.

  • The average number of tea consumed can range from 1-5 cups per day.

  • Factors such as age, gender, occupation, and cultural background can influence tea consumption.

  • Some people may drink more tea in the morning for a caffeine boost, while others may prefer tea throughout the day for relaxation.

  • Tea consumption may also vary based on the season, with more tea being consum...read more

Q25. What motivated to you too choose data science?

Ans.

I chose data science because of its potential to solve complex problems and make meaningful insights from data.

  • Fascination with the power of data to drive decision-making

  • Interest in solving real-world problems using data-driven approaches

  • Passion for exploring patterns and trends in data

  • Desire to contribute to advancements in technology and innovation

  • Excitement about the interdisciplinary nature of data science

  • Examples: Predictive analytics in healthcare, fraud detection in fi...read more

Q26. What will happen if linear regression is used for classification

Ans.

Using linear regression for classification can lead to inaccurate predictions and unreliable results.

  • Linear regression assumes a continuous output, making it unsuitable for discrete classification tasks.

  • It may not handle outliers well, leading to incorrect classification boundaries.

  • The predicted values may fall outside the 0-1 range for binary classification.

  • Logistic regression is a more appropriate choice for classification tasks.

Q27. What are mutable and immutable data structures?

Ans.

Mutable data structures can be modified after creation, while immutable data structures cannot be changed once created.

  • Mutable data structures allow for in-place modifications, while immutable data structures require creating a new instance when modifications are needed.

  • Examples of mutable data structures include lists, dictionaries, and sets in Python.

  • Examples of immutable data structures include tuples and strings in Python.

Q28. What is regression model in ML and describe intercept and coefficient.

Ans.

Regression model is a ML algorithm used to predict continuous numerical values. Intercept is the value of the predicted variable when all predictors are zero. Coefficient represents the change in the predicted variable for a one-unit change in the predictor variable.

  • Regression model predicts continuous numerical values.

  • Intercept is the value of the predicted variable when all predictors are zero.

  • Coefficient represents the change in the predicted variable for a one-unit change...read more

Q29. 2. How would you handle missing data in time series data

Ans.

Handle missing data in time series data by imputation, interpolation, or deletion.

  • Impute missing values using mean, median, mode, or predictive modeling techniques.

  • Interpolate missing values using linear interpolation, spline interpolation, or time-based interpolation.

  • Delete rows with missing values if they are few and do not significantly impact the analysis.

Q30. What is linear regression and logistics regression?

Ans.

Linear regression is a statistical method to model the relationship between a dependent variable and one or more independent variables. Logistic regression is used to model the probability of a binary outcome.

  • Linear regression is used for predicting continuous outcomes, while logistic regression is used for predicting binary outcomes.

  • Linear regression assumes a linear relationship between the independent and dependent variables, while logistic regression uses a logistic funct...read more

Q31. Do you aware of Supervised and Unsupervised learning

Ans.

Supervised learning uses labeled data to train a model, while unsupervised learning uses unlabeled data.

  • Supervised learning requires a target variable for training

  • Examples of supervised learning include regression and classification tasks

  • Unsupervised learning finds patterns in data without predefined labels

  • Examples of unsupervised learning include clustering and dimensionality reduction

Q32. Please write a dictionary and try to sort it.

Ans.

A dictionary sorted in ascending order based on keys.

  • Create a dictionary with key-value pairs

  • Use the sorted() function to sort the dictionary based on keys

  • Convert the sorted dictionary into a list of tuples

  • Use the dict() constructor to create a new dictionary from the sorted list of tuples

Q33. SQL syntax and difference between having and where clause

Ans.

HAVING clause is used with GROUP BY to filter grouped rows, WHERE clause is used to filter individual rows.

  • HAVING clause is used with GROUP BY to filter grouped rows based on aggregate functions

  • WHERE clause is used to filter individual rows based on conditions

  • HAVING clause is applied after GROUP BY, WHERE clause is applied before GROUP BY

  • HAVING clause can only be used with SELECT statement that contains a GROUP BY clause

Q34. Different types of oops, explain oops

Ans.

Object-oriented programming (OOP) is a programming paradigm that uses objects to represent and manipulate data.

  • OOP is based on the concept of classes and objects.

  • Encapsulation, inheritance, and polymorphism are key principles of OOP.

  • Examples of OOP languages include Java, C++, and Python.

Q35. What is Digital Image Processing?

Ans.

Digital Image Processing is the manipulation of digital images using algorithms to enhance, analyze, or extract information.

  • It involves techniques like image enhancement, restoration, segmentation, and compression.

  • Common applications include medical imaging, satellite imaging, facial recognition, and object detection.

  • Algorithms like edge detection, noise reduction, and image stitching are used in digital image processing.

  • It plays a crucial role in fields like computer vision,...read more

Q36. Different types of algorithms used in e-commerce companies

Ans.

Recommendation, personalization, fraud detection, search algorithms are used in e-commerce companies.

  • Recommendation algorithms suggest products based on user behavior and preferences.

  • Personalization algorithms customize the user experience based on their past behavior.

  • Fraud detection algorithms identify and prevent fraudulent transactions.

  • Search algorithms help users find products based on their search queries.

  • Clustering algorithms group similar products together for easier n...read more

Q37. Given a number and check wheather it is palindrome or not

Ans.

Check if a number is a palindrome or not

  • Convert the number to a string

  • Reverse the string and compare it with the original string

  • If they are the same, the number is a palindrome

Q38. Two good and two bad things you thinks about Data science

Ans.

Good and bad aspects of Data Science

  • Good: Data science helps in making informed decisions based on data-driven insights

  • Good: Data science can uncover valuable patterns and trends in large datasets

  • Bad: Data science can be time-consuming and resource-intensive

  • Bad: Data science may face challenges with data privacy and ethical considerations

Q39. How are you going to tackle the pressurized situations.

Ans.

I tackle pressurized situations by staying calm, prioritizing tasks, seeking help when needed, and maintaining a positive attitude.

  • Stay calm and composed under pressure

  • Prioritize tasks based on urgency and importance

  • Seek help or guidance from colleagues or supervisors

  • Maintain a positive attitude and focus on finding solutions

Q40. What technologies do you know?

Ans.

I am familiar with a variety of technologies commonly used in data science, including programming languages, databases, and machine learning tools.

  • Programming languages: Python, R, SQL

  • Databases: MySQL, MongoDB

  • Machine learning tools: TensorFlow, scikit-learn

Q41. What is central limit theorem? Why we use it

Ans.

Central Limit Theorem states that the sampling distribution of the sample mean approaches a normal distribution as the sample size increases.

  • Central Limit Theorem is used to make inferences about a population mean based on the sample mean.

  • It allows us to use the properties of the normal distribution to estimate population parameters.

  • It is essential in hypothesis testing and constructing confidence intervals.

  • For example, if we take multiple samples of a population and calculat...read more

Q42. what is data science? what do you like about data science?

Ans.

Data science is a field that uses scientific methods, algorithms, and systems to extract knowledge and insights from structured and unstructured data.

  • Data science involves analyzing large amounts of data to uncover patterns, trends, and insights.

  • It combines statistics, machine learning, and domain knowledge to solve complex problems.

  • Data science is used in various industries such as healthcare, finance, marketing, and more.

  • Examples include predicting customer behavior, optimi...read more

Q43. Formulas for Precision, Recall, accuracy, F1 Score.

Ans.

Formulas for Precision, Recall, Accuracy, F1 Score in data science.

  • Precision = TP / (TP + FP)

  • Recall = TP / (TP + FN)

  • Accuracy = (TP + TN) / (TP + TN + FP + FN)

  • F1 Score = 2 * (Precision * Recall) / (Precision + Recall)

Q44. OverFitting and underfitting conditions and example

Ans.

Overfitting and underfitting are common issues in machine learning where the model either learns the noise in the training data or fails to capture the underlying patterns.

  • Overfitting occurs when a model learns the training data too well, including noise and outliers, leading to poor generalization on new data.

  • Underfitting happens when a model is too simple to capture the underlying patterns in the data, resulting in high bias and low variance.

  • Examples of overfitting include ...read more

Q45. What fo you understand by credit card company

Ans.

A credit card company is a financial institution that issues credit cards to consumers for making purchases and borrowing money.

  • Credit card companies issue credit cards to consumers for making purchases and borrowing money

  • They charge interest on outstanding balances and fees for late payments

  • They provide customer service for cardholders and handle disputes and fraud claims

  • Examples include Visa, Mastercard, American Express, and Discover

Q46. What is the feature engineering menthods you used

Ans.

I have used techniques like one-hot encoding, feature scaling, polynomial features, and interaction terms for feature engineering.

  • One-hot encoding for categorical variables

  • Feature scaling for numerical variables

  • Polynomial features for capturing non-linear relationships

  • Interaction terms for capturing interactions between features

Q47. What happens when you type google.com?

Ans.

Typing google.com sends a request to Google's servers, which respond with the website's HTML code.

  • The browser sends a DNS request to resolve the domain name 'google.com' to an IP address

  • The browser establishes a TCP connection with Google's server

  • The browser sends an HTTP request to the server for the website's HTML code

  • The server responds with the HTML code, which the browser renders as a webpage

Q48. 1. How would you deal with time series data

Ans.

Time series data can be analyzed using techniques like smoothing, decomposition, forecasting, and anomaly detection.

  • Use smoothing techniques like moving averages to remove noise and identify trends.

  • Apply decomposition methods like seasonal decomposition of time series (STL) to separate trend, seasonality, and residual components.

  • Utilize forecasting models such as ARIMA, SARIMA, or Prophet to predict future values.

  • Implement anomaly detection algorithms like Isolation Forest or...read more

Q49. Sort nearly sortes array.

Ans.

Sort nearly sorted array using min heap

  • Create a min heap of size k+1

  • Insert first k+1 elements into min heap

  • For remaining elements, extract min and insert new element

  • Extract all remaining elements from min heap

  • Time complexity: O(nlogk)

  • Example: ['apple', 'banana', 'cherry', 'date', 'elderberry']

Q50. What is GAN.Have you worked with it.

Ans.

GAN stands for Generative Adversarial Network, a type of neural network used for generating new data.

  • Consists of two neural networks - generator and discriminator

  • Generator creates new data samples while discriminator tries to distinguish between real and generated data

  • Used in image generation, text generation, and other creative applications

1
2
3
Next
Interview Tips & Stories
Ace your next interview with expert advice and inspiring stories

Interview experiences of popular companies

3.7
 • 180 Interviews
3.0
 • 106 Interviews
4.1
 • 26 Interviews
4.2
 • 22 Interviews
2.4
 • 17 Interviews
3.9
 • 16 Interviews
2.1
 • 14 Interviews
View all

Calculate your in-hand salary

Confused about how your in-hand salary is calculated? Enter your annual salary (CTC) and get your in-hand salary

Data Science Intern Interview Questions
Share an Interview
Stay ahead in your career. Get AmbitionBox app
qr-code
Helping over 1 Crore job seekers every month in choosing their right fit company
65 L+

Reviews

4 L+

Interviews

4 Cr+

Salaries

1 Cr+

Users/Month

Contribute to help millions
Get AmbitionBox app

Made with ❤️ in India. Trademarks belong to their respective owners. All rights reserved © 2024 Info Edge (India) Ltd.

Follow us
  • Youtube
  • Instagram
  • LinkedIn
  • Facebook
  • Twitter