Data Science Intern
100+ Data Science Intern Interview Questions and Answers
In a bank, suppose there are 5 counters. Which approach mentioned below is better?
1) The new customer goes to whichever counter has a smaller queue
2) Each counter has a specific purpose (e.g., ca...read more
You are given a square matrix of non-negative integers 'MATRIX'. Your task is to rotate that array by 90 degrees in an anti-clockwise direction using constant extra space.
For example...read more
The task is to rotate a square matrix by 90 degrees in an anti-clockwise direction using constant extra space.
Iterate through each layer of the matrix from outer to inner
For each layer, perform a four-way swap of elements
Continue this process until all layers have been rotated
Data Science Intern Interview Questions and Answers for Freshers
A table containing the details of all the drivers, and another table containing the details of trips is provided. Query the name and number of trips taken by each driver which has rating more than 4...read more
A table containing the details of all the drivers, and another table containing the details of trips is provided. Query the name of drivers that have taken at least 1 trip.
How would you predict which user is likely to churn in the next 1 month? What variables would you consider for this?
User table containing user_id, ph_no, email_id, signup_city, banned, role, timestamp and trips table containing trip_id, client_id, driver_id, city_id, product_id, request_at, status={completed,ride...read more
Share interview questions and help millions of jobseekers 🌟
What new feature would you like to add in Uber?
Q8. What is gradient descent, why does gradient descent follow tan angles and please explain and write down the formula of it.
Gradient descent is an optimization algorithm used to minimize the cost function of a machine learning model.
Gradient descent is used to update the parameters of a model to minimize the cost function.
It follows the direction of steepest descent, which is the negative gradient of the cost function.
The learning rate determines the step size of the algorithm.
The formula for gradient descent is: theta = theta - alpha * (1/m) * sum((hypothesis - y) * x)
The cost function should be ...read more
Data Science Intern Jobs
Q9. Get Second highest element from an array (duplicates elements are allowed). Required T.C-->O(N) Single traversal. S.C--->O(1)
Get second highest element from an array of strings with O(N) time complexity and O(1) space complexity.
Initialize two variables to store the highest and second highest elements.
Traverse the array and update the variables accordingly.
Return the second highest element.
Handle edge cases like empty array or array with only one element.
A table containing the details of Uber employees were provided. I had to query the maximum number of employees that worked in the company during each employee's tenure.
Complete the function to find the value in a specified categorical column with the highest average of a specified numeric column.
import pandas as pddef category_avg(df, cat_column, num_column):...read more
The function finds the value in a specified categorical column with the highest average of a specified numeric column.
Use the groupby() function in pandas to group the data by the categorical column
Calculate the average of the numeric column for each group
Find the group with the highest average and return the corresponding value from the categorical column
Q12. Coffiecent of x^7 in equation ? y=(x^101-1)(x^100+1)(x^99-1)...........................................(X^0+1)
Coffiecent of x^7 in a given equation
Use the binomial theorem to expand the equation
Identify the term with x^7
The coefficient of x^7 is the coefficient of that term
Q13. From where did you complete the Data Science course?
I completed the Data Science course at XYZ University.
Completed Data Science course at XYZ University
Received hands-on training in machine learning algorithms
Worked on real-world projects during the course
Q14. Explain Random forest. What is gini impurity.
Random forest is an ensemble learning method that constructs a multitude of decision trees and outputs the mode of the classes. Gini impurity is a measure of impurity or randomness used in decision trees.
Random forest is a collection of decision trees that are trained on different subsets of the data.
Each decision tree in the forest is trained on a random subset of the features.
The final prediction is made by taking the mode of the predictions of all the trees.
Gini impurity i...read more
Q15. Which ML algorithm did you use in your project?
I used the Random Forest algorithm in my project.
Random Forest is an ensemble learning method that combines multiple decision trees to make predictions.
It is used for both classification and regression tasks.
Random Forest reduces overfitting and provides feature importance.
Example: I used Random Forest to predict customer churn in a telecom company.
Q16. what is the difference between call by reference and call by value
Call by value passes a copy of the value while call by reference passes the address of the value.
Call by value passes a copy of the value while call by reference passes the address of the value.
Call by value does not modify the original value while call by reference can modify the original value.
Call by value is used for simple data types while call by reference is used for complex data types.
Q17. How to Choose K value in K means? if there are any techniques, name them and explain.
Choosing the optimal K value in K-means clustering is crucial for accurate results.
Elbow method: Plotting the sum of squared distances vs. K and selecting the K value where the curve bends like an elbow.
Silhouette method: Calculating the average silhouette score for different K values and choosing the one with the highest score.
Gap statistic method: Comparing the within-cluster dispersion to a reference null distribution to find the optimal K value.
Cross-validation: Splitting...read more
Q18. similar table. Find students who scored more than avg marks of both 11th and 12th.
Find students who scored more than avg marks in both 11th and 12th grades.
Calculate the average marks for each student in 11th and 12th grades.
Compare each student's marks with the respective average marks to find those who scored higher in both grades.
Q19. Sql query - Customers who have ordered all products from all categories.
Use a SQL query to find customers who have ordered all products from all categories.
Join the Customers, Orders, and Products tables
Group by customer and count the distinct products ordered
Filter for customers who have ordered the total number of products available in each category
Q20. What do you understand by joins in sql?
Joins in SQL are used to combine rows from two or more tables based on a related column between them.
Joins are used to retrieve data from multiple tables in a single query.
Common types of joins include inner join, left join, right join, and full outer join.
Joins are performed using the JOIN keyword and specifying the columns to join on.
Joins can be used to combine tables based on matching values or non-matching values.
Joins help in creating relationships between tables and fe...read more
Q21. Which algorithm will you prefer from Random Forest and XGBoost when the model is low bias?
XGBoost is preferred over Random Forest for low bias models due to its ability to reduce bias further.
XGBoost is a more complex algorithm compared to Random Forest, allowing it to reduce bias further in low bias models.
XGBoost uses gradient boosting which helps in reducing bias by optimizing the loss function iteratively.
Random Forest may not be able to further reduce bias in low bias models as effectively as XGBoost.
In scenarios where the model already has low bias, XGBoost'...read more
Q22. What is overfitting, How to handle missing values, etc
Overfitting is when a model is too complex and fits the training data too closely, leading to poor performance on new data.
Regularization techniques like L1 and L2 can be used to prevent overfitting
Cross-validation can be used to evaluate model performance on new data
Reducing the complexity of the model can also help prevent overfitting
Handling missing values can be done by imputing them with mean, median or mode values
Alternatively, missing values can be dropped if they are ...read more
Q23. What is Sql? And what is a database?
SQL is a programming language used for managing and manipulating relational databases. A database is a structured collection of data.
SQL is used to retrieve, insert, update, and delete data from a database.
A database is a software system that stores and organizes data in a structured manner.
SQL allows users to define the structure of a database, create tables, and establish relationships between tables.
Examples of databases include MySQL, Oracle, and SQL Server.
Q24. What are the average no. Of tea people drink in a day in delhi ?
The average number of tea people drink in a day in Delhi varies depending on individual preferences and habits.
The average number of tea consumed can range from 1-5 cups per day.
Factors such as age, gender, occupation, and cultural background can influence tea consumption.
Some people may drink more tea in the morning for a caffeine boost, while others may prefer tea throughout the day for relaxation.
Tea consumption may also vary based on the season, with more tea being consum...read more
Q25. What motivated to you too choose data science?
I chose data science because of its potential to solve complex problems and make meaningful insights from data.
Fascination with the power of data to drive decision-making
Interest in solving real-world problems using data-driven approaches
Passion for exploring patterns and trends in data
Desire to contribute to advancements in technology and innovation
Excitement about the interdisciplinary nature of data science
Examples: Predictive analytics in healthcare, fraud detection in fi...read more
Q26. What will happen if linear regression is used for classification
Using linear regression for classification can lead to inaccurate predictions and unreliable results.
Linear regression assumes a continuous output, making it unsuitable for discrete classification tasks.
It may not handle outliers well, leading to incorrect classification boundaries.
The predicted values may fall outside the 0-1 range for binary classification.
Logistic regression is a more appropriate choice for classification tasks.
Q27. What are mutable and immutable data structures?
Mutable data structures can be modified after creation, while immutable data structures cannot be changed once created.
Mutable data structures allow for in-place modifications, while immutable data structures require creating a new instance when modifications are needed.
Examples of mutable data structures include lists, dictionaries, and sets in Python.
Examples of immutable data structures include tuples and strings in Python.
Q28. What is regression model in ML and describe intercept and coefficient.
Regression model is a ML algorithm used to predict continuous numerical values. Intercept is the value of the predicted variable when all predictors are zero. Coefficient represents the change in the predicted variable for a one-unit change in the predictor variable.
Regression model predicts continuous numerical values.
Intercept is the value of the predicted variable when all predictors are zero.
Coefficient represents the change in the predicted variable for a one-unit change...read more
Q29. 2. How would you handle missing data in time series data
Handle missing data in time series data by imputation, interpolation, or deletion.
Impute missing values using mean, median, mode, or predictive modeling techniques.
Interpolate missing values using linear interpolation, spline interpolation, or time-based interpolation.
Delete rows with missing values if they are few and do not significantly impact the analysis.
Q30. What is linear regression and logistics regression?
Linear regression is a statistical method to model the relationship between a dependent variable and one or more independent variables. Logistic regression is used to model the probability of a binary outcome.
Linear regression is used for predicting continuous outcomes, while logistic regression is used for predicting binary outcomes.
Linear regression assumes a linear relationship between the independent and dependent variables, while logistic regression uses a logistic funct...read more
Q31. Do you aware of Supervised and Unsupervised learning
Supervised learning uses labeled data to train a model, while unsupervised learning uses unlabeled data.
Supervised learning requires a target variable for training
Examples of supervised learning include regression and classification tasks
Unsupervised learning finds patterns in data without predefined labels
Examples of unsupervised learning include clustering and dimensionality reduction
Q32. Please write a dictionary and try to sort it.
A dictionary sorted in ascending order based on keys.
Create a dictionary with key-value pairs
Use the sorted() function to sort the dictionary based on keys
Convert the sorted dictionary into a list of tuples
Use the dict() constructor to create a new dictionary from the sorted list of tuples
Q33. SQL syntax and difference between having and where clause
HAVING clause is used with GROUP BY to filter grouped rows, WHERE clause is used to filter individual rows.
HAVING clause is used with GROUP BY to filter grouped rows based on aggregate functions
WHERE clause is used to filter individual rows based on conditions
HAVING clause is applied after GROUP BY, WHERE clause is applied before GROUP BY
HAVING clause can only be used with SELECT statement that contains a GROUP BY clause
Q34. Different types of oops, explain oops
Object-oriented programming (OOP) is a programming paradigm that uses objects to represent and manipulate data.
OOP is based on the concept of classes and objects.
Encapsulation, inheritance, and polymorphism are key principles of OOP.
Examples of OOP languages include Java, C++, and Python.
Q35. What is Digital Image Processing?
Digital Image Processing is the manipulation of digital images using algorithms to enhance, analyze, or extract information.
It involves techniques like image enhancement, restoration, segmentation, and compression.
Common applications include medical imaging, satellite imaging, facial recognition, and object detection.
Algorithms like edge detection, noise reduction, and image stitching are used in digital image processing.
It plays a crucial role in fields like computer vision,...read more
Q36. Different types of algorithms used in e-commerce companies
Recommendation, personalization, fraud detection, search algorithms are used in e-commerce companies.
Recommendation algorithms suggest products based on user behavior and preferences.
Personalization algorithms customize the user experience based on their past behavior.
Fraud detection algorithms identify and prevent fraudulent transactions.
Search algorithms help users find products based on their search queries.
Clustering algorithms group similar products together for easier n...read more
Q37. Given a number and check wheather it is palindrome or not
Check if a number is a palindrome or not
Convert the number to a string
Reverse the string and compare it with the original string
If they are the same, the number is a palindrome
Q38. Two good and two bad things you thinks about Data science
Good and bad aspects of Data Science
Good: Data science helps in making informed decisions based on data-driven insights
Good: Data science can uncover valuable patterns and trends in large datasets
Bad: Data science can be time-consuming and resource-intensive
Bad: Data science may face challenges with data privacy and ethical considerations
Q39. How are you going to tackle the pressurized situations.
I tackle pressurized situations by staying calm, prioritizing tasks, seeking help when needed, and maintaining a positive attitude.
Stay calm and composed under pressure
Prioritize tasks based on urgency and importance
Seek help or guidance from colleagues or supervisors
Maintain a positive attitude and focus on finding solutions
Q40. What technologies do you know?
I am familiar with a variety of technologies commonly used in data science, including programming languages, databases, and machine learning tools.
Programming languages: Python, R, SQL
Databases: MySQL, MongoDB
Machine learning tools: TensorFlow, scikit-learn
Q41. What is central limit theorem? Why we use it
Central Limit Theorem states that the sampling distribution of the sample mean approaches a normal distribution as the sample size increases.
Central Limit Theorem is used to make inferences about a population mean based on the sample mean.
It allows us to use the properties of the normal distribution to estimate population parameters.
It is essential in hypothesis testing and constructing confidence intervals.
For example, if we take multiple samples of a population and calculat...read more
Q42. what is data science? what do you like about data science?
Data science is a field that uses scientific methods, algorithms, and systems to extract knowledge and insights from structured and unstructured data.
Data science involves analyzing large amounts of data to uncover patterns, trends, and insights.
It combines statistics, machine learning, and domain knowledge to solve complex problems.
Data science is used in various industries such as healthcare, finance, marketing, and more.
Examples include predicting customer behavior, optimi...read more
Q43. Formulas for Precision, Recall, accuracy, F1 Score.
Formulas for Precision, Recall, Accuracy, F1 Score in data science.
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
Accuracy = (TP + TN) / (TP + TN + FP + FN)
F1 Score = 2 * (Precision * Recall) / (Precision + Recall)
Q44. OverFitting and underfitting conditions and example
Overfitting and underfitting are common issues in machine learning where the model either learns the noise in the training data or fails to capture the underlying patterns.
Overfitting occurs when a model learns the training data too well, including noise and outliers, leading to poor generalization on new data.
Underfitting happens when a model is too simple to capture the underlying patterns in the data, resulting in high bias and low variance.
Examples of overfitting include ...read more
Q45. What fo you understand by credit card company
A credit card company is a financial institution that issues credit cards to consumers for making purchases and borrowing money.
Credit card companies issue credit cards to consumers for making purchases and borrowing money
They charge interest on outstanding balances and fees for late payments
They provide customer service for cardholders and handle disputes and fraud claims
Examples include Visa, Mastercard, American Express, and Discover
Q46. What is the feature engineering menthods you used
I have used techniques like one-hot encoding, feature scaling, polynomial features, and interaction terms for feature engineering.
One-hot encoding for categorical variables
Feature scaling for numerical variables
Polynomial features for capturing non-linear relationships
Interaction terms for capturing interactions between features
Q47. What happens when you type google.com?
Typing google.com sends a request to Google's servers, which respond with the website's HTML code.
The browser sends a DNS request to resolve the domain name 'google.com' to an IP address
The browser establishes a TCP connection with Google's server
The browser sends an HTTP request to the server for the website's HTML code
The server responds with the HTML code, which the browser renders as a webpage
Q48. 1. How would you deal with time series data
Time series data can be analyzed using techniques like smoothing, decomposition, forecasting, and anomaly detection.
Use smoothing techniques like moving averages to remove noise and identify trends.
Apply decomposition methods like seasonal decomposition of time series (STL) to separate trend, seasonality, and residual components.
Utilize forecasting models such as ARIMA, SARIMA, or Prophet to predict future values.
Implement anomaly detection algorithms like Isolation Forest or...read more
Q49. Sort nearly sortes array.
Sort nearly sorted array using min heap
Create a min heap of size k+1
Insert first k+1 elements into min heap
For remaining elements, extract min and insert new element
Extract all remaining elements from min heap
Time complexity: O(nlogk)
Example: ['apple', 'banana', 'cherry', 'date', 'elderberry']
Q50. What is GAN.Have you worked with it.
GAN stands for Generative Adversarial Network, a type of neural network used for generating new data.
Consists of two neural networks - generator and discriminator
Generator creates new data samples while discriminator tries to distinguish between real and generated data
Used in image generation, text generation, and other creative applications
Interview Questions of Similar Designations
Top Interview Questions for Data Science Intern Related Skills
Interview experiences of popular companies
Calculate your in-hand salary
Confused about how your in-hand salary is calculated? Enter your annual salary (CTC) and get your in-hand salary
Reviews
Interviews
Salaries
Users/Month