Data Science Intern

100+ Data Science Intern Interview Questions and Answers for Freshers

Updated 10 Jan 2025
search-icon

Q1. Rotate Matrix by 90 Degrees Problem Statement

Given a square matrix 'MATRIX' of non-negative integers, rotate the matrix by 90 degrees in an anti-clockwise direction using only constant extra space.

Input:

The ...read more
Ans.

The task is to rotate a square matrix by 90 degrees in an anti-clockwise direction using constant extra space.

  • Iterate through each layer of the matrix from outer to inner

  • For each layer, perform a four-way swap of elements

  • Continue this process until all layers have been rotated

Q2. Get Second highest element from an array (duplicates elements are allowed). Required T.C-->O(N) Single traversal. S.C--->O(1)

Ans.

Get second highest element from an array of strings with O(N) time complexity and O(1) space complexity.

  • Initialize two variables to store the highest and second highest elements.

  • Traverse the array and update the variables accordingly.

  • Return the second highest element.

  • Handle edge cases like empty array or array with only one element.

Q3. What new feature would you like to add to Uber?
Ans.

I would like to add a feature that allows users to schedule rides in advance.

  • Users can schedule rides for important events or appointments

  • Option to choose specific driver or vehicle for scheduled rides

  • Notifications for upcoming scheduled rides

  • Ability to edit or cancel scheduled rides

Q4. How do you implement a machine learning algorithm based on a given case study, and which algorithm do you choose and why?

Ans.

To implement a machine learning algorithm based on a case study, choose an algorithm based on the type of data and problem to be solved.

  • Understand the problem statement and the type of data available.

  • Preprocess the data by handling missing values, encoding categorical variables, and scaling features.

  • Split the data into training and testing sets.

  • Choose an appropriate algorithm based on the problem type (classification, regression, clustering) and data characteristics.

  • Train the...read more

Are these interview questions helpful?

Q5. Coffiecent of x^7 in equation ? y=(x^101-1)(x^100+1)(x^99-1)...........................................(X^0+1)

Ans.

Coffiecent of x^7 in a given equation

  • Use the binomial theorem to expand the equation

  • Identify the term with x^7

  • The coefficient of x^7 is the coefficient of that term

Q6. From where did you complete the Data Science course?

Ans.

I completed the Data Science course at XYZ University.

  • Completed Data Science course at XYZ University

  • Received hands-on training in machine learning algorithms

  • Worked on real-world projects during the course

Share interview questions and help millions of jobseekers 🌟

man-with-laptop

Q7. Explain Random forest. What is gini impurity.

Ans.

Random forest is an ensemble learning method that constructs a multitude of decision trees and outputs the mode of the classes. Gini impurity is a measure of impurity or randomness used in decision trees.

  • Random forest is a collection of decision trees that are trained on different subsets of the data.

  • Each decision tree in the forest is trained on a random subset of the features.

  • The final prediction is made by taking the mode of the predictions of all the trees.

  • Gini impurity i...read more

Q8. Which ML algorithm did you use in your project?

Ans.

I used the Random Forest algorithm in my project.

  • Random Forest is an ensemble learning method that combines multiple decision trees to make predictions.

  • It is used for both classification and regression tasks.

  • Random Forest reduces overfitting and provides feature importance.

  • Example: I used Random Forest to predict customer churn in a telecom company.

Data Science Intern Jobs

Data Science Intern 0-1 years
IT Education Centre
4.5
Pune
Data Science Intern 0-1 years
Feynn Labs
4.0
Guwahati
Data Science Intern (2) 2-3 years
Blubirch
3.0
Bangalore / Bengaluru

Q9. what is the difference between call by reference and call by value

Ans.

Call by value passes a copy of the value while call by reference passes the address of the value.

  • Call by value passes a copy of the value while call by reference passes the address of the value.

  • Call by value does not modify the original value while call by reference can modify the original value.

  • Call by value is used for simple data types while call by reference is used for complex data types.

Q10. How to Choose K value in K means? if there are any techniques, name them and explain.

Ans.

Choosing the optimal K value in K-means clustering is crucial for accurate results.

  • Elbow method: Plotting the sum of squared distances vs. K and selecting the K value where the curve bends like an elbow.

  • Silhouette method: Calculating the average silhouette score for different K values and choosing the one with the highest score.

  • Gap statistic method: Comparing the within-cluster dispersion to a reference null distribution to find the optimal K value.

  • Cross-validation: Splitting...read more

Q11. What is the difference between Data Definition Language (DDL) and Data Manipulation Language (DML)?

Ans.

DDL is used to define the structure of database objects, while DML is used to manipulate data within those objects.

  • DDL is used to create, modify, and delete database objects such as tables, indexes, and views.

  • DML is used to insert, update, retrieve, and delete data within those database objects.

  • Examples of DDL statements include CREATE TABLE, ALTER INDEX, and DROP VIEW.

  • Examples of DML statements include INSERT INTO, UPDATE SET, and DELETE FROM.

Q12. similar table. Find students who scored more than avg marks of both 11th and 12th.

Ans.

Find students who scored more than avg marks in both 11th and 12th grades.

  • Calculate the average marks for each student in 11th and 12th grades.

  • Compare each student's marks with the respective average marks to find those who scored higher in both grades.

Q13. Sql query - Customers who have ordered all products from all categories.

Ans.

Use a SQL query to find customers who have ordered all products from all categories.

  • Join the Customers, Orders, and Products tables

  • Group by customer and count the distinct products ordered

  • Filter for customers who have ordered the total number of products available in each category

Q14. What is LDA and represent the LDA using diagram ?

Ans.

LDA stands for Latent Dirichlet Allocation, a topic modeling technique used in natural language processing.

  • LDA is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar.

  • It is commonly used in text mining to extract topics from a collection of documents.

  • LDA assumes that each document is a mixture of a small number of topics and that each word's presence is attributable to one of t...read more

Q15. What is the SQL query for calculating a moving average?

Ans.

The SQL query for calculating a moving average involves using window functions.

  • Use the OVER clause with the ORDER BY clause to define the window frame for the moving average calculation.

  • Use the AVG() function to calculate the average within the window frame.

  • Example: SELECT value, AVG(value) OVER (ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS moving_avg FROM table_name;

Q16. What do you understand by joins in sql?

Ans.

Joins in SQL are used to combine rows from two or more tables based on a related column between them.

  • Joins are used to retrieve data from multiple tables in a single query.

  • Common types of joins include inner join, left join, right join, and full outer join.

  • Joins are performed using the JOIN keyword and specifying the columns to join on.

  • Joins can be used to combine tables based on matching values or non-matching values.

  • Joins help in creating relationships between tables and fe...read more

Q17. Which algorithm will you prefer from Random Forest and XGBoost when the model is low bias?

Ans.

XGBoost is preferred over Random Forest for low bias models due to its ability to reduce bias further.

  • XGBoost is a more complex algorithm compared to Random Forest, allowing it to reduce bias further in low bias models.

  • XGBoost uses gradient boosting which helps in reducing bias by optimizing the loss function iteratively.

  • Random Forest may not be able to further reduce bias in low bias models as effectively as XGBoost.

  • In scenarios where the model already has low bias, XGBoost'...read more

Q18. What is overfitting, How to handle missing values, etc

Ans.

Overfitting is when a model is too complex and fits the training data too closely, leading to poor performance on new data.

  • Regularization techniques like L1 and L2 can be used to prevent overfitting

  • Cross-validation can be used to evaluate model performance on new data

  • Reducing the complexity of the model can also help prevent overfitting

  • Handling missing values can be done by imputing them with mean, median or mode values

  • Alternatively, missing values can be dropped if they are ...read more

Q19. What is Sql? And what is a database?

Ans.

SQL is a programming language used for managing and manipulating relational databases. A database is a structured collection of data.

  • SQL is used to retrieve, insert, update, and delete data from a database.

  • A database is a software system that stores and organizes data in a structured manner.

  • SQL allows users to define the structure of a database, create tables, and establish relationships between tables.

  • Examples of databases include MySQL, Oracle, and SQL Server.

Q20. What are the average no. Of tea people drink in a day in delhi ?

Ans.

The average number of tea people drink in a day in Delhi varies depending on individual preferences and habits.

  • The average number of tea consumed can range from 1-5 cups per day.

  • Factors such as age, gender, occupation, and cultural background can influence tea consumption.

  • Some people may drink more tea in the morning for a caffeine boost, while others may prefer tea throughout the day for relaxation.

  • Tea consumption may also vary based on the season, with more tea being consum...read more

Q21. What will happen if linear regression is used for classification

Ans.

Using linear regression for classification can lead to inaccurate predictions and unreliable results.

  • Linear regression assumes a continuous output, making it unsuitable for discrete classification tasks.

  • It may not handle outliers well, leading to incorrect classification boundaries.

  • The predicted values may fall outside the 0-1 range for binary classification.

  • Logistic regression is a more appropriate choice for classification tasks.

Q22. What are mutable and immutable data structures?

Ans.

Mutable data structures can be modified after creation, while immutable data structures cannot be changed once created.

  • Mutable data structures allow for in-place modifications, while immutable data structures require creating a new instance when modifications are needed.

  • Examples of mutable data structures include lists, dictionaries, and sets in Python.

  • Examples of immutable data structures include tuples and strings in Python.

Q23. 2. How would you handle missing data in time series data

Ans.

Handle missing data in time series data by imputation, interpolation, or deletion.

  • Impute missing values using mean, median, mode, or predictive modeling techniques.

  • Interpolate missing values using linear interpolation, spline interpolation, or time-based interpolation.

  • Delete rows with missing values if they are few and do not significantly impact the analysis.

Q24. What is regression model in ML and describe intercept and coefficient.

Ans.

Regression model is a ML algorithm used to predict continuous numerical values. Intercept is the value of the predicted variable when all predictors are zero. Coefficient represents the change in the predicted variable for a one-unit change in the predictor variable.

  • Regression model predicts continuous numerical values.

  • Intercept is the value of the predicted variable when all predictors are zero.

  • Coefficient represents the change in the predicted variable for a one-unit change...read more

Q25. Do you aware of Supervised and Unsupervised learning

Ans.

Supervised learning uses labeled data to train a model, while unsupervised learning uses unlabeled data.

  • Supervised learning requires a target variable for training

  • Examples of supervised learning include regression and classification tasks

  • Unsupervised learning finds patterns in data without predefined labels

  • Examples of unsupervised learning include clustering and dimensionality reduction

Q26. SQL syntax and difference between having and where clause

Ans.

HAVING clause is used with GROUP BY to filter grouped rows, WHERE clause is used to filter individual rows.

  • HAVING clause is used with GROUP BY to filter grouped rows based on aggregate functions

  • WHERE clause is used to filter individual rows based on conditions

  • HAVING clause is applied after GROUP BY, WHERE clause is applied before GROUP BY

  • HAVING clause can only be used with SELECT statement that contains a GROUP BY clause

Q27. Different types of oops, explain oops

Ans.

Object-oriented programming (OOP) is a programming paradigm that uses objects to represent and manipulate data.

  • OOP is based on the concept of classes and objects.

  • Encapsulation, inheritance, and polymorphism are key principles of OOP.

  • Examples of OOP languages include Java, C++, and Python.

Q28. What is Digital Image Processing?

Ans.

Digital Image Processing is the manipulation of digital images using algorithms to enhance, analyze, or extract information.

  • It involves techniques like image enhancement, restoration, segmentation, and compression.

  • Common applications include medical imaging, satellite imaging, facial recognition, and object detection.

  • Algorithms like edge detection, noise reduction, and image stitching are used in digital image processing.

  • It plays a crucial role in fields like computer vision,...read more

Q29. Given a number and check wheather it is palindrome or not

Ans.

Check if a number is a palindrome or not

  • Convert the number to a string

  • Reverse the string and compare it with the original string

  • If they are the same, the number is a palindrome

Q30. Two good and two bad things you thinks about Data science

Ans.

Good and bad aspects of Data Science

  • Good: Data science helps in making informed decisions based on data-driven insights

  • Good: Data science can uncover valuable patterns and trends in large datasets

  • Bad: Data science can be time-consuming and resource-intensive

  • Bad: Data science may face challenges with data privacy and ethical considerations

Q31. How are you going to tackle the pressurized situations.

Ans.

I tackle pressurized situations by staying calm, prioritizing tasks, seeking help when needed, and maintaining a positive attitude.

  • Stay calm and composed under pressure

  • Prioritize tasks based on urgency and importance

  • Seek help or guidance from colleagues or supervisors

  • Maintain a positive attitude and focus on finding solutions

Q32. What technologies do you know?

Ans.

I am familiar with a variety of technologies commonly used in data science, including programming languages, databases, and machine learning tools.

  • Programming languages: Python, R, SQL

  • Databases: MySQL, MongoDB

  • Machine learning tools: TensorFlow, scikit-learn

Q33. what is data science? what do you like about data science?

Ans.

Data science is a field that uses scientific methods, algorithms, and systems to extract knowledge and insights from structured and unstructured data.

  • Data science involves analyzing large amounts of data to uncover patterns, trends, and insights.

  • It combines statistics, machine learning, and domain knowledge to solve complex problems.

  • Data science is used in various industries such as healthcare, finance, marketing, and more.

  • Examples include predicting customer behavior, optimi...read more

Q34. Formulas for Precision, Recall, accuracy, F1 Score.

Ans.

Formulas for Precision, Recall, Accuracy, F1 Score in data science.

  • Precision = TP / (TP + FP)

  • Recall = TP / (TP + FN)

  • Accuracy = (TP + TN) / (TP + TN + FP + FN)

  • F1 Score = 2 * (Precision * Recall) / (Precision + Recall)

Q35. OverFitting and underfitting conditions and example

Ans.

Overfitting and underfitting are common issues in machine learning where the model either learns the noise in the training data or fails to capture the underlying patterns.

  • Overfitting occurs when a model learns the training data too well, including noise and outliers, leading to poor generalization on new data.

  • Underfitting happens when a model is too simple to capture the underlying patterns in the data, resulting in high bias and low variance.

  • Examples of overfitting include ...read more

Q36. What is svm? Any project you perform using this?

Ans.

SVM stands for Support Vector Machine, a supervised machine learning algorithm used for classification and regression tasks.

  • SVM finds the hyperplane that best separates different classes in the feature space.

  • It can handle both linear and non-linear data by using different kernel functions.

  • Example project: Sentiment analysis using SVM to classify movie reviews as positive or negative.

Q37. What fo you understand by credit card company

Ans.

A credit card company is a financial institution that issues credit cards to consumers for making purchases and borrowing money.

  • Credit card companies issue credit cards to consumers for making purchases and borrowing money

  • They charge interest on outstanding balances and fees for late payments

  • They provide customer service for cardholders and handle disputes and fraud claims

  • Examples include Visa, Mastercard, American Express, and Discover

Q38. What is the feature engineering menthods you used

Ans.

I have used techniques like one-hot encoding, feature scaling, polynomial features, and interaction terms for feature engineering.

  • One-hot encoding for categorical variables

  • Feature scaling for numerical variables

  • Polynomial features for capturing non-linear relationships

  • Interaction terms for capturing interactions between features

Q39. What happens when you type google.com?

Ans.

Typing google.com sends a request to Google's servers, which respond with the website's HTML code.

  • The browser sends a DNS request to resolve the domain name 'google.com' to an IP address

  • The browser establishes a TCP connection with Google's server

  • The browser sends an HTTP request to the server for the website's HTML code

  • The server responds with the HTML code, which the browser renders as a webpage

Q40. 1. How would you deal with time series data

Ans.

Time series data can be analyzed using techniques like smoothing, decomposition, forecasting, and anomaly detection.

  • Use smoothing techniques like moving averages to remove noise and identify trends.

  • Apply decomposition methods like seasonal decomposition of time series (STL) to separate trend, seasonality, and residual components.

  • Utilize forecasting models such as ARIMA, SARIMA, or Prophet to predict future values.

  • Implement anomaly detection algorithms like Isolation Forest or...read more

Q41. When does type error and syntax error occur

Ans.

Type error occurs when an operation is performed on an object of an inappropriate type, while syntax error occurs when the code is not written according to the syntax rules of the programming language.

  • Type error occurs when trying to perform an operation on incompatible data types, such as adding a string to an integer.

  • Syntax error occurs when the code is not written correctly according to the rules of the programming language, such as missing parentheses or semicolons.

  • Type e...read more

Q42. Sort nearly sortes array.

Ans.

Sort nearly sorted array using min heap

  • Create a min heap of size k+1

  • Insert first k+1 elements into min heap

  • For remaining elements, extract min and insert new element

  • Extract all remaining elements from min heap

  • Time complexity: O(nlogk)

  • Example: ['apple', 'banana', 'cherry', 'date', 'elderberry']

Q43. What is GAN.Have you worked with it.

Ans.

GAN stands for Generative Adversarial Network, a type of neural network used for generating new data.

  • Consists of two neural networks - generator and discriminator

  • Generator creates new data samples while discriminator tries to distinguish between real and generated data

  • Used in image generation, text generation, and other creative applications

Q44. What programming knowledge you have ?

Ans.

Proficient in Python, R, and SQL with experience in data manipulation, visualization, and machine learning algorithms.

  • Proficient in Python for data analysis and machine learning tasks

  • Experience with R for statistical analysis and visualization

  • Knowledge of SQL for querying databases and extracting data

  • Familiarity with libraries such as Pandas, NumPy, Matplotlib, and Scikit-learn

Q45. difference between array and linkedlist,stack and queue

Ans.

Arrays store elements in contiguous memory, while linked lists use nodes with pointers. Stacks follow LIFO, queues follow FIFO.

  • Arrays store elements in contiguous memory locations, allowing for constant time access to elements using indices.

  • Linked lists use nodes with pointers to the next node, allowing for dynamic memory allocation and insertion/deletion at any position.

  • Stacks follow Last In First Out (LIFO) principle, where elements are added and removed from the same end (...read more

Q46. Why do you want to pursue data science

Ans.

I want to pursue data science because of my passion for analyzing and interpreting data to solve complex problems.

  • I enjoy working with data and finding patterns and insights

  • I want to use my skills to help businesses make data-driven decisions

  • Data science is a rapidly growing field with endless opportunities

  • I have experience in programming and statistics, which are essential skills for data science

  • For example, I have worked on projects analyzing customer behavior and predictin...read more

Q47. Which is the best clustering algorithm?

Ans.

There is no one-size-fits-all answer as the best clustering algorithm depends on the specific dataset and goals.

  • The best clustering algorithm depends on the dataset characteristics such as size, dimensionality, and noise level.

  • K-means is popular for its simplicity and efficiency, but may not perform well on non-linear data.

  • DBSCAN is good for clusters of varying shapes and sizes, but may struggle with high-dimensional data.

  • Hierarchical clustering is useful for visualizing clus...read more

Q48. How to handle missing data in a dataset?

Ans.

Missing data can be handled by imputation, deletion, or using algorithms that can handle missing values.

  • Imputation: Fill missing values with mean, median, mode, or using predictive modeling.

  • Deletion: Remove rows or columns with missing values.

  • Algorithms: Use algorithms like Random Forest, XGBoost, or LightGBM that can handle missing values.

  • Consider the reason for missing data and choose the appropriate method for handling it.

Q49. How did you use that particular ML algorithm

Ans.

I used the Random Forest algorithm to predict customer churn in a telecom company.

  • Preprocessed the data by handling missing values and encoding categorical variables

  • Split the data into training and testing sets

  • Tuned hyperparameters using grid search

  • Trained the Random Forest model on the training data

  • Evaluated the model's performance using metrics like accuracy, precision, recall, and F1 score

  • Interpreted feature importance to understand key drivers of customer churn

Q50. What are Regularization Techniques ?

Ans.

Regularization techniques are methods used to prevent overfitting in machine learning models by adding a penalty term to the loss function.

  • Regularization techniques help in reducing the complexity of the model by penalizing large coefficients.

  • Common regularization techniques include L1 regularization (Lasso), L2 regularization (Ridge), and Elastic Net regularization.

  • Regularization helps in improving the generalization of the model by preventing it from fitting noise in the tr...read more

1
2
3
Next
Interview Tips & Stories
Ace your next interview with expert advice and inspiring stories

Interview experiences of popular companies

3.0
 • 110 Interviews
3.3
 • 32 Interviews
4.0
 • 27 Interviews
4.2
 • 26 Interviews
2.4
 • 17 Interviews
View all

Calculate your in-hand salary

Confused about how your in-hand salary is calculated? Enter your annual salary (CTC) and get your in-hand salary

Data Science Intern Interview Questions
Share an Interview
Stay ahead in your career. Get AmbitionBox app
qr-code
Helping over 1 Crore job seekers every month in choosing their right fit company
65 L+

Reviews

4 L+

Interviews

4 Cr+

Salaries

1 Cr+

Users/Month

Contribute to help millions

Made with ❤️ in India. Trademarks belong to their respective owners. All rights reserved © 2024 Info Edge (India) Ltd.

Follow us
  • Youtube
  • Instagram
  • LinkedIn
  • Facebook
  • Twitter