Over the past few years, data is viewed as a valuable asset that makes data generation and collection a critical part of any business. Data Science helps in facilitating the organizations with the ability to process large volumes of data.
Every day, billions of tonnes of data is generated worldwide. This has resulted in data science becoming an obligatory requirement. Therefore there is a demand for the role of a data scientist among the recruiters. To prepare yourself for these job roles you must be familiar with the commonly asked Data Science Interview Questions and Answers.
- Define Data Science?
Data Science can be defined as the fusion of multiple areas constituting science, predictive analysis, algorithms, statistics, system tools, and machine learning principles. It can be referred to as the interdisciplinary branch of science that emphasises on huge data sets or big data for knowledge extraction. These are repeatedly asked Data Science Interview Questions and Answers for Freshers.
- The use of data science can be seen in everyday life. How?
Youtube uses Data Science’s recommendation algorithm to track the history of our previously watched videos and creates suggestions based on them which are displayed in the play next section. It reduces our effort to manually search for a related video.
- Why do we require data science?
In today’s world information is collected from various data sources resulting in massive heterogeneous data. Simple business intelligence tools can process this kind of data, therefore data science provides advanced analytics tools which use high-level algorithms for processing.
- Give 3 reasons why data science is crucial for the industry.
Helps to build a connection with each customer personally so that their needs can be understood better.
Allows the organization to know their target audience.
Facilitates effective use of resources and provides the best possible solution.
- What is a data scientist?
Data scientists are expert analysts who gather and analyze huge structured and unstructured datasets. Their primary work involves transforming the available raw data into useful form and presenting it in such a way that is easy to understand.
- What responsibilities does a data scientist have?
Some of the responsibilities are:
Collecting data from various sources and also cleaning it.
Data analysis and processing
Understanding business requirement
Training and deployment of the model
Documentation, Visualization, and Presentation of final results
These are commonly asked Data Scientist Interview Questions and Answers.
- What is the use of data science in the healthcare industry?
Medical images such as X Rays, MRIs, CT scans, etc can be easily interpreted with the help of Data science. A data scientist can predict future medical health for a patient based on the data of his medical history. Various diseases like cancer, schizophrenia, Alzheimer's, etc can be diagnosed at early stages with help of pattern matching and spectrum analysis. It also provides a greater understanding of how genetic tissues are affected and the reaction to certain drugs or diseases.
- Differentiate between supervised and unsupervised learning.
Supervised
Unsupervised
Known and labeledinput and output data is used.
The data used is unknown and unlabeled.
It has a feedback mechanism.
No such mechanism is present.
Its goal is to predict the outcome of the new data.
It gives insights from large volumes of data.
Eg: logistic regression, decision trees.
Eg: Kmeans, apriori.
- Define Linear Regression in terms of data science.
Linear Regression is a supervised learning algorithm that provides a mathematical relationship between two or more variables, one dependent and others independent.
Y = mx + c
Here Y is the dependent variable and x is an independent variable; m and c are constants.
In data science, this relationship helps predict the outcome of events.
- What is overfitting?
Overfitting is a modeling error. In overfitting, the analysis of a model is too closely linked (or in some cases exactly linked) to a particular set of data therefore it fails to fit any other data or predict any future outcomes. Overfitted models train data too deep and thus fail to generalize the data.
- What do you mean by NLP?
NLP stands for Natural Language Processing. It is one of the branches of artificial intelligence that converts human language into a language that a machine understands.
- Give some examples of NLP in real life.
NLP is used in Google translator, chatbots, and various virtual voice assistants like Siri and Alexa. It also finds its application in sentence correction, text completion, or word suggestion.
- What are the drawbacks of linear regression?
Its usefulness is only restricted to linear relationships.
Does not provide a descriptive relationship between variables.
Is highly sensitive to noise and outliers.
Assumes data is independent of each other.
- Differentiate between regression and classification.
Regression
Classification
Used in the prediction of continuous values like age and salary.
Used for predicting discrete values like True or False.
It predicts ordered data.
It can predict unordered data.
Calculates output by measuring accuracy.
Output is calculated by measuring root mean square error.
Eg: Decision tree.
Eg: Random forest.
- What is a root mean squared error?
Root mean square error is a general-purpose error metric for numerical predictions. It is a standard way to measure errors in a quantitative data predicting model. When we square root the mean of squares of all the errors we obtain the value of root mean squared error.
- What is logistic regression?
Logistic regression is a type of supervised learning algorithm. It is a statistical model that makes use of a logistic function to predict the binary value of a dependent variable like 0 and 1 or True and False. It is similar to linear regression except for the fact that linear regression is used for regression problems and logistic regression is used for solving classification problems.
- What are the assumptions of the logistic regression model?
Assumes that the dependent variable would be binary.
The correlation between the independent variables is almost negligible.
Its accuracy is directly proportional to the size of the data set.
These are frequently asked Data Scientist Interview Questions and Answers for Freshers.
- What do you mean by regression analysis?
Regression analysis is a type of predictive modeling technique that establishes a relation between a dependent variable and one or more independent variables. When we have only one independent variable it is all regression analysis. In the case of more than one independent variable, it is called multiple regression. It can be further classified into linear regression and logistic regression.
- What are the types of logistic regression?
According to the number of categories, Logistic regression has the following types:
Binomial: The dependent variable can have only 2 types of value like 0 or 1.
Multinomial: Three or more values of the target variable are possible. These values are not ordered. Eg: sun, moon, and stars.
Ordinal: Three or more ordered values of target variables are allowed. For example, very low, low, medium, high, very high.
- What is data sampling?
Data sampling is a statistical technique in which we take a sample out of the whole data set and analyze it to find patterns in the original large data set. Sampling data can be done two in two ways: probability and non-probability.
- What are the advantages of sampling?
It helps in quick and easy analysis of the data set.
More efficient results.
Cost-effective models.
- Explain probability sampling.
All the elements of the population have a known and non-zero probability. Its features like bias and sampling error are usually known. It can further be divided into:
Simple random sampling: Subjects are randomly selected from the whole population.
Stratified sampling: Based on a common factor, the data is divided into subsets, and samples are collected randomly from each subset.
Cluster sampling: The data set is divided into clusters based on a defining factor, then a random cluster sample is analyzed.
Multistage sampling: This method involves subsetting the larger population into a number of clusters. The subsets are further divided based on a secondary factor, and the obtained clusters are sampled and analyzed. This division of clusters continues till multiple subsets are identified. It is a more complicated version of cluster sampling.
Systematic sampling: An interval is set at which a sample is created and data is extracted from the larger data set. Eg: If we select every 15th row in data containing 150 items, a sample size of 10 rows would be created.
- What is non-probability sampling?
In non-probability sampling, the analyst defines the factor based on which the data would be sampled and extracted. It can be difficult to estimate if the sample accurately represents the larger population.
Some of the non-probability data sampling methods are:
Convenience sampling: Easily available and accessible groups are used to collect data.
Consecutive sampling: Every subject that meets the criteria is selected until the sample size limit is reached.
Purposive or judgmental sampling: A predefined criterion is used to select the data from the sample.
Quota sampling: Equal representation is given to all subgroups within the sample population by the researcher.
- What is an underfit model?
An underfit model is a statistical model which is unable to predict the accuracy of the data as it fails to capture relationships between the input and output data. Underfitting simply means that the model does not fit the data well. This usually happens when the model is not trained well or there are not enough features in the data.
- How can you reduce underfitting?
By increasing the complexity of the model.
An increasing number of attributes.
Removing noise and outliers from the data.
Increasing the duration of training.
- Define bias and variance.
Bias is a kind of error that is caused due to the assumptions made by the model. High bias value fails an algorithm in finding relevant relation between feature and output.
Variance is a type of error that occurs due to fluctuations in the training set. It is sensitive to even very small changes in the data. Higher the fluctuation, the higher the variance.
- Overfit models have low variance and high bias. True or False?
This statement is false. Overfit models have high variance and low bias.
- What are Type 1 and Type 2 errors?
In type 1 errors a true null hypothesis is rejected. It is also called false positive.
Type 2 errors occur when a false null hypothesis is accepted. It is known as a false negative.
- What kind of biases can occur during sampling?
In sampling, there are three kinds of biases:
Selection bias
Under coverage bias
Survivorship bias
- Give 2 real-life examples of Type 1 errors.
Beeping of metal detectors without the presence of any metal.
Convicting an innocent person. Data Science Online Course at FITA Academy provides extensive training on the Data Science lifecycle and its concepts with numerous real-time practices.
- List any 5 important languages used by data scientists.
R
Javascript
SQL
Scala
- What are Recommender Systems?
Recommender System predicts the ratings of an item or a product which the users are likely to give. It is a subset of information filtering techniques.
- What is the purpose of A/B Testing?
AB testing is a type of control experiment done using random testing. The goal of this testing method is to find out which variable or variable version works better when placed in a controlled environment. These are commonly asked Data Scientist Interview Questions and Answers for Freshers and Experienced candidates.
- List the important Python libraries for data science analysis.
Pandas
Matplotlib
SciPy
Seaborn
NumPy
SciKit
- Mention some methods to reduce overfitting.
Some methods to reduce overfitting :
Increasing the value of training data.
Reducing the complexity of the model.
Ridge and Lasso Regularization.
Reducing the training time.
- Define data analysis.
Data analysis involves using statistical methods to collect, clean, analyze, manipulate data in order to discover valuable information which can be used for better decision making.
- What is univariate and bivariate analysis?
Data containing only one variable is known as a univariate variable and the analysis variable to it is called univariate analysis. Eg: boxplot.
Bivariate data contains two types of variables. Bivariate analysis determines the relationship between the two variables.
- Explain unsupervised learning.
In unsupervised learning, a model trains itself without the use of any classification or labels in the data. They act without any supervision from the user. Eg: Kmeans.
- What is clustering?
Clustering is a technique of grouping objects into different sets or clusters in such a way that the objects belonging to the same cluster are more similar to each other than to the objects in other clusters.
- What are the different types of clustering techniques?
Density-Basederrorstering
Distribution Based Clustering.
Partition Based Clustering
Hierarchical Clustering.
- Write a program to print numbers ranging from one to 50. For multiples of 3 it should print "Apple",for multiples of 5, print "Pine" and for multiples of both 3 and 5, print "Pineapple".
for num in range(1,51):
if (num % 3 == 0):
print(“Apple”)
elif (num % 5 == 0):
print(“Pine”)
elif (num % 3 == 0 and num % 5 == 0 ):
print(“Pineapple”)
else:
print(num)
- How will you deal with a data set containing more than 30% missing values.
For large data sets, we can remove the rows with missing data values and the rest of the data can be used to predict the values.
For small data sets, the mean of the dataset can replace the null values. This can be done using the methods of Python’s panda’s library such as df.mean(), df.fillna(mean).
- What does Kmeans clustering mean?
Kmeans is a type of unsupervised learning algorithm. It categorizes data into K groups or clusters on the basis of similarity. The similarity between data points is calculated using Euclidean distance.
- What are the steps of the Kmeans algorithm?
Kmeans algorithm works as follows:
First, the k number of clusters is decided.
A mean value of each cluster is randomly selected./>
The data points are assigned to each cluster depending upon the closed distance to the mean value.
The mean value is updated to the average of the data points in the cluster./>
This process is repeated till the total number of iterations are reached and then we have our desired clusters.
- How will you calculate the euclidean distance between 2 data points?
For 2 data points A(x, y) and B(x1, y1) euclidean distance is calculated as :
sqrt( (x-x1)**2 + (y-y1)**2 ) - How do statistics benefit data scientists?
Statistics help in summarizing the data quickly. It provides various tools for analyzing the data. Statistics concepts help data scientists in gaining valuable insights from the data to perform quantitative analysis on it. Statistical methods such as classification, regression, hypothesis testing, time-series analyses are of great assistance to data scientists while experimenting on the data.
- What is data wrangling?
You can secure an S3 bucket in the following two ways:
Data wrangling is the process of cleaning the data and organizing it so that it can be used for analyses.
- What is the prime difference between a data scientist and a data analyst?
A data analyst works on existing data while a data scientist finds new methods of manipulating, capturing, and analyzing the data for the use of data analysts.
- What are the different types of data analyses?
There are 4 types of analysis:/p>
Predictive analysis
Prescriptive analysis
Descriptive analysis
Diagnostic analysis
- What will be the Euclidean distance between A(3,4) and B(5,2)?
ED = sqrt( (3-5)^2 + (4-2)^2 ) = 2.82
- What are the commonly used algorithms for data science?
Linear regression
Random Forest
KNN
Logistic regression
- What is a decision tree? Which algorithm is used to build it?
The decision tree algorithm is a type of supervised learning algorithm that can be used to solve classification and regression problems. It has a tree-like structure where the internal nodes represent the attributes of a dataset, the branches represent the decision, and outcomes are represented by leaf nodes. It is a graphical representation of problems and their solutions according to the given conditions.
The CART algorithm is used for building the tree. It stands for Classification and Regression Tree algorithm. These are commonly asked Data Science Interview Questions and Answers.
- What is dimensionality reduction?
Dimensionality reduction is the process of reducing the size of a data set by removing some of its attributes in such a way that the information it conveys is unchanged.
- What is the use of decision trees?
They are easy to understand as they enact human thinking ability while making any decision.
The tree-like structure makes understanding the model easy.
- What is pruning? Why is it done?
The process of eliminating unwanted tree nodes to obtain an optimal tree is known as pruning. It is done in order to save the accuracy of the decision tree.
- List the steps in making a decision tree.
The entire dataset is taken as input.
Find a test or split such that the separation of the classes is maximum.
The split is applied to the input data. This is known as the divided step.
Apply steps one and two again to the divided data.
Stop at stopping criteria.
The tree is cleaned up if there are too many splits.
- What is ensemble learning?
In ensemble learning various sets of learners are combined together in order to improve the model's stability and power of prediction. There are two types of Ensemble learning methods: Bagging and boosting.
- Explain bagging and boosting.
Bagging helps in the implementation of the same learners on a sample population of small size and makes nearer predictions.
Boosting helps build stronger models by reducing bias. It iterates and adjusts the weight of an observation based on the previous classification.
- Elucidate RMSE?
The term RMSE stands for - "root means square error". It is the measure of complete accuracy in the Regression. Generally, the RMSE permits you to calculate the total magnitude of an error that is produced by the regression model. You can calculate the RMSE by the method that is given below:
Firstly, you should calculate the total number of errors in the predictions by using a regression model. To do this, you can calculate the complete differences between the actual & predicted values.
Secondly, you should square those errors.
Thirdly, you can calculate the mean of the square errors.
Finally, you should take the square root of the total mean of all the squared errors.
- List 3 advantages of decision trees.
Compared to other algorithms, it requires less data cleaning.
Follows the same decision-making approach as a human.
Extremely useful in decision-related problems.
- What is imbalanced data?
Data allocated to different categories in a high imbalanced manner is called imbalanced data. It gives significance to large values in a data set affecting the performance of a model.
- What is a random forest algorithm?
Random forest is a type of ensemble learning method which uses a supervised learning approach. It constitutes multiple decision trees on various subsets of the data and takes the mean of all for improved predictive accuracy.
- What are the benefits of using a random forest?
Provides high accuracy irrespective of the size of the dataset.
Less training time.
Accuracy is maintained in case of missing data as well.
Can help in classification as well as regression.
- List a few disadvantages of using a decision tree.
High complexity due to the presence of multiple layers.
Can produce an overfit model.
Computational complexity increases with an increase in the number of class labels.
- What are the applications of random forest in the banking and medicinal sector?
Random forest helps in identifying the risk of a loan in the banking sector. In the healthcare sector, it helps in finding patterns of diseases and the risk they can cause.
- What do you mean by cross-validation?
Cross-validation helps in estimating the accuracy of a model. It is a statistical method in which a part of a data set, called validation data, is removed while training the model and later on used for testing the model. If positive results are received after testing, the model is approved.
- What is LASSO?
LASSO is a regression analysis method that stands for Least Absolute Shrinkage and Selection Operator. It enhances the accuracy of a model by performing selection as well as regularization of the data.
- How can overfitting be avoided?
Using fewer variables in the dataset.
Making use of techniques like cross-validation.
Using regularization techniques such as LASSO.
- What is hypothesis testing?
A hypothesis is a theory that describes the nature of a population. Hypothesis testing compares two mutually exclusive statements about a population and concludes the statement which best describes the sample data.
- What is the p-value?
P-value is a numerical value ranging from 0 to 1 which helps in determining the strength of your outcome in a hypothesis test. Data Science Course in Bangalore at FITA Academy imparts the students of the training program with the required skills and knowledge that are required for a professional Data Scientist.
- What happens when p-value <= 0.5 and >= 0.5 ?
When p-value <= 0.5 it means that the null hypothesis is incorrect and should be rejected. While a value more than 0.5 indicates the accuracy of the null hypothesis, therefore, it is accepted. These are commonly asked Data Science Interview Questions and Answers for Freshers.
- What do you mean by machine learning?
Machine learning is the ability of a machine to understand new things and automatically predict the outcomes of an event without being programmed by a developer.
- What are artificial neural networks?
Artificial neural networks are computational networks inspired by biological neural networks. They are designed to replicate the working of a human brain i.e., how the human brain processes and analyzes information. It adapts to the input to provide the best possible output.
- Name some cross-validation techniques.
K- Fold Cross-Validation
Leave p-out Cross-Validation
Leave-one-out cross-validation.
Holdout method
- Define deep learning.
Deep learning belongs to the family of machine learning algorithms. It is based on artificial neural networks. It contains 3 or more layers. Deep learning models absorb data and learn from it automatically.
- Which language is better for text analysis? R or Python?
Python’s panda’s library contains high-level data analytical tools and data structures that are more suitable for text analysis.
- What is collaborative filtering?
Collaborative filtering is a kind of technique used by the recommenders system. Its algorithm automatically filters the preferences of a user and makes recommendations according to the user’s interests.
- Give a real-life example of collaborative filtering.
The most popular e-commerce website, Amazon, makes use of collaborative filtering. If a buyer purchases items A and B it would recommend item C to the buyer based on previous buying histories.
- List any 5 websites using collaborative filtering.
Amazon
Youtube
Netflix
Spotify
LinkedIn
- What are different types of collaborative filtering?
Memory Based: Recommendations are made based on the likeness of an item through user rating information.
Model-Based: Data mining helps in creating models which find trends based on training data. Then predictions for actual data are made using these models.
Hybrid: It is a combined approach of memory and model-based collaborative filtering.
- Which algorithm would be best for the prediction of the death rate due to heart disease?
Linear regression would be the best algorithm since it builds a relationship between events having multiple independent variables.
- What do you mean by regularization?
Regularization is the addition of extra variables to the data to improve a model's performance. It is used to solve the problem of overfitting by appropriately fitting it to the model.
- What is the importance of data cleansing?
It is important to clean your data before using it because it increases the productivity of the model as it removes unwanted and duplicate values from the data. It eliminates the possibility of errors and inconsistencies in the model.
- List a few deep learning frameworks.
Pytorch
Tensorflow
Keras
Sonnet
Chainer
- Define precision.
Precision is a numerical value ranging from 0 to 1. It is the percentage of relevant results that the algorithm classifies.
- State the law of large numbers.
The Law of large numbers states that with the increase in the number of trials, the mean or the average result comes in close range to the expected value.
- What is a normal distribution?
The normal distribution is a bell-shaped curve that shows the distribution of continuous variables. It is a kind of probability distribution that shows the position of variables with respect to the mean of the data.
- In which scenarios are an algorithm updated?
If there is a change in the data source.
For the evolution of data model through infrastructure.
If the algorithm is not stationary.
- What is the t-test?
The t-test helps in determining the similarity or differences between the means of two groups. It is often used in hypothesis testing to test the differences between the two populations.
- Explain the DBSCAN algorithm.
DBSCAN algorithm is a type of clustering technique that uses an unsupervised learning approach. It divides the dataset into different clusters based on the minimum distance between data points and the number of points that can be placed in each cluster. It uses 2 primary parameters for clustering:
Epsilon - the least possible distance between 2 data points.
Min - the minimum number of data points that should be present in each cluster.
- List some real-life applications of deep learning.
Self-driving cars
Virtual assistants
Chatbots
Computer vision
Image processing
- What are the different types of data points in DBSCAN?
Core Point: It has more than min points within epsilon.
Border Point: It lies in the neighborhood of a core point and contains fewer than min points within epsilon.
Noise or outlier: It is neither a core point nor a border point.
- State advantages of the DBSCAN algorithm.
No need to set the number of clusters.
Clusters can be arbitrarily shaped.
Remains unaffected by outliers.
Only 2 parameters are required.
Is not sensitive to data ordering.
- What are the disadvantages of DBSCAN?
Data points reachable from more than one cluster can be placed in any cluster.
Data with large different densities cannot be clustered properly.
Choosing epsilon can be difficult if data is not well understood.
- Which R packages are used for DBSCAN implementation?
dbscan
fpc
factoextra
- Write code for implementation of DBSCAN in R.
install.packages(“factoextra”,”fpc”.”dbscan”) library(factoextra) library(fpc) library(dbscan) data("multishapes", package = "factoextra") df <- multishapes[, 1:2] db <- fpc::dbscan(df, eps = 0.15, MinPts = 5) plot(db, df, main = "DBSCAN", frame = FALSE)
- Explain Naive Bayes classifier.
Naive Bayes classifier is a type of probabilistic classifiers which use the Bayes theorem. It assumes that the features are independent of each other. It can be combined with other kernel functions to increase its accuracy.
- What is the difference between the validation set and the test set?
The validation set is used for parameter selection so that overfitting can be avoided whereas the test set is used for testing the trained model performance.
- What is statistical power?
It is the power of a binary hypothesis that determines the probability of a test to reject the null hypothesis because the alternative hypothesis is true.
- Differentiate between RNN and CNN.
Recurrent Neural Network(RNN)
Convolutional Neural Network(CNN)
Used for sequential data.
Used in images and distributed data.
Variable dimension data can be used.
Fixed-size input and output required.
Uses its mechanism for internal memory.
Type of a feed-forward neural network.
Used in time series and text classification.
Used in image processing.
These are commonly asked Data Science Interview Questions and Answers.
- What is the major drawback of Naive Bayes? How can it be solved?
Naive Bayes assumes that the variables are not correlated to one another which is never true, this serves as a major drawback.
Decorrelating its features can resolve this issue so that the assumption it makes becomes true.
Conclusion
This article covers the commonly asked questions in Data Science interviews which are extremely important to ace any interview. We hope these questions and answers will help you through your interview process. Apart from these questions and answers if you are considering upskilling your Data Science knowledge, check out Data Science Course in Chennai at FITA Academy. They provide extensive knowledge about courses on data science under expert mentorship.