There's the expectation for Data Scientists who have done some initial research to readily be able to explain basic statistical terms (such as sampling
, hypothesis testing
, and variance
vs. standard deviation
). Similarly, you should have an understanding of the popular technical models and methodologies (such as Neural Networks, Bayesian Network, supervised machine learning, unsupervised and reinforcement learning, and much more).

But should you ask most Data Science enthusiasts what data science
actually is, and you will find the majority of them lost. Terms like Data Mining
, Big Data
, Data Science
, and Data Engineering
all seem synonymous. This is because we lack the understanding that lies behind the emergence of these individual technologies. In future tutorials, we'll address the difference between these so-called synonymous terms. However, we'll restrict the scope of this tutorial.
This tutorial will help you out in identifying the terminologies and concepts that a professional Data Scientist must comprehend to crack an entry-level interview.
Some tips:
- Do not rush into answering the question. Rather, first, ensure that you understand the concepts that are relevant to the question. Only then try to answer the question.
- Even if you do not understand the question completely, make a habit of narrowing the answer choices, especially in multiple-choice questions, and then make an informed guess.
- This tutorial will guide you towards more realistic questions that are usually asked in interviews. You can maximize the output from this tutorial if you give us feedback on topics that you'd like to see.
1. What is bias variance trade-off?
Bias
and variance
are known as prediction errors of a model. An oversimplified model leads to high bias
, a measure of how far the predicted values of the model are from the correct values.
A more complex model can lead to high variance, meaning more variation in the predictions between model iterations. However, these two errors are inversely proportional
to each other.
The goal of any supervised machine learning algorithm is to have low bias and low variance to achieve strong prediction performance. Therefore, a balance is maintained between these two and this is called the bias-variance trade-off
.

2. How do we combat Over-fitting and Under-fitting?
To combat overfitting and underfitting, you can resample
the data to estimate the model accuracy (k-fold cross-validation
) and by having a validation dataset
to evaluate the model.

3. How would you effectively represent data with multiple dimensions?
We can use a method like a pair-wise correlation
matrix or pair-wise scatter
plot. Another common solution is to change the data and use traditional 2D and 3D data visualization techniques. There are 3 options to adjust the data:
- Feature selection
- Feature extraction
- Manifold learning
4. How is k-NN different from k-means clustering?
K-NN | K-Means |
---|---|
Unsupervised learning algorithm | Supervised learning algorithm |
Use for Clustering | Use for Classification |
It takes a bunch of unlabeled points and tries to group them into “k” number of clusters. | It takes a bunch of labeled points and uses them to learn how to label other points. |
5. How would you validate a model you created to generate a predictive model of a quantitative outcome variable using multiple regression?
To model a qualitative dependent variable such as a simple yes/no,
we can use logistic regression
.
The first step is to split the data in the ratio of 70-30
or 60-40
. We can keep large data for building/training the model and use the rest to validate the trained model.
The final step is to review the residuals of your model on both sets of data and look at other parameters such as R squared and standard error.
Try this exercise. Fill in the missing part by typing it in.
A low p-value __ indicates strength against the null hypothesis, which means we can reject the null hypothesis.

Try this exercise. Fill in the missing part by typing it in.
Differencing a time series is the best way to remove __ from it.

Write the missing line below.
Let's test your knowledge. Fill in the missing part by typing it in.
Sensitivity and Specificity are ____ to each other.
Write the missing line below.
Try this exercise. Click the correct answer from the options.
A classifier that predicts if an image contains only a cat, a dog, or a llama produced the following confusion matrix:

What is the accuracy of the model, in percentage?
Click the option that best answers the question.
- 35%
- 60%
- 75%
- 80%
Try this exercise. Fill in the missing part by typing it in.
An excellent model has AUC near to the _ which means it has a good measure of separability.
Write the missing line below.
Try this exercise. Is this statement true or false?
Predicted labels usually match with all of the observed labels in real-world scenarios.
Press true if you believe the statement is correct, or false otherwise.
Try this exercise. Is this statement true or false?
ROC is a probability curve.

Press true if you believe the statement is correct, or false otherwise.
Let's test your knowledge. Is this statement true or false?
We need to define an index in Pandas.
Press true if you believe the statement is correct, or false otherwise.
Are you sure you're getting this? Click the correct answer from the options.
You are carrying out primary market research to understand how many days in a week people generally do physical activities. You asked 3 of your friends. The answers you got were 2
, 3
, and 5
. Based on this, which of the following is correct about the results you found?
Click the option that best answers the question.
- The mean of the sample is 3.33 and the variance is 1.53
- The mean of the population is 3.33 and variance is -2.33
- The mean of the population is 3.33 and variance is -1.53
- The mean of the sample is 3.33 and the variance is 2.33
Build your intuition. Click the correct answer from the options.
In simple linear regression, the goodness of fit...
Click the option that best answers the question.
- Indicates the spread of data on which analysis is run
- Is an index of how closely the analysis reaches statistical significance
- Represents how close the predicted findings are to actual findings
- None
Let's test your knowledge. Click the correct answer from the options.
The mean of a Poisson distribution equals the mean of an exponential distribution only when the mean of the Poisson distribution equals...
Click the option that best answers the question.
- Mean or both can never be equal
- 0.5
- 1
- 2
Build your intuition. Is this statement true or false?
K-means clustering is a useful technique to reduce the number of dimensions from 5 to 3.
Press true if you believe the statement is correct, or false otherwise.
Let's test your knowledge. Is this statement true or false?
A data scientist of high caliber should have competency in mathematics, stats, and computer programming.
Press true if you believe the statement is correct, or false otherwise.
Are you sure you're getting this? Is this statement true or false?
In Python, the in
method returns a boolean value.
Press true if you believe the statement is correct, or false otherwise.
Let's test your knowledge. Click the correct answer from the options.
Assume all dices are fair, which of the following events is most likely?

Click the option that best answers the question.
- At least one 6, when 6 dices are rolled
- At least two 6's, when 12 dices are rolled
- At least three 6's, when 18 dices are rolled
- All the above have the same probability
Build your intuition. Is this statement true or false?
In machine learning, predictive modeling works on labels.
Press true if you believe the statement is correct, or false otherwise.
Let's test your knowledge. Is this statement true or false?
The following slicing technique is right:
1tuple[-4]
Press true if you believe the statement is correct, or false otherwise.
One Pager Cheat Sheet
- This tutorial will provide you with the necessary knowledge to answer entry-level data science interview questions, as well as tips on understanding concepts, narrowing answer choices and getting feedback when needed.
- A
p-value
of ≤ 0.05 indicates that the results of an experiment are highly unlikely to have occurred by random chance, allowing the null hypothesis to be rejected and the alternate hypothesis to be accepted. - Differencing a time series is an effective way to
remove
its seasonality by reducing its trend-like component and emphasizing its seasonal one. - The performance of a
binary classification model
is measured by two measures, Sensitivity (true positive rate) and Specificity (true negative rate) which are inversely proportional to each other. - The accuracy of this model can be calculated by taking the sum of the true positives and true negatives and dividing by the total number of observations (
15
), resulting in an accuracy of75%
. - The model's accuracy was calculated as 63.8% and its Area Under the Curve (AUC) was used to measure the model's ability to discriminate between two classes, with a value of 1.0 indicating a perfect classifier.
- The model predictions often differ from the actual results seen in a
real-world
environment, due to the possibility ofoverfitting
orunderfitting
leading to inaccurate predictions. - The
ROC
plot displays the trade-off between the true positive rate (TPR) and false positive rate (FPR) of a binary classifier system. - It is not necessary to manually create the immutable array index when creating a
DataFrame
inPandas
since it is automatically generated. - The average mean of the responses from your three friends is 3.33, calculated by adding the responses and dividing by the total number, with the resulting
variance
of the sample being 2.33. - The goodness of fit in simple linear regression measures how
closely
the observed data pointsmatch
the model's predicted values, with a value of 1 indicating a perfect fit. - The means of a Poisson and an exponential distribution can never be equal since the Poisson mean corresponds to a whole number, while the exponential mean is determined by an equation involving the
rate parameter λ
. - K-means clustering is a
machine learning
algorithm used for clustering data points into groups, but it does not actually reduce the number ofdimensions
. - Data scientists play an essential role in managing and analyzing data, requiring a deep understanding of mathematics, statistics, and computer programming to effectively
develop statistical models
,numerical simulations
,data extraction processes
, andvisualizations
to help explain the data. - The
in
method returns a boolean value,True
orFalse
, which is essential for data scientists to analyze data sets and act on the presence of certain elements. - The
in
method in Python returnsTrue
orFalse
, but cannot determine the probability of an event - in this case, rolling a certain number on a single dice is 1/6 for all given dices. - Predictive modeling in
Machine Learning
uses features to make predictions, as opposed to labels, which are the actual classifications or outcomes being predicted. - Using the
negative index
and theslicing technique
, the fourth item from the end of thetuple
inPython
can be returned.