Let’s see how we can use the univariate analysis to detect outliers.
- Univariate analysis is the simplest form of analyzing data. “Uni” means “one”, so in other words, your data has only one variable.
- It doesn’t deal with causes or relationships (unlike regression ) and its major purpose is to describe; It takes data, summarizes that data, and finds patterns in the data.
- Variable in the univariate analysis is just a condition or subset that your data falls into. You can think of it as a “category.” For example, the analysis might look at a variable of “age” or it might look at “height” or “weight”.
Univariate Analysis can be done for two kinds of variables- Categorical and Numerical.
There are three common ways to perform the univariate analysis: The following examples show how to perform each type of univariate analysis using the Household Size variable from our dataset.
1. Summary Statistics: The most common way to perform the univariate analysis is to describe a variable using summary statistics. There are two popular types of summary statistics:
- Measures of central tendency: these numbers describe where the center of a dataset is located. Examples include the mean and the median.
We can calculate the following measures of central tendency for Household Size:
- Mean (the average value): 3.8
- Median (the middle value): 4
These values give us an idea of where the “center” value is located.
- Measures of dispersion: these numbers describe how spread out the values is in the dataset. Examples include the range, interquartile range, standard deviation, and variance.
We can also calculate the following measures of dispersion:
- Range (the difference between the max and min): 6
- Interquartile Range (the spread of the middle 50% of values): 2.5
- Standard Deviation (an average measure of spread): 1.87
These values give us an idea of how spread out the values is for this variable.
2. Frequency Distributions: Another way to perform the univariate analysis is to create a frequency distribution, which describes how often different values occur in a dataset.
Mode: This allows us to quickly see that the most frequent household size is 4.
3. Charts: Yet another way to perform the univariate analysis is to create charts to visualize the distribution of values for a certain variable. Common examples include:
a. Histogram: A histogram is a type of chart that uses vertical bars to display frequencies. This type of chart is a useful way to visualize the distribution of values in a dataset.
On the left we have a histogram for the variable Household Size from our data: It has no outlier.
On the right side, there’s a graph that shows how a histogram will look if our data with an outlier. Outliers are often easy to spot in histograms. For example, the point on the far left in the above figure on the right-hand side is an outlier.
b. Box Plots: A boxplot is a plot that shows the five-number summary of a dataset.
The five-number summary includes: Let n be the number of data values in the data set.
- The minimum value: The far left of the chart is the minimum( smallest number in the set)
- The first quartile: Lower quartile (Q1) is the median of the lower half of the data set
- The median value: The Median (Q2) is the middle value of the data set.
- The third quartile: The Upper quartile (Q3) is the median of the upper half of the data set.
- The maximum value: The far-right is the maximum (the largest number in the set)
- The Interquartile range (IQR) is the spread of the middle 50% of the data values.
Interquartile Range (IQR) = Upper Quartile (Q3) – Lower Quartile (Q1)
So any value that will be more than the upper limit or lesser than the lower limit will be the outliers. Only the data that lies within the Lower and upper limits are statistically considered normal.