During the last months, I’ve probably run the t-test dozens of times but recently I realized that I did not fully understand some concepts such as why it is not possible to accept the null hypothesis or where the numbers in the t-tables come from. After doing some research, I found that several articles provide those answers but not so many gather all of the information together.
Therefore, I decided to write this article to explain the t-test step-by-step so anyone can use it as a reference whenever they have to run the test or review the concepts. Depending on your level, I recommend:
For beginners: Reading the whole article carefully
For experts: Reading section 3 (types of t-test), section 4 (What are the T-scores?) and section 6 (Multiple Comparison Problem).
1. What is a t-test?
Imagine you are running an experiment where you want to compare two groups and quantify the difference between them. For example:
- Compare if the people of one country are taller than people of another one.
- Compare if the brain of a person is more activated while watching happy movies than sad movies.
This comparison can be analyzed by conducting different statistical analyses, such as the t-test, which is the one described in this article.
So, what is a t-test? It is a type of inferential statistic used to study if there is a statistical difference between two groups. Mathematically, it establishes the problem by assuming that the means of the two distributions are equal (H₀: µ₁=µ₂). If the t-test rejects the null hypothesis (H₀: µ₁=µ₂), which indicates that the groups are highly probably different.
This test should be implemented when the groups have 20–30 samples. If we want to examine more groups or larger sample sizes, there are other tests more accurate than t-tests such as the Z-test, chi-square test or F-test.
Important: The t-test rejects or fails to reject the null hypothesis, never accepts it.
2. What are the p-value and the critical value?
The p-value and critical value are defined in Wikipedia as:
The p-value or probability value is the probability of obtaining test results at least as extreme as the results actually observed during the test, assuming that the null hypothesis is correct.
The critical values of a statistical test are the boundaries of the acceptance region of the test.
The p-value is the variable that allows us to reject the null hypothesis (H₀: µ₁=µ₂) or, in other words, to establish that the two groups are different . However, since the p-value is just a value, we need to compare it with the critical value (⍺):
- p_value > ⍺ (Critical value): Fail to reject the null hypothesis of the statistical test.
- p_value ≤ ⍺ (Critical value): Reject the null hypothesis of the statistical test.
The critical value that most statisticians choose is ⍺ = 0.05. This 0.05 means that, if we run the experiment 100 times, 5% of the times we will be able to reject the null hypothesis and 95% we will not.
Also, in some cases, statisticians choose ⍺ = 0.01. Reducing the critical value from 0.05 to 0.01 decreases the chance of a false positive (called a Type I error), but it also makes it more difficult to reject the null hypothesis. Therefore, with a critical value of 0.01, the results are more trustworthy but also more difficult to obtain.
- p_value > 0.1: No evidence
- p_value between 0.05 and 0.1: Weak evidence
- p_value between 0.01 and 0.05: Evidence
- p_value between 0.001 and 0.01: Strong evidence
- p_value < 0.001: Very strong evidence
Important: It is always necessary to report the p-value and critical value.
The statistical test can be one-tailed or two-tailed. The difference is the alternative hypothesis, as shown below.
The one–tailed test is appropriate when there is a difference between groups in a specific direction . It is less common than the two-tailed test, so the rest of the article focuses on this one.
3. Types of t-test
Depending on the assumptions of your distributions, there are different types of statistical tests.
The assumptions that you have to analyze when deciding the kind of test you have to implement are:
- Paired or unpaired: The data of both groups come from the same participants or not.
- Parametric or non-parametric: The data are distributed according to some distributions or not.
There are three types of t-tests:
- One sample t-test (Not displayed in the figure)
- Unpaired two-sample t-test (Displayed in the figure)
- Paired sample t-test (Displayed in the figure)
As mentioned, the differences that make these t-tests different from the other tests are the assumptions of our experiment:
- The data has to follow a continuous or ordinal scale.
- The data has to be randomly selected.
- The data should be normally distributed.
In case you are not sure which test to implement, I recommend checking the web page Laerd Statistics. Then, in case you are interested, Ref  contains the flowchart when the number of groups being compared is higher than three. Lastly, ref  is a nice tutor’s guide to learning more about commonly used statistical tests.
4. What are the t-scores?
A t-score is one form of a standardized test statistic. The t-score formula enables us to transform a distribution into a standardized form, which we use to compare the score.
The t-score formula for the welch t-test is:
Once we have the t-value, we have to look at the t-tables. If the absolute value of our t-value is higher than the value in the tables, we can reject the null hypothesis.
The parameters to look at the table are:
- The cumulative probability or the probability that the value of a random variable falls within a specified range.
- One-tail or two-tail, depending on the statistical analysis that you are running.
- The number of degrees of freedom which refers to the maximum number of logically independent values in the data sample. The degrees of freedom parameter for looking up the t‐value is the smaller of n₁–1 and n₂– 1.
But, what do the numbers mean? The numbers indicate the distribution of observed t-values when the null hypothesis is true.
To explain that more in detail, I found an interesting blog whose explanation I tried to replicate using Python code. Here there is the histogram of the t-values calculated from running over 100000 iterations, two random normal distributions with 30 samples each, same mean and same standard deviation.
The t-value in the t-table for two distributions with 30 samples, two-tail and ⍺ of 0.05 is 2.043. The number of data above and below, since we are doing two-tail, is ≅5%. This number matches the critical value selected.
Lastly, all the theories explained can be run with a few lines in Python. Here is the output of the statistical analysis of three normal distributions.
- X1 and X2: p_value = 0.15
- X1 and X3: p_value < 1.00e-04
6. Multiple comparison problem
After reading this article, you may be wondering what happens when we run several tests in the same experiment because, in the end, we will be able to reject the null hypothesis even if the two groups are similar. This is what is known as Multiple Comparison Problem, and it has also been well studied. In case you are interested, I wrote about this problem in this other article.
 Stats Direct, p_value.
 Institute of Digital Research and Education, What are the differences between one-tailed and two-tailed tests?.
 Liz Thiele, Two sample t-test for Means.
 Stack Exchange, Why does the normalized z-score introduce a square root?
 Bozeman science, Student’s t-test, Youtube.
 Will Koehrsen, Statistical Significance Explained, Medium
The post was originally published on Medium.