In this part, we will address the assessment of normality in real data, both in a visual way (histograms, QโQ plots) and in a theoretical way (Law of Large Numbers and Central Limit Theorem). Approximate normality is the bridge between descriptive analysis and statistical inference.
Note
๐ Objectives of this post
Identify variables that approximately follow the normal distribution.
Recognize that normality is a key assumption for many statistical methods.
Use plots (histograms, QโQ plots) to assess the normality of data.
Interpret the results of normality analysis in a practical and applied way.
1.1 ๐ง Complements โ Understanding Normality in the Real World
In this part, based on Levine et al., Statistics for Managers Using Microsoft Excel, we explore when and how the normal distribution can be used as a valid approximation in real-world contexts.
๐ฏ Objectives:
Understand under which conditions variables can be treated as approximately normal.
Recognize the importance of normality for methods of statistical inference.
Use practical and graphical criteria to assess the normality of data.
๐ง Letโs deepen our understanding!
1.2 ๐ โ What Is an Approximately Normal Distribution?
We call approximately normal distributions those variables that, even without following the normal curve exactly, present enough characteristics for statistical methods based on normality to be applied.
๐ Main characteristics:
Bell-shaped form and symmetry around the mean.
Higher concentration of observations close to the mean, with few extreme occurrences.
Most values concentrated within 1 to 2 standard deviations of the mean.
โ Important notes:
Not every variable needs to be perfectly normal for us to apply statistical tests.
Small asymmetries or irregularities are usually tolerable.
Many real-world distributions are not exactly normal, but rather approximately normal.
๐ Typical examples:
Adult height.
Service times in operations.
Industrial processes under statistical control.
๐ ๐ Examples of Distributions: Normal and Non-Normal
โ Approximately normal variables:
Adult height within the same population.
Service time in standardized operations.
Measurement errors under controlled conditions.
โ Non-normally distributed variables:
Household income (right-skewed โ positive skewness).
Number of children per family (discrete, skewed).
Lifetime of electronic equipment (long right tail).
๐ Note: Some variables can approach normality after transformations, such as logarithm or square root.
๐ ๐ Real-World Examples of Approximate Normality
Examples of variables with approximately normal distribution:
Heights of university students.
Phone service times in standardized call centers.
Weights of newborns in hospitals.
Scores on standardized tests (e.g., proficiency exams).
Measurement errors in controlled physics experiments.
Retirement ages in large populations.
โ Important: Even if the actual distribution is not perfectly normal, a normal approximation is often sufficient for practical applications and statistical inference.
1.3 ๐ โ What Is a QโQ Plot?
๐ The QโQ Plot (QuantileโQuantile Plot) is a chart used to compare the distribution of sample data with a theoretical distribution โ usually the normal.
๐ฏ Objectives:
Assess whether the data approximately follow a normal distribution.
Identify relevant deviations, such as skewness or heavy tails.
๐ How to interpret:
If the points align close to a diagonal straight line, the data are approximately normal.
Systematic deviations (curvature or tail departures) indicate lack of normality.
๐ Note: The QโQ Plot is especially useful with large samples, since small imperfections are expected and do not compromise the overall interpretation.
๐ ๐ Visual Example โ Histogram of a Normal Distribution
๐ Situation:
Sample of 200 adult heights.
Observed mean: \(170\) cm.
Observed standard deviation: \(8\) cm.
๐ Chart:
Chart generated in R from 200 simulated observations of \(X \sim \mathcal N(170,\,8^2)\).
๐ Interpretation: The histogram shows a bell shape, symmetric around the mean. Small variations are expected, but the approximation to the normal distribution is very good.
๐ ๐ Visual Example โ QโQ Plot (Levine)
๐ Situation: The same sample of 200 adult heights (\(\mu=170,\; \sigma=8\)) was used to build the QโQ plot.
๐ Chart:
Chart generated in R with 200 simulated observations of \(X \sim \mathcal N(170,\,8^2)\).
๐ Interpretation:
When the points align approximately along the straight line, we conclude that the distribution is approximately normal.
Small fluctuations are expected in real samples and do not invalidate the analysis.
Generating Plots in R: Histogram and QโQ Plot
๐ฏ Objective: Generate a sample of heights and visualize the histogram and the QโQ plot directly in RStudio.
# Generate the sample of heightsset.seed(123)# Ensures reproducibilityheights<-rnorm(200, mean =170, sd =8)# Display the Histogramhist(heights,breaks =15,main ="Histogram of Heights (Approximate Normal)",xlab ="Height (cm)",col ="lightblue",border ="black")
# Display the Q-Q Plotqqnorm(heights,main ="Q-Q Plot of Heights")qqline(heights, col ="red", lwd =2)
๐ Note: The code generates the plots directly in the RStudio window.
1.4 ๐งญ Step-by-Step to Generate the Plots in RStudio
๐ฏ Objective: Build the Histogram and the QโQ Plot of the height sample using RStudio.
๐ (1) Generate the sample:
Use rnorm() to create random data from a normal distribution.
๐ (2) Build the Histogram:
Use the hist() function to visualize the data distribution.
โ Important: Visualize and interpret the plots on screen before applying statistical methods!
๐ Why visualize plots before the analysis?
Before applying any statistical technique, it is essential to explore the data visually. Plots such as histograms and QโQ plots help verify fundamental assumptions, like normality, the presence of outliers, and symmetry of the distribution.
Applying statistical tests without this prior check may lead to misleading or statistically invalid conclusions. Visualization allows you to detect patterns, deviations, and anomalies that numbers alone may not revealโtherefore, it is a critical step in the data analysis workflow.
Activity 1: Generating and Interpreting New Plots in RStudio
๐ฏ Objective: Apply what youโve learned to generate new plots in RStudio.
๐ง ๐ Task:
๐ (1) Generate a new sample of 200 normally distributed observations with:
๐ง ๐ Answer Key for Activity 1: New Plots Generated in RStudio
๐งโ๐ป R Code:
# Generate new sampleset.seed(456)# New seed to vary the datanew_heights<-rnorm(200, mean =160, sd =5)# Histogramhist(new_heights,breaks =15,main ="Histogram of Heights (New Sample)",xlab ="Height (cm)",col ="lightgreen",border ="black")
# Q-Q Plotqqnorm(new_heights,main ="Q-Q Plot of Heights (New Sample)")qqline(new_heights, col ="blue", lwd =2)
๐ Interpretation: The new data also approximately follow a normal distribution.
1.5 ๐งญ Step-by-Step to Generate Plots in Excel
๐ฏ Objective: Build the Histogram and the QโQ Plot of the height sample using Excel.
๐ Histogram in Excel:
Enter the sample data in a column.
Select the data.
Go to Insert โ Statistical Charts โ Histogram.
Adjust the number of bins as needed.
๐ QโQ Plot in Excel:
Sort the sample data (ascending).
Compute the theoretical quantile positions: =NORM.INV((ROW()-0.5)/Total, Mean, StdDev)(Tip: you can obtain Mean and StdDev from the data using AVERAGE(range) and STDEV.S(range).)
Build an XY (Scatter) plot of sample data vs. theoretical quantiles.
Add a linear trendline for reference.
๐ Note: The QโQ Plot is manual in Excel, but easy to build!
1.6 ๐ง Law of Large Numbers (LLN)
Large samples tend to reflect the true population mean.
Variability decreases as we increase the sample size.
๐ Summary: LLN ensures that sample means approach the population mean.
1.7 ๐ง Central Limit Theorem (CLT)
The mean of large samples tends to follow a normal distribution.
Regardless of the original distribution!
Conclusion: The CLT is the theoretical basis for using the normal distribution in practice.
1.8 ๐ง Variability and the Shape of the Normal Curve
\(\sigma\) small \(\rightarrow\) narrower curve.
\(\sigma\) large \(\rightarrow\) flatter curve.
Variability and Shape of the Normal Curve
๐ง ๐ Quick Test: True or False?
A normal curve with larger \(\sigma\) is narrower? (T or F)
According to the LLN, small samples already reflect the true mean? (T or F)
The CLT explains the prevalence of normality? (T or F)
๐ง ๐ Answer Key โ Quick Test
A normal curve with larger \(\sigma\) is narrower? (F)
According to the LLN, small samples already reflect the true mean? (F)
The CLT explains the prevalence of normality? (T)
๐ Note:Understanding normality is essential to correctly apply statistical tests and make data-driven decisions!
1.9 ๐ Conclusion of Part 3: Plots, CLT, and Approximate Normality
In this final part of the course, you learned:
To identify variables that follow an approximately normal distribution.
To recognize that normality is a key assumption for many statistical methods.
To use plots such as histograms and normal probability plots (QโQ plots) to assess data normality.
To interpret the results of normality analysis in a practical and applied way.
2 ๐ References
Important
Schmuller, Joseph. Statistical Analysis with Excelยฎ For Dummiesยฎ, 5th ed. Wiley, 2016.
Schmuller, Joseph. Statistical Analysis with R For Dummiesยฎ (Portuguese edition), 2nd ed. Alta Books, 2021.
Levine, D. M.; Stephan, D.; Szabat, K. A. Statistics for Managers Using Microsoft Excel, 8th ed. Pearson, 2017.
Morettin, L. G. Estatรญstica Bรกsica: Probabilidade e Inferรชncia, 7th ed. Pearson, 2017.
Morettin, P. A.; Bussab, W. O. Estatรญstica Bรกsica, 10th ed. SaraivaUni, 2023.