Utilizing The QQ Plot Python (Full Code) » EML (2024)

In data science, some underlying assumptions are made when you use specific machine learning models.

Many of these assumptions are based on the distributions of your data.

With this 5-minute read, we will explore how to find the current distributions of your dependent and independent variables, and also over some models and what distributions they need.

Table of Contents show


How To Create a Q-Q Plot in Python using SciPy

import scipy.stats as statsimport matplotlib.pyplot as plt# pull in some random datadf = pd.read_csv('ds_salaries.csv')# lets work with salarydf = df[['salary']]# use scipy.stats to plot against a normstats.probplot(df['salary'], dist="norm", plot=plt)

Utilizing The QQ Plot Python (Full Code) » EML (1)

As we see above, our values are not normally distributed.

We could utilize transformations like box-cox and np.log to try to move data into normality.

After applying the transformations, we can replot and check.

import scipy.stats as statsimport matplotlib.pyplot as plt# pull in some random datadf = pd.read_csv('ds_salaries.csv')# lets work with salarydf = df[['salary']]# use scipy.stats to plot against a normstats.probplot(np.log(df['salary']), dist="norm", plot=plt)

Utilizing The QQ Plot Python (Full Code) » EML (2)

This is much closer to normality but still needs work on the tails.


How to Create a Q-Q Plot Manually in Python Using Pandas, Matplotlib and SciPy

# importsimport numpy as npimport pandas as pdimport matplotlib.pyplot as pltfrom scipy.special import ndtri# pull in some random datadf = pd.read_csv('ds_salaries.csv')# lets work with salarydf = df[['job_title','salary']]# see our dataframedf.head()

Utilizing The QQ Plot Python (Full Code) » EML (3)

# to build our Q-Q plot, sort the col in ascdf = df.sort_values(by=['salary'], ascending=True).reset_index(drop=True)# let's run a counter variable to see what we're doingdf['count'] = df.index + 1df.head()

Utilizing The QQ Plot Python (Full Code) » EML (4)

# build out our points to be plotted# take the point - the mean of the column# divided by the std of that columndf['comparison_points'] = (df['salary'] - df['salary'].mean())\ /df['salary'].std(ddof=0)# need to compare these to what the real points should be# this is the row number divided by total number of rows# remember, these plots are sequentialdf['real_normal'] = ndtri(df['count']/df.shape[0])df.head()

Utilizing The QQ Plot Python (Full Code) » EML (5)

# we would want all plots to line up on our line# we quickly see this is not the case# our salary data is not normalplt.scatter(df['real_normal'],df['comparison_points'])plt.plot([-3,-2,-1,0,1,2,3],[-3,-2,-1,0,1,2,3], color='red')

Utilizing The QQ Plot Python (Full Code) » EML (6)

We get this plot if we switch out our data for what the actual distribution should look like below.

# example of a normal plot (from our values)plt.scatter(df['real_normal'],df['real_normal'])plt.plot([-3,-2,-1,0,1,2,3],[-3,-2,-1,0,1,2,3], color='red')

Utilizing The QQ Plot Python (Full Code) » EML (7)

While I know it’s theoretical, with a plot like the one above, we can confidently say our data is normally distributed.

What are probability distributions?

Probability distributions are a very dense subject, and the easiest way to think about them is just a pool of values.

And with this pool of values, when pulling values from it, the values you pull will (hopefully) resemble the same distribution as the original probability distribution.

For this to happen, you’ll need to pull a decent amount of values from your original distribution.

For example, check the chart below, where we pull 500 random points from a population (size 50,000).

As we can see, our sample distribution nearly matches our population data but has a much lower count.

Utilizing The QQ Plot Python (Full Code) » EML (8)

This is important because we rarely have access to the whole population data.

Still, it shows that if we have enough samples, our sample distribution will follow the same distribution as the population data.


Why are probability distribution types important in data science

This idea of underlying distributions quickly becomes vital in data science.

This is because many statistical tests have the assumption of normality.

The normal distribution looks like a bell curve and is foundational in many data science and machine learning applications.

Utilizing The QQ Plot Python (Full Code) » EML (9)


What is the normality assumption in statistics?

The normality assumption in statistics is that the population’s standard deviation is independent of the population’s mean, which only happens in normally distributed data. This is foundational in hypothesis testing, where the normality of data is assumed before running the test.


Normality Assumption in Linear Regression

With a linear regression model, the plot of the residuals should follow a normal distribution. This can be checked with a QQ Plot or a Shapiro-Wilk test. If your model does not show this, non-normality is sometimes an indicator of outliers or multicollinearity.


How QQ plots can help us identify the distribution types

QQ plots help us identify distribution types by visually comparing data from two different sources onto one plot. This quickly allows us to see if our data follows the tested distribution. A QQ plot can be used to test for a match with any distribution.

An error that is commonly made is that QQ plots are only for the normal distribution.

This is entirely false. Here is a QQ plot testing for the maxwell distribution using scipy against a salary dataset.

As we can see, our data does not come from a maxwell distribution.

import scipy.stats as statsimport matplotlib.pyplot as plt# pull in some random datadf = pd.read_csv('ds_salaries.csv')# lets work with salarydf = df[['salary']]# use scipy.stats to plot against a normstats.probplot(df['salary'], dist=stats.maxwell, plot=plt)

Utilizing The QQ Plot Python (Full Code) » EML (10)

Here is another testing my example data for the uniform distribution, where the values range from [0,1].

We could have easily used any distribution, like the chi-square distribution.

import scipy.stats as statsimport matplotlib.pyplot as plt# pull in some random datadf = pd.read_csv('ds_salaries.csv')# lets work with salarydf = df[['salary']]# use scipy.stats to plot against a normstats.probplot(df['salary'], dist=stats.uniform, plot=plt)

Utilizing The QQ Plot Python (Full Code) » EML (11)

How The QQ Plot Can Ensure Your Data Is The Right Distribution

The QQ Plot can ensure your data is the correct distribution because your data and the data from the distribution will match perfectly. If they do not, your data is either from a different distribution, has outliers, or is skewed, altering it off the true theoretical distribution.


Frequently Asked Questions


What does the QQ plot tell us?

A QQ plot tells you if the data you currently have matches the distribution you are testing it against. This involves two datasets as you compare your dataset to the standardized line that comes from the theoretical quantiles of your suspected distribution.


Why is the QQ plot used?

A QQ plot is used because it gives a visual representation of your dataset compared to a distribution. The QQ plot quickly allows you to identify if your data matches your proposed distribution. It also allows you to identify skewness, tails, and potential bimodal situations.

  • Author
  • Recent Posts

Stewart Kaplan

Stewart Kaplan has years of experience as a Senior Data Scientist. He enjoys coding and teaching and has created this website to make Machine Learning accessible to everyone.

Latest posts by Stewart Kaplan (see all)

  • Exploring the Different Levels of Software Engineers at Google [Unlock Your Career Potential] - August 15, 2024
  • Do software consultants make a lot of money? [Boost Your Earnings Now] - August 15, 2024
  • When Did the Internet Start? [Fascinating History Revealed] - August 15, 2024
Utilizing The QQ Plot Python (Full Code) » EML (2024)

References

Top Articles
Blue Cheese-Stuffed Chicken with Buffalo Sauce Recipe
French Onion Tart Recipe
Evil Dead Movies In Order & Timeline
55Th And Kedzie Elite Staffing
Kreme Delite Menu
Txtvrfy Sheridan Wy
Beds From Rent-A-Center
2022 Apple Trade P36
J Prince Steps Over Takeoff
Tlc Africa Deaths 2021
What Happened To Father Anthony Mary Ewtn
Sunday World Northern Ireland
Raid Guides - Hardstuck
Charmeck Arrest Inquiry
Regal Stone Pokemon Gaia
Sarpian Cat
Local Dog Boarding Kennels Near Me
Nwi Arrests Lake County
Michael Shaara Books In Order - Books In Order
Michigan cannot fire coach Sherrone Moore for cause for known NCAA violations in sign-stealing case
Craighead County Sheriff's Department
Glenda Mitchell Law Firm: Law Firm Profile
Icivics The Electoral Process Answer Key
Bible Gateway passage: Revelation 3 - New Living Translation
Espn Horse Racing Results
Sussyclassroom
Disputes over ESPN, Disney and DirecTV go to the heart of TV's existential problems
Craigslist Brandon Vt
Schooology Fcps
Rainfall Map Oklahoma
Craigslist Sf Garage Sales
Kelley Fliehler Wikipedia
Current Time In Maryland
Palmadise Rv Lot
Gas Prices In Henderson Kentucky
Chattanooga Booking Report
Appraisalport Com Dashboard /# Orders
11 Pm Pst
Samsung 9C8
AI-Powered Free Online Flashcards for Studying | Kahoot!
In Polen und Tschechien droht Hochwasser - Brandenburg beobachtet Lage
Case Funeral Home Obituaries
The best Verizon phones for 2024
US-amerikanisches Fernsehen 2023 in Deutschland schauen
Pulitzer And Tony Winning Play About A Mathematical Genius Crossword
Perc H965I With Rear Load Bracket
Deezy Jamaican Food
Lesson 5 Homework 4.5 Answer Key
Phunextra
Overstock Comenity Login
Honeybee: Classification, Morphology, Types, and Lifecycle
Latest Posts
Article information

Author: Van Hayes

Last Updated:

Views: 5570

Rating: 4.6 / 5 (66 voted)

Reviews: 81% of readers found this page helpful

Author information

Name: Van Hayes

Birthday: 1994-06-07

Address: 2004 Kling Rapid, New Destiny, MT 64658-2367

Phone: +512425013758

Job: National Farming Director

Hobby: Reading, Polo, Genealogy, amateur radio, Scouting, Stand-up comedy, Cryptography

Introduction: My name is Van Hayes, I am a thankful, friendly, smiling, calm, powerful, fine, enthusiastic person who loves writing and wants to share my knowledge and understanding with you.