Spurious Correlations: The Comedy and Drama of Statistics

0
25


What to not do with statistics

By Celia Banks, PhD and Paul Boothroyd III

1*88zVz1hdLds5g9kpthBGNg
By

Introduction

Since Tyler Vigen coined the time period ‘spurious correlations’ for “any random correlations dredged up from foolish information” (Vigen, 2014) see: Tyler Vigen’s private web site, there have been many articles that pay tribute to the perils and pitfalls of this whimsical tendency to control statistics to make correlation equal causation. See: HBR (2015), Medium (2016), FiveThirtyEight (2016). As information scientists, we’re tasked with offering statistical analyses that both settle for or reject null hypotheses. We’re taught to be moral in how we supply information, extract it, preprocess it, and make statistical assumptions about it. And that is no small matter — world corporations depend on the validity and accuracy of our analyses. It’s simply as necessary that our work be reproducible. But, regardless of all the ‘good’ that we’re taught to observe, there could also be that ​one event (or extra) the place a boss or consumer will insist that you simply work the information till it helps the speculation and, above all, present how variable y causes variable x when correlated. That is the premise of p-hacking the place you enter right into a territory that’s removed from supported by ‘good’ observe. On this report, we learn to conduct fallacious analysis utilizing spurious correlations. We get to delve into ‘unhealthy’ with the target of studying what to not do if you find yourself confronted with that inevitable second to ship what the boss or consumer whispers in your ear.

The target of this venture is to show you

what to not do with statistics

​We’ll exhibit the spurious correlation of two unrelated variables. Datasets from two completely different sources have been preprocessed and merged collectively with a purpose to produce visuals of relationships. Spurious correlations happen when two variables are misleadingly correlated, and it’s additional assumed that one variable straight impacts the opposite variable in order to trigger a sure final result. ​The explanation we selected this venture concept is as a result of we have been desirous about ways in which handle a consumer’s expectations of what an information evaluation venture ought to produce. For staff member Banks, typically she has had shoppers exhibit displeasure with evaluation outcomes and really on one event she was requested to return and take a look at different information sources and alternatives to “assist” arrive on the solutions they have been looking for. Sure, that is p-hacking — on this case, the place the consumer insisted that causal relationships existed as a result of they imagine the correlations existed to trigger an final result.

Examples of Spurious Correlations

Excerpts of Tyler Vigen’s Spurious Correlations. Retrieved February 1, 2024, from Spurious Correlations (tylervigen.com) Reprinted with permission from the writer.

Analysis Questions Pertinent to this Research

What are the analysis questions?

Why the heck do we’d like them?

We’re doing a “unhealthy” evaluation, proper?

Analysis questions are the inspiration of the analysis examine. They information the analysis course of by specializing in particular subjects that the researcher will examine. The reason why they’re important embody however aren’t restricted to: for focus and readability; as steering for methodology; set up the relevance of the examine; assist to construction the report; assist the researcher consider outcomes and interpret findings. ​In studying how a ‘unhealthy’ evaluation is carried out, we addressed the next questions:

(1) Are the information sources legitimate (not made up)?

(2) How have been lacking values dealt with?

(3) How have been you in a position to merge dissimilar datasets?

(4) What are the response and predictor variables?

(5) Is the connection between the response and predictor variables linear?

(6) Is there a correlation between the response and predictor variables?

(7) Can we are saying that there’s a causal relationship between the variables?

(8) What rationalization would you present a consumer within the relationship between these two variables?

(9) Did you discover spurious correlations within the chosen datasets?

(10) What studying was your takeaway in conducting this venture?

Methodology

How did we conduct a examine about

Spurious Correlations?​

To research the presence of spurious correlations between variables, a complete evaluation was carried out. The datasets spanned completely different domains of financial and environmental components that have been collected and affirmed as being from public sources. The datasets contained variables with no obvious causal relationship however exhibited statistical correlation. The chosen datasets have been of the Apple inventory information, the first, and each day excessive temperatures in New York Metropolis, the secondary. The datasets spanned the time interval of January, 2017 by December, 2022.

​Rigorous statistical methods have been used to investigate the information. A Pearson correlation coefficients was calculated to quantify the energy and course of linear relationships between pairs of the variables. To finish this evaluation, scatter plots of the 5-year each day excessive temperatures in New York Metropolis, candlestick charting of the 5-year Apple inventory pattern, and a dual-axis charting of the each day excessive temperatures versus sock pattern have been utilized to visualise the connection between variables and to determine patterns or traits. Areas this system adopted have been:

1*ww faj7IayqQOosIfoUw5g

The Information: Supply/Extract/Course of

Main dataset: Apple Inventory Value Historical past | Historic AAPL Firm Inventory Costs | FinancialContent Enterprise Web page

Secondary dataset: New York Metropolis each day excessive temperatures from Jan 2017 to Dec 2022: https://www.extremeweatherwatch.com/cities/new-york/year-{12 months}

The info was affirmed as publicly sourced and accessible for reproducibility. Capturing the information over a time interval of 5 years gave a significant view of patterns, traits, and linearity. Temperature readings noticed seasonal traits. For temperature and inventory, there have been troughs and peaks in information factors. Observe temperature was in Fahrenheit, a meteorological setting. We used astronomical setting to additional manipulate our information to pose stronger spuriousness. Whereas the information could possibly be downloaded as csv or xls information, for this task, Python’s Lovely soup internet scraping API was used.

Subsequent, the information was checked for lacking values and what number of data every contained. Climate information contained date, each day excessive, each day low temperature, and Apple inventory information contained date, opening worth, closing worth, quantity, inventory worth, inventory title. To merge the datasets, the date columns wanted to be in datetime format. An internal be part of matched data and discarded non-matching. For Apple inventory, date and each day closing worth represented the columns of curiosity. For the climate, date and each day excessive temperature represented the columns of curiosity.

1*6vwtWlydXGOpkLAq8v5TUw

The Information: Manipulation

From Duarte® Slide Deck

To do ‘unhealthy’ the proper approach, you have got to

therapeutic massage the information till you discover the

relationship that you simply’re wanting for…​

Our earlier method didn’t fairly yield the meant outcomes. So, as a substitute of utilizing the summer time season of 2018 temperatures in 5 U.S. cities, we pulled 5 years of each day excessive temperatures for New York Metropolis and Apple Inventory efficiency from January, 2017 by December, 2022. In conducting exploratory evaluation, we noticed weak correlations throughout the seasons and years. So, our subsequent step was to transform the temperature. As an alternative of meteorological, we selected astronomical. This gave us ​‘significant’ correlations throughout seasons.

​With the brand new method in place, we observed that merging the datasets was problematic. The date fields have been completely different the place for climate, the date was month and day. For inventory, the date was in year-month-day format. We addressed this by changing every dataset’s date column to datetime. Additionally, every date column was sorted both in chronological or reverse chronological order. This was resolved by sorting each date columns in ascending order.

1*graj uHepvhp4D btQmIqg
1* Ya8qFdP6n H9cQZKiJeXw

Evaluation I: Do We Have Spurious Correlation? Can We Show It?

The spurious nature of the correlations

right here is proven by shifting from

meteorological seasons (Spring: Mar-Could,

Summer season: Jun-Aug, Fall: Sep-Nov, Winter:

Dec-Feb) that are primarily based on climate

patterns within the northern hemisphere, to

astronomical seasons (Spring: Apr-Jun,

Summer season: Jul-Sep, Fall: Oct-Dec, Winter:

Jan-Mar) that are primarily based on Earth’s tilt.

​As soon as we completed the exploration, a key level in our evaluation of spurious correlation was to find out if the variables of curiosity correlate. We eyeballed that Spring 2020 had a correlation of 0.81. We then decided if there was statistical significance — sure, and at p-value ≈ 0.000000000000001066818316115281, I’d say we’ve got significance!

1*LE7YgPiWQM iMCIZzHFhLA
Spring 2020 temperatures correlate with Apple inventory
1*kBBqp6NMg5zI4 zyDLLMDQ
1*Qwn2SgS 5G2zqr V8FYb3w
1*i52k2ebs1oC jJpOfdA29A
1*xG5 u5UknUDDQQysV7HXHw

Evaluation II: Further Statistics to Take a look at the Nature of Spuriousness

If there may be actually spurious correlation, we might want to

think about if the correlation equates to causation — that

is, does a change in astronomical temperature trigger

Apple inventory to fluctuate? We employed additional

statistical testing to show or reject the speculation

that one variable causes the opposite variable.

There are quite a few statistical instruments that check for causality. Instruments corresponding to Instrumental Variable (IV) Evaluation, Panel Information Evaluation, Structural Equation Modelling (SEM), Vector Autoregression Fashions, Cointegration Evaluation, and Granger Causality. IV evaluation considers omitted variables in regression evaluation; Panel Information research fixed-effects and random results fashions; SEM analyzes structural relationships; Vector Autoregression considers dynamic multivariate time collection interactions; and Cointegration Evaluation determines whether or not variables transfer collectively in a stochastic pattern. We wished a software that would finely distinguish between real causality and coincidental affiliation. To attain this, our alternative was Granger Causality.

Granger Causality

A Granger check checks whether or not previous values can predict future ones. In our case, we examined whether or not previous each day excessive temperatures in New York Metropolis might predict future values of Apple inventory costs.

Ho: Every day excessive temperatures in New York Metropolis don’t Granger trigger Apple inventory worth fluctuation.

​To conduct the check, we ran by 100 lags to see if there was a standout p-value. We encountered close to 1.0 p-values, and this prompt that we couldn’t reject the null speculation, and we concluded that there was no proof of a causal relationship between the variables of curiosity.

Granger Causality Take a look at at lags=100

Evaluation III: Statistics to Validate Not Rejecting the Null Ho

Granger causality proved the p-value

insignificant in rejecting the null

speculation. However, is that sufficient?

Let’s validate our evaluation.

To assist in mitigating the danger of misinterpreting spuriousness as real causal results, performing a Cross-Correlation evaluation at the side of a Granger causality check will affirm its discovering. Utilizing this method, if spurious correlation exists, we’ll observe significance in cross-correlation at some lags with out constant causal course or with out Granger causality being current.

Cross-Correlation Evaluation

This technique is completed by the next steps:

  • Study temporal patterns of correlations between variables;
  • •If variable A Granger causes variable B, vital cross-correlation will happen between variable A and variable B at constructive lags;
  • Important peaks in cross-correlation at particular lags infers the time delay between modifications within the causal variable.
1*vnXQ0F7ndH2OQrf8EEXyIg

Interpretation:

The ccf and lag values present significance in constructive correlation at sure lags. This confirms that spurious correlation exists. Nevertheless, just like the Granger causality, the cross-correlation evaluation can not help the declare that causality exists within the relationship between the 2 variables.

Wrapup: Key Learnings

  • Spurious correlations are a type of p-hacking. Correlation doesn’t indicate causation.
  • Even with ‘unhealthy’ information ways, statistical testing will root out the dearth of significance. Whereas there was statistical proof of spuriousness within the variables, causality testing couldn’t help the declare that causality existed within the relationship of the variables.
  • A examine can not relaxation on the only premise that variables displaying linearity might be correlated to exhibit causality. As an alternative, different components that contribute to every variable have to be thought of.
  • A non-statistical check of whether or not each day excessive temperatures in New York Metropolis trigger Apple inventory to fluctuate might be to simply think about: In the event you owned an Apple inventory certificates and also you positioned it within the freezer, would the worth of the certificates be impacted by the chilly? Equally, in the event you positioned the certificates outdoors on a sunny, scorching day, would the solar affect the worth of the certificates?

Moral Concerns: P-Hacking is Not a Legitimate Evaluation

https://www.freepik.com/free-vector/business-people-saying-no-concept-illustration_38687005.htm#question=refusepercent20work&place=20&from_view=key phrase&observe=ais&uuid=e5cd742b-f902-40f7-b7c4-812b147fe1df Picture by storyset on Freepik

Spurious correlations aren’t causality.

P-hacking could affect your credibility as a

information scientist. Be the grownup within the room and

refuse to take part in unhealthy statistics.

This examine portrayed evaluation that concerned ‘unhealthy’ statistics. It demonstrated how an information scientist might supply, extract and manipulate information in such a approach as to statistically present correlation. In the long run, statistical testing withstood the problem and demonstrated that correlation doesn’t equal causality.

​Conducting a spurious correlation brings moral questions of utilizing statistics to derive causation in two unrelated variables. It’s an instance of p-hacking, which exploits statistics with a purpose to obtain a desired final result. This examine was completed as tutorial analysis to indicate the absurdity in misusing statistics.

​One other space of moral consideration is the observe of internet scraping. Many web site house owners warn towards pulling information from their websites to make use of in nefarious methods or methods unintended by them. For that reason, websites like Yahoo Finance make inventory information downloadable to csv information. That is additionally true for many climate websites the place you possibly can request time datasets of temperature readings. Once more, this examine is for educational analysis and to exhibit one’s capability to extract information in a nonconventional approach.

​When confronted with a boss or consumer that compels you to p-hack and supply one thing like a spurious correlation as proof of causality, clarify the implications of their ask and respectfully refuse the venture. No matter your choice, it should have an enduring affect in your credibility as an information scientist.

Dr. Banks is CEO of I-Meta, maker of the patented Spice Chip Know-how that gives Huge Information analytics for numerous industries. Mr. Boothroyd, III is a retired Army Analyst. Each are veterans having honorably served in the USA navy and each take pleasure in discussing spurious correlations. They’re cohorts of the College of Michigan, Faculty of Data MADS program…Go Blue!

References

Aschwanden, Christie. January 2016. You Can’t Belief What You Learn About Vitamin. FiveThirtyEight. Retrieved January 24, 2024 from https://fivethirtyeight.com/options/you-cant-trust-what-you-read-about-nutrition/

Enterprise Administration: From the Journal. June 2015. Beware Spurious Correlations. Harvard Enterprise Overview. Retrieved January 24, 2024 from https://hbr.org/2015/06/beware-spurious-correlations

Excessive Climate Watch. 2017–2023. Retrieved January 24, 2024 from https://www.extremeweatherwatch.com/cities/new-york/year-2017

Monetary Content material Providers, Inc. Apple Inventory Value Historical past | Historic AAPL Firm Inventory Costs | Monetary Content material Enterprise Web page. Retrieved January 24, 2024 from

https://markets.financialcontent.com/shares/quote/historic?Image=537percent3A908440&12 months=2019&Month=1&Vary=12

Plotlygraphs.July 2016. Spurious-Correlations. Medium. Retrieved January 24, 2024 from https://plotlygraphs.medium.com/spurious-correlations-56752fcffb69

Vigen, Tyler. Spurious Correlations. Retrieved February 1, 2024 from https://www.tylervigen.com/spurious-correlations

Mr. Vigen’s graphs have been reprinted with permission from the writer acquired on January 31, 2024.

Pictures have been licensed from their respective house owners.

Code Part

##########################
# IMPORT LIBRARIES SECTION
##########################
# Import internet scraping software
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

# Import visualization applicable libraries
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import seaborn as sns # New York temperature plotting
import plotly.graph_objects as go # Apple inventory charting
from pandas.plotting import scatter_matrix # scatterplot matrix

# Import applicable libraries for New York temperature plotting
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime, timedelta
import re

# Convert day to datetime library
import calendar

# Cross-correlation evaluation library
from statsmodels.tsa.stattools import ccf

# Stats library
import scipy.stats as stats

# Granger causality library
from statsmodels.tsa.stattools import grangercausalitytests
##################################################################################
# EXAMINE THE NEW YORK CITY WEATHER AND APPLE STOCK DATA IN READYING FOR MERGE ...
##################################################################################

# Extract New York Metropolis climate information for the years 2017 to 2022 for all 12 months
# 5-YEAR NEW YORK CITY TEMPERATURE DATA

# Operate to transform 'Day' column to a constant date format for merging
def convert_nyc_date(day, month_name, 12 months):
month_num = datetime.strptime(month_name, '%B').month

# Extract numeric day utilizing common expression
day_match = re.search(r'd+', day)
day_value = int(day_match.group()) if day_match else 1

date_str = f"{month_num:02d}-{day_value:02d}-{12 months}"

strive:
return pd.to_datetime(date_str, format='%m-%d-%Y')
besides ValueError:
return pd.to_datetime(date_str, errors='coerce')

# Set variables
years = vary(2017, 2023)
all_data = [] # Initialize an empty record to retailer information for all years

# Enter for loop
for 12 months in years:
url = f'https://www.extremeweatherwatch.com/cities/new-york/year-{12 months}'
response = requests.get(url)
soup = BeautifulSoup(response.textual content, 'html.parser')

div_container = soup.discover('div', {'class': 'web page city-year-page'})

if div_container:
select_month = div_container.discover('choose', {'class': 'form-control url-selector'})

if select_month:
monthly_data = []
for choice in select_month.find_all('choice'):
month_name = choice.textual content.strip().decrease()

h5_tag = soup.discover('a', {'title': choice['value'][1:]}).find_next('h5', {'class': 'mt-4'})

if h5_tag:
responsive_div = h5_tag.find_next('div', {'class': 'responsive'})
desk = responsive_div.discover('desk', {'class': 'bordered-table daily-table'})

if desk:
information = []
for row in desk.find_all('tr')[1:]:
cols = row.find_all('td')
day = cols[0].textual content.strip()
high_temp = float(cols[1].textual content.strip())
information.append([convert_nyc_date(day, month_name, year), high_temp])

monthly_df = pd.DataFrame(information, columns=['Date', 'High (°F)'])
monthly_data.append(monthly_df)
else:
print(f"Desk not discovered for {month_name.capitalize()} {12 months}")
else:
print(f"h5 tag not discovered for {month_name.capitalize()} {12 months}")

# Concatenate month-to-month information to kind the entire dataframe for the 12 months
yearly_nyc_df = pd.concat(monthly_data, ignore_index=True)

# Extract month title from the 'Date' column
yearly_nyc_df['Month'] = yearly_nyc_df['Date'].dt.strftime('%B')

# Capitalize the month names
yearly_nyc_df['Month'] = yearly_nyc_df['Month'].str.capitalize()

all_data.append(yearly_nyc_df)


######################################################################################################
# Generate a time collection plot of the 5-year New York Metropolis each day excessive temperatures
######################################################################################################

# Concatenate the information for all years
if all_data:
combined_df = pd.concat(all_data, ignore_index=True)

# Create a line plot for every year
plt.determine(figsize=(12, 6))
sns.lineplot(information=combined_df, x='Date', y='Excessive (°F)', hue=combined_df['Date'].dt.12 months)
plt.title('New York Metropolis Every day Excessive Temperature Time Collection (2017-2022) - 5-12 months Development', fontsize=18)
plt.xlabel('Date', fontsize=16) # Set x-axis label
plt.ylabel('Excessive Temperature (°F)', fontsize=16) # Set y-axis label
plt.legend(title='12 months', bbox_to_anchor=(1.05, 1), loc='higher left', fontsize=14) # Show legend outdoors the plot
plt.tick_params(axis='each', which='main', labelsize=14) # Set font measurement for each axes' ticks
plt.present()
# APPLE STOCK CODE

# Set variables
years = vary(2017, 2023)
information = [] # Initialize an empty record to retailer information for all years

# Extract Apple's historic information for the years 2017 to 2022
for 12 months in years:
url = f'https://markets.financialcontent.com/shares/quote/historic?Image=537percent3A908440&12 months={12 months}&Month=12&Vary=12'
response = requests.get(url)
soup = BeautifulSoup(response.textual content, 'html.parser')
desk = soup.discover('desk', {'class': 'quote_detailed_price_table'})

if desk:
for row in desk.find_all('tr')[1:]:
cols = row.find_all('td')
date = cols[0].textual content

# Test if the 12 months is throughout the desired vary
if str(12 months) in date:
open_price = cols[1].textual content
excessive = cols[2].textual content
low = cols[3].textual content
shut = cols[4].textual content
quantity = cols[5].textual content
change_percent = cols[6].textual content
information.append([date, open_price, high, low, close, volume, change_percent])

# Create a DataFrame from the extracted information
apple_df = pd.DataFrame(information, columns=['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Change(%)'])

# Confirm that DataFrame comprises 5-years
# apple_df.head(50)

#################################################################
# Generate a Candlestick charting of the 5-year inventory efficiency
#################################################################

new_apple_df = apple_df.copy()

# Convert Apple 'Date' column to a constant date format
new_apple_df['Date'] = pd.to_datetime(new_apple_df['Date'], format='%b %d, %Y')

# Kind the datasets by 'Date' in ascending order
new_apple_df = new_apple_df.sort_values('Date')

# Convert numerical columns to drift, dealing with empty strings
numeric_cols = ['Open', 'High', 'Low', 'Close', 'Volume', 'Change(%)']
for col in numeric_cols:
new_apple_df[col] = pd.to_numeric(new_apple_df[col], errors='coerce')

# Create a candlestick chart
fig = go.Determine(information=[go.Candlestick(x=new_apple_df['Date'],
open=new_apple_df['Open'],
excessive=new_apple_df['High'],
low=new_apple_df['Low'],
shut=new_apple_df['Close'])])

# Set the format
fig.update_layout(title='Apple Inventory Candlestick Chart',
xaxis_title='Date',
yaxis_title='Inventory Value',
xaxis_rangeslider_visible=False,
font=dict(
household="Arial",
measurement=16,
colour="Black"
),
title_font=dict(
household="Arial",
measurement=20,
colour="Black"
),
xaxis=dict(
title=dict(
textual content="Date",
font=dict(
household="Arial",
measurement=18,
colour="Black"
)
),
tickfont=dict(
household="Arial",
measurement=16,
colour="Black"
)
),
yaxis=dict(
title=dict(
textual content="Inventory Value",
font=dict(
household="Arial",
measurement=18,
colour="Black"
)
),
tickfont=dict(
household="Arial",
measurement=16,
colour="Black"
)
)
)

# Present the chart
fig.present()
##########################################
# MERGE THE NEW_NYC_DF WITH NEW_APPLE_DF
##########################################
# Convert the 'Day' column in New York Metropolis combined_df to a constant date format ...

new_nyc_df = combined_df.copy()

# Add lacking weekends to NYC temperature information
start_date = new_nyc_df['Date'].min()
end_date = new_nyc_df['Date'].max()
weekend_dates = pd.date_range(start_date, end_date, freq='B') # B: enterprise day frequency (excludes weekends)
missing_weekends = weekend_dates[~weekend_dates.isin(new_nyc_df['Date'])]
missing_data = pd.DataFrame({'Date': missing_weekends, 'Excessive (°F)': None})
new_nyc_df = pd.concat([new_nyc_df, missing_data]).sort_values('Date').reset_index(drop=True) # Resetting index
new_apple_df = apple_df.copy()

# Convert Apple 'Date' column to a constant date format
new_apple_df['Date'] = pd.to_datetime(new_apple_df['Date'], format='%b %d, %Y')

# Kind the datasets by 'Date' in ascending order
new_nyc_df = combined_df.sort_values('Date')
new_apple_df = new_apple_df.sort_values('Date')

# Merge the datasets on the 'Date' column
merged_df = pd.merge(new_apple_df, new_nyc_df, on='Date', how='internal')

# Confirm the proper merge -- ought to merge solely NYC temp data that match with Apple inventory data by Date
merged_df
# Make sure the columns of curiosity are numeric 
merged_df['High (°F)'] = pd.to_numeric(merged_df['High (°F)'], errors='coerce')
merged_df['Close'] = pd.to_numeric(merged_df['Close'], errors='coerce')

# UPDATED CODE BY PAUL USES ASTRONOMICAL TEMPERATURES

# CORRELATION HEATMAP OF YEAR-OVER-YEAR
# DAILY HIGH NYC TEMPERATURES VS.
# APPLE STOCK 2017-2023

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Convert 'Date' to datetime
merged_df['Date'] = pd.to_datetime(merged_df['Date'])

# Outline a perform to map months to seasons
def map_season(month):
if month in [4, 5, 6]:
return 'Spring'
elif month in [7, 8, 9]:
return 'Summer season'
elif month in [10, 11, 12]:
return 'Fall'
else:
return 'Winter'

# Extract month from the Date column and map it to seasons
merged_df['Season'] = merged_df['Date'].dt.month.map(map_season)

# Extract the years current within the information
years = merged_df['Date'].dt.12 months.distinctive()

# Create subplots for every mixture of 12 months and season
seasons = ['Spring', 'Summer', 'Fall', 'Winter']

# Convert 'Shut' column to numeric
merged_df['Close'] = pd.to_numeric(merged_df['Close'], errors='coerce')

# Create an empty DataFrame to retailer correlation matrix
corr_matrix = pd.DataFrame(index=years, columns=seasons)

# Calculate correlation matrix for every mixture of 12 months and season
for 12 months in years:
year_data = merged_df[merged_df['Date'].dt.12 months == 12 months]
for season in seasons:
information = year_data[year_data['Season'] == season]
corr = information['High (°F)'].corr(information['Close'])
corr_matrix.loc[year, season] = corr

# Plot correlation matrix
plt.determine(figsize=(10, 6))
sns.heatmap(corr_matrix.astype(float), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Temperature-Inventory Correlation', fontsize=18) # Set most important title font measurement
plt.xlabel('Season', fontsize=16) # Set x-axis label font measurement
plt.ylabel('12 months', fontsize=16) # Set y-axis label font measurement
plt.tick_params(axis='each', which='main', labelsize=14) # Set annotation font measurement
plt.tight_layout()
plt.present()
#######################
# STAT ANALYSIS SECTION
#######################
#############################################################
# GRANGER CAUSALITY TEST
# check whether or not previous values of temperature (or inventory costs)
# can predict future values of inventory costs (or temperature).
# carry out the Granger causality check between 'Excessive (°F)' and
# 'Shut' columns in merged_df as much as a most lag of 255
#############################################################

# Carry out Granger causality check
max_lag = 1 # Select the utmost lag of 100 - Jupyter occasions out at increased lags
test_results = grangercausalitytests(merged_df[['High (°F)', 'Close']], max_lag)

# Interpretation:

# seems to be like not one of the lag give a big p-value
# at alpha .05, we can not reject the null speculation, that's,
# we can not conclude that Granger causality exists between each day excessive
# temperatures in NYC and Apple inventory

#################################################################
# CROSS-CORRELATION ANALYSIS
# calculate the cross-correlation between 'Excessive (°F)' and 'Shut'
# columns in merged_df, and ccf_values will include the
# cross-correlation coefficients, whereas lag_values will
# include the corresponding lag values
#################################################################

# Calculate cross-correlation
ccf_values = ccf(merged_df['High (°F)'], merged_df['Close'])
lag_values = np.arange(-len(merged_df)+1, len(merged_df))

ccf_values, lag_values

# Interpretation:
# Appears like there may be sturdy constructive correlation within the variables
# in latter years and constructive correlation of their respective
# lags. This confirms what our plotting exhibits us

########################################################
# LOOK AT THE BEST CORRELATION COEFFICIENT - 2020? LET'S
# EXPLORE FURTHER AND CALCULATE THE p-VALUE AND
# CONFIDENCE INTERVAL
########################################################

# Get dataframes for particular durations of spurious correlation

merged_df['year'] = merged_df['Date'].dt.12 months
best_season_data = merged_df.loc[(merged_df['year'] == 2020) & (merged_df['Season'] == 'Spring')]

# Calculate correlation coefficient and p-value
corr_coeff, p_value = stats.pearsonr(best_season_data['High (°F)'], best_season_data['Close'])
corr_coeff, p_value

# Carry out bootstrapping to acquire confidence interval
def bootstrap_corr(information, n_bootstrap=1000):
corr_values = []
for _ in vary(n_bootstrap):
pattern = information.pattern(n=len(information), change=True)
corr_coeff, _ = stats.pearsonr(pattern['High (°F)'], pattern['Close'])
corr_values.append(corr_coeff)
return np.percentile(corr_values, [2.5, 97.5]) # 95% confidence interval

confidence_interval = bootstrap_corr(best_season_data)
confidence_interval
#####################################################################
# VISUALIZE RELATIONSHIP BETWEEN APPLE STOCK AND NYC DAILY HIGH TEMPS
#####################################################################

# Twin y-axis plotting utilizing twinx() perform from matplotlib
date = merged_df['Date']
temperature = merged_df['High (°F)']
stock_close = merged_df['Close']

# Create a determine and axis
fig, ax1 = plt.subplots(figsize=(10, 6))

# Plotting temperature on the left y-axis (ax1)
colour = 'tab:purple'
ax1.set_xlabel('Date', fontsize=16)
ax1.set_ylabel('Temperature (°F)', colour=colour, fontsize=16)
ax1.plot(date, temperature, colour=colour)
ax1.tick_params(axis='y', labelcolor=colour)

# Create a secondary y-axis for the inventory shut costs
ax2 = ax1.twinx()
colour = 'tab:blue'
ax2.set_ylabel('Inventory Shut Value', colour=colour, fontsize=16)
ax2.plot(date, stock_close, colour=colour)
ax2.tick_params(axis='y', labelcolor=colour)

# Title and present the plot
plt.title('Apple Inventory correlates with New York Metropolis Temperature', fontsize=18)
plt.present()

stat?event=post


Spurious Correlations: The Comedy and Drama of Statistics was initially revealed in In the direction of Information Science on Medium, the place persons are persevering with the dialog by highlighting and responding to this story.



Supply hyperlink

LEAVE A REPLY

Please enter your comment!
Please enter your name here