X
is high, would we expect Y
to also be high, or be low?Some terms:
Y
with another X
means predicting your Y
by looking at the distribution of Y
for your value of X
table(wage1$numdep,wage1$smsa,dnn=c('Num. Dependents','Lives in Metropolitan Area'))
## Lives in Metropolitan Area
## Num. Dependents 0 1
## 0 60 192
## 1 27 78
## 2 38 61
## 3 13 32
## 4 3 13
## 5 3 4
## 6 2 0
prop.table(table(wage1$numdep,wage1$smsa,dnn=c('Num. Dependents','Lives in Metropolitan Area')),margin=2)
## Lives in Metropolitan Area
## Num. Dependents 0 1
## 0 0.41095890 0.50526316
## 1 0.18493151 0.20526316
## 2 0.26027397 0.16052632
## 3 0.08904110 0.08421053
## 4 0.02054795 0.03421053
## 5 0.02054795 0.01052632
## 6 0.01369863 0.00000000
prop.table(table(wage1$numdep,wage1$smsa,dnn=c('Number of Dependents','Lives in Metropolitan Area')),margin=1)
## Lives in Metropolitan Area
## Number of Dependents 0 1
## 0 0.2380952 0.7619048
## 1 0.2571429 0.7428571
## 2 0.3838384 0.6161616
## 3 0.2888889 0.7111111
## 4 0.1875000 0.8125000
## 5 0.4285714 0.5714286
## 6 1.0000000 0.0000000
group_by()
to organize the data into groupssummarize()
the data within those groupswage1 %>%
group_by(smsa) %>%
summarize(numdep=mean(numdep))
## # A tibble: 2 x 2
## smsa numdep
## <dbl> <dbl>
## 1 0 1.24
## 2 1 0.968
smsa
is high, numdep
tends to be low - negative correlation!X
is associated with a correlation-standard-deviation increase in Y
”cor(wage1$numdep,wage1$smsa)
## [1] -0.09636769
cor(wage1$smsa,wage1$numdep)
## [1] -0.09636769
Let’s go back to those different means:
## # A tibble: 2 x 2
## smsa numdep
## <dbl> <dbl>
## 1 0 1.24
## 2 1 0.968
table(df$var1,df$var2)
to look at two variables togetherprop.table(table(df$var1,df$var2))
for the proportion in each cellprop.table(table(df$var1,df$var2),margin=2)
to get proportions within each columnprop.table(table(df$var1,df$var2),margin=1)
to get proportions within each rowdf %>% group_by(var1) %>% summarize(mean(var2))
to get mean of var2 for each value of var1cor(df$var1,df$var2)
to calculate correlationplot(xvar,yvar)
with two variablesplot(wage1$educ,wage1$wage,xlab="Years of Education",ylab="Wage")
educ
, what changes about the values of wage
we see?plot(wage1$educ,wage1$wage,xlab="Years of Education",ylab="Wage")
abline(-.9,.5,col='red')
plot(function(x) 5.4-.6*x+.05*(x^2),0,18,add=TRUE,col='blue')
plot(xvar,yvar)
is extremely powerful, and will show you relationships at a glancecor(wage1$wage,wage1$educ)
= 0.4059033library(Ecdat)
data(Clothing)
plot(Clothing$sales,Clothing$margin,xlab="Gross Sales",ylab="Margin")
library(Ecdat)
data(Diamond)
plot(Diamond$carat,Diamond$price,xlab="Number of Carats",ylab="Price")
density()
function for different valueswage1 %>% group_by(married) %>% summarize(wage = mean(wage))
## # A tibble: 2 x 2
## married wage
## <dbl> <dbl>
## 1 0 4.84
## 2 1 6.57
cor(wage1$wage,wage1$married)
## [1] 0.2288172
cor(x,y)
is positive, it could be that x
causes y
… or that y
causes x
, or that something else causes both!addata <- read.csv('http://www.nickchk.com/ad_spend_and_gdp.csv')
plot(addata$AdSpending,addata$GDP,
xlab='Ad Spend/Year (Mil.)',ylab='US GDP (Bil.)')
SMCRM
package, load it, get the customerAcquisition
data. Rename it caacquisition==1
observations, see if the size of first purchase is related to duration as a customer, with cor
and (labeled) plot
industry
and acquisition
are dependent on each other using prop.table
with the margin
optionaggregate
, then check the cor
industry==0
in blue and, on the same graph, revenues for industry==1
in redinstall.packages('SMCRM')
library(SMCRM)
data(customerAcquisition)
ca <- customerAcquisition
cor(filter(ca,acquisition==1)$first_purchase,filter(ca,acquisition==1)$duration)
plot(filter(ca,acquisition==1)$first_purchase,filter(ca,acquisition==1)$duration,
xlab="Value of First Purchase",ylab="Customer Duration")
prop.table(table(ca$industry,ca$acquisition),margin=1)
prop.table(table(ca$industry,ca$acquisition),margin=2)
aggregate(revenue~industry,data=ca,FUN=mean)
cor(ca$revenue,ca$industry)
plot(density(filter(ca,industry==0)$revenue),col='blue',xlab="Revenues",main="Revenue Distribution")
lines(density(filter(ca,industry==1)$revenue),col='red')