X is high, would we expect Y to also be high, or be low?Some terms:
Y with another X means predicting your Y by looking at the distribution of Y for your value of Xtable(wage1$numdep,wage1$smsa,dnn=c('Num. Dependents','Lives in Metropolitan Area'))## Lives in Metropolitan Area
## Num. Dependents 0 1
## 0 60 192
## 1 27 78
## 2 38 61
## 3 13 32
## 4 3 13
## 5 3 4
## 6 2 0
prop.table(table(wage1$numdep,wage1$smsa,dnn=c('Num. Dependents','Lives in Metropolitan Area')),margin=2)## Lives in Metropolitan Area
## Num. Dependents 0 1
## 0 0.41095890 0.50526316
## 1 0.18493151 0.20526316
## 2 0.26027397 0.16052632
## 3 0.08904110 0.08421053
## 4 0.02054795 0.03421053
## 5 0.02054795 0.01052632
## 6 0.01369863 0.00000000
prop.table(table(wage1$numdep,wage1$smsa,dnn=c('Number of Dependents','Lives in Metropolitan Area')),margin=1)## Lives in Metropolitan Area
## Number of Dependents 0 1
## 0 0.2380952 0.7619048
## 1 0.2571429 0.7428571
## 2 0.3838384 0.6161616
## 3 0.2888889 0.7111111
## 4 0.1875000 0.8125000
## 5 0.4285714 0.5714286
## 6 1.0000000 0.0000000
group_by() to organize the data into groupssummarize() the data within those groupswage1 %>%
group_by(smsa) %>%
summarize(numdep=mean(numdep))## # A tibble: 2 x 2
## smsa numdep
## <dbl> <dbl>
## 1 0 1.24
## 2 1 0.968
smsa is high, numdep tends to be low - negative correlation!X is associated with a correlation-standard-deviation increase in Y”cor(wage1$numdep,wage1$smsa)## [1] -0.09636769
cor(wage1$smsa,wage1$numdep)## [1] -0.09636769
Let’s go back to those different means:
## # A tibble: 2 x 2
## smsa numdep
## <dbl> <dbl>
## 1 0 1.24
## 2 1 0.968
table(df$var1,df$var2) to look at two variables togetherprop.table(table(df$var1,df$var2)) for the proportion in each cellprop.table(table(df$var1,df$var2),margin=2) to get proportions within each columnprop.table(table(df$var1,df$var2),margin=1) to get proportions within each rowdf %>% group_by(var1) %>% summarize(mean(var2)) to get mean of var2 for each value of var1cor(df$var1,df$var2) to calculate correlationplot(xvar,yvar) with two variablesplot(wage1$educ,wage1$wage,xlab="Years of Education",ylab="Wage")educ, what changes about the values of wage we see?plot(wage1$educ,wage1$wage,xlab="Years of Education",ylab="Wage")
abline(-.9,.5,col='red')
plot(function(x) 5.4-.6*x+.05*(x^2),0,18,add=TRUE,col='blue')plot(xvar,yvar) is extremely powerful, and will show you relationships at a glancecor(wage1$wage,wage1$educ) = 0.4059033library(Ecdat)
data(Clothing)
plot(Clothing$sales,Clothing$margin,xlab="Gross Sales",ylab="Margin")library(Ecdat)
data(Diamond)
plot(Diamond$carat,Diamond$price,xlab="Number of Carats",ylab="Price")density() function for different valueswage1 %>% group_by(married) %>% summarize(wage = mean(wage))## # A tibble: 2 x 2
## married wage
## <dbl> <dbl>
## 1 0 4.84
## 2 1 6.57
cor(wage1$wage,wage1$married)## [1] 0.2288172
cor(x,y) is positive, it could be that x causes y… or that y causes x, or that something else causes both!addata <- read.csv('http://www.nickchk.com/ad_spend_and_gdp.csv')
plot(addata$AdSpending,addata$GDP,
xlab='Ad Spend/Year (Mil.)',ylab='US GDP (Bil.)')SMCRM package, load it, get the customerAcquisition data. Rename it caacquisition==1 observations, see if the size of first purchase is related to duration as a customer, with cor and (labeled) plotindustry and acquisition are dependent on each other using prop.table with the margin optionaggregate, then check the corindustry==0 in blue and, on the same graph, revenues for industry==1 in redinstall.packages('SMCRM')
library(SMCRM)
data(customerAcquisition)
ca <- customerAcquisition
cor(filter(ca,acquisition==1)$first_purchase,filter(ca,acquisition==1)$duration)
plot(filter(ca,acquisition==1)$first_purchase,filter(ca,acquisition==1)$duration,
xlab="Value of First Purchase",ylab="Customer Duration")
prop.table(table(ca$industry,ca$acquisition),margin=1)
prop.table(table(ca$industry,ca$acquisition),margin=2)
aggregate(revenue~industry,data=ca,FUN=mean)
cor(ca$revenue,ca$industry)
plot(density(filter(ca,industry==0)$revenue),col='blue',xlab="Revenues",main="Revenue Distribution")
lines(density(filter(ca,industry==1)$revenue),col='red')