1 Zakład Bioinformatyki, Instytut Informatyki, Uniwersytet w Białymstoku
✉ Correspondence: Jarosław Kotowicz <j.kotowicz@uwb.edu.pl>
Registered S3 method overwritten by 'dplyr':
method from
print.rowwise_df
Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
[30m-- [1mAttaching packages[22m --------------------------------------- tidyverse 1.3.0 --[39m
[30m[32m<U+221A>[30m [34mggplot2[30m 3.3.0 [32m<U+221A>[30m [34mpurrr [30m 0.3.4
[32m<U+221A>[30m [34mtibble [30m 3.0.1 [32m<U+221A>[30m [34mdplyr [30m 0.8.5
[32m<U+221A>[30m [34mtidyr [30m 1.0.2 [32m<U+221A>[30m [34mstringr[30m 1.4.0
[32m<U+221A>[30m [34mreadr [30m 1.3.1 [32m<U+221A>[30m [34mforcats[30m 0.5.0[39m
[30m-- [1mConflicts[22m ------------------------------------------ tidyverse_conflicts() --
[31mx[30m [34mdplyr[30m::[32mfilter()[30m masks [34mstats[30m::filter()
[31mx[30m [34mdplyr[30m::[32mlag()[30m masks [34mstats[30m::lag()[39m
Przykład (skopiowany kod i następnie uporządkowany)
acsNY <- read_csv("https://jaredlander.com/data/acs_ny.csv",
col_types = cols(ElectricBill = col_integer(),
FamilyIncome = col_integer(),
FamilyType = col_factor(levels = c("Married", "Female Head", "Male Head")),
HouseCosts = col_integer(),
Insurance = col_integer(),
NumBedrooms = col_integer(),
NumChildren = col_integer(),
NumPeople = col_integer(),
NumRooms = col_integer(),
NumVehicles = col_integer(),
NumWorkers = col_integer()))
f(a,b) jest to a %>% f(b)
f(a,b) jest to b %>% f(a,.) f(a,b) Uwaga: Kropka wskazuje miejsca wstawienia zmiennej.
g(f(a)) jest to a %>% f %>% g
Acres FamilyIncome FamilyType NumBedrooms NumChildren
Length:22745 Min. : 50 Married :18326 Min. :0.000 Min. : 0.0000
Class :character 1st Qu.: 52540 Female Head: 3266 1st Qu.:3.000 1st Qu.: 0.0000
Mode :character Median : 87000 Male Head : 1153 Median :3.000 Median : 0.0000
Mean : 110281 Mean :3.385 Mean : 0.9012
3rd Qu.: 133800 3rd Qu.:4.000 3rd Qu.: 2.0000
Max. :1605000 Max. :8.000 Max. :12.0000
NumPeople NumRooms NumUnits NumVehicles NumWorkers
Min. : 2.00 Min. : 1.000 Length:22745 Min. :0.000 Min. :0.000
1st Qu.: 2.00 1st Qu.: 6.000 Class :character 1st Qu.:2.000 1st Qu.:1.000
Median : 3.00 Median : 7.000 Mode :character Median :2.000 Median :2.000
Mean : 3.39 Mean : 7.175 Mean :2.113 Mean :1.745
3rd Qu.: 4.00 3rd Qu.: 8.000 3rd Qu.:3.000 3rd Qu.:2.000
Max. :18.00 Max. :21.000 Max. :6.000 Max. :3.000
OwnRent YearBuilt HouseCosts ElectricBill FoodStamp
Length:22745 Length:22745 Min. : 4 Min. : 1 Length:22745
Class :character Class :character 1st Qu.: 650 1st Qu.:100 Class :character
Mode :character Mode :character Median :1200 Median :150 Mode :character
Mean :1480 Mean :175
3rd Qu.:2000 3rd Qu.:220
Max. :7090 Max. :580
HeatingFuel Insurance Language
Length:22745 Min. : 0.0 Length:22745
Class :character 1st Qu.: 400.0 Class :character
Mode :character Median : 720.0 Mode :character
Mean : 960.9
3rd Qu.:1200.0
Max. :6600.0
Budowa funkcji prefix NazwaRozkładu
Rodzaje prefiksów
Nazwy rozkładów
norm - noramlny,
gamma
beta
chisq
itd
Wyznaczamy wartość gęstości rozkładu normalnego standardowego dla argumentu 0.
[1] 0.3989423
[1] 0.07365403
[1] 0.5000000 0.5792597 0.6554217 0.7257469 0.7881446 0.8413447
Registered S3 methods overwritten by 'htmltools':
method from
print.html tools:rstudio
print.shiny.tag tools:rstudio
print.shiny.tag.list tools:rstudio
Registered S3 method overwritten by 'GGally':
method from
+.gg ggplot2
Registered S3 method overwritten by 'gdata':
method from
reorder.factor DescTools
Registered S3 method overwritten by 'htmlwidgets':
method from
print.htmlwidget tools:rstudio
[-3.17,-2.17] (-2.17,-1.17] (-1.17,-0.169] (-0.169,0.831] (0.831,1.83] (1.83,2.83]
15 109 293 381 175 24
(2.83,3.83]
3
[31mx[39m[34m <character>
[39m[34m# total N=1000 valid N=1000 mean=1.88 sd=0.68
[39mValue | N | Raw % | Valid % | Cum. %
--------------------------------------
x | 297 | 29.70 | 29.70 | 29.70
y | 525 | 52.50 | 52.50 | 82.20
z | 178 | 17.80 | 17.80 | 100.00
<NA> | 0 | 0.00 | <NA> | <NA>
df01 <- data.frame(zm1 = sample(c("a", "b"), 100, replace = TRUE),
zm2 = sample(c("c", "d"), 100, replace = TRUE))
[31mzm1[39m[34m <character>
[39m[34m# total N=100 valid N=100 mean=1.38 sd=0.49
[39mValue | N | Raw % | Valid % | Cum. %
-------------------------------------
a | 62 | 62 | 62 | 62
b | 38 | 38 | 38 | 100
<NA> | 0 | 0 | <NA> | <NA>
[31mzm2[39m[34m <character>
[39m[34m# total N=100 valid N=100 mean=1.55 sd=0.50
[39mValue | N | Raw % | Valid % | Cum. %
-------------------------------------
c | 45 | 45.00 | 45.00 | 45
d | 55 | 55.00 | 55.00 | 100
<NA> | 0 | 0.00 | <NA> | <NA>
[1] -3.1686257 -0.6939557 0.0476431 0.6494921 3.6518932
0% 10% 20% 30% 40% 50% 60% 70%
-3.1686257 -1.2769306 -0.8797193 -0.5415714 -0.2266632 0.0476431 0.2510401 0.4988340
80% 90% 100%
0.8378308 1.2253506 3.6518932
Min. 1st Qu. Median Mean 3rd Qu. Max.
-3.337 1.612 3.095 2.982 4.293 10.304
[1] -3.337251 1.612089 3.095286 4.298984 10.303786
x.norm32
n missing distinct Info Mean Gmd .05 .10 .25 .50
1000 0 1000 1 2.982 2.229 -0.3510 0.4461 1.6121 3.0953
.75 .90 .95
4.2929 5.4507 6.0952
lowest : -3.337251 -3.304396 -2.868932 -2.750761 -1.974559
highest: 8.327035 8.470047 8.957606 9.994734 10.303786
[1] -0.03115604
[1] -0.03115604
[1] -0.03124973
[1] -0.03115604
[1] 0.1129802
[1] 0.1258378
[1] 0.1129802
[1] -0.02882743
norm01 norm32 ga
norm01 0.97850552 1.95701104 -0.02882743
norm32 1.95701104 3.91402208 -0.05765486
ga -0.02882743 -0.05765486 2.19773905
norm01 norm32 ga
norm01 1.00000000 1.00000000 -0.01965786
norm32 1.00000000 1.00000000 -0.01965786
ga -0.01965786 -0.01965786 1.00000000
norm01 norm32 ga
norm01 1.00 1.00 -0.02
norm32 1.00 1.00 -0.02
ga -0.02 -0.02 1.00
n= 1000
P
norm01 norm32 ga
norm01 0.0000 0.5347
norm32 0.0000 0.5347
ga 0.5347 0.5347
norm01 norm32 ga
norm01 1.00 1.00 -0.01
norm32 1.00 1.00 -0.01
ga -0.01 -0.01 1.00
n= 1000
P
norm01 norm32 ga
norm01 0.0000 0.8288
norm32 0.0000 0.8288
ga 0.8288 0.8288
mean sd
-0.009051554 0.988699659
( 0.031265428) ( 0.022107996)
shape rate
1.81814044 0.93017351
(0.07504241) (0.04415700)
upper mean lower
0.052332592 -0.009051554 -0.070435701
mean lwr.ci upr.ci
-0.009051554 -0.070435701 0.052332592
One Sample t-test
data: x.norm01
t = -0.28936, df = 999, p-value = 0.7724
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-0.07043570 0.05233259
sample estimates:
mean of x
-0.009051554
No class or unkown class. Using default calcuation.
Estimate CI lower CI upper Std. Error
-0.009051554 -0.070435701 0.052332592 0.031281073
Estimate CI lower CI upper Std. Error
[1,] 0.294 0.2659037 0.3233147 0.01440708
1-sample proportions test with continuity correction
data: sum(x.bin) out of length(x.bin), null probability 0.3
X-squared = 0.14405, df = 1, p-value = 0.7043
alternative hypothesis: true p is not equal to 0.3
95 percent confidence interval:
0.2661099 0.3234946
sample estimates:
p
0.294
CI_var_N <- function(dane, alpha = .05) {
konieclewy <- (length(dane) -1) * var(dane) / qchisq(1- alpha/2, df = length(dane) - 1)
koniecprawy <- (length(dane) -1) * var(dane) / qchisq(alpha/2, df = length(dane) - 1)
cat(paste("Przedział ufności dla wariancji wynosi:\n [", konieclewy, ",", koniecprawy, "]\n"))
}
CI_var_N(x.norm01)
Przedział ufności dla wariancji wynosi:
[ 0.898060293361267 , 1.07032294630049 ]
One Sample t-test
data: x.norm01
t = -0.28936, df = 999, p-value = 0.7724
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-0.07043570 0.05233259
sample estimates:
mean of x
-0.009051554
Interpretacja wyniku!
One Sample t-test
data: x.norm32
t = 47.663, df = 999, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
2.859129 3.104665
sample estimates:
mean of x
2.981897
Interpretacja wyniku!
One Sample t-test
data: x.norm32
t = -0.28936, df = 999, p-value = 0.7724
alternative hypothesis: true mean is not equal to 3
95 percent confidence interval:
2.859129 3.104665
sample estimates:
mean of x
2.981897
Interpretacja wyniku!
One Sample t-test
data: x.norm32
t = -0.28936, df = 999, p-value = 0.3862
alternative hypothesis: true mean is less than 3
95 percent confidence interval:
-Inf 3.084898
sample estimates:
mean of x
2.981897
Interpretacja wyniku!
One Sample t-test
data: x.norm32
t = -0.28936, df = 999, p-value = 0.6138
alternative hypothesis: true mean is greater than 3
95 percent confidence interval:
2.878896 Inf
sample estimates:
mean of x
2.981897
Interpretacja wyniku!
Welch Two Sample t-test
data: x.norm32 and x.norm01
t = 42.76, df = 1469.1, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
2.853743 3.128154
sample estimates:
mean of x mean of y
2.981896891 -0.009051554
Interpretacja wyniku!
Welch Two Sample t-test
data: x.norm32[1:500] and x.norm32[501:1000]
t = 0.46174, df = 997.23, p-value = 0.6444
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.1878366 0.3034311
sample estimates:
mean of x mean of y
3.010796 2.952998
Interpretacja wyniku!
F test to compare two variances
data: x.norm01 and x.norm32
F = 0.25, num df = 999, denom df = 999, p-value < 2.2e-16
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.2208247 0.2830300
sample estimates:
ratio of variances
0.25
Interpretacja wyniku!
F test to compare two variances
data: x.norm01[1:100] and x.norm01[901:100]
F = 0.97939, num df = 99, denom df = 801, p-value = 0.9215
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.7406962 1.3406822
sample estimates:
ratio of variances
0.979391
Interpretacja wyniku!
F test to compare two variances
data: x.norm01 and x.norm32
F = 0.25, num df = 999, denom df = 999, p-value < 2.2e-16
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.2208247 0.2830300
sample estimates:
ratio of variances
0.25
Interpretacja wyniku!
Mood two-sample test of scale
data: x.norm01 and x.norm32
Z = -5.3831, p-value = 7.323e-08
alternative hypothesis: two.sided
Interpretacja wyniku!
Ansari-Bradley test
data: x.norm01 and x.norm32
AB = 541388, p-value = 2.408e-10
alternative hypothesis: true ratio of scales is not equal to 1
Interpretacja wyniku!
1-sample proportions test with continuity correction
data: 470 out of 1000, null probability 0.5
X-squared = 3.481, df = 1, p-value = 0.06208
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.4387437 0.5014896
sample estimates:
p
0.47
Interpretacja wyniku!
Cramer-von Mises normality test
data: x.norm01
W = 0.10747, p-value = 0.08869
Interpretacja wyniku!
Anderson-Darling normality test
data: x.norm01
A = 0.52378, p-value = 0.1819
Interpretacja wyniku!
Shapiro-Wilk normality test
data: x.norm01
W = 0.99829, p-value = 0.428
Interpretacja wyniku!
Lilliefors (Kolmogorov-Smirnov) normality test
data: x.norm01
D = 0.028205, p-value = 0.05865
Interpretacja wyniku!
Pearson chi-square normality test
data: x.norm01
P = 31.04, p-value = 0.3635
Interpretacja wyniku!
Title:
D'Agostino Normality Test
Test Results:
STATISTIC:
Chi2 | Omnibus: 0.8855
Z3 | Skewness: -0.4057
Z4 | Kurtosis: 0.8491
P VALUE:
Omnibus Test: 0.6423
Skewness Test: 0.685
Kurtosis Test: 0.3959
Description:
Thu May 14 22:08:00 2020 by user: user
Aproksymacja chi-kwadrat mo戼㹦e by攼㸶 niepoprawna
Chi-squared test for given probabilities
data: x.unif
X-squared = 166.57, df = 999, p-value = 1
Interpretacja wyniku!
Chi-squared test for given probabilities
data: x.norm32 - min(x.norm32)
X-squared = 618.77, df = 999, p-value = 1
Interpretacja wyniku!
One-sample Kolmogorov-Smirnov test
data: x.unif
D = 1, p-value < 2.2e-16
alternative hypothesis: two-sided
Interpretacja wyniku!
One-sample Kolmogorov-Smirnov test
data: x.unif
D = 0.022248, p-value = 0.7054
alternative hypothesis: two-sided
Interpretacja wyniku!
One-sample Kolmogorov-Smirnov test
data: x.norm32
D = 0.70315, p-value < 2.2e-16
alternative hypothesis: two-sided
Interpretacja wyniku!
One-sample Kolmogorov-Smirnov test
data: x.norm32
D = 0.024589, p-value = 0.581
alternative hypothesis: two-sided
Interpretacja wyniku!
Two-sample Kolmogorov-Smirnov test
data: x.norm01 and x.norm32
D = 0.714, p-value < 2.2e-16
alternative hypothesis: two-sided
Interpretacja wyniku!
Two-sample Kolmogorov-Smirnov test
data: x.norm01 and x.norm32
D^+ = 0.714, p-value < 2.2e-16
alternative hypothesis: the CDF of x lies above that of y
Interpretacja wyniku!
Two-sample Kolmogorov-Smirnov test
data: x.norm01 and x.norm32
D^- = 0.002, p-value = 0.996
alternative hypothesis: the CDF of x lies below that of y
Interpretacja wyniku!
Pearson's product-moment correlation
data: x.norm01 and x.norm32
t = Inf, df = 998, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
1 1
sample estimates:
cor
1
Interpretacja wyniku!
Pearson's product-moment correlation
data: x.norm01 and df$ga
t = -0.62113, df = 998, p-value = 0.5347
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.08155156 0.04238688
sample estimates:
cor
-0.01965786
Interpretacja wyniku!
Spearman's rank correlation rho
data: x.norm01 and df$ga
S = 167807514, p-value = 0.8288
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
-0.006846091
Interpretacja wyniku!
Pearson's Chi-squared test with Yates' continuity correction
data: tabela
X-squared = 1.9825, df = 1, p-value = 0.1591
Interpretacja wyniku!
daneMieszkania <- read_delim("http://www.biecek.pl/R/dane/daneMieszkania.csv",
";", escape_double = FALSE, trim_ws = TRUE)
Parsed with column specification:
cols(
cena = [32mcol_double()[39m,
pokoi = [32mcol_double()[39m,
powierzchnia = [32mcol_double()[39m,
dzielnica = [31mcol_character()[39m,
`typ budynku` = [31mcol_character()[39m
)
cena pokoi powierzchnia dzielnica typ budynku
Min. : 83280 Min. :1.00 Min. :17.00 Biskupin :65 kamienica :61
1st Qu.:143304 1st Qu.:2.00 1st Qu.:31.15 Krzyki :79 niski blok:63
Median :174935 Median :3.00 Median :43.70 Srodmiescie:56 wiezowiec :76
Mean :175934 Mean :2.55 Mean :46.20
3rd Qu.:208741 3rd Qu.:3.00 3rd Qu.:61.40
Max. :295762 Max. :4.00 Max. :87.70
Call:
lm(formula = cena ~ dzielnica, data = daneMieszkania)
Residuals:
Min 1Q Median 3Q Max
-84893 -31892 -880 29611 106268
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 189494 5238 36.178 < 2e-16 ***
dzielnicaKrzyki -21321 7072 -3.015 0.00291 **
dzielnicaSrodmiescie -18351 7699 -2.383 0.01810 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 42230 on 197 degrees of freedom
Multiple R-squared: 0.04873, Adjusted R-squared: 0.03907
F-statistic: 5.046 on 2 and 197 DF, p-value: 0.007294
Analysis of Variance Table
Response: cena
Df Sum Sq Mean Sq F value Pr(>F)
dzielnica 2 1.7995e+10 8997691613 5.0456 0.007294 **
Residuals 197 3.5130e+11 1783263361
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Interpretacja wyniku!
Harvey-Collier test
data: model
HC = 1.3246, df = 196, p-value = 0.1868
Interpretacja wyniku!
Rainbow test
data: model
Rain = 0.98189, df1 = 100, df2 = 97, p-value = 0.5364
Interpretacja wyniku!
Jarque-Bera test for normality
data: model$residuals
JB = 5.2583, p-value = 0.0555
Interpretacja wyniku!
Durbin-Watson test
data: model
DW = 2.1565, p-value = 0.8655
alternative hypothesis: true autocorrelation is greater than 0
Interpretacja wyniku!
Goldfeld-Quandt test
data: model
GQ = 1.0691, df1 = 97, df2 = 97, p-value = 0.3713
alternative hypothesis: variance increases from segment 1 to 2
Interpretacja wyniku!
studentized Breusch-Pagan test
data: model
BP = 2.2201, df = 2, p-value = 0.3295
Interpretacja wyniku!
Harrison-McCabe test
data: model
HMC = 0.48313, p-value = 0.362
Interpretacja wyniku!
Bartlett test of homogeneity of variances
data: cena by dzielnica
Bartlett's K-squared = 1.3075, df = 2, p-value = 0.5201
Interpretacja wyniku!
Fligner-Killeen test of homogeneity of variances
data: cena by dzielnica
Fligner-Killeen:med chi-squared = 1.7358, df = 2, p-value = 0.4198
Interpretacja wyniku!
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 2 0.6934 0.5011
197
Modified robust Brown-Forsythe Levene-type test based on the absolute deviations from
the median
data: residuals(model)
Test Statistic = 0.6934, p-value = 0.5011
Classical Levene's test based on the absolute deviations from the mean ( none not
applied because the location is not set to median )
data: residuals(model)
Test Statistic = 0.72784, p-value = 0.4842
Błąd w poleceniu 'mood.test.formula(cena ~ dzielnica, data = daneMieszkania)':
grupujący czynnik musi mieć dokładnie 2 poziomy
Interpretacja wyniku!
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = model)
$dzielnica
diff lwr upr p adj
Krzyki-Biskupin -21321.019 -38021.10 -4620.9333 0.0081457
Srodmiescie-Biskupin -18350.541 -36532.88 -168.2053 0.0473579
Srodmiescie-Krzyki 2970.478 -14450.28 20391.2340 0.9145465
Study: aov(model) ~ "dzielnica"
Scheffe Test for cena
Mean Square Error : 1783263361
dzielnica, means
Alpha: 0.05 ; DF Error: 197
Critical Value of F: 3.041753
Groups according to probability of means differences and alpha level( 0.05 )
Means with the same letter are not significantly different.
Study: aov(model) ~ "dzielnica"
LSD t Test for cena
Mean Square Error: 1783263361
dzielnica, means and individual ( 95 %) CI
Alpha: 0.05 ; DF Error: 197
Critical Value of t: 1.972079
Comparison between treatments means
Study: aov(model) ~ "dzielnica"
LSD t Test for cena
P value adjustment method: bonferroni
Mean Square Error: 1783263361
dzielnica, means and individual ( 95 %) CI
Alpha: 0.05 ; DF Error: 197
Critical Value of t: 2.414597
Comparison between treatments means
Study: aov(model) ~ "dzielnica"
Duncan's new multiple range test
for cena
Mean Square Error: 1783263361
dzielnica, means
Groups according to probability of means differences and alpha level( 0.05 )
Means with the same letter are not significantly different.