1 Zakład Bioinformatyki, Instytut Informatyki, Uniwersytet w Białymstoku
Inspiracją do napisania tego notebooka jest rodział 6.1 książki Jareda P. Landera “R dla każdego. Zaawansowane analizy i grafika statystyczna”.1 Dane wykorzystane pochodzą również ze strony Jareda P. Landera.2
Korzystamy z
Uwaga:
acsNew <- read_csv("http://www.jaredlander.com/data/acsNew.csv",
col_types =
cols(FamilyType = col_factor(levels = c("Married","Female Head", "Male Head")),
Income = col_factor(levels = c("Below", "Above"))
)
)
Wczytujemy bibliotekę dplyr
Zobaczmy, co ukrywa się pod danymi (wykorzystujemy funkcję summary).
Acres FamilyIncome FamilyType NumBedrooms
Length:2273 Min. : 1125 Married :1831 Min. :0.000
Class :character 1st Qu.: 53700 Female Head: 327 1st Qu.:3.000
Mode :character Median : 89200 Male Head : 115 Median :3.000
Mean : 110982 Mean :3.404
3rd Qu.: 136630 3rd Qu.:4.000
Max. :1014000 Max. :8.000
NumChildren NumPeople NumRooms NumUnits
Min. :0.0000 Min. : 2.000 Min. : 1.000 Length:2273
1st Qu.:0.0000 1st Qu.: 2.000 1st Qu.: 6.000 Class :character
Median :0.0000 Median : 3.000 Median : 7.000 Mode :character
Mean :0.8847 Mean : 3.401 Mean : 7.241
3rd Qu.:2.0000 3rd Qu.: 4.000 3rd Qu.: 8.000
Max. :8.0000 Max. :12.000 Max. :21.000
NumVehicles NumWorkers OwnRent YearBuilt HouseCosts
Min. :0.000 Min. :0.000 Length:2273 Length:2273 Min. : 4
1st Qu.:2.000 1st Qu.:1.000 Class :character Class :character 1st Qu.: 670
Median :2.000 Median :2.000 Mode :character Mode :character Median :1200
Mean :2.118 Mean :1.778 Mean :1488
3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:2000
Max. :6.000 Max. :3.000 Max. :6500
ElectricBill FoodStamp HeatingFuel Insurance
Min. : 1.0 Length:2273 Length:2273 Min. : 0.0
1st Qu.:100.0 Class :character Class :character 1st Qu.: 400.0
Median :150.0 Mode :character Mode :character Median : 720.0
Mean :176.6 Mean : 968.1
3rd Qu.:220.0 3rd Qu.:1200.0
Max. :580.0 Max. :6600.0
Language Income
Length:2273 Below:1817
Class :character Above: 456
Mode :character
Sprawdźmy tym danych (funkcja class)
[1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
Zmienna Language wygaje się, że powinna być zmienna czynnikową. Wyciągnimy kolumnę Language polecenim select.
Zobaczmy pierwsze 6 wierszy tej kolumny
oraz zobaczmy jakiej klasy jest ten obiekt.
[1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
To spóbujemy z niej taka zrobić.
Ponieważ funkcja select nie działał poprawnie wykorzystamy funkcję **pull* tworząc obiekt acsNew.lang.pull i sprawdzimy typ obiektu oraz obejrzymy 6 pierwszych elementów.
[1] "character"
Tworzymy nowy obiekt acsNew.lang.pull.fac z wektora danych acsNew.lang.pull, oglądamy go i nadpisujemy kolumnę Language w danych acsNew tym obiektem wykorzystując funkcję $.
[1] English Spanish Spanish English Asian Pacific
[6] English
Levels: Asian Pacific English Other Other European Spanish
Ponieważ oszczędzamy pamięć opisane poprzednio operacje można zapisać jako jeden ciąg poleceń wykorzystując przetwarzanie potokowe. Zobaczmy też podsumowanie.
Acres FamilyIncome FamilyType NumBedrooms
Length:2273 Min. : 1125 Married :1831 Min. :0.000
Class :character 1st Qu.: 53700 Female Head: 327 1st Qu.:3.000
Mode :character Median : 89200 Male Head : 115 Median :3.000
Mean : 110982 Mean :3.404
3rd Qu.: 136630 3rd Qu.:4.000
Max. :1014000 Max. :8.000
NumChildren NumPeople NumRooms NumUnits
Min. :0.0000 Min. : 2.000 Min. : 1.000 Mobile home : 67
1st Qu.:0.0000 1st Qu.: 2.000 1st Qu.: 6.000 Single attached: 235
Median :0.0000 Median : 3.000 Median : 7.000 Single detached:1971
Mean :0.8847 Mean : 3.401 Mean : 7.241
3rd Qu.:2.0000 3rd Qu.: 4.000 3rd Qu.: 8.000
Max. :8.0000 Max. :12.000 Max. :21.000
NumVehicles NumWorkers OwnRent YearBuilt HouseCosts
Min. :0.000 Min. :0.000 Length:2273 Length:2273 Min. : 4
1st Qu.:2.000 1st Qu.:1.000 Class :character Class :character 1st Qu.: 670
Median :2.000 Median :2.000 Mode :character Mode :character Median :1200
Mean :2.118 Mean :1.778 Mean :1488
3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:2000
Max. :6.000 Max. :3.000 Max. :6500
ElectricBill FoodStamp HeatingFuel Insurance
Min. : 1.0 Length:2273 Length:2273 Min. : 0.0
1st Qu.:100.0 Class :character Class :character 1st Qu.: 400.0
Median :150.0 Mode :character Mode :character Median : 720.0
Mean :176.6 Mean : 968.1
3rd Qu.:220.0 3rd Qu.:1200.0
Max. :580.0 Max. :6600.0
Language Income
Asian Pacific : 62 Below:1817
English :1786 Above: 456
Other : 39
Other European: 219
Spanish : 167
Powtórzmy czynność dla pozostałych zmiennych typu character.
acsNew$Acres <- acsNew %>%
pull(Acres) %>%
as.factor
acsNew$OwnRent <- acsNew %>%
pull(OwnRent) %>%
as.factor
acsNew$FoodStamp <- acsNew %>%
pull(FoodStamp) %>%
as.factor
acsNew$HeatingFuel <- acsNew %>%
pull(HeatingFuel) %>%
as.factor
acsNew %>% summary
Acres FamilyIncome FamilyType NumBedrooms NumChildren
1-10 : 452 Min. : 1125 Married :1831 Min. :0.000 Min. :0.0000
10+ : 90 1st Qu.: 53700 Female Head: 327 1st Qu.:3.000 1st Qu.:0.0000
Sub 1:1731 Median : 89200 Male Head : 115 Median :3.000 Median :0.0000
Mean : 110982 Mean :3.404 Mean :0.8847
3rd Qu.: 136630 3rd Qu.:4.000 3rd Qu.:2.0000
Max. :1014000 Max. :8.000 Max. :8.0000
NumPeople NumRooms NumUnits NumVehicles
Min. : 2.000 Min. : 1.000 Mobile home : 67 Min. :0.000
1st Qu.: 2.000 1st Qu.: 6.000 Single attached: 235 1st Qu.:2.000
Median : 3.000 Median : 7.000 Single detached:1971 Median :2.000
Mean : 3.401 Mean : 7.241 Mean :2.118
3rd Qu.: 4.000 3rd Qu.: 8.000 3rd Qu.:3.000
Max. :12.000 Max. :21.000 Max. :6.000
NumWorkers OwnRent YearBuilt HouseCosts ElectricBill
Min. :0.000 Mortgage:2008 Length:2273 Min. : 4 Min. : 1.0
1st Qu.:1.000 Outright: 15 Class :character 1st Qu.: 670 1st Qu.:100.0
Median :2.000 Rented : 250 Mode :character Median :1200 Median :150.0
Mean :1.778 Mean :1488 Mean :176.6
3rd Qu.:2.000 3rd Qu.:2000 3rd Qu.:220.0
Max. :3.000 Max. :6500 Max. :580.0
FoodStamp HeatingFuel Insurance Language Income
No :2106 Coal : 16 Min. : 0.0 Asian Pacific : 62 Below:1817
Yes: 167 Electricity: 109 1st Qu.: 400.0 English :1786 Above: 456
Gas :1387 Median : 720.0 Other : 39
None : 4 Mean : 968.1 Other European: 219
Oil : 622 3rd Qu.:1200.0 Spanish : 167
Other : 18 Max. :6600.0
Wood : 117
Coś nie tak z jedną daną ze zmiennej YearBuilt
Podczytujemy bibliotekę DT.
Filtrujemy dane (funkcja filter) i oglądamy.
Filtrujemy dane dla zmiennej NumUnits. Ponieważ nazwa zawiera spację używamy apostrofów (‘Mobile home’) i oglądamy.
Jest błąd, ale go zignorujemy. Wyświetlamy podsumowanie i wprowadzamy nowy obiekt acsNew.ver.StepByStep, w którym przechowywać będziemy wynik naszych operacji.
Acres FamilyIncome FamilyType NumBedrooms NumChildren
1-10 : 452 Min. : 1125 Married :1831 Min. :0.000 Min. :0.0000
10+ : 90 1st Qu.: 53700 Female Head: 327 1st Qu.:3.000 1st Qu.:0.0000
Sub 1:1731 Median : 89200 Male Head : 115 Median :3.000 Median :0.0000
Mean : 110982 Mean :3.404 Mean :0.8847
3rd Qu.: 136630 3rd Qu.:4.000 3rd Qu.:2.0000
Max. :1014000 Max. :8.000 Max. :8.0000
NumPeople NumRooms NumUnits NumVehicles
Min. : 2.000 Min. : 1.000 Mobile home : 67 Min. :0.000
1st Qu.: 2.000 1st Qu.: 6.000 Single attached: 235 1st Qu.:2.000
Median : 3.000 Median : 7.000 Single detached:1971 Median :2.000
Mean : 3.401 Mean : 7.241 Mean :2.118
3rd Qu.: 4.000 3rd Qu.: 8.000 3rd Qu.:3.000
Max. :12.000 Max. :21.000 Max. :6.000
NumWorkers OwnRent YearBuilt HouseCosts ElectricBill
Min. :0.000 Mortgage:2008 Before 1939:588 Min. : 4 Min. : 1.0
1st Qu.:1.000 Outright: 15 1950-1959 :423 1st Qu.: 670 1st Qu.:100.0
Median :2.000 Rented : 250 1960-1969 :269 Median :1200 Median :150.0
Mean :1.778 1970-1979 :229 Mean :1488 Mean :176.6
3rd Qu.:2.000 1990-1999 :198 3rd Qu.:2000 3rd Qu.:220.0
Max. :3.000 1980-1989 :195 Max. :6500 Max. :580.0
(Other) :371
FoodStamp HeatingFuel Insurance Language Income
No :2106 Coal : 16 Min. : 0.0 Asian Pacific : 62 Below:1817
Yes: 167 Electricity: 109 1st Qu.: 400.0 English :1786 Above: 456
Gas :1387 Median : 720.0 Other : 39
None : 4 Mean : 968.1 Other European: 219
Oil : 622 3rd Qu.:1200.0 Spanish : 167
Other : 18 Max. :6600.0
Wood : 117
Można to zrobić jednym ciągiem poleceń. Wykorzystujemy w tym celu funkcję mutate_if, gdzie przekształcamy tylko te zmienne, które są typu charakter (argument is.charakter), na zmienne czyniikowe (argument list(~as.factor(.))). Kropka w ostatnim argumencie oznacza zmienną, do której ma być stosowana funkcja as.factor.
Acres FamilyIncome FamilyType NumBedrooms
Length:2273 Min. : 1125 Married :1831 Min. :0.000
Class :character 1st Qu.: 53700 Female Head: 327 1st Qu.:3.000
Mode :character Median : 89200 Male Head : 115 Median :3.000
Mean : 110982 Mean :3.404
3rd Qu.: 136630 3rd Qu.:4.000
Max. :1014000 Max. :8.000
NumChildren NumPeople NumRooms NumUnits
Min. :0.0000 Min. : 2.000 Min. : 1.000 Length:2273
1st Qu.:0.0000 1st Qu.: 2.000 1st Qu.: 6.000 Class :character
Median :0.0000 Median : 3.000 Median : 7.000 Mode :character
Mean :0.8847 Mean : 3.401 Mean : 7.241
3rd Qu.:2.0000 3rd Qu.: 4.000 3rd Qu.: 8.000
Max. :8.0000 Max. :12.000 Max. :21.000
NumVehicles NumWorkers OwnRent YearBuilt HouseCosts
Min. :0.000 Min. :0.000 Length:2273 Length:2273 Min. : 4
1st Qu.:2.000 1st Qu.:1.000 Class :character Class :character 1st Qu.: 670
Median :2.000 Median :2.000 Mode :character Mode :character Median :1200
Mean :2.118 Mean :1.778 Mean :1488
3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:2000
Max. :6.000 Max. :3.000 Max. :6500
ElectricBill FoodStamp HeatingFuel Insurance
Min. : 1.0 Length:2273 Length:2273 Min. : 0.0
1st Qu.:100.0 Class :character Class :character 1st Qu.: 400.0
Median :150.0 Mode :character Mode :character Median : 720.0
Mean :176.6 Mean : 968.1
3rd Qu.:220.0 3rd Qu.:1200.0
Max. :580.0 Max. :6600.0
Language Income
Asian Pacific : 62 Below:1817
English :1786 Above: 456
Other : 39
Other European: 219
Spanish : 167
Oglądamy wynik podsumowania.
Acres FamilyIncome FamilyType NumBedrooms NumChildren
1-10 : 452 Min. : 1125 Married :1831 Min. :0.000 Min. :0.0000
10+ : 90 1st Qu.: 53700 Female Head: 327 1st Qu.:3.000 1st Qu.:0.0000
Sub 1:1731 Median : 89200 Male Head : 115 Median :3.000 Median :0.0000
Mean : 110982 Mean :3.404 Mean :0.8847
3rd Qu.: 136630 3rd Qu.:4.000 3rd Qu.:2.0000
Max. :1014000 Max. :8.000 Max. :8.0000
NumPeople NumRooms NumUnits NumVehicles
Min. : 2.000 Min. : 1.000 Mobile home : 67 Min. :0.000
1st Qu.: 2.000 1st Qu.: 6.000 Single attached: 235 1st Qu.:2.000
Median : 3.000 Median : 7.000 Single detached:1971 Median :2.000
Mean : 3.401 Mean : 7.241 Mean :2.118
3rd Qu.: 4.000 3rd Qu.: 8.000 3rd Qu.:3.000
Max. :12.000 Max. :21.000 Max. :6.000
NumWorkers OwnRent YearBuilt HouseCosts ElectricBill
Min. :0.000 Mortgage:2008 Before 1939:588 Min. : 4 Min. : 1.0
1st Qu.:1.000 Outright: 15 1950-1959 :423 1st Qu.: 670 1st Qu.:100.0
Median :2.000 Rented : 250 1960-1969 :269 Median :1200 Median :150.0
Mean :1.778 1970-1979 :229 Mean :1488 Mean :176.6
3rd Qu.:2.000 1990-1999 :198 3rd Qu.:2000 3rd Qu.:220.0
Max. :3.000 1980-1989 :195 Max. :6500 Max. :580.0
(Other) :371
FoodStamp HeatingFuel Insurance Language Income
No :2106 Coal : 16 Min. : 0.0 Asian Pacific : 62 Below:1817
Yes: 167 Electricity: 109 1st Qu.: 400.0 English :1786 Above: 456
Gas :1387 Median : 720.0 Other : 39
None : 4 Mean : 968.1 Other European: 219
Oil : 622 3rd Qu.:1200.0 Spanish : 167
Other : 18 Max. :6600.0
Wood : 117
Zapisujemy otrzymany wynik, zmieniając nazwę obiektu na acsNew.ver.OneStep, do pliku acsNewVerOneStep.RData.
Na koniec usuwamy ze środowiska bibliotek
1 J.P. Lander, R dla każdego. Zaawansowane analizy i grafika statystyczna (APN Promise, Warszawa, 2018).
2 J.P. Lander, (2020).
3 R Core Team, R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, Vienna, Austria, 2019).
4 H. Wickham, J. Hester, and R. Francois, readr: Read Rectangular Text Data (2018).
5 H. Wickham, R. François, L. Henry, and K. Müller, dplyr: A Grammar of Data Manipulation (2020).
6 Y. Xie, J. Cheng, and X. Tan, DT: A Wrapper of the JavaScript Library ’DataTables’ (2020).
7 Y. Xie, J.J. Allaire, and G. Grolemund, R Markdown: The Definitive Guide (Chapman; Hall/CRC, Boca Raton, Florida, 2018).
8 J. Allaire, Y. Xie, J. McPherson, J. Luraschi, K. Ushey, A. Atkins, H. Wickham, J. Cheng, W. Chang, and R. Iannone, rmarkdown: Dynamic Documents for R (2020).