강의 및 세미나 자료/R

R 핵심만 배우기 #5 - dplyr

R기초5

R을 이용한 데이터 전처리 : dplyr

오늘은 R을 활용한 데이터 전처리를 다뤄보겠습니다. 여기서 데이터 전처리라 함은 Data munging 혹은 data wrangling 이라고도 불리는데요. 이는 raw 데이터를 우리가 원하는 형태로 바꾸는 것을 의미합니다. 불필요한 정보를 제거하거나 다른 패키지의 input으로 활용하는 등의 다양한 작업에 있어서 꼭 필요한 작업인 동시에 많은 시간이 필요한 작업입니다. 오늘 배울 dplyr는 특정 데이터를 추출, 새로운 변수를 추가, group별 연산 등 다양하게 활용되는 패키지이므로 꼭 알아두시기 바랍니다.

dplyr

dplyr에서 다음 3가지는 꼭 알아두세요.

* filter(데이터, 조건, 조건, ..) : 조건에 맞는 row를 선택

* select(데이터, column이름, ..) : 조건에 맞는 column을 선택

* mutate(데이터, 새로운 변수 = 기존 변수 조합,..) : 기존의 변수를 활용하여 새로운 column 추가

추가적으로 arrange, summarise, group_by 에 대해서도 내가 원하는 작업과 함수의 이름을 잘 매칭시켜 기억해 두시기 바랍니다.

* arrange() : 정렬

* summarise() : 각 column별 평균, 표준편차 계산 등

* group_by() : group별로 묶어주는 역할

dplyr를 활용한 예제

먼저 dplyr를 설치합니다. 그리고 hflights 라는 데이터셋도 같이 설치하겠습니다.

install.packages('dplyr')
install.packages('hflights') #미국 휴스턴에서 출발하는 모든 비행기의  2011년 이착륙기록 데이터

위 명령어는 패키지를 다운받아서 컴퓨터에 설치한 것이고, 이를 R에서 불러오기 위해서는 library(‘패키지명’) 명령어를 꼭 실행해 줘야 합니다.

library(dplyr)
library(hflights)

head(hflights)

##      Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## 5424 2011     1          1         6    1400    1500            AA
## 5425 2011     1          2         7    1401    1501            AA
## 5426 2011     1          3         1    1352    1502            AA
## 5427 2011     1          4         2    1403    1513            AA
## 5428 2011     1          5         3    1405    1507            AA
## 5429 2011     1          6         4    1359    1503            AA
##      FlightNum TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin
## 5424       428  N576AA                60      40      -10        0    IAH
## 5425       428  N557AA                60      45       -9        1    IAH
## 5426       428  N541AA                70      48       -8       -8    IAH
## 5427       428  N403AA                70      39        3        3    IAH
## 5428       428  N492AA                62      44       -3        5    IAH
## 5429       428  N262AA                64      45       -7       -1    IAH
##      Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 5424  DFW      224      7      13         0                         0
## 5425  DFW      224      6       9         0                         0
## 5426  DFW      224      5      17         0                         0
## 5427  DFW      224      9      22         0                         0
## 5428  DFW      224      9       9         0                         0
## 5429  DFW      224      6      13         0                         0

str(hflights)

## 'data.frame':    227496 obs. of  21 variables:
##  $ Year             : int  2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
##  $ Month            : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ DayofMonth       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ DayOfWeek        : int  6 7 1 2 3 4 5 6 7 1 ...
##  $ DepTime          : int  1400 1401 1352 1403 1405 1359 1359 1355 1443 1443 ...
##  $ ArrTime          : int  1500 1501 1502 1513 1507 1503 1509 1454 1554 1553 ...
##  $ UniqueCarrier    : chr  "AA" "AA" "AA" "AA" ...
##  $ FlightNum        : int  428 428 428 428 428 428 428 428 428 428 ...
##  $ TailNum          : chr  "N576AA" "N557AA" "N541AA" "N403AA" ...
##  $ ActualElapsedTime: int  60 60 70 70 62 64 70 59 71 70 ...
##  $ AirTime          : int  40 45 48 39 44 45 43 40 41 45 ...
##  $ ArrDelay         : int  -10 -9 -8 3 -3 -7 -1 -16 44 43 ...
##  $ DepDelay         : int  0 1 -8 3 5 -1 -1 -5 43 43 ...
##  $ Origin           : chr  "IAH" "IAH" "IAH" "IAH" ...
##  $ Dest             : chr  "DFW" "DFW" "DFW" "DFW" ...
##  $ Distance         : int  224 224 224 224 224 224 224 224 224 224 ...
##  $ TaxiIn           : int  7 6 5 9 9 6 12 7 8 6 ...
##  $ TaxiOut          : int  13 9 17 22 9 13 15 12 22 19 ...
##  $ Cancelled        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ CancellationCode : chr  "" "" "" "" ...
##  $ Diverted         : int  0 0 0 0 0 0 0 0 0 0 ...

들어가기 전에 dplyr - tbl_df()

dplyr가 제공하는 기능 중 특별한 건 아니고, 크기가 큰 데이터를 실수로 실행하게 되면, 모든 데이터가 console에 출력되면서 시간이 오래 걸리는데, 이를 방지하기 위해 데이터 일부만 보여주는 기능입니다.

hflights_df <- tbl_df(hflights)
hflights_df

## # A tibble: 227,496 x 21
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## *  <int> <int>      <int>     <int>   <int>   <int>         <chr>
## 1   2011     1          1         6    1400    1500            AA
## 2   2011     1          2         7    1401    1501            AA
## 3   2011     1          3         1    1352    1502            AA
## 4   2011     1          4         2    1403    1513            AA
## 5   2011     1          5         3    1405    1507            AA
## 6   2011     1          6         4    1359    1503            AA
## 7   2011     1          7         5    1359    1509            AA
## 8   2011     1          8         6    1355    1454            AA
## 9   2011     1          9         7    1443    1554            AA
## 10  2011     1         10         1    1443    1553            AA
## # ... with 227,486 more rows, and 14 more variables: FlightNum <int>,
## #   TailNum <chr>, ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## #   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## #   Diverted <int>

dplyr - 1번 : filter() 를 이용해 원하는 row 선택하기

예를 들어, hflights_df 라는 데이터셋에는 Month, DayOfMonth, Distance라는 변수가 있는데, Month는 2이고, DayOfMonth는 1이며, Distance는 200이하인 row만 선택하고 싶을 때, 바로 filter를 사용합니다.

문법은 filter(데이터, 조건1, 조건2, …) 와 같이 데이터와 조건을 차례대로 나열하면 됩니다.

filter(hflights_df, Month==2, DayofMonth ==1, Distance <=200)

## # A tibble: 39 x 21
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
##    <int> <int>      <int>     <int>   <int>   <int>         <chr>
## 1   2011     2          1         2    1559    1707            CO
## 2   2011     2          1         2    1533    1624            CO
## 3   2011     2          1         2    1303    1407            CO
## 4   2011     2          1         2    1322    1436            CO
## 5   2011     2          1         2    1909    2020            CO
## 6   2011     2          1         2    1455    1556            CO
## 7   2011     2          1         2    1939    2033            CO
## 8   2011     2          1         2    2101    2207            CO
## 9   2011     2          1         2    2100    2151            CO
## 10  2011     2          1         2    1424    1521            CO
## # ... with 29 more rows, and 14 more variables: FlightNum <int>,
## #   TailNum <chr>, ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## #   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## #   Diverted <int>

filter()의 특징은 같은 column에 대해서도 조건을 걸 수 있다는 겁니다. 예를 들어 1월, 2월 데이터를 추가하고 싶으면 or(|) 조건을 활용해 아래와 같이 쓸 수 있습니다.

filter(hflights_df, Month == 1 | Month == 2)

## # A tibble: 36,038 x 21
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
##    <int> <int>      <int>     <int>   <int>   <int>         <chr>
## 1   2011     1          1         6    1400    1500            AA
## 2   2011     1          2         7    1401    1501            AA
## 3   2011     1          3         1    1352    1502            AA
## 4   2011     1          4         2    1403    1513            AA
## 5   2011     1          5         3    1405    1507            AA
## 6   2011     1          6         4    1359    1503            AA
## 7   2011     1          7         5    1359    1509            AA
## 8   2011     1          8         6    1355    1454            AA
## 9   2011     1          9         7    1443    1554            AA
## 10  2011     1         10         1    1443    1553            AA
## # ... with 36,028 more rows, and 14 more variables: FlightNum <int>,
## #   TailNum <chr>, ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## #   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## #   Diverted <int>

참고 : dplyr의 filter는 R의 기본 함수인 subset과 비슷합니다.

subset(hflights_df, Month == 2 & DayofMonth == 1 & Distance <=200)

## # A tibble: 39 x 21
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
##    <int> <int>      <int>     <int>   <int>   <int>         <chr>
## 1   2011     2          1         2    1559    1707            CO
## 2   2011     2          1         2    1533    1624            CO
## 3   2011     2          1         2    1303    1407            CO
## 4   2011     2          1         2    1322    1436            CO
## 5   2011     2          1         2    1909    2020            CO
## 6   2011     2          1         2    1455    1556            CO
## 7   2011     2          1         2    1939    2033            CO
## 8   2011     2          1         2    2101    2207            CO
## 9   2011     2          1         2    2100    2151            CO
## 10  2011     2          1         2    1424    1521            CO
## # ... with 29 more rows, and 14 more variables: FlightNum <int>,
## #   TailNum <chr>, ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## #   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## #   Diverted <int>

Quiz

iris 데이터에서 Sepal.Length가 5 이하이고, Species가 setosa인 것은?

##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1           4.9         3.0          1.4         0.2  setosa
## 2           4.7         3.2          1.3         0.2  setosa
## 3           4.6         3.1          1.5         0.2  setosa
## 4           5.0         3.6          1.4         0.2  setosa
## 5           4.6         3.4          1.4         0.3  setosa
## 6           5.0         3.4          1.5         0.2  setosa
## 7           4.4         2.9          1.4         0.2  setosa
## 8           4.9         3.1          1.5         0.1  setosa
## 9           4.8         3.4          1.6         0.2  setosa
## 10          4.8         3.0          1.4         0.1  setosa
## 11          4.3         3.0          1.1         0.1  setosa
## 12          4.6         3.6          1.0         0.2  setosa
## 13          4.8         3.4          1.9         0.2  setosa
## 14          5.0         3.0          1.6         0.2  setosa
## 15          5.0         3.4          1.6         0.4  setosa
## 16          4.7         3.2          1.6         0.2  setosa
## 17          4.8         3.1          1.6         0.2  setosa
## 18          4.9         3.1          1.5         0.2  setosa
## 19          5.0         3.2          1.2         0.2  setosa
## 20          4.9         3.6          1.4         0.1  setosa
## 21          4.4         3.0          1.3         0.2  setosa
## 22          5.0         3.5          1.3         0.3  setosa
## 23          4.5         2.3          1.3         0.3  setosa
## 24          4.4         3.2          1.3         0.2  setosa
## 25          5.0         3.5          1.6         0.6  setosa
## 26          4.8         3.0          1.4         0.3  setosa
## 27          4.6         3.2          1.4         0.2  setosa
## 28          5.0         3.3          1.4         0.2  setosa

dplyr - 2번 select() 를 이용해 원하는 column 추출하기

특정 column을 선택하고 싶을 때 사용합니다.

select(hflights_df, Year, Month) #Year, Month 선택
select(hflights_df, Year:ArrTime) #Year부터 ArrTime까지 선택
select(hflights_df, -(Year:ArrTime)) #Year부터 ArrTime 까지를 제외하고 선택

dplyr - 3번 mutate() 를 이용한 column 추가

예를 들어, ArrDelay-DepDelay를 계산한 값을 계산해서 gain이라는 값을 계산하여 새로 column을 추가하고 싶으면?

mutate(hflights_df, gain = ArrDelay - DepDelay)

## # A tibble: 227,496 x 22
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
##    <int> <int>      <int>     <int>   <int>   <int>         <chr>
## 1   2011     1          1         6    1400    1500            AA
## 2   2011     1          2         7    1401    1501            AA
## 3   2011     1          3         1    1352    1502            AA
## 4   2011     1          4         2    1403    1513            AA
## 5   2011     1          5         3    1405    1507            AA
## 6   2011     1          6         4    1359    1503            AA
## 7   2011     1          7         5    1359    1509            AA
## 8   2011     1          8         6    1355    1454            AA
## 9   2011     1          9         7    1443    1554            AA
## 10  2011     1         10         1    1443    1553            AA
## # ... with 227,486 more rows, and 15 more variables: FlightNum <int>,
## #   TailNum <chr>, ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## #   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## #   Diverted <int>, gain <int>

추가로, mutate의 특징은 계산한 gain을 이용해서 gain_per_hour를 계산하고 싶을 경우 아래와 같이 한 줄로 코딩이 가능하다는 점이다.

mutate(hflights_df, gain = ArrDelay - DepDelay,
       gain_per_hour = gain/(AirTime/60))

## # A tibble: 227,496 x 23
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
##    <int> <int>      <int>     <int>   <int>   <int>         <chr>
## 1   2011     1          1         6    1400    1500            AA
## 2   2011     1          2         7    1401    1501            AA
## 3   2011     1          3         1    1352    1502            AA
## 4   2011     1          4         2    1403    1513            AA
## 5   2011     1          5         3    1405    1507            AA
## 6   2011     1          6         4    1359    1503            AA
## 7   2011     1          7         5    1359    1509            AA
## 8   2011     1          8         6    1355    1454            AA
## 9   2011     1          9         7    1443    1554            AA
## 10  2011     1         10         1    1443    1553            AA
## # ... with 227,486 more rows, and 16 more variables: FlightNum <int>,
## #   TailNum <chr>, ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## #   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## #   Diverted <int>, gain <int>, gain_per_hour <dbl>

dplyr - 4번 arrange() 를 활용한 정렬

문법 : arrange(데이터, 정렬기준으로 하고 싶은 column명)

arrange(hflights_df, Month, Year) #Month를 1차, Year를 2차 기준으로 오름차순 정렬

## # A tibble: 227,496 x 21
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
##    <int> <int>      <int>     <int>   <int>   <int>         <chr>
## 1   2011     1          1         6    1400    1500            AA
## 2   2011     1          2         7    1401    1501            AA
## 3   2011     1          3         1    1352    1502            AA
## 4   2011     1          4         2    1403    1513            AA
## 5   2011     1          5         3    1405    1507            AA
## 6   2011     1          6         4    1359    1503            AA
## 7   2011     1          7         5    1359    1509            AA
## 8   2011     1          8         6    1355    1454            AA
## 9   2011     1          9         7    1443    1554            AA
## 10  2011     1         10         1    1443    1553            AA
## # ... with 227,486 more rows, and 14 more variables: FlightNum <int>,
## #   TailNum <chr>, ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## #   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## #   Diverted <int>

arrange(hflights_df, desc(Month)) #내림차순 정렬

## # A tibble: 227,496 x 21
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
##    <int> <int>      <int>     <int>   <int>   <int>         <chr>
## 1   2011    12         15         4    2113    2217            AA
## 2   2011    12         16         5    2004    2128            AA
## 3   2011    12         18         7    2007    2113            AA
## 4   2011    12         19         1    2108    2223            AA
## 5   2011    12         20         2    2008    2107            AA
## 6   2011    12         21         3    2025    2124            AA
## 7   2011    12         22         4    2021    2118            AA
## 8   2011    12         23         5    2015    2118            AA
## 9   2011    12         26         1    2013    2118            AA
## 10  2011    12         27         2    2007    2123            AA
## # ... with 227,486 more rows, and 14 more variables: FlightNum <int>,
## #   TailNum <chr>, ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## #   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## #   Diverted <int>

dplyr - 5번 summarise() 를 이용한 요약

hflights_df 데이터셋의 delay, Cancelled column에 대해 각각 평균을 계산하고 싶다면,

summarise(hflights_df, 
          Delay = mean(DepDelay, na.rm = TRUE),
          Cancel_rate = mean(Cancelled))

## # A tibble: 1 x 2
##      Delay Cancel_rate
##      <dbl>       <dbl>
## 1 9.444951  0.01306836

dplyr - 6번 group_by()

비행기 고유 번호인 tailnum가 있는데, 각 비행기의 비행 횟수, 평균 이동거리, 평균 delay 등을 계산하고 싶다. 이러한 경우 group_by와 summary를 같이 사용하면 결과를 얻을 수 있다.

by_tailnum <- group_by(hflights_df, TailNum)
by_tailnum

## Source: local data frame [227,496 x 21]
## Groups: TailNum [3,320]
## 
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## *  <int> <int>      <int>     <int>   <int>   <int>         <chr>
## 1   2011     1          1         6    1400    1500            AA
## 2   2011     1          2         7    1401    1501            AA
## 3   2011     1          3         1    1352    1502            AA
## 4   2011     1          4         2    1403    1513            AA
## 5   2011     1          5         3    1405    1507            AA
## 6   2011     1          6         4    1359    1503            AA
## 7   2011     1          7         5    1359    1509            AA
## 8   2011     1          8         6    1355    1454            AA
## 9   2011     1          9         7    1443    1554            AA
## 10  2011     1         10         1    1443    1553            AA
## # ... with 227,486 more rows, and 14 more variables: FlightNum <int>,
## #   TailNum <chr>, ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## #   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## #   Diverted <int>

delay <- summarise(by_tailnum,
                   count = n(), #n() : 각 그룹에서 row 개수
                   dist = mean(Distance, na.rm = TRUE),
                   delay = mean(ArrDelay, na.rm = TRUE))
delay

## # A tibble: 3,320 x 4
##    TailNum count      dist     delay
##      <chr> <int>     <dbl>     <dbl>
## 1            795  938.7157       NaN
## 2   N0EGMQ    40 1095.2500  1.918919
## 3   N10156   317  801.7192  8.199357
## 4   N10575    94  631.5319 18.148936
## 5   N11106   308  774.9805 10.101639
## 6   N11107   345  768.1130  8.052786
## 7   N11109   331  772.4532 10.280000
## 8   N11113   282  772.8298  4.057143
## 9   N11119   130  790.2385  7.396825
## 10  N11121   333  774.8018  6.740854
## # ... with 3,310 more rows

참고로 알아두세요

mutate_each() / summarise_each()

#dplyr - mutate_each()
#_each 함수를 이용하면 여러 변수에 대해 요약값을 계산해 column에 추가할 수 있음
summarized <- mutate_each(by_tailnum, funs(min, max),
                          Distance)

#dplyr - summarise_each()
summarise_each(by_tailnum, funs(min, max, median), 
               Distance)

Quiz

1)에서 4)까지 이어지는 문제입니다. 차례대로 풀어보세요.

iris 데이터에서 setosa 종만 뽑고,
Sepal.Length와 Sepal.Width column을 추출한 후,
두 변수를 더한 값을 새로운 변수 Sepal_sum 을 만들고, 동시에 Sepal_sum을 제곱한 변수 Sepal_square 도 column에 추가하고
추가한 Sepal_sum을 기준으로 내림차순 정렬하시오.

Chain 기능 %>% 활용하기

R은 특별한 기능이 있습니다. 바로 Chain 인데요. %>% (단축키 : Ctrl Shift m) 직관적이고 간결한 코드를 작성할 수 있습니다. 위의 퀴즈 문제를 해결하기 위해서 4단계를 거쳤는데 이를 아래와 같이 한 줄로 작성할 수 있습니다.

data <- iris %>% 
  filter(Species == 'setosa') %>% 
  select(Sepal.Length, Sepal.Width) %>% 
  mutate(Sepal_sum = Sepal.Length + Sepal.Width, 
         Sepal_square = Sepal_sum^2) %>% 
  arrange(desc(Sepal_sum))
data

##    Sepal.Length Sepal.Width Sepal_sum Sepal_square
## 1           5.7         4.4      10.1       102.01
## 2           5.8         4.0       9.8        96.04
## 3           5.5         4.2       9.7        94.09
## 4           5.7         3.8       9.5        90.25
## 5           5.4         3.9       9.3        86.49
## 6           5.4         3.9       9.3        86.49
## 7           5.2         4.1       9.3        86.49
## 8           5.4         3.7       9.1        82.81
## 9           5.5         3.5       9.0        81.00
## 10          5.3         3.7       9.0        81.00
## 11          5.1         3.8       8.9        79.21
## 12          5.1         3.8       8.9        79.21
## 13          5.1         3.8       8.9        79.21
## 14          5.4         3.4       8.8        77.44
## 15          5.1         3.7       8.8        77.44
## 16          5.4         3.4       8.8        77.44
## 17          5.2         3.5       8.7        75.69
## 18          5.1         3.5       8.6        73.96
## 19          5.0         3.6       8.6        73.96
## 20          5.1         3.5       8.6        73.96
## 21          5.2         3.4       8.6        73.96
## 22          4.9         3.6       8.5        72.25
## 23          5.1         3.4       8.5        72.25
## 24          5.0         3.5       8.5        72.25
## 25          5.0         3.5       8.5        72.25
## 26          5.0         3.4       8.4        70.56
## 27          5.0         3.4       8.4        70.56
## 28          5.1         3.3       8.4        70.56
## 29          5.0         3.3       8.3        68.89
## 30          4.8         3.4       8.2        67.24
## 31          4.6         3.6       8.2        67.24
## 32          4.8         3.4       8.2        67.24
## 33          5.0         3.2       8.2        67.24
## 34          4.6         3.4       8.0        64.00
## 35          4.9         3.1       8.0        64.00
## 36          5.0         3.0       8.0        64.00
## 37          4.9         3.1       8.0        64.00
## 38          4.9         3.0       7.9        62.41
## 39          4.7         3.2       7.9        62.41
## 40          4.7         3.2       7.9        62.41
## 41          4.8         3.1       7.9        62.41
## 42          4.8         3.0       7.8        60.84
## 43          4.8         3.0       7.8        60.84
## 44          4.6         3.2       7.8        60.84
## 45          4.6         3.1       7.7        59.29
## 46          4.4         3.2       7.6        57.76
## 47          4.4         3.0       7.4        54.76
## 48          4.4         2.9       7.3        53.29
## 49          4.3         3.0       7.3        53.29
## 50          4.5         2.3       6.8        46.24

Quiz

Cars93 데이터를 활용해서 dplyr를 연습해 봅시다.

install.packages("MASS")

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

str(Cars93)

## 'data.frame':    93 obs. of  27 variables:
##  $ Manufacturer      : Factor w/ 32 levels "Acura","Audi",..: 1 1 2 2 3 4 4 4 4 5 ...
##  $ Model             : Factor w/ 93 levels "100","190E","240",..: 49 56 9 1 6 24 54 74 73 35 ...
##  $ Type              : Factor w/ 6 levels "Compact","Large",..: 4 3 1 3 3 3 2 2 3 2 ...
##  $ Min.Price         : num  12.9 29.2 25.9 30.8 23.7 14.2 19.9 22.6 26.3 33 ...
##  $ Price             : num  15.9 33.9 29.1 37.7 30 15.7 20.8 23.7 26.3 34.7 ...
##  $ Max.Price         : num  18.8 38.7 32.3 44.6 36.2 17.3 21.7 24.9 26.3 36.3 ...
##  $ MPG.city          : int  25 18 20 19 22 22 19 16 19 16 ...
##  $ MPG.highway       : int  31 25 26 26 30 31 28 25 27 25 ...
##  $ AirBags           : Factor w/ 3 levels "Driver & Passenger",..: 3 1 2 1 2 2 2 2 2 2 ...
##  $ DriveTrain        : Factor w/ 3 levels "4WD","Front",..: 2 2 2 2 3 2 2 3 2 2 ...
##  $ Cylinders         : Factor w/ 6 levels "3","4","5","6",..: 2 4 4 4 2 2 4 4 4 5 ...
##  $ EngineSize        : num  1.8 3.2 2.8 2.8 3.5 2.2 3.8 5.7 3.8 4.9 ...
##  $ Horsepower        : int  140 200 172 172 208 110 170 180 170 200 ...
##  $ RPM               : int  6300 5500 5500 5500 5700 5200 4800 4000 4800 4100 ...
##  $ Rev.per.mile      : int  2890 2335 2280 2535 2545 2565 1570 1320 1690 1510 ...
##  $ Man.trans.avail   : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 1 1 1 1 ...
##  $ Fuel.tank.capacity: num  13.2 18 16.9 21.1 21.1 16.4 18 23 18.8 18 ...
##  $ Passengers        : int  5 5 5 6 4 6 6 6 5 6 ...
##  $ Length            : int  177 195 180 193 186 189 200 216 198 206 ...
##  $ Wheelbase         : int  102 115 102 106 109 105 111 116 108 114 ...
##  $ Width             : int  68 71 67 70 69 69 74 78 73 73 ...
##  $ Turn.circle       : int  37 38 37 37 39 41 42 45 41 43 ...
##  $ Rear.seat.room    : num  26.5 30 28 31 27 28 30.5 30.5 26.5 35 ...
##  $ Luggage.room      : int  11 15 14 17 13 16 17 21 14 18 ...
##  $ Weight            : int  2705 3560 3375 3405 3640 2880 3470 4105 3495 3620 ...
##  $ Origin            : Factor w/ 2 levels "USA","non-USA": 2 2 2 2 2 1 1 1 1 1 ...
##  $ Make              : Factor w/ 93 levels "Acura Integra",..: 1 2 4 3 5 6 7 9 8 10 ...

head(Cars93)

##   Manufacturer   Model    Type Min.Price Price Max.Price MPG.city
## 1        Acura Integra   Small      12.9  15.9      18.8       25
## 2        Acura  Legend Midsize      29.2  33.9      38.7       18
## 3         Audi      90 Compact      25.9  29.1      32.3       20
## 4         Audi     100 Midsize      30.8  37.7      44.6       19
## 5          BMW    535i Midsize      23.7  30.0      36.2       22
## 6        Buick Century Midsize      14.2  15.7      17.3       22
##   MPG.highway            AirBags DriveTrain Cylinders EngineSize
## 1          31               None      Front         4        1.8
## 2          25 Driver & Passenger      Front         6        3.2
## 3          26        Driver only      Front         6        2.8
## 4          26 Driver & Passenger      Front         6        2.8
## 5          30        Driver only       Rear         4        3.5
## 6          31        Driver only      Front         4        2.2
##   Horsepower  RPM Rev.per.mile Man.trans.avail Fuel.tank.capacity
## 1        140 6300         2890             Yes               13.2
## 2        200 5500         2335             Yes               18.0
## 3        172 5500         2280             Yes               16.9
## 4        172 5500         2535             Yes               21.1
## 5        208 5700         2545             Yes               21.1
## 6        110 5200         2565              No               16.4
##   Passengers Length Wheelbase Width Turn.circle Rear.seat.room
## 1          5    177       102    68          37           26.5
## 2          5    195       115    71          38           30.0
## 3          5    180       102    67          37           28.0
## 4          6    193       106    70          37           31.0
## 5          4    186       109    69          39           27.0
## 6          6    189       105    69          41           28.0
##   Luggage.room Weight  Origin          Make
## 1           11   2705 non-USA Acura Integra
## 2           15   3560 non-USA  Acura Legend
## 3           14   3375 non-USA       Audi 90
## 4           17   3405 non-USA      Audi 100
## 5           13   3640 non-USA      BMW 535i
## 6           16   2880     USA Buick Century

Quiz

Cars93의 데이터를 활용하여 Manufacturer가 Chevrolet, Ford인 데이터를 선택한 후 Chervolet, Ford별 차량 대수와 Price의 평균과 표준편차를 계산하시오.

## # A tibble: 2 x 4
##   Manufacturer count Price.mean Price.std
##         <fctr> <int>      <dbl>     <dbl>
## 1    Chevrolet     8    18.1875  8.304463
## 2         Ford     8    14.9625  5.114667

Max.Price(프리미엄 버젼 가격) - Min.Price(기본 버젼 가격)의 차이를 계산하여 Price.Diff 라는 column을 추가하고, Price.Diff 를 기준으로 내림차순 정렬하시오. 그리고, Manufacturer, Model, Type, Price.Diff column만 선택하시오.

##    Manufacturer    Model    Type Price.Diff
## 1 Mercedes-Benz     300E Midsize       36.2
## 2          Saab      900 Compact       16.8
## 3         Dodge  Stealth  Sporty       14.6
## 4          Audi      100 Midsize       13.8
## 5           BMW     535i Midsize       12.5
## 6          Ford Aerostar     Van       10.8

'강의 및 세미나 자료 > R' 카테고리의 다른 글

R로 크롤링하기 - 보배드림 예제 (20)	2016.08.07
R 핵심만 배우기 #4 - 데이터 프레임 2번째 (5)	2016.08.03
R 핵심만 배우기 #3 - 데이터 프레임(data.frame) 첫번째 (1)	2016.08.01
R 핵심만 배우기 #2 - 리스트, 메트릭스, 배열 (0)	2016.08.01
R 핵심만 배우기 #1 - 스칼라, 요인, 벡터 (2)	2016.08.01

Contents

새소식

인기 검색어

R 핵심만 배우기 #5 - dplyr

R기초 문법 #5 - dplyr

호돌2

dplyr를 활용한 예제

들어가기 전에 dplyr - tbl_df()

dplyr - 1번 : filter() 를 이용해 원하는 row 선택하기

참고 : dplyr의 filter는 R의 기본 함수인 subset과 비슷합니다.

Quiz

dplyr - 2번 select() 를 이용해 원하는 column 추출하기

dplyr - 3번 mutate() 를 이용한 column 추가

dplyr - 4번 arrange() 를 활용한 정렬

dplyr - 5번 summarise() 를 이용한 요약

dplyr - 6번 group_by()

참고로 알아두세요

mutate_each() / summarise_each()

Quiz

Chain 기능 %>% 활용하기

Quiz

Quiz

'강의 및 세미나 자료 > R' 카테고리의 다른 글

당신이 좋아할만한 콘텐츠

티스토리툴바