데이터 정리 (tidyverse)

R

by mignong 2022. 6. 13. 11:37

728x90

tidyverse 패키지를 활용한 데이터 정리

library(tidyverse)
# 데이터 정리하는 것은 tidyr 패지키에 기능이 구현되어 있다.

분석에 적합한 형태의 tidy한 데이터의 특징

1) variables : 각 변수는 하나의 열(컬럼)로 표현되어야 함
2) observations : 각 관측치는 하나의 행으로 표현되어야 함
3) values : 각 값은 하나의 셀(cell)로 표현되어야 함

tidy한 데이터로 정리하기
table2,3,4를 table1처럼 잘 정리(tidy)된 데이터로 바꾸기

table1

> table1
# A tibble: 6 x 4
  country      year  cases population
  <chr>       <int>  <int>      <int>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

table2 cases와 populatain이 count에 혼재된 데이터

> table2
# A tibble: 12 x 4
   country      year type            count
   <chr>       <int> <chr>           <int>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583

# tidy한 데이터로 정리하기
> table2 %>% pivot_wider(names_from = type, values_from = count)

table3 cases와 population이 rate에 혼재된 데이터

> table3
# A tibble: 6 x 3
  country      year rate             
* <chr>       <int> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583

# tidy한 데이터로 정리하기
> table3 %>% separate(rate, into = c("cases", "population"))

table4a 변수 지정 없이 cases 값만 삽입되고 population 값은 누락된 데이터
table4b 변수 지정 없이 population 값만 삽입되고 cases 값은 누락된 데이터

> table4a
# A tibble: 3 x 3
  country     `1999` `2000`
* <chr>        <int>  <int>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766

> table4b
# A tibble: 3 x 3
  country         `1999`     `2000`
* <chr>            <int>      <int>
1 Afghanistan   19987071   20595360
2 Brazil       172006362  174504898
3 China       1272915272 1280428583

# tidy한 데이터로 정리하기

> table4a %>% pivot_longer(c('1999','2000'), names_to="year", values_to="cases")
# A tibble: 6 x 3
  country     year   cases
  <chr>       <chr>  <int>
1 Afghanistan 1999     745
2 Afghanistan 2000    2666
3 Brazil      1999   37737
4 Brazil      2000   80488
5 China       1999  212258
6 China       2000  213766

> table4b %>% pivot_longer(c('1999','2000'), names_to="year", values_to="population")
# A tibble: 6 x 3
  country     year  population
  <chr>       <chr>      <int>
1 Afghanistan 1999    19987071
2 Afghanistan 2000    20595360
3 Brazil      1999   172006362
4 Brazil      2000   174504898
5 China       1999  1272915272
6 China       2000  1280428583

# 변환된 table4a, table4b 합치기

> tidy4a <- table4a %>% pivot_longer(c('1999','2000'), names_to="year", values_to="cases")
> tidy4b <- table4b %>% pivot_longer(c('1999','2000'), names_to="year", values_to="population")
> left_join(tidy4a, tidy4b)
Joining, by = c("country", "year")
# A tibble: 6 x 4
  country     year   cases population
  <chr>       <chr>  <int>      <int>
1 Afghanistan 1999     745   19987071
2 Afghanistan 2000    2666   20595360
3 Brazil      1999   37737  172006362
4 Brazil      2000   80488  174504898
5 China       1999  212258 1272915272
6 China       2000  213766 1280428583

데이터 형태 변환하는 함수

pivot_longer() 데이터 형태를 길게 변환하는 함수
left_join() 왼쪽 테이블의 key를 기준으로 병합하는 함수
pivot_wider() 데이터 형태를 넓게 변환하는 함수
separate() 혼재된 데이터를 분리할 때 사용하는 함수. 옵션 미지정 시 알파벳이 아닌 기호를 기준으로 값을 분리
unite() 지정된 컬럼들을 합쳐서 새로운 컬럼 생성할 때 사용하는 함수

# table5 table3에서 year를 century와 year로 분할한 데이터
# table5 <- table3 %>% separate(year, into = c("century", "year"), sep=2)
> table5
# A tibble: 6 x 4
  country     century year  rate             
* <chr>       <chr>   <chr> <chr>            
1 Afghanistan 19      99    745/19987071     
2 Afghanistan 20      00    2666/20595360    
3 Brazil      19      99    37737/172006362  
4 Brazil      20      00    80488/174504898  
5 China       19      99    212258/1272915272
6 China       20      00    213766/1280428583

> table5 %>% unite(new, century, year)
# A tibble: 6 x 3
  country     new   rate             
  <chr>       <chr> <chr>            
1 Afghanistan 19_99 745/19987071     
2 Afghanistan 20_00 2666/20595360    
3 Brazil      19_99 37737/172006362  
4 Brazil      20_00 80488/174504898  
5 China       19_99 212258/1272915272
6 China       20_00 213766/1280428583

# sep="" 언더바(_)를 제거하여 합치기
> table5 %>% unite(new, century, year, sep="")
# A tibble: 6 x 3
  country     new   rate             
  <chr>       <chr> <chr>            
1 Afghanistan 1999  745/19987071     
2 Afghanistan 2000  2666/20595360    
3 Brazil      1999  37737/172006362  
4 Brazil      2000  80488/174504898  
5 China       1999  212258/1272915272
6 China       2000  213766/1280428583

728x90

'R' 카테고리의 다른 글

데이터 시각화 (ggplot) (0)	2022.06.13
데이터 변환 - 파이프연산자, dplyr (0)	2022.06.09
장바구니 분석 (0)	2022.06.09
로지스틱 회귀분석 (0)	2022.06.09
분류 성능 평가-정확도/정밀도/재현율/F값/ROC그래프/K-Fold검증 (0)	2022.06.08