Hive : collect_list, collect_set, concat, concat_ws (여러 행 데이터 합치기, 문자 연결하기, Hive 문자 연결, Hive listagg)

SQL/Apache Hive

CosmosProject 2021. 1. 14. 19:39

728x90

Hive에선 collect_list, collect_set이라는 함수를 제공합니다.

이는 여러 행의 데이터를 합칠 때 사용합니다.

마치 Redshift의 listagg와 비슷한 함수라고 생각하면 되겠네요.

Table name = employees

위와 같은 table이 있다고 가정해봅시다.

select	department_no
        , collect_list(name) as name_collect
from r_test
--
group by department_no
;

위 결과를 보면 동일한 department_no를 기준으로 name 컬럼에 있는 값들이 합쳐져서 나타내집니다.

또한 department_no = 10인 결과를 보면 Alice가 2개 있죠. 이렇게 collect_list는 중복값도 모두 나타내줍니다.

select	department_no
        , collect_set(name) as name_collect
from r_test
--
group by department_no
;

동일한 구문이지만 collect_set을 사용하면 collect된 결과에서 중복값이 사라집니다.

select	department_no
        , concat(join_dt, '_', name) as name_concat as name_concat
from r_test
--
where 1=1
and department_no = 10
;

concat 함수는 단순히 동일한 행에 있는 값들 또는 기타 값들을 합쳐줍니다.

위 예시를 보면 join_dt에 있는 값에 underscore를 붙이고 그 뒤에 name값을 붙입니다.

select	department_no
        , concat_ws(',', collect_set(name)) as name_concat
from r_test
--
group by department_no
;

concat_ws는 list 또는 set 형태의 값을 모두 합쳐줍니다.

(Redshift를 사용하신다면 Redshift의 listagg 함수와 비슷합니다.)

일단 위 예시에서 collect_list와 collect_set 함수를 봤었죠.

근데 이런 collect 함수들의 특징은 반환값이 ["value1", "value2"] 처럼 마치 list의 형태와 같다는것이죠.

근데 우리는 사실 이런 값보단 value1, value2 이렇게 통으로 합쳐진 값을 보통 다루게됩니다.

이것을 해주는 것이 바로 concat_ws입니다.

위 예시를 보면 concat_ws는 인자로 콤마를 먼저 받고 있는데 이것은 list에 있는 값을 연결할 때 사용할 구분자를 의미합니다.

그리고나서 collect_set(name)을 받습니다. 즉, 동일한 department_no를 기준으로 모든 name을 합쳐 list로 만든 값을 받는다는것이죠.

그리고 concat_ws는 이 list 속 값들을 콤마를 구분자로하여 합쳐줄겁니다.

728x90