Python Pandas : groupby & aggregate (동시에 여러 column에 다양한 aggregate 함수 적용하기)

Python/Python Pandas

Python Pandas : groupby & aggregate (동시에 여러 column에 다양한 aggregate 함수 적용하기)

CosmosProject 2023. 2. 3. 19:13

728x90

일반적으로 pandas의 groupby를 이용하면 한번에 하나의 aggregate function밖에 사용할 수 없습니다.

import pandas as pd

dict_item = {
    'date': [
        20200101, 20200102, 20200103,
        20200101, 20200102, 20200103,
        20200101, 20200102, 20200103,
        20200101, 20200102, 20200103, 20200104
    ],
    'item_id': [
        1, 1, 1,
        2, 2, 2,
        3, 3, 3,
        4, 4, 4, 4
    ],
    'item_name': [
        'a', 'a', 'a',
        'b', 'b', 'b',
        'c', 'c', 'c',
        'd', 'd', 'd', 'd'
    ],
    'price': [
        1000, 1000, 1010,
        2000, 2100, 2050,
        3000, 3100, 2950,
        4000, 3950, 3900, 3980
    ],
    'quantity': [
        100, 105, 98,
        50, 51, 55,
        201, 200, 220,
        30, 40, 38, 50
    ]
}
df_item = pd.DataFrame(dict_item)

df_item_agg = df_item.groupby(by=['item_id'])[['price', 'quantity']].apply(sum)
print(df_item_agg)



-- Result
         price  quantity
item_id                 
1         3010       303
2         6150       156
3         9050       621
4        15830       158

위 예시를 보면 DataFrame에 groupby를 적용하고 있습니다.

기준은 item_id 컬럼이며 price, quantity 컬럼에 모두 sum aggregate 함수를 적용합니다.

즉, item_id가 동일한 값을 가진 행들의 price, quantity 컬럼 값을 각각 더해서 합치는거죠.

근데 만약 price 컬럼값에 대해서는 평균(mean)을 구하고 싶고, quantity 컬럼 값에 대해서는 합(sum)을 구하고 싶다면 어떻게 해야할까요?

이럴 때 aggregate 함수를 사용하면 됩니다.

import pandas as pd

dict_item = {
    'date': [
        20200101, 20200102, 20200103,
        20200101, 20200102, 20200103,
        20200101, 20200102, 20200103,
        20200101, 20200102, 20200103, 20200104
    ],
    'item_id': [
        1, 1, 1,
        2, 2, 2,
        3, 3, 3,
        4, 4, 4, 4
    ],
    'item_name': [
        'a', 'a', 'a',
        'b', 'b', 'b',
        'c', 'c', 'c',
        'd', 'd', 'd', 'd'
    ],
    'price': [
        1000, 1000, 1010,
        2000, 2100, 2050,
        3000, 3100, 2950,
        4000, 3950, 3900, 3980
    ],
    'quantity': [
        100, 105, 98,
        50, 51, 55,
        201, 200, 220,
        30, 40, 38, 50
    ]
}
df_item = pd.DataFrame(dict_item)

df_item_agg = df_item.groupby(by=['item_id']).aggregate(
    {
        'price': 'mean',
        'quantity': 'sum'
    }
)
print(df_item_agg)



-- Result
               price  quantity
item_id                       
1        1003.333333       303
2        2050.000000       156
3        3016.666667       621
4        3957.500000       158

위 코드의 결과를 보면 item_id별로 price 컬럼의 값은 평(mean)이 나타내어졌고, quantity 컬럼의 값은 합(sum)이 나타내어진 것을 볼 수 있습니다.

df_item_agg = df_item.groupby(by=['item_id']).aggregate(
    {
        'price': 'mean',
        'quantity': 'sum'
    }
)

핵심은 위 부분입니다.

groupby를 사용하는 것 까진 동일하지만 그 후 aggregate method를 적용합니다.

aggregate method의 인자에는 dictionary를 전달해야하는데, dictionary를 보면 다음과 같습니다.

    {
        'price': 'mean',
        'quantity': 'sum'
    }

key는 컬럼의 이름을 의미하며

value는 각 컬럼에 어떤 aggregate 함수를 적용할지를 나타냅니다.

price 컬럼에는 평균(mean)을 적용하고,

quantity 컬럼에는 합(sum)을 적용한다는 의미이죠.

import pandas as pd

dict_item = {
    'date': [
        20200101, 20200102, 20200103,
        20200101, 20200102, 20200103,
        20200101, 20200102, 20200103,
        20200101, 20200102, 20200103, 20200104
    ],
    'item_id': [
        1, 1, 1,
        2, 2, 2,
        3, 3, 3,
        4, 4, 4, 4
    ],
    'item_name': [
        'a', 'a', 'a',
        'b', 'b', 'b',
        'c', 'c', 'c',
        'd', 'd', 'd', 'd'
    ],
    'price': [
        1000, 1000, 1010,
        2000, 2100, 2050,
        3000, 3100, 2950,
        4000, 3950, 3900, 3980
    ],
    'quantity': [
        100, 105, 98,
        50, 51, 55,
        201, 200, 220,
        30, 40, 38, 50
    ]
}
df_item = pd.DataFrame(dict_item)

df_item_agg = df_item.groupby(by=['item_id']).aggregate(
    {
        'price': 'min',
        'quantity': 'max'
    }
)
print(df_item_agg)



-- Result
         price  quantity
item_id                 
1         1000       105
2         2000        55
3         2950       220
4         3900        50

aggregate 함수로서 min, max 함수도 사용할 수 있습니다.

import pandas as pd

dict_item = {
    'date': [
        20200101, 20200102, 20200103,
        20200101, 20200102, 20200103,
        20200101, 20200102, 20200103,
        20200101, 20200102, 20200103, 20200104
    ],
    'item_id': [
        1, 1, 1,
        2, 2, 2,
        3, 3, 3,
        4, 4, 4, 4
    ],
    'item_name': [
        'a', 'a', 'a',
        'b', 'b', 'b',
        'c', 'c', 'c',
        'd', 'd', 'd', 'd'
    ],
    'price': [
        1000, 1000, 1010,
        2000, 2100, 2050,
        3000, 3100, 2950,
        4000, 3950, 3900, 3980
    ],
    'quantity': [
        100, 105, 98,
        50, 51, 55,
        201, 200, 220,
        30, 40, 38, 50
    ]
}
df_item = pd.DataFrame(dict_item)

df_item_agg = df_item.groupby(by=['item_id']).aggregate(
    {
        'price': ['sum', 'mean'],
        'quantity': ['sum', 'mean']
    }
)
print(df_item_agg)



-- Result
         price              quantity       
           sum         mean      sum   mean
item_id                                    
1         3010  1003.333333      303  101.0
2         6150  2050.000000      156   52.0
3         9050  3016.666667      621  207.0
4        15830  3957.500000      158   39.5

위 예시는 aggregate method를 이용하고, 하나의 컬럼을 대상으로 다양한 aggregate 함수를 전달한 예시입니다.

결과를 보면 price 컬럼에 대해 sum, mean 함수가 적용된 결과와

quantity 컬럼에 sun, mean 함수가 적용된 결과가 동시에 모두 나옵니다.

이렇게 list의 형태로 aggregate 함수를 전달하면 하나의 컬럼에 다양한 aggregate 함수를 적용한 결과를 return하도록 할 수 있습니다.

728x90