Python Pandas : contains (문자열의 포함여부 판단하기)

Python/Python Pandas

Python Pandas : contains (문자열의 포함여부 판단하기)

CosmosProject 2021. 6. 30. 19:00

728x90

Pandas의 str.contains method는 특정 Series에 적용할 수 있으며

해당 Series에 있는 값들이 어떤 문자열을 포함하고있으면 True, 포함하고있지 않으면 False를 return합니다.

Syntax

Series.str.contains(string/pattern, case=True/False, regex=True/False)

string/pattern : 찾을 문자열 또는 패턴

case : True일 경우 case sensitive(대소문자 구분), False일 경우 case insensitive(대소문자 구분 안함)

regex : True일 경우 string/pattern을 regular expression pattern으로 인식. False일 경우 string/pattern을 문자 그대로 인식.

import pandas as pd

dict_test = {
    'col1': [1, 2, 3, 4, 5, 6],
    'col2': ['apple', 'abcde', 'lelele', 'Ppa', 'xyzab', '123']
}

df_test = pd.DataFrame(dict_test)

s = df_test.loc[:, 'col2'] # 1. Series 생성
s = s.str.contains('pp', case=False, regex=False) # 2. Series에 있는 값 중 pp라는 텍스트가 포함되어있는지 여부를 체크함.
print(s)


-- Result
0     True
1    False
2    False
3     True
4    False
5    False
Name: col2, dtype: bool

위 예시를 봅시다.

1. loc를 이용해 DataFrame에서 col2의 데이터만 뽑아 Series로 만들었습니다.

2. Series에 있는 값 중 pp라는 text가 포함되어있는지 여부를 return합니다.

결과에서 보이듯이 return되는 값은 Series이며 pp라는 문자가 포함되어있는 index = 0, 3의 값(apple, Ppa)은 True, 그리고 나머지는pp라는 텍스트가 없으니 False가 return되었습니다.

여기서 case=False이므로 대소문자를 구분하지 않습니다.

따라서 apple에는 pp가 포함되어있으니 True

Ppa에도 pp가 포함되어있으니 True입니다.

import pandas as pd

dict_test = {
    'col1': [1, 2, 3, 4, 5, 6],
    'col2': ['apple', 'abcde', 'lelele', 'Ppa', 'xyzab', '123']
}

df_test = pd.DataFrame(dict_test)

s = df_test.loc[:, 'col2'] # 1. Series 생성
s = s.str.contains('pp', case=True, regex=False) # 2. Series에 있는 값 중 pp라는 텍스트를 찾음.
print(s)


-- Result
0     True
1    False
2    False
3    False
4    False
5    False
Name: col2, dtype: bool

case=True로 변경하면 대소문자를 구분합니다.

따라서 Ppa는 더 이상 pp라는 문자를 포함하지 않은 것으로 판단되어 index=3 행의 결과값은 False로 return됩니다.

import pandas as pd

dict_test = {
    'col1': [1, 2, 3, 4, 5, 6],
    'col2': ['apple', 'abcde', 'lelele', 'Ppa', 'xyzab', '123']
}

df_test = pd.DataFrame(dict_test)

s = df_test.loc[:, 'col2'] # Series 생성
s = s.str.contains('a+', case=False, regex=True) # Series에 있는 값 중 a가 1개 이상 포함되면 True return
print(s)


-- Result
0     True
1     True
2    False
3     True
4     True
5    False
Name: col2, dtype: bool

regex=True로 설정하면 a+를 문자 그대로가 아니라 정규표현식 패턴으로 봅니다.

a+는 a가 1개 이상 존재한다는 의미로서

Series에 a가 1개 이상 포함된 문자열들에 대해서만 True값이 반환되었습니다.

import pandas as pd

dict_test = {
    'col1': [1, 2, 3, 4, 5, 6],
    'col2': ['apple', 'abcde', 'lelele', 'Ppa', 'xyzab', '123']
}

df_test = pd.DataFrame(dict_test)

s = df_test.loc[:, 'col2'] # Series 생성
s = s.str.contains('a+', case=False, regex=True) # Series에 있는 값 중 a가 1개 이상 포함되면 True return

df_test = df_test.loc[s, :]
print(df_test)


-- Result
   col1   col2
0     1  apple
1     2  abcde
3     4    Ppa
4     5  xyzab

contains는 위 예시처럼

loc와 같이 사용하여 특정 문자를 포함하는 행(row)만 추출할 때 사용될 수 있습니다.

728x90