Python 프로그래밍 및 Pandas 활용 실습(3)-2

Python 프로그래밍 및 Pandas 활용 실습(3)-2

2024. 1. 4. 22:23ㆍTIL

DataFrame활용 - 데이터셋 확인하기

타이타닉 데이터셋
- 타이타닉 데이터셋은 타이타닉호의 침몰 사건과 관련된 정보
- 객들의 데이터(예: 이름, 나이, 성별, 사회경제적 지위 등)를 사용하여
  "어떤 종류의 사람들이 생존할 가능성이 더 높았는가?"라는 예측 모델을 구축
예상해볼 수 있는 분석
- 생존율 분석
- 특성 엔지니어링
- 데이터 시각화
- 예측 모델링
- 특성 중요도 분석
항목 설명

pclass	Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd), 승객의 등급 (1 = 1등급, 2 = 2등급, 3 = 3등급)
survived	Survival (0 = No; 1 = Yes), 생존 여부 (0 = 생존하지 않음, 1 = 생존)
name	Name, 이름
sex	Sex, 성별
age	Age, 나이
sibsp	Number of Siblings/Spouses Aboard, 함께 탑승한 형제자매/배우자의 수
parch	Number of Parents/Children Aboard, 함께 탑승한 부모/자녀의 수
ticket	Ticket Number, 티켓 번호
fare	Passenger Fare (British pound), 승객 요금 (영국 파운드)
cabin	Cabin, 객실 번호
embarked	Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton), 탑승 항구 (C = 쉐르부르, Q = 퀸스타운, S = 사우샘프턴)
boat	Lifeboat, 구명보트 번호
body	Body Identification Number, 시신 식별 번호
home.dest	Home/Destination, 거주지/목적지

4. 특이사항

Pclass는 사회경제적 지위(SES)를 나타내는 대리 변수입니다.
1등급은 상류층, 2등급은 중산층, 3등급은 하류층을 대표합니다.
Age는 연령으로 표시되며, 만약 1보다 적은 경우 소수점으로 표현됩니다.
만약 연령이 추정된 경우 xx.5와 같은 형식으로 표시됩니다.
Fare는 1970년 이전의 영국 파운드로 표시됩니다.
-환산 계수: 1 파운드 = 12 실링 = 240 페니, 1 실링 = 20 페니
가족 관계 변수인 sibsp와 parch에 대해서는 일부 관계가 무시되었습니다.
-다음은 sibsp와 parch에 사용된 정의입니다.
-Sibling: 타이타닉호에 함께 탑승한 형제, 자매, 이복형제, 이복자매
-Spouse: 타이타닉호에 함께 탑승한 남편 또는 아내 (연인과 약혼자는 제외)
-Parent: 타이타닉호에 함께 탑승한 어머니 또는 아버지
-Child: 타이타닉호에 함께 탑승한 아들, 딸, 이복아들, 이복딸
이 연구에서 제외된 다른 가족 관계에는 사촌, 조카, 이모, 고모, 시부모 등이 포함됩니다.
-일부 아이들은 유모와 함께만 여행했으므로, 그들에 대해서는 parch=0으로 표시되었습니다.
-또한, 일부 사람들은 아주 가까운 친구나 이웃과 함께 여행했지만, 이러한 관계는 정의에서 지원되지 않습니다.

import pandas as pd
titanic_df = pd.read_excel('titanic3.xls')
# 타이타닉 데이터셋 info
titanic_df.info()
'''
1. 데이터 행은 총 1309개
2. pclass, survived, name, sex, sibsp, parch, ticket 은 결측치 없음
3. age, fare, cabin, embarked, boat, body, home.dest는 결측치 있음.
4. int64 : pclass, survived, sibsp, parch
5. float64 : age, fare, body
6. object : name, sex, ticket, cabin, embarked, boat, home.dest
'''

DataFrame활용 - 행 조회와 중복값 처리(loc, duplicated, keep=)

# index_col = 'name'으로 지정
titanic_df = pd.read_excel('titanic3.xls', index_col='name')

# 정렬 : 이름 순
# 문자부호표 원리로 소문자가 뒤로 배치 됨
titanic_df = titanic_df.sort_index()
# 정렬 : 이름 순 - 대소문자 상관없이 정렬
# key : 크고 작음을 처리하는 방식을 정하는 함수 지정
titanic_df = titanic_df.sort_index(key=lambda x:x.str.lower())

'''
.loc
-인덱스 레이블로 행 조회
-메소드(함수)가 아닌 속성(property)
-.loc[]은 주로 레이블(label)을 기반으로 행과 열에 접근하는 데 사용되지만, 
boolean 배열과 함께 사용할 수도 있습니다.
'''
# 1. Abbing, Mr. Anthony 를 조회
titanic_df.loc['Abbing, Mr. Anthony']
# 1-1 시리즈 말고 데이터프레임으로 받고 싶으면
titanic_df.loc[['Abbing, Mr. Anthony']]

# 2. Abbing, Mr. Anthony와 Zimmerman, Mr. Leo 를 조회
titanic_df.loc[['Abbing, Mr. Anthony','Zimmerman, Mr. Leo']]

# 3. Abbott, Master. Eugene Joseph 에서 Abelseth, Miss. Karen Marie까지 조회
titanic_df.loc['Abbott, Master. Eugene Joseph':'Abelseth, Miss. Karen Marie']

# 4. 없는 이름 조회
titanic_df.loc['james']
#없는 경우 대처 방법
# if문으로 제어
if 'james' in titanic_df.index:
    result = titanic_df.loc['james']
else:
    result = None #또는 다른 조치
# try문으로 제어
try:
    result = titanic_df.loc['james']
except:
    result = None #또는 다른 조치

'''.duplicated() - 동명이인 발생하는 경우'''
# 1. 데이터프레임.인덱스.중복확인()
titanic_df.index.duplicated()
# 2. 1에 조건에 맞는 행 출력
titanic_df.loc[titanic_df.index.duplicated()] #4명이 나와야 하는데 2명만 나옴
'''
keep : 중복된 값 중 어떤 값을 유지할지를 지정
-first
첫 번째 등장하는 값을 제외하고 중복된 값을 True로 표시
중복된 값 중 첫 번째 값은 유지되고, 이후 등장하는 중복된 값들은 True로 표시.
-last
마지막 등장하는 값을 제외하고 중복된 값을 True로 표시
중복된 값 중 마지막 값은 유지되고, 이전에 등장하는 중복된 값들은 True로 표시.
-False
모든 중복된 값을 True로 표시합니다
중복된 값들 모두를 유지하지 않고 True로 표시.
'''
# 3. 2에서 중복값을 포함하도록 한다.
titanic_df.loc[titanic_df.index.duplicated(keep=False)]

DataFrame활용 - 행의 열 조회(loc)

import pandas as pd
# index_col = 'name'으로 지정
titanic_df = pd.read_excel('titanic3.xls', index_col='name')

# loc - 특정 행의 특정 열 조회
# 1. Blank, Mr. Henry는 승객 등급은?
titanic_df.loc['Blank, Mr. Henry','pclass']
# 2. Blank, Mr. Henry는 승객 등급과 요금은?
titanic_df.loc['Blank, Mr. Henry',['pclass','fare']]
# 3. Blank, Mr. Henry는 승객 등급과 요금은? -> DataFrame으로
titanic_df.loc[['Blank, Mr. Henry'],['pclass','fare']]

#Herman, Miss. Alice 와 Herman, Miss. Kate 관계 확인
# 탑승자
passenger = ['Herman, Miss. Alice','Herman, Miss. Kate']
# 1. 모든 열 조회
titanic_df.loc[passenger]
# 2. 승객 등급과 요금은?
titanic_df.loc[passenger,['pclass','fare']]

DataFrame활용 - 인덱스 위치 기반 행과 열 조회(iloc)

import pandas as pd
cols = ['name','sex','age','pclass','fare']
titanic = pd.read_excel('titanic3.xls', usecols=cols)
'''
loc[ ] vs iloc[ ]
-loc은 label, iloc은 index position을 사용하여 단일 행 또는 여러 행을 선택
-index(index label)는 문자일 수도, 숫자일 수도 있다.
-index position은 고정이다. (마치 파이썬의 리스트 인덱스)
-헷갈리지 않는다면 괜찮지만, 가능하면 index label 숫자는 지양
-index의 중복값이 우려된다면 그 땐 index label이 정수여도 양호
'''
# 0 행 삭제
titanic.drop(labels=0, inplace=True)
# loc을 이용한 0 조회 -> 오류 발생
titanic.loc[0]
# iloc을 이용한 0 조회 -> index label이 1인 것이 조회
titanic.iloc[0]

# 슬라이싱 1
titanic.iloc[5:11]
# 슬라이싱 2
titanic.iloc[5:11:2]
# index label의 numeric 피하기
cols = ['name','sex','age','pclass','fare']
titanic = pd.read_excel('titanic3.xls', usecols=cols, index_col='name'
# 여전히 iloc은 잘 동작한다 - 슬라이싱 1
titanic.iloc[5:11]
# 여전히 iloc은 잘 동작한다 - 슬라이싱 2
titanic.iloc[5:11:2]
# 여러 행 조회
titanic.iloc[[50,100,150,200]]
# 역순환 조회 - 파이썬 리스트처럼
titanic.iloc[-1]
# Series가 아닌 DataFrame으로 받으려면?
titanic.iloc[[-1,2]]

'''
편하게 사용할 수 있는 건 iloc 같으나
유의미하고 명확하게 사용하게 되는 건 loc
'특정 누구를 찾는다'라고 할 땐 loc이 유리
정렬을 하다보면 index position은 바뀌기 마련.
때문에 '특정 행에서 특정처리를 한다'는 상황은 가능한 loc을 지향
'''
# 3번째 위치한 사람 행 데이터
titanic.iloc[3]
# iloc[index_position, column_position] -> 스칼라
titanic.iloc[3,2]
# iloc[[index_position], [column_position]] -> DataFrame 1
titanic.iloc[[3,5,7],[2]]
# iloc[[index_position], [column_position]] -> DataFrame 2
titanic.iloc[[3,5,7],[2,3]]
# iloc[index_position, [column_position]] -> Series
titanic.iloc[3,[2,3]]
# iloc[index_position, [column_position]] -> Series
titanic.iloc[3,[2]]
# 모든 행의 2, 3 열 -> DataFrame
titanic.iloc[:,[2,3]]

DataFrame활용 - 서로 다른 차원 배열 연산(Matching, Broadcasting)

#numpy는 Stretch(스케일 변환, 뻗다. 스트레칭하다)는 방식이고
#pandas는 각각 처리하고 반영이 어려운 곳은 결측치(NA)로 처리하는 방식

#행렬의 크기가 같을 때
a = pd.Series(data = [1, 2, 3])
b = pd.Series(data = [3, 5, 7])
#연산 문제 없음

#행렬 크기가 다를 때
# Case1. 크기가 다른 Series를 DataFrame시킬 때
s1 = pd.Series(data = [1, 2, 3], index=['a', 'b', 'c'])
s2 = pd.Series(data = [4, 5, 6, 7], index=['a', 'b', 'c', 'd'])
s3 = pd.Series(data = [8, 9, 10], index=['a', 'b', 'd'])
df1 = pd.DataFrame(
    data = {
        '1열': s1, '2열': s2, '3열': s3
    }
)
df2 = pd.DataFrame(
    data = {
        '2열': s2
    }
)

# Case2. 모두 DataFrame이고 행렬 크기가 다르나 동일한 열이름이 있는 경우 사칙 연산은?
df1 - df2 #동일한 label만 가진 열만 계산되고 나머지는 모두 결측치가 되었다.

# Case3. DataFrame과 Series의 연산이라면?
s4 = df2.squeeze()
# 위의 df2는 df1의 2열 조회와 동일
df1['2열']
# - 연산
df1 - s4 #엉망진창
# sub 함수로 축을 변경해주면, Stretch가 일어났다!
df1.sub(s4, axis='index')

# Case4, 그러면 행을 연산시키면 어떻게 될까?
s5 = df1.iloc[2]
# 여기서도 Stretch가 발생했다.
df1.sub(s5)

연산자	메소드	사용 예시
+	add()	df.add(row, axis='columns'), df.add(column, axis='index')
-	sub()	df.sub(row, axis='columns'), df.sub(column, axis='index')
*	mul()	df.mul(row, axis='columns'), df.mul(column, axis='index')
/	div()	df.div(row, axis='columns'), df.div(column, axis='index')

DataFrame활용 - 조건문(Conditional Statement) + 퀴즈

import pandas as pd
# 데이터프레임 생성
cols = ['name', 'survived', 'pclass', 'fare', 'sex', 'age']
titanic = pd.read_excel('titanic3.xls', usecols=cols)
titanic = titanic[cols]
# name을 index로 지정 후 확인
titanic.set_index('name', inplace=True)

# 살아남은 survived == 1인 사람 찾기
titanic['survived'] == 1 
titanic[titanic['survived'] == 1]
# 다음처럼 작성하는것을 지향
survived_mask = titanic['survived'] == 1
titanic[survived_mask]

male_mask = titanic['sex'] == 'male'
female_mask = titanic['sex'] == 'female'
old_mask = titanic['age'] > 65
young_mask = titanic['age'] < 15
# 남성이고 65세 이상의 생존 확인
titanic[male_mask & old_mask]
# 여기에 생존자까지 확인
titanic[male_mask & old_mask & survived_mask]

#Quiz1. 상류층 여성의 생존 명단 확인
pclass_1_mask = titanic['pclass'] == 1
female_mask = titanic['sex'] == 'female'
survived_mask = titanic['survived'] == 1
titanic[pclass_1_mask & female_mask & survived_mask]

#Quiz2. 250이상을 지불한 15세 미만 어린이의 생존 명단 확인
high_fare_mask = titanic['fare'] > 250
young_mask = titanic['age'] < 15
survived_mask = titanic['survived'] == 1
titanic[high_fare_mask & young_mask & survived_mask]

#Quiz3. (15세 미만 또는 여성)의 생존 명단 확인
young_mask = titanic['age'] < 15
female_mask = titanic['sex'] == 'female'
survived_mask = titanic['survived'] == 1
titanic[(young_mask | female_mask) & survived_mask]

DataFrame활용 - 범위 지정 필터링(between)

import pandas as pd
# 데이터프레임 생성
cols = ['name', 'survived', 'pclass', 'fare', 'sex', 'age']
titanic = pd.read_excel('titanic3.xls', usecols=cols, index_col='name')

#조건문으로 20대 찾기
upper_20_mask = titanic['age'] >= 20
lower_30_mask = titanic['age'] < 30
titanic[upper_20_mask & lower_30_mask]

#between으로 20대 찾기
age_20s_mask = titanic['age'].between(20,30,inclusive='left')
titanic[age_20s_mask]

# 20이상 30이하
titanic['age'].between(20,30,inclusive='both')
# 20초과 30이하
titanic['age'].between(20,30,inclusive='right')
# 20초과 30미만
titanic['age'].between(20,30,inclusive='neither')

DataFrame활용 - 결측치 필터와 처리(isin, isnull, isna, notnull)

import pandas as pd
cols = ['name', 'survived', 'pclass', 'fare', 'sex', 'age']
titanic = pd.read_excel('titanic3.xls', usecols=cols, index_col='name')

#상류층 | 중산층 모두 찾을 조건문(마스크)를 만들 때
pclass_1_mask = titanic['pclass'] == 1
pclass_2_mask = titanic['pclass'] == 2
pclass_1_mask | pclass_2_mask
'''
.isin()
-파이썬의 in 과 비슷하게 생각해보자.
-1 in [1, 2] 2 in [1, 2] 모두 True가 나옴
-isin 메소드를 사용하여 각 요소가 특정 값들에 속하는지 여부를 확인할 수 있다.
-Series의 각 요소가 주어진 값(values)에 포함되는지 여부를 나타내는 불리언 Series를 반환
-values는 집합(set)이나 리스트 형태의 값들로 이루어진 시퀀스
-values에 단일 문자열을 전달할 경우 TypeError가 발생
-단일 문자열을 하나의 요소로 갖는 리스트로 변환해야함.
'''
titanic['pclass'].isin([1,2]) # 앞 코드와 동일

'''
.isnull()
-null 요소가 있는지 확인하는 메서드
-NA를 발견하면 True로 반환한다.
'''
# age열 isnull()
titanic['age'].isnull()
# 개수 확인
titanic['age'].isnull().sum()

'''
.notnull()
-isnull과 반대로 NA가 아닌 걸 발견하면 True로 반환
'''
titanic['age'].notnull()
# 개수 확인
titanic['age'].notnull().sum()

unknown_age_mask = titanic['age'].isnull()
known_age_mask = titanic['age'].notnull() 
# 나이가 식별되지 않은 사람 확인
titanic[unknown_age_mask]
# 나이가 식별된 사람 확인
titanic[known_age_mask]

DataFrame활용 - 행렬 인덱스 네이밍 변경(rename)

import pandas as pd
cols = ['name', 'survived', 'pclass', 'fare', 'sex', 'age']
titanic = pd.read_excel('titanic3.xls', usecols=cols, index_col='name')

# 열 목록(columns index) 조회
titanic.columns
# 특정 열 이름 접근
titanic.columns[0]
# Try : 수정 = 접근 후 할당 
titanic.columns[0] = 'class'#error
# 방법 1 : rename을 이용하기
titanic.rename(columns={'pclass':'class'},inplace=True)
# 방법 2 : 같은 길이의 리스트를 준비해서 교체하기
titanic.columns = ['class', 'survived', 'sex', 'age', 'cost']

# 1. 2개의 열을 변경하려면?
titanic.rename({'class':'Pclass', 'cost':'Fare'}, axis=1)
# axis인자를 빼고 한다면
titanic.rename({'class':'Pclass', 'cost':'Fare'}) #error

# 2. 인덱스에서 'Allen, Miss. Elisabeth Walton' 이름 변경
titanic.rename({'Allen, Miss. Elisabeth Walton':'Allen'}, axis=0)
# axis인자를 빼고 한다면
titanic.rename({'Allen':'Allen, Miss. Elisabeth Walton'}) #default: axis=0

# 3. 행렬 인덱스를 설정
titanic.index.name = 'Passengers'
titanic.columns.name = 'information'

공부하며 어려웠던 내용

하루 수업의 양이 많아 익히고 이해하기보다 정리하기에도 벅찼다. 데이터프레임의 축이 아직 헷갈린다.

'TIL' 카테고리의 다른 글

Python 프로그래밍 및 Pandas 활용 실습(5)-1 (1)	2024.01.06
Python 프로그래밍 및 Pandas 활용 실습(4) (1)	2024.01.05
Python 프로그래밍 및 Pandas 활용 실습(3)-1 (2)	2024.01.04
Python 프로그래밍 및 Pandas 활용 실습(2) (1)	2024.01.03
Python 프로그래밍 및 Pandas 활용 실습(1) (1)	2024.01.02

사연 없는 데이터 없다

사연 없는 데이터 없다

태그

최근글

댓글

공지사항

아카이브

'TIL' 카테고리의 다른 글

관련글

티스토리툴바