September 18, 2021

판다스 데이터 분석하기

이번 포스트에서는 DataFrame으로부터 의미 있는 정보(insight)를 얻기 위해 데이터를 분석하는 방법에 대해 알아보겠습니다. 이 전 포스트까지는 직접 만든 작은 데이터셋에 대해 했지만, 이번 포스트에서는 외부에서 가져온 데이터셋을 이용해 분석에 사용해 보도록 하겠습니다.

데이터 셋: FIFA19 complete player dataset

1. 통계적 분석(Statistical Data Analysis)

# 필요한 라이브러리 불러오기
import pandas as pd

# 데이터 읽어서 가져오기
df = pd.read_csv("../input/fifa19/data.csv")

df.head(5)
# df.tail()

df.shape
-------------------
(18207, 89)

df.columns
--------------------------------------------------------------------------------
Index(['Unnamed: 0', 'ID', 'Name', 'Age', 'Photo', 'Nationality', 'Flag',
       'Overall', 'Potential', 'Club', 'Club Logo', 'Value', 'Wage', 'Special',
       'Preferred Foot', 'International Reputation', 'Weak Foot',
       'Skill Moves', 'Work Rate', 'Body Type', 'Real Face', 'Position',
       'Jersey Number', 'Joined', ...
        'GKPositioning', 'GKReflexes', 'Release Clause'],
      dtype='object')

df.info()
----------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18207 entries, 0 to 18206
Data columns (total 89 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Unnamed: 0                18207 non-null  int64  
 1   ID                        18207 non-null  int64  
 2   Name                      18207 non-null  object 
 3   Age                       18207 non-null  int64  
 4   Photo                     18207 non-null  object 
 5   Nationality               18207 non-null  object 
 6   Flag                      18207 non-null  object 
 7   Overall                   18207 non-null  int64  
 8   Potential                 18207 non-null  int64  
 9   Club                      17966 non-null  object 
 ...
 86  GKPositioning             18159 non-null  float64
 87  GKReflexes                18159 non-null  float64
 88  Release Clause            16643 non-null  object 
dtypes: float64(38), int64(6), object(45)
memory usage: 12.4+ MB

df.describe()

df['Position'].unique()
------------------------------------------------------
array(['RF', 'ST', 'LW', 'GK', 'RCM', 'LF', 'RS', 'RCB', 'LCM', 'CB',
       'LDM', 'CAM', 'CDM', 'LS', 'LCB', 'RM', 'LAM', 'LM', 'LB', 'RDM',
       'RW', 'CM', 'RB', 'RAM', 'CF', 'RWB', 'LWB', nan], dtype=object)

# aggregation 함수

df['Overall'].mean()
----------------------------------
66.23869940132916

df['Age'].max()
--------------------
45

# aggregation 함수를 DataFrame에 대해 적용하면 각 column별로 aggregation함수가 적용된다
df.max()
-------------------------------------------
Unnamed: 0                                                         18206
ID                                                                246620
Name                                                       Óscar Whalley
Age                                                                   45
Photo                       https://cdn.sofifa.org/players/4/19/9833.png
Nationality                                                     Zimbabwe
Flag                                 https://cdn.sofifa.org/flags/99.png
Overall                                                               94
Potential                                                             95
Club Logo                   https://cdn.sofifa.org/teams/2/light/983.png
Value                                                                €9M
Wage                                                                 €9K
Special                                                             2346
International Reputation                                             5.0
Weak Foot                                                            5.0
Skill Moves                                                          5.0
Jersey Number                                                       99.0
Crossing                                                            93.0

2. 탐색적 분석(Exploratory Data Analysis)

1) Selection과 Filtering

# selection
condition = df['Club'] == 'Juventus'
df[condition].head(5)

# filtering
condition = df['Overall'] > 91
df[condition]

# selection & filtering
condition_1 = df['Nationality'] == 'Italy'
condition_2 = df['Overall'] >= 85

df[condition_1 & condition_2]

# selection & aggregation

condition = df['Position'] == 'GK'


df[condition][['GKDiving', 'GKHandling', 'GKPositioning', 'GKReflexes']].mean()
----------------------------------------------------------------------------------------
GKDiving         65.323951
GKHandling       62.868148
GKPositioning    63.047407
GKReflexes       66.101728
dtype: float64


df[condition][['GKDiving', 'GKHandling', 'GKPositioning', 'GKReflexes']].mean(axis=1)
----------------------------------------------------------------------------------------
3        89.25
9        88.75
18       86.75
19       87.50
22       87.50
         ...  
18178    47.25
18180    47.25
18183    47.00
18194    47.00
18198    47.00
Length: 2025, dtype: float64

2) Groupby

df.groupby(by='Club').mean()

df[['Club', 'Nationality', 'Name']].groupby(by=['Club', 'Nationality']).count()

df.groupby(by=['Club', 'Nationality'])[['Age', 'Overall', 'Potential']].mean()

df.groupby(by=['Club', 'Nationality'])[['Age', 'Overall', 'Potential']].agg([max, min])

df.groupby(by=['Club', 'Nationality']).agg({'Nationality': 'count', 'Age': 'mean', 'Overall': 'max', 'Potential': 'mean'})

3. 분석에 유용한 함수들

# DataFrame.sort_values(by = 'col', ascending=False, replace=True)

# Series.value_counts()

# DataFrame.apply()
# Series.apply()
# Groupby.apply()

Twitter Facebook LinkedIn

JaeYeong Kim

판다스 데이터 분석하기

1. 통계적 분석(Statistical Data Analysis)

2. 탐색적 분석(Exploratory Data Analysis)

1) Selection과 Filtering

2) Groupby

3. 분석에 유용한 함수들

You May Also Enjoy