'Kaggle'에 해당되는 글 2건
- 2020.02.27 :: [kaggle] Real or Not? NLP with Disaster Tweets
- 2020.01.04 :: [kaggle] House Price competition data 전처리
https://www.kaggle.com/c/nlp-getting-started/overview
Real or Not? NLP with Disaster Tweets
Predict which Tweets are about real disasters and which ones are not
www.kaggle.com
1. 평가는 https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html
sklearn.metrics.f1_score — scikit-learn 0.22.1 documentation
scikit-learn.org
Text 분석을 R로만 해보았는데 이번에 python으로 도전!
'Data 분석' 카테고리의 다른 글
[데이콘] 스타2 게임 데이터 분석대회 (0) | 2020.04.06 |
---|---|
N-gram 언어 모델(N-gram Language Model) (0) | 2020.02.29 |
[kaggle] House Price competition data 전처리 (0) | 2020.01.04 |
포아송 분포 (Poisson distribution) (0) | 2019.09.16 |
분류분석 - 신경망 모형 (0) | 2019.02.24 |
https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview
House Prices: Advanced Regression Techniques
Predict sales prices and practice feature engineering, RFs, and gradient boosting
www.kaggle.com
kaggle 에서 진행되고 있는 집값 예측 competition.
개인적으로 분석 공부가 부족하다고 생각되어 해당 competition에 도전해 보기로 하였다.
오늘은 전처리 과정을 포스팅 해보려고한다
# coding: utf-8
# ### Data
#
# 1. train data 형태
# - 81 columns
# - 1460 rows
# 2. target
# - SalePrice
# 3. test data 형태
# - 80 columns
# - 1459 rows
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math
#pd.set_option('display.max_columns', 100)
#pd.set_option('display.max_rows', 100)
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
# ### preprocessing
# 1. category data 와 number data 분류
# 2. Id column 제거
# 3. number data na fill : mean(각 columns)
# ---
# 주의할 점 : 형태는 number지만 실제로는 category로 분류해야 하는 column들이 있음(data 설명 참조 - data_description.txt)
#
# -> MSSubClass, OverallCond, KitchenAbvGr, BedroomAbvGr, TotRmsAbvGrd
df_train = train.drop(['Id'],axis=1)
# result에 id를 추가해야 하므로 따로 빼두자
test_ids = test['Id']
df_test = test.drop(['Id'],axis=1)
df_num_to_category_cols = ['MSSubClass', 'OverallCond', 'KitchenAbvGr', 'BedroomAbvGr','TotRmsAbvGrd']
df_train_numbers = df_train[df_train.columns[df_train.dtypes != object]]
df_train_numbers = df_train_numbers.drop(df_num_to_category_cols, axis=1)
df_test_numbers = df_test[df_test.columns[df_test.dtypes != object]]
df_test_numbers = df_test_numbers.drop(df_num_to_category_cols, axis=1)
df_train_categorys = df_train.select_dtypes(['object']).apply(lambda x: x.astype('category'))
df_train_categorys = pd.concat([df_train_categorys,
df_train[df_num_to_category_cols]]
, axis=1)
df_train_categorys = df_train_categorys.apply(lambda x: x.astype('category'))
df_test_categorys = df_test.select_dtypes(['object']).apply(lambda x: x.astype('category'))
df_test_categorys = pd.concat([df_test_categorys,
df_test[df_num_to_category_cols]]
, axis=1)
df_test_categorys = df_test_categorys.apply(lambda x: x.astype('category'))
df_train_categorys.head()
df_test_numbers.isnull().sum()
# GarageYrBlt -> Yearbuilt 로 채우기
# MasVnrArea -> 0으로 채우기
# LotFrontage -> 평균으로 채우기
## test data는 아래 데이터도 NA가 있음 -> 모두 평균으로 채움
# BsmtFinSF1
# BsmtFinSF2
# BsmtUnfSF
# TotalBsmtSF
# BsmtFullBath
# BsmtHalfBath
# GarageCars
# GarageArea
year_built_list = df_train_numbers[df_train_numbers['GarageYrBlt'].isnull() ==True]['YearBuilt'].tolist()
year_built_list_2 = df_test_numbers[df_test_numbers['GarageYrBlt'].isnull() ==True]['YearBuilt'].tolist()
garage_list = []
i = 0
for v in df_train_numbers['GarageYrBlt'].values:
if math.isnan(v):
garage_list.append(year_built_list[i])
i += 1
else:
garage_list.append(v)
garage_list_2 = []
i = 0
for v in df_test_numbers['GarageYrBlt'].values:
if math.isnan(v):
garage_list_2.append(year_built_list_2[i])
i += 1
else:
garage_list_2.append(v)
df_train_numbers['GarageYrBlt'] = garage_list
df_test_numbers['GarageYrBlt'] = garage_list_2
df_train_numbers['GarageYrBlt'].isnull().sum()
df_test_numbers['GarageYrBlt'].isnull().sum()
df_train_numbers['MasVnrArea'] = df_train_numbers['MasVnrArea'].apply(lambda x: 0 if math.isnan(x) else x)
df_test_numbers['MasVnrArea'] = df_test_numbers['MasVnrArea'].apply(lambda x: 0 if math.isnan(x) else x)
df_train_numbers['MasVnrArea'].isnull().sum()
df_test_numbers['MasVnrArea'].isnull().sum()
df_train_numbers['LotFrontage'] = df_train_numbers['LotFrontage'].apply(lambda x: int(np.mean(df_train_numbers['LotFrontage'])) if math.isnan(x) else x)
df_test_numbers['LotFrontage'] = df_test_numbers['LotFrontage'].apply(lambda x: int(np.mean(df_test_numbers['LotFrontage'])) if math.isnan(x) else x)
df_train_numbers['LotFrontage'].isnull().sum()
df_test_numbers['LotFrontage'].isnull().sum()
df_test_numbers['BsmtHalfBath']
test_data_nan_cols = ['BsmtFinSF1','BsmtFinSF2','BsmtUnfSF','TotalBsmtSF','BsmtFullBath','BsmtHalfBath','GarageCars','GarageArea']
for col in test_data_nan_cols:
df_test_numbers[col] = df_test_numbers[col].apply(lambda x: int(np.mean(df_test_numbers[col])) if math.isnan(x) else x)
# ### preprocessing
# 4. category data na fill : max(각 columns)
# ---
# 여기서 문제 -> train과 test에 category columns 중 서로 다른 갯수(예를들어 train에서 복도의 형태가 5가지라면, test에서는 4가지 같은)가 있는 경우가 발생해서, get_dummies로 각 data를 category화 시키면 컬럼이 다르게 뽑힘 따라서
#
#
# 5. train data에는 있지만, test data에 없는 column을 찾아서 0으로 채운 column을 추가
df_train_categorys.isnull().sum()
df_test_categorys.isnull().sum()
# Alley, PoolQC, Fence, MiscFeature 삭제
df_train_categorys = df_train_categorys.drop(['Alley','PoolQC','Fence','MiscFeature'], axis=1)
df_test_categorys = df_test_categorys.drop(['Alley','PoolQC','Fence','MiscFeature'], axis=1)
# BsmtQual 37
# BsmtCond 37
# BsmtExposure 38
# BsmtFinType1 37
# BsmtFinType2 38
# GarageType 76
# GarageFinish 78
# GarageQual 78
# GarageCond 78
# FireplaceQu
# BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinType2
# 은 'No'로 채움
var_list = ['BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
'GarageType','GarageFinish','GarageQual','GarageCond','FireplaceQu']
for var in var_list:
fill_no_list = []
for v in df_train_categorys[var].values:
if type(v) != str:
fill_no_list.append('No')
else:
fill_no_list.append(v)
df_train_categorys[var] = fill_no_list
df_train_categorys[var] = df_train_categorys[var].astype('category')
for var in var_list:
fill_no_list_2 = []
for v in df_test_categorys[var].values:
if type(v) != str:
fill_no_list_2.append('No')
else:
fill_no_list_2.append(v)
df_test_categorys[var] = fill_no_list_2
df_test_categorys[var] = df_test_categorys[var].astype('category')
# Electrical nan 1개 -> 제일 많은 값 SBrkr로 채움
df_train_categorys['Electrical'][df_train_categorys['Electrical'].isnull() == True] = 'SBrkr'
# MasVnrType nan 8개 -> None으로 채움
df_train_categorys['MasVnrType'][df_train_categorys['MasVnrType'].isnull() == True] = 'None'
# test data 남은 nan data 처리
# MSZoning 4 -> OverallQual에 따라 ['C (all)','C (all)','RL','C (all)'] 로 입력
# Utilities 2 - > AllPub 으로 채움
# Exterior1st 1 -> VinylSd
# Exterior2nd 1 -> VinylSd
# MasVnrType 16 -> 6이상은 stone, 4는 BrkCmn
# KitchenQual 1 -> TA
# Functional 2 -> Typ
# SaleType 1 -> WD
df_test_categorys['SaleType'].hist()
df_test_categorys['MSZoning'][df_test_categorys['MSZoning'].isnull() == True] = ['C (all)','C (all)','RL','C (all)']
df_test_categorys['Utilities'][df_test_categorys['Utilities'].isnull() == True] = 'AllPub'
df_test_categorys['Exterior1st'][df_test_categorys['Exterior1st'].isnull() == True] = 'VinylSd'
df_test_categorys['Exterior2nd'][df_test_categorys['Exterior2nd'].isnull() == True] = 'VinylSd'
MasVnrType_list = []
for v in df_test_numbers['OverallQual'][df_test_categorys['MasVnrType'].isnull() == True].values:
if v > 5:
MasVnrType_list.append('Stone')
else:
MasVnrType_list.append('BrkCmn')
df_test_categorys['MasVnrType'][df_test_categorys['MasVnrType'].isnull() == True] = MasVnrType_list
df_test_categorys['KitchenQual'][df_test_categorys['KitchenQual'].isnull() == True] = 'TA'
df_test_categorys['Functional'][df_test_categorys['Functional'].isnull() == True] = 'Typ'
df_test_categorys['SaleType'][df_test_categorys['SaleType'].isnull() == True] = 'WD'
### number data, category data 합치기
new_train_data = pd.concat([df_train_numbers, df_train_categorys], axis=1)
new_test_data = pd.concat([df_test_numbers, df_test_categorys], axis=1)
### category one hot encoding
new_train_data = pd.get_dummies(new_train_data)
new_test_data = pd.get_dummies(new_test_data)
### train / test 서로 없는 column 0으로 채워서 만들기
empty_cols = []
for col in new_train_data.columns:
if col not in new_test_data.columns:
empty_cols.append(col)
for col in empty_cols[1:]:
new_test_data[col] = 0
empty_cols = []
for col in new_test_data.columns:
if col not in new_train_data.columns:
empty_cols.append(col)
for col in empty_cols:
new_train_data[col] = 0
new_train_data.info()
new_test_data.info()
# train data에서 outlier 제거
plt.scatter(new_train_data['SalePrice'], new_train_data['PoolArea'])
new_train_data = new_train_data.drop(new_train_data.sort_values('GrLivArea', ascending=False)[:2]['GrLivArea'].index)
일단 전처리 코드는 여기까지.
na 값을 처리하는 부분은 개인마다 다르겠지만, data description을 읽고 나름대로 해석해서 적절하게 채워봤다.
이제 전처리 한 데이터에서 적절한 feature를 뽑기 위한 작업을 해야겠다.
'Data 분석' 카테고리의 다른 글
N-gram 언어 모델(N-gram Language Model) (0) | 2020.02.29 |
---|---|
[kaggle] Real or Not? NLP with Disaster Tweets (0) | 2020.02.27 |
포아송 분포 (Poisson distribution) (0) | 2019.09.16 |
분류분석 - 신경망 모형 (0) | 2019.02.24 |
분류분석 - 로지스틱 회귀모형 (0) | 2019.02.24 |