'Kaggle'에 해당되는 글 2건

  1. 2020.02.27 :: [kaggle] Real or Not? NLP with Disaster Tweets
  2. 2020.01.04 :: [kaggle] House Price competition data 전처리
Data 분석 2020. 2. 27. 22:09

https://www.kaggle.com/c/nlp-getting-started/overview

 

Real or Not? NLP with Disaster Tweets

Predict which Tweets are about real disasters and which ones are not

www.kaggle.com

1. 평가는 https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html

 

sklearn.metrics.f1_score — scikit-learn 0.22.1 documentation

 

scikit-learn.org

 

Text 분석을 R로만 해보았는데 이번에 python으로 도전!

posted by 초코렛과자
:
Data 분석 2020. 1. 4. 11:50

https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview

 

House Prices: Advanced Regression Techniques

Predict sales prices and practice feature engineering, RFs, and gradient boosting

www.kaggle.com

kaggle 에서 진행되고 있는 집값 예측 competition. 

개인적으로 분석 공부가 부족하다고 생각되어 해당 competition에 도전해 보기로 하였다.

 

오늘은 전처리 과정을 포스팅 해보려고한다

 

# coding: utf-8

# ### Data
# 
# 1. train data 형태
#  - 81 columns
#  - 1460 rows
# 2. target
#  - SalePrice
# 3. test data 형태
#  - 80 columns
#  - 1459 rows
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math
#pd.set_option('display.max_columns', 100)
#pd.set_option('display.max_rows', 100)

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

# ### preprocessing
# 1. category data 와 number data 분류
# 2. Id column 제거
# 3. number data na fill : mean(각 columns)
# ---
# 주의할 점 : 형태는 number지만 실제로는 category로 분류해야 하는 column들이 있음(data 설명 참조 - data_description.txt)
# 
# -> MSSubClass, OverallCond, KitchenAbvGr, BedroomAbvGr, TotRmsAbvGrd

df_train = train.drop(['Id'],axis=1)
# result에 id를 추가해야 하므로 따로 빼두자
test_ids = test['Id']
df_test = test.drop(['Id'],axis=1)
df_num_to_category_cols = ['MSSubClass', 'OverallCond', 'KitchenAbvGr', 'BedroomAbvGr','TotRmsAbvGrd']
df_train_numbers = df_train[df_train.columns[df_train.dtypes != object]]
df_train_numbers = df_train_numbers.drop(df_num_to_category_cols, axis=1)
df_test_numbers = df_test[df_test.columns[df_test.dtypes != object]]
df_test_numbers = df_test_numbers.drop(df_num_to_category_cols, axis=1)

df_train_categorys = df_train.select_dtypes(['object']).apply(lambda x: x.astype('category'))
df_train_categorys = pd.concat([df_train_categorys,
                         df_train[df_num_to_category_cols]]
                        , axis=1)
df_train_categorys = df_train_categorys.apply(lambda x: x.astype('category'))

df_test_categorys = df_test.select_dtypes(['object']).apply(lambda x: x.astype('category'))
df_test_categorys = pd.concat([df_test_categorys,
                         df_test[df_num_to_category_cols]]
                        , axis=1)
df_test_categorys = df_test_categorys.apply(lambda x: x.astype('category'))
df_train_categorys.head()
df_test_numbers.isnull().sum()

# GarageYrBlt -> Yearbuilt 로 채우기
# MasVnrArea -> 0으로 채우기
# LotFrontage -> 평균으로 채우기
## test data는 아래 데이터도 NA가 있음 -> 모두 평균으로 채움
# BsmtFinSF1
# BsmtFinSF2
# BsmtUnfSF
# TotalBsmtSF
# BsmtFullBath
# BsmtHalfBath
# GarageCars
# GarageArea

year_built_list = df_train_numbers[df_train_numbers['GarageYrBlt'].isnull() ==True]['YearBuilt'].tolist()
year_built_list_2 = df_test_numbers[df_test_numbers['GarageYrBlt'].isnull() ==True]['YearBuilt'].tolist()

garage_list = []
i = 0
for v in df_train_numbers['GarageYrBlt'].values:
    if math.isnan(v):
        garage_list.append(year_built_list[i])
        i += 1
    else:
        garage_list.append(v)

garage_list_2 = []
i = 0
for v in df_test_numbers['GarageYrBlt'].values:
    if math.isnan(v):
        garage_list_2.append(year_built_list_2[i])
        i += 1
    else:
        garage_list_2.append(v)

df_train_numbers['GarageYrBlt'] = garage_list

df_test_numbers['GarageYrBlt'] = garage_list_2

df_train_numbers['GarageYrBlt'].isnull().sum()

df_test_numbers['GarageYrBlt'].isnull().sum()

df_train_numbers['MasVnrArea'] = df_train_numbers['MasVnrArea'].apply(lambda x: 0 if math.isnan(x) else x)

df_test_numbers['MasVnrArea'] = df_test_numbers['MasVnrArea'].apply(lambda x: 0 if math.isnan(x) else x)

df_train_numbers['MasVnrArea'].isnull().sum()

df_test_numbers['MasVnrArea'].isnull().sum()

df_train_numbers['LotFrontage'] = df_train_numbers['LotFrontage'].apply(lambda x: int(np.mean(df_train_numbers['LotFrontage'])) if math.isnan(x) else x)

df_test_numbers['LotFrontage'] = df_test_numbers['LotFrontage'].apply(lambda x: int(np.mean(df_test_numbers['LotFrontage'])) if math.isnan(x) else x)

df_train_numbers['LotFrontage'].isnull().sum()

df_test_numbers['LotFrontage'].isnull().sum()


df_test_numbers['BsmtHalfBath']

test_data_nan_cols = ['BsmtFinSF1','BsmtFinSF2','BsmtUnfSF','TotalBsmtSF','BsmtFullBath','BsmtHalfBath','GarageCars','GarageArea']

for col in test_data_nan_cols:
    df_test_numbers[col] = df_test_numbers[col].apply(lambda x: int(np.mean(df_test_numbers[col])) if math.isnan(x) else x)


# ### preprocessing
# 4. category data na fill : max(각 columns)
# ---
# 여기서 문제 -> train과 test에 category columns 중 서로 다른 갯수(예를들어 train에서 복도의 형태가 5가지라면, test에서는 4가지 같은)가 있는 경우가 발생해서, get_dummies로 각 data를 category화 시키면 컬럼이 다르게 뽑힘 따라서
# 
# 
# 5. train data에는 있지만, test data에 없는 column을 찾아서 0으로 채운 column을 추가

df_train_categorys.isnull().sum()

df_test_categorys.isnull().sum()

# Alley, PoolQC, Fence, MiscFeature 삭제

df_train_categorys = df_train_categorys.drop(['Alley','PoolQC','Fence','MiscFeature'], axis=1)
df_test_categorys = df_test_categorys.drop(['Alley','PoolQC','Fence','MiscFeature'], axis=1)

# BsmtQual          37
# BsmtCond          37
# BsmtExposure      38
# BsmtFinType1      37
# BsmtFinType2      38
# GarageType        76
# GarageFinish      78
# GarageQual        78
# GarageCond        78
# FireplaceQu

# BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinType2
# 은 'No'로 채움
var_list = ['BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
           'GarageType','GarageFinish','GarageQual','GarageCond','FireplaceQu']

for var in var_list:
    fill_no_list = []
    for v in df_train_categorys[var].values:
        if type(v) != str:
            fill_no_list.append('No')
        else:
            fill_no_list.append(v)
    df_train_categorys[var] = fill_no_list
    df_train_categorys[var] = df_train_categorys[var].astype('category')

for var in var_list:
    fill_no_list_2 = []
    for v in df_test_categorys[var].values:
        if type(v) != str:
            fill_no_list_2.append('No')
        else:
            fill_no_list_2.append(v)
    df_test_categorys[var] = fill_no_list_2
    df_test_categorys[var] = df_test_categorys[var].astype('category')

# Electrical nan 1개 -> 제일 많은 값 SBrkr로 채움
df_train_categorys['Electrical'][df_train_categorys['Electrical'].isnull() == True] = 'SBrkr'

# MasVnrType nan 8개 -> None으로 채움
df_train_categorys['MasVnrType'][df_train_categorys['MasVnrType'].isnull() == True] = 'None'

# test data 남은 nan data 처리 
# MSZoning          4  -> OverallQual에 따라 ['C (all)','C (all)','RL','C (all)'] 로 입력
# Utilities         2 - > AllPub 으로 채움
# Exterior1st       1 -> VinylSd
# Exterior2nd       1 -> VinylSd
# MasVnrType       16 -> 6이상은 stone, 4는 BrkCmn
# KitchenQual       1 -> TA
# Functional        2 -> Typ
# SaleType          1 -> WD


df_test_categorys['SaleType'].hist()

df_test_categorys['MSZoning'][df_test_categorys['MSZoning'].isnull() == True] = ['C (all)','C (all)','RL','C (all)']
df_test_categorys['Utilities'][df_test_categorys['Utilities'].isnull() == True] = 'AllPub'
df_test_categorys['Exterior1st'][df_test_categorys['Exterior1st'].isnull() == True] = 'VinylSd'
df_test_categorys['Exterior2nd'][df_test_categorys['Exterior2nd'].isnull() == True] = 'VinylSd'
MasVnrType_list = []
for v in df_test_numbers['OverallQual'][df_test_categorys['MasVnrType'].isnull() == True].values:
    if v > 5:
        MasVnrType_list.append('Stone')
    else:
        MasVnrType_list.append('BrkCmn')

df_test_categorys['MasVnrType'][df_test_categorys['MasVnrType'].isnull() == True] = MasVnrType_list
df_test_categorys['KitchenQual'][df_test_categorys['KitchenQual'].isnull() == True] = 'TA'
df_test_categorys['Functional'][df_test_categorys['Functional'].isnull() == True] = 'Typ'
df_test_categorys['SaleType'][df_test_categorys['SaleType'].isnull() == True] = 'WD'

### number data, category data 합치기
new_train_data = pd.concat([df_train_numbers, df_train_categorys], axis=1)
new_test_data = pd.concat([df_test_numbers, df_test_categorys], axis=1)

### category one hot encoding
new_train_data = pd.get_dummies(new_train_data)
new_test_data = pd.get_dummies(new_test_data)

### train / test 서로 없는 column 0으로 채워서 만들기
empty_cols = []
for col in new_train_data.columns:
    if col not in new_test_data.columns:
        empty_cols.append(col)

for col in empty_cols[1:]:
    new_test_data[col] = 0

empty_cols = []
for col in new_test_data.columns:
    if col not in new_train_data.columns:
        empty_cols.append(col)

for col in empty_cols:
    new_train_data[col] = 0

new_train_data.info()
new_test_data.info()

# train data에서 outlier 제거
plt.scatter(new_train_data['SalePrice'], new_train_data['PoolArea'])
new_train_data = new_train_data.drop(new_train_data.sort_values('GrLivArea', ascending=False)[:2]['GrLivArea'].index)

 

일단 전처리 코드는 여기까지. 

na 값을 처리하는 부분은 개인마다 다르겠지만, data description을 읽고 나름대로 해석해서 적절하게 채워봤다.

이제 전처리 한 데이터에서 적절한 feature를 뽑기 위한 작업을 해야겠다.

 

 

posted by 초코렛과자
: