'스타2'에 해당되는 글 1건
- 2020.04.06 :: [데이콘] 스타2 게임 데이터 분석대회
Data 분석
2020. 4. 6. 21:49
https://dacon.io/competitions/official/235583/overview/
[게임] 월간 데이콘 3 행동 데이터 분석 대회
출처 : DACON - Data Science Competition
dacon.io
3월 한달간 진행된 데이콘 행동 데이터 분석 대회.
틈틈히 분석해 보았지만 아무래도 시간이 부족해서 간신히 베이스라인을 넘어서는 결과만 가져왔다.
그리고 이걸 분석하면서 메모리가 여러번 터짐 -_- train data가 5기가 정도 되는데 램이 4기가라 분석 중 고생을 많이했다. PC를 바꿀때가 된 것 같다는 생각을 하게 된 분석 대회.
해당 분석대회 토론방을 가보면 대용량 데이터를 핸들링하는 방법을 공유하고 있어서 많은 도움이 되었다.
간신히(?) 베이스 라인을 넘긴 score로 종료.
아래는 전처리 코드.
1. data read
2. game_id별로 그룹화
3. event에 있는 string data를 parsing해서 각 단어(scv, build 이런단어들)별로 column을 만드는 작업
끝나고 나니 아쉬운건 time data도 충분히 사용 여지가 있었을 것 같고, camera 좌표를 이용한 starting position을 이용하는 방법도 있었는데 해당 부분까지 적용하지 못했던건 좀 아쉽다.
import pandas as pd
import numpy as np
import string
import re
def remove_punct(text):
table=str.maketrans('','',string.punctuation)
#return text.translate(table)
#return re.sub("\d+", " ", text.translate(table)).replace('Location','').replace('NW',' ')
return text.translate(table).replace('Location','')
def remove_include_number_string(target):
return [s for s in remove_punct(str(target)).split(' ') if not re.search(r'\d',s)]
def create_events_dict(event_contents):
total_strings = []
for eg in event_contents:
for ts in remove_include_number_string(remove_punct(str(eg))):
if ts != '' and ts not in ['None','at','nan']:
total_strings.append(ts)
total_st_dict = {}
for d in range(len(np.unique(total_strings, return_counts=True)[0])):
total_st_dict[np.unique(total_strings, return_counts=True)[0][d]] = np.unique(total_strings, return_counts=True)[1][d]
return total_st_dict
train_sample = pd.read_csv('train.csv')
train_sample = train_origin.drop(['time'],axis=1)
train_sample[['winner','player','species','event']] = train_sample[['winner','player','species','event']].astype('category')
match_group = train_sample.groupby('game_id')
match_groups = [g for g in match_group]
df = pd.DataFrame()
for group in match_groups:
cg = group[1]
player_groups = [g for g in cg.groupby('player')]
match_info = pd.DataFrame()
for player_g in player_groups:
player_info = player_g[1][['game_id','winner','player','species']].drop_duplicates()
if player_info['species'].values == 'T':
player_info['species'] = 1
elif player_info['species'].values == 'P':
player_info['species'] = 2
else:
player_info['species'] = 3
player_info = player_info.rename(columns={'species':'species_'+str(player_g[0])})
player_info = player_info.drop('player',axis=1)
event_value_counts = player_g[1]['event'].value_counts()
event_value_counts_df = pd.DataFrame(event_value_counts).T
for col in event_value_counts_df.columns:
player_info[col+'_'+str(player_g[0])] = event_value_counts_df[col].values[0]
#print(player_info)
if 'game_id' in match_info.columns:
match_info = pd.merge(match_info,player_info,on=['game_id','winner'],sort=True)
else:
match_info = player_info
df = df.append(match_info)
total_st_dict_keys = {}
i = 0
df_events = pd.DataFrame()
for match in match_groups:
event_group = match[1].groupby('event')
event_groups = [g for g in event_group]
event = event_groups[0]
total_events_dict_0 = create_events_dict(event[1][event[1]['player']== 0]['event_contents'])
total_events_dict_1 = create_events_dict(event[1][event[1]['player']== 1]['event_contents'])
keys = list(total_events_dict_0.keys())
keys_1 = [key for key in keys if key not in list(total_events_dict_1.keys())]
keys.extend(keys_1)
#keys = keys_0.extend(keys_1)
for key in keys:
if key not in total_st_dict_keys.keys():
total_st_dict_keys[key+'_0'] = 0
total_st_dict_keys[key+'_1'] = 0
new_key_dicts_0 = {}
for key in total_events_dict_0.keys():
new_key_dicts_0[key+'_0'] = total_events_dict_0[key]
new_key_dicts_1 = {}
for key in total_events_dict_1.keys():
new_key_dicts_1[key+'_1'] = total_events_dict_1[key]
total_st_dict_keys = { x:0 for x in total_st_dict_keys}
for key in new_key_dicts_0.keys():
if key in total_st_dict_keys.keys():
total_st_dict_keys[key] = new_key_dicts_0[key]
for key in new_key_dicts_1.keys():
if key in total_st_dict_keys.keys():
total_st_dict_keys[key] = new_key_dicts_1[key]
df_events = df_events.append(pd.DataFrame(total_st_dict_keys, index=[0]))
'Data 분석' 카테고리의 다른 글
[통계] 왜도(Skewness) (0) | 2020.09.06 |
---|---|
코호트 분석(Cohort Analysis) 란? (0) | 2020.07.21 |
N-gram 언어 모델(N-gram Language Model) (0) | 2020.02.29 |
[kaggle] Real or Not? NLP with Disaster Tweets (0) | 2020.02.27 |
[kaggle] House Price competition data 전처리 (0) | 2020.01.04 |