Python
2020. 2. 29. 21:09
1. remove URL
example="New competition launched :https://www.kaggle.com/c/nlp-getting-started"
def remove_URL(text):
url = re.compile(r'https?://\S+|www\.\S+')
return url.sub(r'',text)
remove_URL(example)
2. remove HTML
example = """<div>
<h1>Real or Fake</h1>
<p>Kaggle </p>
<a href="https://www.kaggle.com/c/nlp-getting-started">getting started</a>
</div>"""
def remove_html(text):
html=re.compile(r'<.*?>')
return html.sub(r'',text)
print(remove_html(example))
3. remove emoji
def remove_emoji(text):
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
"]+", flags=re.UNICODE)
return emoji_pattern.sub(r'', text)
remove_emoji("Omg another Earthquake 😔😔")
4. remove punctuations
import string
def remove_punct(text):
table=str.maketrans('','',string.punctuation)
return text.translate(table)
example="I am a #king"
print(remove_punct(example))
'Python' 카테고리의 다른 글
Pandas Big Data 다루기 (0) | 2020.03.18 |
---|---|
ValueError: If using all scalar values, you must pass an index (0) | 2020.03.09 |
[Text 분석] Scikit-Learn의 문서 전처리 기능 (0) | 2020.02.29 |
[KoNLPy] 쉽고 간결한 한국어 정보처리 파이썬 패키지 (제 26회 한글 및 한국어 정보처리 학술대회 논문집 2014년) (0) | 2019.08.25 |
apscheduler (0) | 2019.02.11 |