NLP/이론 및 정리

[기초정리] 3. 자연어 처리 파이프라인 - 품사 태깅 ~ 불용어 제거 (NLTK Part-of-speech tag list 포함)

DongJin Jeong 2021. 1. 1. 22:49

1. 품사 태깅 (Part-of-Speech Tagging)

품사 태깅(Part-of-Speech Tagging, POS Tagging)이란 단어 토큰화(Word Tokenization)를 거친 토큰(Token)들에게 품사를 붙여주는 작업을 뜻한다.

구현 코드

import nltk

test_text = ['All', 'rights', 'reserved', '.']

def POS_tagging(token_list):
    POS_list = list()

    for sentence in token_list:
        POS_list.append(nltk.pos_tag(sentence))

    return POS_list

print(POS_tagging(test_text))

결과

[[('All', 'DT'), ('rights', 'NNS'), ('reserved', 'VBN'), ('.', '.')]]

품사 태깅을 거친 결과물은 (토큰, 품사) 형태로 반환되며, 품사는 축약어로 표현된다. 아래의 표를 통해 의미를 확인할 수 있다.

NLTK Part-of-Speech tag list

2. 표제어 추출(Lemmatization)

표제어(Lemma)의 사전적인 의미는 "언어 사전 따위의 표제 항목에 넣어 알기 쉽게 풀이해 놓은 말."이다. 예시를 들자면, 'am', 'are', 'is'는 서로 다른 단어이지만 이 단어들의 뿌리는 'be'라고 볼 수 있다. 이 때 이 단어들의 표제어를 'be'라고 한다.

구현 코드

import nltk

test_text = [[('All', 'DT'), ('rights', 'NNS'), ('reserved', 'VBN'), ('.', '.')]]

def lemmatization(POS_list):
    lemma_list = list()
    lemmatizer = nltk.stem.WordNetLemmatizer()
    for sentence in POS_list:
        #Lemmatize 함수를 사용할 때, 해당 토큰이 어떤 품사인지 알려줄 수 있다. 만약 품사를 알려주지 않는다면 올바르지 않은 결과가 반환될 수 있다.
        #Lemmatize 함수가 입력받는 품사는 동사, 형용사, 명사, 부사 뿐이다. 각각 v, a, n, r로 입력받는다.
        #nltk.pos_tag로 반환받는 품사 중, 형용사는 J로 시작하기 때문에 lemmatize 전에 a로 바꿔줘야 한다.
        func_j2a = lambda x : x if x != 'j' else 'a'
        pos_contraction = [(token, func_j2a(POS.lower()[0])) for token, POS in sentence if POS[0] in ['V', 'J', 'N', 'R']]
        lemma_list.append([lemmatizer.lemmatize(token, POS) for token, POS in pos_contraction])

    return lemma_list
    
print(lemmatization(test_text))

결과

[['right', 'reserve']]

3. 불용어(Stopword) 제거

불용어(Stopword)란 유의미한 가치를 지니지 않는 토큰을 의미한다. 예를 들어, 문장을 이해할 때 I, your, once, that 등 불용어는 자주 등장하지만 해석에 있어서는 큰 의미가 없는 경우가 많다. 이러한 불용어를 제거함으로써 자연어 처리 시의 능률을 높일 수 있다.

구현 코드

import nltk

test_text = [['above', 'copyright', 'notice', 'permission', 'notice', 'be', 'include', 'copy', 'substantial', 'portion', 'Software']]
def remove_stopwords(lemma_list):
    no_stopword_list = list()
    #영어 불용어 불러오기
    stop_words = set(nltk.corpus.stopwords.words('english'))

    for lemma_sentence in lemma_list:
        no_stopword_list.append([lemma for lemma in lemma_sentence if not lemma in stop_words])

    return no_stopword_list
    
print(remove_stopwords(test_text))

결과

[['copyright', 'notice', 'permission', 'notice', 'include', 'copy', 'substantial', 'portion', 'Software']]

'NLP > 이론 및 정리' 카테고리의 다른 글

[기초정리] 2. 자연어 처리 파이프라인 - 문장 분할(Sentence Segmentation), 단어 토큰화(Word Tokenization) (0)	2021.01.01
[기초정리] 1. 자연어 처리란 무엇인가? (0)	2021.01.01

현재글[기초정리] 3. 자연어 처리 파이프라인 - 품사 태깅 ~ 불용어 제거 (NLTK Part-of-speech tag list 포함)

JUST CODE IT

column picture, 품사 태깅, nlp, NSMC, 네이버 리뷰, Linear, 자연어 처리 실습, David Silver, algebra, 자연어 처리, 신경망 학습, count vectorization, reinforcement, linear algebra, gilbert, gilbert strang, NLP 구현, 강화학습, pytorch, 단어 토큰화,

Today :
Yesterday :

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

JUST CODE IT

[기초정리] 3. 자연어 처리 파이프라인 - 품사 태깅 ~ 불용어 제거 (NLTK Part-of-speech tag list 포함)

1. 품사 태깅 (Part-of-Speech Tagging)

2. 표제어 추출(Lemmatization)

3. 불용어(Stopword) 제거

'NLP > 이론 및 정리' 카테고리의 다른 글

'NLP/이론 및 정리'의 다른글

티스토리툴바

[기초정리] 3. 자연어 처리 파이프라인 - 품사 태깅 ~ 불용어 제거 (NLTK Part-of-speech tag list 포함)

1. 품사 태깅 (Part-of-Speech Tagging)

2. 표제어 추출(Lemmatization)

3. 불용어(Stopword) 제거

'NLP > 이론 및 정리' 카테고리의 다른 글

'NLP/이론 및 정리'의 다른글

관련글

티스토리툴바