nlp - how to treat with <s> and </s> in calculating unigram LM? -
i beginner in nlp , i'm confused how treat <s> , </s> symbols calculate counts unigram model? should count them or ignore?
if understand correctly <s> , </s> mean special (fake) unigrams first , last unigrams (actually, pre-first , after-last) each text, there no need in them unigrams, because string contains these unigrams , provide no additional information.
such special unigrams can useful in case of high-order n-grams: example, allows extract 1-word string hello 2 bigrams: <s> hello , hello </s> or 3 trigrams: <s0> <s1> hello, <s1> hello </s1>,hello </s1> </s0>.
Comments
Post a Comment