nlp - how to treat with <s> and </s> in calculating unigram LM? -


i beginner in nlp , i'm confused how treat <s> , </s> symbols calculate counts unigram model? should count them or ignore?

if understand correctly <s> , </s> mean special (fake) unigrams first , last unigrams (actually, pre-first , after-last) each text, there no need in them unigrams, because string contains these unigrams , provide no additional information.

such special unigrams can useful in case of high-order n-grams: example, allows extract 1-word string hello 2 bigrams: <s> hello , hello </s> or 3 trigrams: <s0> <s1> hello, <s1> hello </s1>,hello </s1> </s0>.


Comments

Popular posts from this blog

c++ - No viable overloaded operator for references a map -

java - Custom OutputStreamAppender not run: LOGBACK: No context given for <MYAPPENDER> -

java - Cannot secure connection using TLS -