nlp - how to treat with <s> and </s> in calculating unigram LM? -
i beginner in nlp , i'm confused how treat <s>
, </s>
symbols calculate counts unigram model? should count them or ignore?
if understand correctly <s>
, </s>
mean special (fake) unigrams first , last unigrams (actually, pre-first , after-last) each text, there no need in them unigrams, because string contains these unigrams , provide no additional information.
such special unigrams can useful in case of high-order n-grams: example, allows extract 1-word string hello
2 bigrams: <s> hello
, hello </s>
or 3 trigrams: <s0> <s1> hello
, <s1> hello </s1>
,hello </s1> </s0>
.
Comments
Post a Comment