twitter - Train corpus of Tweets for Sentiment Analysis, using NLTK for Python -
i'm trying train own corpora sentiment analysis, using nltk python. have 2 text files: 1 has 25k positive tweets, separated per line, other 1 25k negative tweets.
i use stackoverflow article, method 2
when run code create corpora:
import string itertools import chain nltk.corpus import stopwords nltk.probability import freqdist nltk.classify import naivebayesclassifier nbc nltk.corpus import categorizedplaintextcorpusreader import nltk mydir = 'c:\users\gerbuiker\desktop\sentiment analyse\my_movie_reviews' mr = categorizedplaintextcorpusreader(mydir, r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii') stop = stopwords.words('english') documents = [([w w in mr.words(i) if w.lower() not in stop , w.lower() not in string.punctuation], i.split('/')[0]) in mr.fileids()] word_features = freqdist(chain(*[i i,j in documents])) word_features = word_features.keys()[:100] numtrain = int(len(documents) * 90 / 100) train_set = [({i:(i in tokens) in word_features}, tag) tokens,tag in documents[:numtrain]] test_set = [({i:(i in tokens) in word_features}, tag) tokens,tag in documents[numtrain:]] classifier = nbc.train(train_set) print nltk.classify.accuracy(classifier, test_set) classifier.show_most_informative_features(5)
i receive error message:
c:\users\gerbuiker\anaconda\python.exe "c:/users/gerbuiker/desktop/sentiment analyse/corpus_pos_neg/createcorpus.py" traceback (most recent call last): file "c:/users/gerbuiker/desktop/sentiment analyse/corpus_pos_neg/createcorpus.py", line 23, in <module> documents = [([w w in mr.words(i) if w.lower() not in stop , w.lower() not in string.punctuation], i.split('/')[0]) in mr.fileids()] file "c:\users\gerbuiker\appdata\roaming\python\python27\site-packages\nltk\corpus\reader\util.py", line 336, in iterate_from assert self._len not none assertionerror process finished exit code 1
does know how fix this?
i'm not 100% positive i'm not on windows machine test @ moment, think may catching difference between path slash direction in @alvas original example , adaptation windows.
specifically, use: 'c:\users\gerbuiker\desktop\sentiment analyse\my_movie_reviews'
while example uses '/home/alvas/my_movie_reviews'
. part fine, attempt re-use cat_pattern
regex: r'(neg|pos)/.*'
match slash in paths reject 1 in yours.
Comments
Post a Comment