twitter - Train corpus of Tweets for Sentiment Analysis, using NLTK for Python -

- July 15, 2011

i'm trying train own corpora sentiment analysis, using nltk python. have 2 text files: 1 has 25k positive tweets, separated per line, other 1 25k negative tweets.

i use stackoverflow article, method 2

when run code create corpora:

import string itertools import chain  nltk.corpus import stopwords nltk.probability import freqdist nltk.classify import naivebayesclassifier nbc nltk.corpus import categorizedplaintextcorpusreader import nltk  mydir = 'c:\users\gerbuiker\desktop\sentiment analyse\my_movie_reviews'  mr = categorizedplaintextcorpusreader(mydir, r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii') stop = stopwords.words('english') documents = [([w w in mr.words(i) if w.lower() not in stop , w.lower() not in string.punctuation], i.split('/')[0]) in mr.fileids()]  word_features = freqdist(chain(*[i i,j in documents])) word_features = word_features.keys()[:100]  numtrain = int(len(documents) * 90 / 100) train_set = [({i:(i in tokens) in word_features}, tag) tokens,tag in documents[:numtrain]] test_set = [({i:(i in tokens) in word_features}, tag) tokens,tag  in documents[numtrain:]]  classifier = nbc.train(train_set) print nltk.classify.accuracy(classifier, test_set) classifier.show_most_informative_features(5)

i receive error message:

c:\users\gerbuiker\anaconda\python.exe "c:/users/gerbuiker/desktop/sentiment analyse/corpus_pos_neg/createcorpus.py" traceback (most recent call last):   file "c:/users/gerbuiker/desktop/sentiment analyse/corpus_pos_neg/createcorpus.py", line 23, in <module>     documents = [([w w in mr.words(i) if w.lower() not in stop , w.lower() not in string.punctuation], i.split('/')[0]) in mr.fileids()]   file "c:\users\gerbuiker\appdata\roaming\python\python27\site-packages\nltk\corpus\reader\util.py", line 336, in iterate_from     assert self._len not none assertionerror  process finished exit code 1

does know how fix this?

i'm not 100% positive i'm not on windows machine test @ moment, think may catching difference between path slash direction in @alvas original example , adaptation windows.

specifically, use: 'c:\users\gerbuiker\desktop\sentiment analyse\my_movie_reviews' while example uses '/home/alvas/my_movie_reviews'. part fine, attempt re-use cat_pattern regex: r'(neg|pos)/.*' match slash in paths reject 1 in yours.

Search This Blog

Shefl

twitter - Train corpus of Tweets for Sentiment Analysis, using NLTK for Python -

Comments

Post a Comment

Popular posts from this blog

c++ - No viable overloaded operator for references a map -

java - UML - How would you draw a try catch in a sequence diagram? -

c++ - Gamma correction doesn't look properly corrected, is this linear? -