text mining - Modifying corpus by inserting codewords using Python -
i have corpus (30,000 customer reviews) in csv file (or txt file). means each customer review line in text file. examples are:
- this bike amazing, brake poor
- this ice maker works great, price reasonable, bad smell ice maker
- the food awesome, water rude
i want change these texts following:
- this bike amazing positive, brake poor negative
- this ice maker works great positive , price reasonable positive, bad negative smell ice maker
- the food awesome positive, water rude negative
i have 2 separate lists (lexicons) of positive words , negative words. example, text file contains such positive words as:
- amazing
- great
- awesome
- very cool
- reasonable
- pretty
- fast
- tasty
- kind
and, text file contains such negative words as:
- rude
- poor
- worst
- dirty
- slow
- bad
so, want python script reads customer review: when of positive words found, insert "positive" after positive word; when of negative words found, insert "negative" after positive word.
here code have tested far. works (see comments in codes below), needs improvement meet needs described above.
specifically, my_escaper
works (this code finds such words cheap , , replace them cheap positive , positive), problem have 2 files (lexicons), each containing thousand positive/negative words. want codes read word lists lexicons, search them in corpus, , replace words in corpus (for example, "good" "good positive", "bad" "bad negative").
#adapted http://stackoverflow.com/questions/6116978/python-replace-multiple-strings import re def multiple_replacer(*key_values): replace_dict = dict(key_values) replacement_function = lambda match: replace_dict[match.group(0)] pattern = re.compile("|".join([re.escape(k) k, v in key_values]), re.m) return lambda string: pattern.sub(replacement_function, string) def multiple_replace(string, *key_values): return multiple_replacer(*key_values)(string) #this my_escaper works (this code finds such words cheap , , replace them cheap positive , positive), problem have 2 files (lexicons), each containing thousand positive/negative words. want codes read word lists lexicons, search them in corpus, , replace words in corpus (for example, "good" "good positive", "bad" "bad negative") my_escaper = multiple_replacer(('cheap','cheap positive'), ('good', 'good positive'), ('avoid', 'avoid negative')) d = [] open("review.txt","r") file: line in file: review = line.strip() d.append(review) line in d: print my_escaper(line)
a straightforward way code load positive , negative words lexicons separate sets. then, each review, split sentence list of words , look-up each word in sentiment sets. checking set membership o(1) in average case. insert sentiment label (if any) word list , join build final string.
example:
import re reviews = [ "this bike amazing, brake poor", "this ice maker works great, price reasonable, bad smell ice maker", "the food awesome, water rude" ] positive_words = set(['amazing', 'great', 'awesome', 'reasonable']) negative_words = set(['poor', 'bad', 'rude']) sentence in reviews: tagged = [] word in re.split('\w+', sentence): tagged.append(word) if word.lower() in positive_words: tagged.append("positive") elif word.lower() in negative_words: tagged.append("negative") print ' '.join(tagged)
while approach straightforward, there downside: lose punctuation due use of re.split()
.
Comments
Post a Comment