text mining - Modifying corpus by inserting codewords using Python -

- January 15, 2011

i have corpus (30,000 customer reviews) in csv file (or txt file). means each customer review line in text file. examples are:

this bike amazing, brake poor
this ice maker works great, price reasonable, bad smell ice maker
the food awesome, water rude

i want change these texts following:

this bike amazing positive, brake poor negative
this ice maker works great positive , price reasonable positive, bad negative smell ice maker
the food awesome positive, water rude negative

i have 2 separate lists (lexicons) of positive words , negative words. example, text file contains such positive words as:

amazing
great
awesome
very cool
reasonable
pretty
fast
tasty
kind

and, text file contains such negative words as:

rude
poor
worst
dirty
slow
bad

so, want python script reads customer review: when of positive words found, insert "positive" after positive word; when of negative words found, insert "negative" after positive word.

here code have tested far. works (see comments in codes below), needs improvement meet needs described above.

specifically, my_escaper works (this code finds such words cheap , , replace them cheap positive , positive), problem have 2 files (lexicons), each containing thousand positive/negative words. want codes read word lists lexicons, search them in corpus, , replace words in corpus (for example, "good" "good positive", "bad" "bad negative").

#adapted http://stackoverflow.com/questions/6116978/python-replace-multiple-strings  import re  def multiple_replacer(*key_values):     replace_dict = dict(key_values)     replacement_function = lambda match: replace_dict[match.group(0)]     pattern = re.compile("|".join([re.escape(k) k, v in key_values]), re.m)     return lambda string: pattern.sub(replacement_function, string)  def multiple_replace(string, *key_values):     return multiple_replacer(*key_values)(string)  #this my_escaper works (this code finds such words cheap , , replace them cheap positive , positive), problem have 2 files (lexicons), each containing thousand positive/negative words. want codes read word lists lexicons, search them in corpus, , replace words in corpus (for example, "good" "good positive", "bad" "bad negative")        my_escaper = multiple_replacer(('cheap','cheap positive'), ('good', 'good positive'), ('avoid', 'avoid negative'))  d = [] open("review.txt","r") file:     line in file:         review = line.strip()         d.append(review)   line in d:     print my_escaper(line)

a straightforward way code load positive , negative words lexicons separate sets. then, each review, split sentence list of words , look-up each word in sentiment sets. checking set membership o(1) in average case. insert sentiment label (if any) word list , join build final string.

example:

import re  reviews = [     "this bike amazing, brake poor",     "this ice maker works great, price reasonable, bad smell ice maker",     "the food awesome, water rude"     ]  positive_words = set(['amazing', 'great', 'awesome', 'reasonable']) negative_words = set(['poor', 'bad', 'rude'])  sentence in reviews:     tagged = []     word in re.split('\w+', sentence):         tagged.append(word)         if word.lower() in positive_words:             tagged.append("positive")         elif word.lower() in negative_words:             tagged.append("negative")     print ' '.join(tagged)

while approach straightforward, there downside: lose punctuation due use of re.split().

Search This Blog

Shefl

text mining - Modifying corpus by inserting codewords using Python -

Comments

Post a Comment

Popular posts from this blog

c++ - No viable overloaded operator for references a map -

java - UML - How would you draw a try catch in a sequence diagram? -

c++ - Gamma correction doesn't look properly corrected, is this linear? -