1. I love python. If you don't have it, get it. Its awesomely easy to learn.
2. Get twitter data. I've used the normal search and some BeautifulSoup parsing. Better way will be to use API with OAuth.
import urllib,urllib2,BeautifulSoup
from BeautifulSoup import BeautifulSoup
from HTMLParser import HTMLParser
import time
class MLStripper(HTMLParser):
def __init__(self):
self.reset()
self.fed = []
def handle_data(self, d):
self.fed.append(d)
def get_data(self):
return ''.join(self.fed)
q="iphone"
html=urllib.urlopen("http://search.twitter.com/search?q="+q).read()
soup=BeautifulSoup(html)
text_file=open("tw_"+q+"_"+str(time.time())+".txt", "w")
for msg in soup.findAll("span",{"class":"msgtxt en"}):
s=MLStripper()
s.feed(str(msg))
text_file.write(s.get_data()+"\r\n\r\n")
text_file.close()3. Open the above file and mark each tweet as good (Y) or bad (N) manually based on the content. The more data you mark, the better will be the accuracy of the model which will be trained in next step.
Iphone is awesome,Y Iphone sucks, big time for me,N Just bought Iphone 4, its feels really nice resting in my bare hands,Y
4. Download and install NLTK python package. Train the Naive Bayes classifier model with the tweets you just marked in above step.
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
def word_feats(words):
return dict([(word,True) for word in words])
f=open("tw_iphone_23423423.txt")
trainfeats=[]
while True:
line=f.readline()
print line[:len(line)-3]
print line[len(line)-2:]
trainfeats.append((word_feats(line[:len(line)-3].split(' ')),line[len(line)-2:len(line)-1]))
if not line:
break
classifier=NaiveBayesClassifier.train(trainfeats)5. We are done. Test the model on any random text/tweet.
>>> classifier.classify(word_feat("I like my iphone. It's awesome"))
'Y'



