Stimmungsanalyse mit dem Naive-Bayes-Theorem | von kaiku | Juli 2023

0
27


Naive Bayes-Klasse

class NaiveBayes:

def __init__(self,unique_classes):

self.lessons=unique_classes # Constructor is solely handed with distinctive variety of lessons of the coaching set

def addToBow(self,instance,dict_index):

'''
Parameters:
1. instance
2. dict_index - implies to which BoW class this instance belongs to
What the perform does?
-----------------------
It merely splits the instance on the premise of area as a tokenizer and provides each tokenized phrase to
its corresponding dictionary/BoW
Returns:
---------
Nothing

'''

if isinstance(instance,np.ndarray): instance=instance[0]

for token_word in instance.cut up(): #for each phrase in preprocessed instance

self.bow_dicts[dict_index][token_word]+=1 #increment in its depend

def prepare(self,dataset,labels):

'''
Parameters:
1. dataset - form = (m X d)
2. labels - form = (m,)
What the perform does?
-----------------------
That is the coaching perform which is able to prepare the Naive Bayes Mannequin i.e compute a BoW for every
class/class.
Returns:
---------
Nothing

'''

self.examples=dataset
self.labels=labels
self.bow_dicts=np.array([defaultdict(lambda:0) for index in range(self.classes.shape[0])])

#solely convert to numpy arrays if initially not handed as numpy arrays - else its a ineffective recomputation

if not isinstance(self.examples,np.ndarray): self.examples=np.array(self.examples)
if not isinstance(self.labels,np.ndarray): self.labels=np.array(self.labels)

#establishing BoW for every class
for cat_index,cat in tqdm(enumerate(self.lessons), complete=len(self.lessons), desc="Coaching"):

all_cat_examples=self.examples[self.labels==cat] #filter all examples of class == cat

#get examples preprocessed

cleaned_examples=[preprocess_string(cat_example) for cat_example in all_cat_examples]

cleaned_examples=pd.DataFrame(information=cleaned_examples)

#now costruct BoW of this explicit class
np.apply_along_axis(self.addToBow,1,cleaned_examples,cat_index)

###################################################################################################

'''
Though we're performed with the coaching of Naive Bayes Mannequin BUT!!!!!!
------------------------------------------------------------------------------------
Keep in mind The Take a look at Time Forumla ? : + 1 ] * p(c)
------------------------------------------------------------------------------------

We're performed with establishing of BoW for every class. However we have to precompute just a few
different calculations at coaching time too:
1. prior chance of every class - p(c)
2. vocabulary |V|
3. denominator worth of every class - [ count(c) + |V| + 1 ]

Cause for doing this precomputing calculations stuff ???
---------------------
We will do all these 3 calculations at check time too BUT doing so means to re-compute these
repeatedly each time the check perform shall be known as - this could considerably
enhance the computation time particularly when we've a whole lot of check examples to categorise!!!).
And furthermore, it doensot make sense to repeatedly compute the identical factor -
why do additional computations ???
So we are going to precompute all of them & use them throughout check time to hurry up predictions.

'''

###################################################################################################

prob_classes=np.empty(self.lessons.form[0])
all_words=[]
cat_word_counts=np.empty(self.lessons.form[0])
for cat_index,cat in enumerate(self.lessons):

#Calculating prior chance p(c) for every class
prob_classes[cat_index]=np.sum(self.labels==cat)/float(self.labels.form[0])

#Calculating complete counts of all of the phrases of every class
depend=listing(self.bow_dicts[cat_index].values())
cat_word_counts[cat_index]=np.sum(np.array(listing(self.bow_dicts[cat_index].values())))+1 # |v| is remaining to be added

#get all phrases of this class
all_words+=self.bow_dicts[cat_index].keys()

#mix all phrases of each class & make them distinctive to get vocabulary -V- of complete coaching set

self.vocab=np.distinctive(np.array(all_words))
self.vocab_length=self.vocab.form[0]

#computing denominator worth
denoms=np.array([cat_word_counts[cat_index]+self.vocab_length+1 for cat_index,cat in enumerate(self.lessons)])

'''
Now that we've every little thing precomputed as effectively, its higher to prepare every little thing in a tuple
moderately than to have a separate listing for each factor.

Each factor of self.cats_info has a tuple of values
Every tuple has a dict at index 0, prior chance at index 1, denominator worth at index 2
'''

self.cats_info=[(self.bow_dicts[cat_index],prob_classes[cat_index],denoms[cat_index]) for cat_index,cat in enumerate(self.lessons)]
self.cats_info=np.array(self.cats_info)
print('coaching full')

def getExampleProb(self,test_example):

'''
Parameters:
-----------
1. a single check instance
What the perform does?
-----------------------
Perform that estimates posterior chance of the given check instance
Returns:
---------
chance of check instance in ALL CLASSES
'''

likelihood_prob=np.zeros(self.lessons.form[0]) #to retailer chance w.r.t every class

#discovering chance w.r.t every class of the given check instance
for cat_index,cat in enumerate(self.lessons):

for test_token in test_example.cut up(): #cut up the check instance and get p of every check phrase

####################################################################################

#This loop computes : for every phrase w [ count(w|c)+1 ] / [ count(c) + |V| + 1 ]

####################################################################################

#get complete depend of this check token from it is respective coaching dict to get numerator worth
test_token_counts=self.cats_info[cat_index][0].get(test_token,0)+1

#now get chance of this test_token phrase
test_token_prob=test_token_counts/float(self.cats_info[cat_index][2])

#keep in mind why taking log? To forestall underflow!
likelihood_prob[cat_index]+=np.log(test_token_prob)

# we've chance estimate of the given instance towards each class however we want posterior probility
post_prob=np.empty(self.lessons.form[0])
for cat_index,cat in enumerate(self.lessons):
post_prob[cat_index]=likelihood_prob[cat_index]+np.log(self.cats_info[cat_index][1])

return post_prob

def check(self, test_set):
'''
Parameters:
-----------
1. A whole check set of form (m,)

What the perform does?
-----------------------
Determines chance of every check instance towards all lessons and predicts the label
towards which the category chance is most
Returns:
---------
Predictions of check examples - A single prediction towards each check instance
'''

predictions = [] # to retailer prediction of every check instance
for instance in tqdm(test_set, desc="Testing"):
# preprocess the check instance the identical method we did for coaching set examples
cleaned_example = preprocess_string(instance)

# merely get the posterior chance of each instance
post_prob = self.getExampleProb(cleaned_example) # get prob of this instance for each lessons

# merely decide the max worth and map towards self.lessons!
predictions.append(self.lessons[np.argmax(post_prob)])

print('check full') # Transfer this line contained in the perform (after the loop) to be executed
return np.array(predictions)

Datenrahmen anzeigen

import pandas as pd 

df = pd.read_csv('nb.csv')
df.head(20)

Definieren Sie X und y

X = df['training samples'].values
y = df['labels'].values

Zugtest-Break up

from sklearn.model_selection import train_test_split
import numpy as np
# cut up information
train_data,test_data,train_labels,test_labels = train_test_split(X,y,
shuffle=True,
test_size=0.2,
random_state=42,
stratify=y)
labels=np.distinctive(train_labels)

Naive Bayes-Instanz

nb=NaiveBayes(labels)

Ausbildung…

from collections import defaultdict
from tqdm import tqdm
# nb.prepare will do phrases cleansing for us
nb.prepare(train_data, train_labels)

output:
Coaching: 100%|██████████| 2/2 [00:00<00:00, 2001.10it/s]
coaching full

Auswertung…

# prediction of our x_test samples
y_predict = nb.check(test_data)
y_predict

output:
Testing: 100%|██████████| 2/2 [00:00<?, ?it/s]
check full

Genauigkeit

# prediction vs. true labels (test_labels someday set as y_test)
accuracy = np.sum(y_predict == test_labels)/float(test_labels.form[0])
accuracy

output:
1.0

Unser Modell hat genaue Vorhersagen für unsere Testproben getroffen. Es ist jedoch wichtig zu beachten, dass unser Datensatz klein ist und nur aus 10 Proben besteht und die Testdaten nur 2 Proben umfassen.

df = pd.read_csv('labeledTrainData.tsv',sep='t')
df.form
df.head()
# Outline X and y
y = df.sentiment.values
X = df.evaluate.values

# Break up information
train_data,test_data,train_labels,test_labels=train_test_split(X,y,
shuffle=True,
test_size=0.25,
random_state=69,
stratify=y)
labels = np.distinctive(train_labels)

# naive bayes occasion
nb = NaiveBayes(labels)

# coaching
# nb.prepare will do the phrases cleansing for us
nb.prepare(train_data,train_labels)

# Testing
# prediction for test_data (someday set as x_test)
y_predict = nb.check(test_data)

output:
Coaching: 100%|██████████| 2/2 [00:05<00:00, 2.90s/it]
coaching full
Testing: 100%|██████████| 6250/6250 [00:10<00:00, 616.60it/s]
check full

Genauigkeit

# prediction vs. true labels (test_labels someday set as y_test)
accuracy = np.sum(y_predict == test_labels)/float(test_labels.form[0])
accuracy

output:
0.8424

Testmodell auf neuen Bewertungen

comment1 = ['i love you baby baby']
comment2 = ['i hate YOU very much Zzzz']
comment3 = ['your always in my heart ;)']

comment1_predict = nb.check(comment1)
comment2_predict = nb.check(comment2)
comment3_predict = nb.check(comment3)
print(comment1 + listing(comment1_predict))
print(comment2 + listing(comment2_predict))
print(comment3 + listing(comment3_predict))

output:
['i love you baby baby', 1]
['i hate YOU very much Zzzz', 0]
['your always in my heart ;)', 1]

In diesem Abschnitt werden wir die Naive Bayes verwenden MultinomialNB Klasse aus der sklearn Bibliothek.

import pandas as pd 
import numpy as np
from collections import defaultdict
import re
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.textual content import CountVectorizer
df = pd.read_csv('labeledTrainData.tsv',sep='t')
print(df.form)
df.head()

Definieren Sie X und y

y = df.sentiment.values
X = df.evaluate.values

Bereinigen von Daten

# we clear on coaching information solely (X)
X =[preprocess_string(i) for i in X]
# uncomment to view pattern of cleaned evaluations 
X[0]

output:
'with all stuff happening second with mj i ve began listening to his music watching odd documentary right here and there watched wiz and watched moonwalker once more possibly i simply wish to get a sure perception into man who i assumed was actually cool eighties simply to possibly make up my thoughts whether or not he responsible or harmless moonwalker half biography half function movie which i keep in mind going to see cinema when was initially launched a few of has delicate messages about mj s feeling in direction of press and in addition apparent message of medicine are dangerous m kay br br visually spectacular however in fact all about michael jackson except you remotely like mj anyway then you will hate and discover boring some might name mj an egotist for consenting to creating of film however mj and most of his followers would say that he made for followers which if true very nice of him br br precise function movie bit when lastly begins just for minutes or excluding easy felony sequence and joe pesci convincing as a psychopathic all highly effective drug lord why he desires mj useless dangerous past me as a result of mj overheard his plans nah joe pesci s character ranted that he wished individuals to know he who supplying medicine and so on i dunno possibly he simply hates mj s music br br numerous cool issues like mj turning right into a automotive and a robotic and complete pace demon sequence additionally director should have had persistence of a saint when got here to filming kiddy dangerous sequence as often administrators hate working with one child not to mention a complete bunch of them performing a fancy dance scene br br backside line film for individuals who like mj one stage or one other which i feel most individuals if not then keep away does try to give off a healthful message and paradoxically mj s bestest buddy film a woman michael jackson really considered one of most gifted individuals ever to grace planet however he responsible effectively with all consideration i ve gave topic hmmm effectively i don t know as a result of individuals will be completely different behind closed doorways i do know for a truth he both an especially good however silly man or considered one of most sickest liars i hope he not latter'

Daten aufteilen

train_data,test_data,train_labels,test_labels=train_test_split(X,y,
shuffle=True,
test_size=0.25,
random_state=69,
stratify=y)

Transformieren der Zug- und Testdaten in ein für MultinomialNB lesbares Format.

# instantiate it is object
count_vect = CountVectorizer()

# do the computation then rework to a Naive Bayes readable format
train_data_vectorize = count_vect.fit_transform(train_data)

# transforms solely 'NOT FIT_TRANSFORM'
# we're doing the computation on the coaching information that have been gonna use by our mannequin study,
# testing information are used for analysis solely however want to remodel to the identical format.
test_data_vectorize = count_vect.rework(test_data)

Naives Bayes-MultinomialNB

# merely instantiate a Multinomial Naive Bayes object
clf = MultinomialNB()

#calling the match methodology and passing the train_data_counter and train_labels
clf.match(train_data_vectorize, train_labels)

Auswertung

# utilizing the test_data_counter (remodeled test_data to a clf readable format)
y_predict = clf.predict(test_data_vectorize)

y_predict

output:
array([1, 1, 0, ..., 1, 0, 1], dtype=int64)

Genauigkeit

# identical code
print(clf.rating(test_data_counter, test_labels))
print (np.sum(y_predict == test_labels)/float(len(test_labels)))

output:
0.84736
0.84736

Testen

# Tranform first our feedback, assuming that feedback are already cleaned
c1 = count_vect.rework(comment1)
c2 = count_vect.rework(comment2)
c3 = count_vect.rework(comment3)
print(comment1 + listing(clf.predict(c1)))
print(comment2 + listing(clf.predict(c2)))
print(comment3 + listing(clf.predict(c3)))

output:
['i love you baby baby', 1]
['i hate YOU very much Zzzz', 0]
['your always in my heart ;)', 1]

Unsere benutzerdefinierte Naive Bayes-Klasse und die Sklearn Naive Bayes-Implementierung erzielen nahezu identische Genauigkeitswerte.



Source link

HINTERLASSEN SIE EINE ANTWORT

Please enter your comment!
Please enter your name here