on this paper you’ll be taught fundamentals of NLP
These are what you’ll be taught :
1 — What’s a tokenizer ?
2 — what’s texts_to_sequences ?
3 — What’s pad sequence ?
4 — What’s Embedding ?
5 — Make prediction mannequin ?
You should use this code for any binary NLP dataset comprises textual content knowledge
as everyone knows machine can simply perceive numbers so we want convert phrases to numbers
for instance if we need to perceive ‘hiya world’ to machine we must always convert it to numbers like this :
hiya represented by 0
world represented by 1
for doing this we use tokenizer
By tokenizer we convert phrases, subwords, characters to numbers and every phrase has transformed to a quantity is a token
in summery a tokenizer converts texts into tokens
To begin with we have to import dataset, yow will discover the dataset has used on this paper by means of the hyperlink under :
Let’s outline a variable for the dataset :
dataset = pd.read_csv("D:ITML projectPredict depressiondepression_dataset_reddit_cleaned.csv")
Now we have to outline two variables one for sentences and one other one for labels :
sentences = dataset['clean_text']
labels = dataset['is_depression']
For practice a mannequin we want a practice knowledge for coaching mannequin and a take a look at knowledge for testing and optimizing the mannequin.
so now we want separate the info into two half, practice and take a look at
the info comprises 7731 rows(pattern), we outline coaching knowledge from 0 to 6000, means all knowledge earlier than the pattern 6000 are for coaching and all knowledge after which might be for testing :
training_size = 6000training_sentences = sentences[0:training_size]
testing_sentences = sentences[training_size:]
training_labels = labels[0:training_size]
testing_labels = labels[training_size:]
let’s work with tokenizer
'''
on this paper we work with tensorflow tokenizer
'''from keras.preprocessing.textual content import Tokenizer #Import tokenizer
vocab_size = 10000 #numbers of phrase that tokenizer count on
tokenizer = Tokenizer(num_words=vocab_size, oov_token='<OOV>', decrease=True)
tokenizer.fit_on_texts(training_sentences) #Convert phrases to numbers by the tokenizer
#word_index = tokenizer.word_index #present the numer(token) of every phrase
# print(word_index)
oov_token=’<OOV>’ : this parameter helps to tokenizer to handel phrases which weren’t in vocabulary
decrease=True : changing all phrases to decrease case
By this methodology all of the numbers which characterize the phrases will grew to become in a sequence
let’s see an instance
sentence1 = 'canine is an effective animal'
sentence2 = 'my title is omid'tokenizer = Tokenizer(num_words=10, oov_token='<OOV>', decrease=True)
tokenizer.fit_on_texts([sentence1, sentence2])
word_index = tokenizer.word_index
print(word_index)
sequences = tokenizer.texts_to_sequences([sentence1, sentence2])
print(sequences)
'''
Out put :
{'<OOV>': 1, 'is': 2, 'canine': 3, 'a': 4, 'good': 5, 'animal': 6, 'my': 7, 'title': 8, 'omid': 9}
[[3, 2, 4, 5, 6], [7, 8, 2, 9]]
'''
because it clear, all sentences don’t have the identical size, for dealing with this we use Pad sequences
Think about we’ve got 2 sentences, one in all them has 3 phrases and the opposite one has 4 phrases, on this state of affairs pad sequence will make a 2*4 matrix, for the sentence which have 3 phrases the final or first matrix ingredient can be 0
let’s see it with an instance :
from keras.preprocessing.sequence import pad_sequencessequences = tokenizer.texts_to_sequences([sentence1, sentence2]) #just like the final code
sentences_padded = pad_sequences(sequences)
print(sentences_padded)
'''
output:
[[3 2 4 5 6]
[0 7 8 2 9]]
'''
for get extra details about it you may learn its doc
let’s return to our most important code ( despair prediction )
now that we all know what are texts_to_sequences and Pad sequences so lets course of the info by these
from keras.preprocessing.sequence import pad_sequencesmax_length = 100 #the max size of a sentence that tokenizer will settle for
training_sequences = tokenizer.texts_to_sequences(training_sentences)
training_padded = pad_sequences(training_sequences, maxlen=max_length)
testing_sequences = tokenizer.texts_to_sequences(testing_sentences)
testing_padded = pad_sequences(testing_sequences, maxlen=max_length)
Now the info is prepared and we will make mannequin however earlier than it, let’s see what’s Embedding
by embedding phrases convert to vectors. by doing this mannequin can perceive the connection between the phrases
as an illustration, think about phrases ‘good’ and ‘dangerous’ however what a few phrase like ‘not so dangerous’ this phrase assign to a detrimental feeling means dangerous, embedding helps the mannequin to grasp this ( relationship between phrases )
the mannequin educated with an embedding layer and after {that a} world common pooling1D, 24 dense( absolutely linked layer ) with relu activation operate and within the final layer 1 dens with sigmoid activation operate in 10 epochs
activation capabilities assist the mannequin to grasp the info higher
ReLU activation operate
relu is an activation operate that simply settle for the values bigger that 0 :
R(x) = max(0,x)
Sigmoid activation operate
We use sigmoid activation operate when labels of information are binary( 0 or 1 ), precisely just like the dataset we’re utilizing
if output of sigmoid activation operate (the final layer) is bigger than 0.5 it assigns to label 1 and if it’s decrease that 0.5 it assigns to 0 label
in summery :
output > 0.5 — — → 1
output < 0.5 — — -> 0
Code
from keras.fashions import Sequential
from keras.layers import Embedding, Dense, GlobalAveragePooling1Dembedding_dim = 16 #dimention of embedding layer
mannequin = Sequential([
Embedding(vocab_size, output_dim=embedding_dim, input_length=max_length),
GlobalAveragePooling1D(),
Dense(24, activation='relu'),
Dense(1, activation='sigmoid')
])
mannequin.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
num_epochs = 10
historical past = mannequin.match(training_padded, training_labels, epochs=num_epochs, validation_data=(testing_padded, testing_labels))
Plot
Let’s see progress of mannequin in 10 epochs
import matplotlib.pyplot as pltplt.plot(historical past.historical past['accuracy'])
plt.plot(historical past.historical past['loss'])
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend(['accuracy', 'loss'])
As you may see with every epoch, the mannequin has improvved, accuracy elevated, loss decreased
Testing the mannequin
now let’s take a look at the mannequin, don’t forget we have to do texts_to_sequence and pad_sequences on enter textual content
test_sentence = ['the life became so hard i can not take it any more i just wanna die ']
test_sentence = tokenizer.texts_to_sequences(test_sentence)
padded_test_sentence = pad_sequences(test_sentence, maxlen=max_length)
print(mannequin.predict(padded_test_sentence))'''
output :
[[0.6440944]]
'''
As you may see, clearly there may be unhappy emotions within the enter textual content ( test_sentence ) and the output of the mannequin is 0.64 which is bigger than 0.5 in order I discussed earlier than, it assigns to label 1 which suggests the despair is constructive
The code accessible on github by means of the hyperlink under :
thanks for studying, I hope you loved it