ML – Text Processing using W2V

In Machine Learning, it’s okay if you know little less about programming but it is very important that your concept should be clear. So, let’s starts with basic intro about ML data.

Types of data

In Machine Learning data classified into two types-

1. Structure data: It is highly-organized and formatted so that it’s pattern makes them easily searchable in relational databases.

Like- Row based data (Database, table data)

2. Unstructured data: It has no per-defined format or organization so that it is very hard to find pattern and much more difficult to collect, process, and analyze.

Like- Audio, Video, Text, Image

In this articles we will see different types of technique for text processing and learn how can we process text data using Word2Vector and build model for it.

You can find the whole code with Text Pre-processing and Word2Vec model at my GitHub Account

Machine Learning core concepts are only working on algebra and statistics. So, it is very important that whatever we giving to the algorithm must have in a numerical form.

But how we can convert the large file that has millions of words in numeric value?

Don’t worry, ML provides techniques for it.

Various Techniques for convert Text to Vector-

Bag of Words (BoW)
Word to Vector (Word2Vec or W2V)
tf-idf (term frequency – inverse document frequency)
tf-idf and W2V

Today we only see W2V in detail.

Word to Vector (Word2Vec or W2V)

Word2vec is a group of related models that are used to produce word embedding. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a vector space.

Machine Learning is actually 90% data pre-processing and 10% data training.

Text Pre-processing

There are lots of techniques by which you can do text pre-processing. Here, I will show you one of the mine. Steps are as below-

1. Removing the stop-words

Stop-word is the word that if you remove it then still you can understand the meaning of the sentence.

For example,

Normal Sentence: Pasta is very famous in Surat.

After removing stop-words: Pasta famous Surat.

As a human being, it’s quite easy to understand the sentence.

You may think, is we have to make manually this list? Answer is no, NLTK (Natural Language Processing Toolkit) provides this feature. You can download stop-word as follow-

2. Make all text in the lower case

By making all text in same case it would be very easy to understand all words for computer. If we don’t do this, then computer will take “Like” and “like” as a two different words.

3. Stemming

Stemming makes similar words to one word. It is very important in pre-processing as reducing the dimensionality of the vector.

It has two algorithms:

Porter Stemmer
Snowball Stemmer

4. Lemmatization

Lemmatization in morphology is the process of grouping together the modified forms of a word so they can be analyzed as a single item.

There are techniques for grouping the sentence-

Uni-gram: In the process of lemmatization it takes single word as at time.

Bi-grams: In the process of lemmatization it takes two words (pair of words) at a time.

n-grams: In the process of lemmatization it takes n words at a time.

5. Semantic Meaning

It works similar to the technique Stemming. Instead of making similar word to one word, it taking all similar meaning (synonyms) of word and make it one word.

For example,

{tasty, delicious, luscious, yum-yum, yummy, flavourful} all words represent by single word – “tasty”

Semantic meaning is the limitation of BoW (Bag of Word).

Now we perform the Word2Vec technique to convert text into vector.

Here we using Amazon Fine Food Reviews dataset. You can download this dataset here.

model code:

i=0
list_of_sent = []
for sent in after_dup['Text'].values:
    filtered_sentence = []
    sent = cleanhtmltag(sent)
    for w in sent.split():
        for cleaned_words in cleanpunc(w).split():
            if(cleaned_words.isalpha()):
                filtered_sentence.append(cleaned_words.lower())
            else:
                continue
    list_of_sent.append(filtered_sentence)

saving model:

import gensim
w2v_model = gensim.models.Word2Vec(list_of_sent, min_count=5, size=50, workers=4)

If you can't figure it out the parameters then visit this page.