Simple Intro About Bag Of Words with Python(BOW)

** Reference: Whole post was copied from this site to keep a note

Bag of Words (BOW) is a method to extract features from text documents. These features can be used for training machine learning algorithms. It creates a vocabulary of all the unique words occurring in all the documents in the training set.

Following image describe it's steps:

Generated vectors can be input to your machine learning algorithm.

Let’s start with an example to understand by taking some sentences and generating vectors for those.

Consider the below two sentences.

1. "John likes to watch movies. Mary likes movies too."

2. "John also likes to watch football games."

These two sentences can be also represented with a collection of words.

1. ['John', 'likes', 'to', 'watch', 'movies.', 'Mary', 'likes', 'movies', 'too.']

2. ['John', 'also', 'likes', 'to', 'watch', 'football', 'games']

Further, for each sentence, remove multiple occurrences of the word and use the word count to represent this.

1. {"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1}

2. {"John":1,"also":1,"likes":1,"to":1,"watch":1,"football":1,   "games":1}

Assuming these sentences are part of a document, below is the combined word frequency for our entire document. Both sentences are taken into account.

 {"John":2,"likes":3,"to":2,"watch":2,"movies":2,"Mary":1,"too":1,  "also":1,"football":1,"games":1}

The above vocabulary from all the words in a document, with their respective word count, will be used to create the vectors for each of the sentences.

The length of the vector will always be equal to vocabulary size. In this case the vector length is 11.

In order to represent our original sentences in a vector, each vector is initialized with all zeros — [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

This is followed by iteration and comparison with each word in our vocabulary, and incrementing the vector value if the sentence has that word.

John likes to watch movies. Mary likes movies too.
[1, 2, 1, 1, 2, 1, 1, 0, 0, 0]

John also likes to watch football games.
[1, 1, 1, 1, 0, 0, 0, 1, 1, 1]

For example, in sentence 1 the word likes appears in second position and appears two times. So the second element of our vector for sentence 1 will be 2: [1, 2, 1, 1, 2, 1, 1, 0, 0, 0]

Coding our BOW algorithm

The input array is this:

["Joe waited for the train", "The train was late", "Mary and Samantha took the bus",

"I looked for Mary and Samantha at the bus station",

"Mary and Samantha arrived at the bus station early but waited until noon for the bus"]

Step 1: Tokenize a sentence

We will start by removing stopwords from the sentences.

Stopwords are words which do not contain enough significance to be used without our algorithm. We would not want these words taking up space in our database, or taking up valuable processing time. For this, we can remove them easily by storing a list of words that you consider to be stop words.

Tokenization is the act of breaking up a sequence of strings into pieces such as words, keywords, phrases, symbols and other elements called tokens. Tokens can be individual words, phrases or even whole sentences. In the process of tokenization, some characters like punctuation marks are discarded.

def word_extraction(sentence):
    ignore = ['a', "the", "is"]
    words = re.sub("[^\w]", " ",  sentence).split()
    cleaned_text = [w.lower() for w in words if w not in ignore]
    return cleaned_text

For more robust implementation of stopwords, you can use python nltklibrary. It has a set of predefined words per language. Here is an example:

import nltk
from nltk.corpus import stopwords
 set(stopwords.words('english'))

Step 2: Apply tokenization to all sentences

def tokenize(sentences):
    words = []
    for sentence in sentences:
        w = word_extraction(sentence)
        words.extend(w)
        
    words = sorted(list(set(words)))
    return words

The method iterates all the sentences and adds the extracted word into an array.

The output of this method will be:

['and', 'arrived', 'at', 'bus', 'but', 'early', 'for', 'i', 'joe', 'late', 'looked', 'mary', 'noon', 'samantha', 'station', 'the', 'took', 'train', 'until', 'waited', 'was']

Step 3: Build vocabulary and generate vectors

Use the methods defined in steps 1 and 2 to create the document vocabulary and extract the words from the sentences.

def generate_bow(allsentences):    
    vocab = tokenize(allsentences)
    print("Word List for Document \n{0} \n".format(vocab));

for sentence in allsentences:
        words = word_extraction(sentence)
        bag_vector = numpy.zeros(len(vocab))
        for w in words:
            for i,word in enumerate(vocab):
                if word == w: 
                    bag_vector[i] += 1
                    
        print("{0}\n{1}\n".format(sentence,numpy.array(bag_vector)))

Here is the defined input and execution of our code:

allsentences = ["Joe waited for the train train", "The train was late", "Mary and Samantha took the bus",

"I looked for Mary and Samantha at the bus station",

"Mary and Samantha arrived at the bus station early but waited until noon for the bus"]

generate_bow(allsentences)

The output vectors for each of the sentences are:

Output:

Joe waited for the train train
[0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 2. 0. 1. 0.]

The train was late
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 1.]

Mary and Samantha took the bus
[1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0. 0.]

I looked for Mary and Samantha at the bus station
[1. 0. 1. 1. 0. 0. 1. 1. 0. 0. 1. 1. 0. 1. 1. 0. 0. 0. 0. 0. 0.]

Mary and Samantha arrived at the bus station early but waited until noon for the bus
[1. 1. 1. 2. 1. 1. 1. 0. 0. 0. 0. 1. 1. 1. 1. 0. 0. 0. 1. 1. 0.]

As you can see, each sentence was compared with our word list generated in Step 1. Based on the comparison, the vector element value may be incremented. These vectors can be used in ML algorithms for document classification and predictions.

Search This Blog

NLP Enthusiastic