Leverage Turing Intelligence capabilities to integrate AI into your operations, enhance automation, and optimize cloud migration for scalable impact.
Advance foundation model research and improve LLM reasoning, coding, and multimodal capabilities with Turing AGI Advancement.
Access a global network of elite AI professionals through Turing Jobs—vetted experts ready to accelerate your AI initiatives.
The Naive Bayes text classification algorithm is a type of probabilistic model used in machine learning. Harry R. Felson and Robert M. Maxwell designed the first text classification method to classify text documents with zero or more words from the document being classified as authorship or genre.
Since then, Naive Bayes has become one of the most popular and effective classification methods for unsupervised learning of data. This article is an introduction to creating a simple Naive Bayes document classification system in python.
Naive Bayes is a probability-based machine learning algorithm that uses Bayes' theorem with the assumption of “naive” independence between the variables (features), making it effective for small datasets. The Naive Bayes algorithms are most useful for classification problems and predictive modeling.
An algorithm based on Naive Bayes is a probabilistic classification algorithm. Based on strong independent assumptions, it uses probability models. There is often no impact on reality due to independent assumptions. As a result, they are considered naive.
Bayes' theorem can provide probability models (credited to Thomas Bayes). It is possible to train the Naive Bayes algorithm in supervised learning, depending on the nature of the probability model.
Naive Bayes models consist of a large cube with the following dimensions:
Bayes theorem
Let’s say you defined a hypothesis regarding your data.
The theorem will state the chances that the hypothesis will occur to be true by multiplying the probable chances. This way the hypothesis will occur true given certain scenarios.
It further divides the product by the probability that the defined scenario will show.
Pr(H|E) = Pr(H) * Pr(E|H) / Pr(E)
Because we are classifying documents, the hypothesis is that the document belongs to Categorical C. The evidence is the occurrence of the word W.
We can use the ratio form of the Bayes theorem in classification tasks because we are comparing two or more hypotheses, which involves comparing the numerators within the formula (for Bayes aficionados: the prior times the likelihood) for each hypothesis:
Pr(C₁|W) / Pr(C₂|W)= Pr(C₁) * Pr(W|C₁) / Pr(C₂) * Pr(W|C₂)
Due to a large number of words in a document, the formula becomes:
Pr(C₁|W₁, W₂ ...Wn) / Pr(C₂|W₁, W₂ ...Wn)=Pr(C₁) * (Pr(W₁|C₁) * Pr(W₂|C₁) * ...Pr(Wn|C₁)) /
Pr(C₂) * (Pr(W₁|C₂) * Pr(W₂|C₂) * ...Pr(Wn|C₂))
The Naive Bayes classifier can be used in the following applications:
Often, even very sophisticated classification methods, especially those utilizing very large datasets, do not perform as Naive Bayes. This is mainly because Naive Bayes is very simple.
Pros
Cons
Assuming that each word is independent of all the others will help us with the equation and, ultimately, with creating codes.
To simplify the math, we can make this assumption, which, in practice, works quite well. Knowing which words come before/after has a direct impact on the next/previous word.
Naive Bayes is based on this assumption. Based on that assumption, we can decompose the numerator as follows.
A Naive Bayesian classifier performs worse than a complex classifier due to the strict assumptions it makes about the data. The classifier, however, has some advantages:
An initial baseline classifier based on a Naive Bayesian classifier offers these advantages. In case it performs well, you will have a classifier for your problem that is intuitive and very fast to interpret.
With some basic knowledge of how well they should perform, you can explore more sophisticated models if it does not perform well initially
Let's get started and upload the libraries first:
import numpy as np, pandas as pd import seaborn as sns import matplotlib.pyplot as plt from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.pipeline import make_pipeline from sklearn.metrics import confusion_matrix, accuracy_score sns.set() # use seaborn plotting style
We will now load the data (training and test data):
# Now Load the dataset data_ = fetch_20newsgroups() # Get the text categories in action text_categories_ = data_.target_names # Now we define the training set train_data_ = fetch_20newsgroups(subset="train", categories=text_categories_) # define the test set test_data_ = fetch_20newsgroups(subset="test", categories=text_categories_)
Let's count the classes and samples:
print("Number of unique classes {}".format(len(text_categories_))) print("Number of training samples {} ".format(len(train_data_.data_))) print("Number of test samples {}".format(len(test_data_.data_)))
You will get output as:
Number of unique classes 20 Number of training samples 11314 Number of test samples 7532
As a result, we have a 20-class (which is the by default no. of classes in which the algorithm divides the data) text classification problem with a training sample size of 11314 and a test sample size of 7532 (text sentences).
Let's take a look at the third training sample:
print(test_data_.data_[3])
You should see something like this printed out since our data is texts (specifical emails):
Outputs
It looks like Ben Baz's mind and heart are also blind, not only his eyes. >I used to respect him, today I lost the minimal amount of respect that >I struggled to keep for him. >To All Muslim netters: This is the same guy who gave a "Fatwah" that >Saudi Arabia can be used by the United States to attack Iraq .They were attacking the Iraqis to drive them out of Kuwait, a country whose citizens have close blood and business ties to Saudi citizens. And me thinks if the US had not helped out the Iraqis would have swallowed Saudi Arabia, too (or at least the eastern oilfields). And no Muslim country was doing much of anything to help liberate Kuwait and protect Saudi Arabia; indeed, in some masses of citizens were demonstrating in favor of that butcher Saddam (who killed lotsa Muslims), just because he was killing, raping, and looting relatively rich Muslims and also thumbing his nose at the West.
So how would have you defended Saudi Arabia and rolled back the Iraqi invasion, were you in charge of Saudi Arabia???
>Fatwah is as legitimate as this one. With that kind of "Clergy", it might >be an Islamic duty to separate religion and politics, if religion >means "official Clergy".
I think that it is a very good idea to not have governments have an official religion (de facto or de jure), because with human nature like it is, the ambitious and not the pious will always be the ones who rise to power. There are just too many people in this world (or any country) for the citizens to really know if a leader is really devout or if he is just a slick operator.
> > CAIRO, Egypt (UPI) -- The Cairo-based Arab Organization for Human > Rights (AOHR) Thursday welcomed the establishement last week of the > Committee for Defense of Legal Rights in Saudi Arabia and said it was > necessary to have such groups operating in all Arab countries.
You make it sound like these guys are angels, Ilyess. (In your clarinet posting you edited out some stuff; was it the following???) Friday's New York Times reported that this group definitely is more conservative than even Sheikh Baz and his followers (who think that the House of Saud does not rule the country conservatively enough). The NYT reported that, besides complaining that the government was not conservative enough, they have:
- asserted that the (approx. 500,000) Shiites in the Kingdom are apostates, a charge that under Saudi (and Islamic) law brings the death penalty. Diplomatic guy (Sheikh bin Jibrin), isn't he Ilyess? - called for severe punishment of the 40 or so women who drove in public a while back to protest the ban on women driving. The guy from the group who said this, Abdelhamoud al-Toweijri, said that these women should be fired from their jobs, jailed, and branded as prostitutes. Is this what you want to see happen, Ilyess? I've heard many Muslims say that the ban on women driving has no basis in the Qur'an, the ahadith, etc. Yet these folks not only like the ban, they want these women falsely called prostitutes? If I were you, I'd choose my heroes wisely, Ilyess, not just reflexively rally behind anyone who hates anyone you hate. - say that women should not be allowed to work. - say that TV and radio are too immoral in the Kingdom.
Now, the House of Saud is neither my least nor my most favorite government on earth; I think they restrict religious and political reedom a lot, among other things. I just think that the most likely replacements for them are going to be a lot worse for the citizens of the country. But I think the House of Saud is feeling the heat lately. In the last six months or so I've read there have been stepped up harassing by the muttawain (religious police---not government) of Western women not fully veiled (something stupid for women to do, IMO, because it sends the wrong signals about your morality). And I've read that they've cracked down on the few, home-based expartiate religious gatherings, and even posted rewards in (government-owned) newspapers offering money for anyone who turns in a group of expartiates who dare worship in their homes or any other secret place. So the government has grown even more intolerant to try to take some of the wind out of the sails of the more-conservative opposition. As unislamic as some of these things are, they're just a small taste of what would happen if these guys overthrow the House of Saud, like they're trying to in the long run.
Is this really what you (and Rached and others in the general west-is-evil-zionists-rule-hate-west-or-you-are-a-puppet crowd) want, Ilyess?
-- Dave Bakken ==>"the President is doing a fine job, but the problem is we don't know what to do with her husband." James Carville (Clinton campaign strategist),2/93 ==>"Oh, please call Daddy. Mom's far too busy." Chelsea to nurse, CSPAN, 2/93
Next, we will build a Naive Bayes classifier and train it. Our example will generate a matrix of token counts based on a collection of text documents. To do so, we will use the make_pipeline function.
# Model building model = make_pipeline(TfidfVectorizer(), MultinomialNB())Training the model with the training data
model.fit(train_data.data, train_data.target)
Predicting the test data categories
predicted_categories = model.predict(test_data.data)
We can predict the labels of the test set in the las line of the code.
Here are the predicted category names:
print(np.array(test_data.target_names)[predicted_categories]) array(['rec.autos', 'sci.crypt', 'alt.atheism', ..., 'rec.sport.baseball', 'comp.sys.ibm.pc.hardware', 'soc.religion.christian'],dtype='<U24')
Let's construct a multi-class confusion matrix to check if the model is suitable or if it only predicts certain text types correctly.
# plotting the confusion matrix mat = confusion_matrix(test_data.target, predicted_categories) sns.heatmap(mat.T, square = True, annot=True, fmt = "d", xticklabels=train_data.target_names,yticklabels=train_data.target_names) plt.xlabel("true labels") plt.ylabel("predicted label") plt.show() print("Accuracy: {}".format(accuracy_score(test_data.target, predicted_categories))) Accuracy: 0.7738980350504514
Naive Bayes is a powerful machine learning algorithm that you can use in Python to create your own spam filters and text classifiers. Naive Bayes classifiers are simple and robust probabilistic classifiers that are particularly useful for text classification tasks. The Naive Bayes algorithm relies on an assumption of conditional independence of features given a class, which is often a good first approximation to real-world phenomena.
Naive Bayes is becoming a popular text classification technique that can quickly provide a somewhat accurate "guess" as to the category of a document. It is a probabilistic classifier and can give very impressive results. It also scales nicely, allowing you to process thousands of documents. Its ability to keep up with new words makes it more accurate in predicting categories than other popular methods.
Sanskriti is a tech writer and a freelance data scientist. She has rich experience into writing technical content and also finds interest in writing content related to mental health, productivity and self improvement.