A Naive Bayes spam classifier implementation in Go enables a text classification system using the Naive Bayes algorithm with Laplace smoothing to classify messages as spam or not spam.
- Naive Bayes Classification: Uses probabilistic classification based on Bayes’ theorem with naive independence assumptions
- Laplace smoothing:Applies additive smoothing to handle zero probabilities for unseen words
- training and classification: Simple API for training and classifying new messages on labeled datasets
- real dataset test:Includes tests with real spam/ham email dataset
go get github.com/igomez10/nspammerpackage main
import (
"fmt"
"github.com/igomez10/nspammer"
)
func main() {
// Create training dataset (map[string]bool where true = spam, false = not spam)
trainingData := map[string]bool{
"buy viagra now": true,
"get rich quick": true,
"meeting at 3pm": false,
"project update report": false,
}
// Create and train classifier
classifier := nspammer.NewSpamClassifier(trainingData)
// Classify new messages
isSpam := classifier.Classify("buy now")
fmt.Printf("Is spam: %v\n", isSpam)
}NewSpamClassifier(dataset map[string]bool) *SpamClassifier
Creates a new spam classifier and trains it on the given dataset. The dataset is a map where the keys are text messages and the values indicate whether the message is spam or not (true) or not spam (false,
(*SpamClassifier).Classify(input string) bool
Classifies the input text as spam (true) or not spam (false) based on the trained model.
The classifier uses the Naive Bayes algorithm:
training phase,
- Computes the prior probabilities: P(spam) and P(not spam)
- Creates a glossary from all training messages
- Calculates word frequencies in spam and non-spam messages
- Stores word frequencies for probability calculations
classification stage,
- Calculates log likelihoods to avoid numerical underflow.
- Calculation: log(p(spam)) + Σ log(p(word|spam))
- Calculation: log(p(notspam)) + Σ log(p(word|notspam))
- Return
true(spam) if spam score is high
Laplace smoothing,
- Adds a smoothing constant to avoid zero probabilities for unseen terms
- Formula: P(word|class) = (count + α) / (total + α × vocabulary_size)
- Default α = 1.0
This project includes support for the Kaggle Spam Mails dataset. To download it:
This script requires the Kaggle CLI to be installed and configured.
Run the test suite:
The tests include:
- simple classification example
- Real World Email Dataset Evaluation
- Accuracy measurement on train/test split
