igomez10/nspammer: Spam classifier in pure Go using Naive Bayes

A Naive Bayes spam classifier implementation in Go enables a text classification system using the Naive Bayes algorithm with Laplace smoothing to classify messages as spam or not spam.

  • Naive Bayes Classification: Uses probabilistic classification based on Bayes’ theorem with naive independence assumptions
  • Laplace smoothing:Applies additive smoothing to handle zero probabilities for unseen words
  • training and classification: Simple API for training and classifying new messages on labeled datasets
  • real dataset test:Includes tests with real spam/ham email dataset
go get github.com/igomez10/nspammer
package main

import (
    "fmt"
    "github.com/igomez10/nspammer"
)

func main() {
    // Create training dataset (map[string]bool where true = spam, false = not spam)
    trainingData := map[string]bool{
        "buy viagra now":           true,
        "get rich quick":           true,
        "meeting at 3pm":           false,
        "project update report":    false,
    }

    // Create and train classifier
    classifier := nspammer.NewSpamClassifier(trainingData)

    // Classify new messages
    isSpam := classifier.Classify("buy now")
    fmt.Printf("Is spam: %v\n", isSpam)
}

NewSpamClassifier(dataset map[string]bool) *SpamClassifier

Creates a new spam classifier and trains it on the given dataset. The dataset is a map where the keys are text messages and the values ​​indicate whether the message is spam or not (true) or not spam (false,

(*SpamClassifier).Classify(input string) bool

Classifies the input text as spam (true) or not spam (false) based on the trained model.

The classifier uses the Naive Bayes algorithm:

  1. training phase,

    • Computes the prior probabilities: P(spam) and P(not spam)
    • Creates a glossary from all training messages
    • Calculates word frequencies in spam and non-spam messages
    • Stores word frequencies for probability calculations
  2. classification stage,

    • Calculates log likelihoods to avoid numerical underflow.
    • Calculation: log(p(spam)) + Σ log(p(word|spam))
    • Calculation: log(p(notspam)) + Σ log(p(word|notspam))
    • Return true (spam) if spam score is high
  3. Laplace smoothing,

    • Adds a smoothing constant to avoid zero probabilities for unseen terms
    • Formula: P(word|class) = (count + α) / (total + α × vocabulary_size)
    • Default α = 1.0

This project includes support for the Kaggle Spam Mails dataset. To download it:

This script requires the Kaggle CLI to be installed and configured.

Run the test suite:

The tests include:

  • simple classification example
  • Real World Email Dataset Evaluation
  • Accuracy measurement on train/test split



Leave a Comment