Skip to content

ErfanMomeniii/simhash

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

go version license version

simhash

simhash is a lightweight Go package for generating Simhash tokens and calculating their similarity using the Moses Charikar Simhash algorithm. It is ideal for applications like text deduplication, plagiarism detection, and near-duplicate content detection and fingerprinting.

For detailed usage, check this.


Documentation

Install

To get started with simhash, install it using:

go get github.com/erfanmomeniii/simhash

Next, include it in your application:

import "github.com/erfanmomeniii/simhash"

Quick Start

The following example demonstrates how to generate Simhash tokens and calculate similarity:

package main

import (
	"fmt"
	"github.com/erfanmomeniii/simhash"
)

func main() {
	// Create a new Simhash instance
	s := simhash.NewSimhash()

	// Add features with weights
	s.AddFeature("example", 2)
	s.AddFeature("test", 5)

	// Generate a Simhash token
	token1 := s.GenerateToken()

	// Create another Simhash instance with different features
	s2 := simhash.NewSimhash()
	s2.AddFeature("example", 2)
	s2.AddFeature("testcase", 5)

	// Generate another token
	token2 := s2.GenerateToken()

	// Compute similarity between the two tokens
	similarity := simhash.ComputeSimilarity(token1, token2)

	fmt.Printf("Token1: %s\nToken2: %s\nSimilarity: %f\n", token1, token2, similarity)
}

Output:

Token1: F9E6E6EF197C2B25
Token2: FDA981914657B7D1
Similarity: 43.75

Features

Add Feature

Add features with their weights to the Simhash generator:

s.AddFeature("example", 5)
s.AddFeature(12345, 10)

Generate Token

Generate a 64-bit hexadecimal Simhash token based on the added features:

token := s.GenerateToken()

Compute Similarity

Calculate the similarity between two Simhash tokens as a percentage (normalized Hamming distance):

similarity := simhash.ComputeSimilarity(token1, token2)

Supported Feature Types

The AddFeature method accepts the following types:

  • Strings: e.g., "example"
  • Numbers: e.g., 123, float64, etc.
  • Byte slices: e.g., []byte("example")
  • Any other type: Converted using JSON serialization

Contributing

Pull requests are welcome! For any changes, please open an issue first to discuss the proposed modification. Ensure tests are updated accordingly.

About

A lightweight Go package implementing Charikar's Simhash algorithm for generating hash fingerprints and calculating similarity, ideal for deduplication and content fingerprinting

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages