Skip to content

HAC-T clustering is very slow with larger data size, for 500K tlsh list it took ~6 hours #124

Description

@SrikanthPusarla

Hi
The HAC-T clustering for 500 K TLSH list took 6 hours, but The paper claimed it took ~ 2hours 10 min for 10 million samples
(HAC-T and Fast Search for Similarity in Security --- chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/viewer.html?pdfurl=https%3A%2F%2Ftlsh.org%2FpapersDir%2FCOINS_2020_camera_ready.pdf&clen=191519&chunk=true )

Please help me how you achieved this faster clustering, Does it support multi threading

My experiment:
Data: 500 K tlsh input
Command: python hac-t.py -f -o -cdist 90 -showtime 1 -showcl 1
Machine: 16 core 122 GB ram
Python 3.8.8

Thanks

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions