Days of Future Past: Towards Robust Detection of Malware Variants via LLM-Based Embedding Generation

Abstract

The evolution of a parent malware into a family of slightly different mutations may hinder detection mechanisms based on signatures, while the limited number of training examples may reduce the effectiveness of machine learning methods in the early stages of the infection. To address these challenges, we define a framework to improve the ability to generalize the detection of “evolving” malware samples. Specifically, we leverage a Large Language Model (LLM) to map malware instructions into a latent space. The obtained embeddings are then used to train a Variational Autoencoder for generating realistic variants. Experimental results obtained by training a detector on both real and synthetic embeddings demonstrate the effectiveness of our approach, especially when facing three real malware families. Our LLM-based feature extraction approach should be then considered a promising mechanism for pursuing robust malware detection in dynamic threat environments.

Publication
IEEE International Conference on Data Mining, ICDM 2025 - Workshops, Washington, DC, USA, November 12-15, 2025