Applying Data Science to Malware — Part 3

5 min readDec 30, 2020

Now we will build a machine learning detector. In order to build a machine learning detector, we need to extract a substantial amount of features from our software binary, not just malware because the point of the detector is to determine whether the software binary is malicious or benign.
But at this moment in time, I’m only using the strings feature, in the future I plan to add more features.

Strings feature

def get_string_features(path,hasher):
chars = r” -~”
min_length = 5
string_regexp = ‘[%s]{%d,}’ % (chars, min_length)
file_object = open(path)
data = file_object.read()
pattern = re.compile(string_regexp)
strings= pattern.findall(data)
string_features = {}
for string in strings:
string_features[string] = 1
hashed_features = hasher.transform([string_features])
hashed_features = hashed_features.todense()
hashed_features = numpy.asarray(hashed_features)
hashed_features = hashed_features[0]
print “Extracted {0} strings from {1}”.format(len(string_features),path)
return hashed_features

We start off by defining our function that has 2 parameters, the path, and a hasher. A hasher is a feature of the sklearn library.

Sklearn

Sklearn is short for Scikit-learn which is a highly popular open-source machine learning package. You can learn more about it here: https://scikit-learn.org/stable/getting_started.html

The hashing library allows us to compress an enormous amount of features down to a smaller chuck. This is so that your hardware can handle the amount of data being processed. 4000 compressed features vs 1 million will result in a big difference.
We then want to extract all the strings from the file passed through, but we only want strings that are 5+ characters long.
We’ll then have a for loop to go through all the strings extracted based on our above rule and for each string, we’ll store it into our “string_features” dictionary with a value of “1” to say that the feature is present in the software binary. Also to use the sklearn hasher feature, it requires a list of dictionaries

hashed_features = hasher.transform([string_features])

After that, we change the hashed_features structure back into a normal NumPy vector and then we return it.

Train detector

Now that we have built the strings feature, we can now build our function to extract the data from the software binaries we pass and train our detector.

def train_detector(benign_path, malicious_path, hasher):
def get_training_paths(directory):
targets = []
for path in os.listdir(directory):
targets.append(os.path.join(directory,path))
return targets
malicious_paths = get_training_paths(malicious_path)
benign_paths = get_training_paths(benign_path)
X = [get_string_features(path,hasher) for path in malicious_paths + benign_paths]
y = [1 for i in range(len(malicious_paths))] + [0 for i in range(len(benign_paths))]
# print X
#print y
return X,y
#classifier = tree.DecisionTreeClassifier()
classifier = ensemble.RandomForestClassifier(64)
classifier.fit(X,y)
pickle.dump((classifier,hasher),open(“saved_detector.pkl”,”w+”))

Our function here takes 3 parameters, the first for the “non-malicious” binary, the second for Malware, and the third is our hasher (which is defined later).

Like before, we’ll need to create the absolute file path for each file within the directory we supply and they will be our targets but this will be a sub-function(helper function) of the “train_dectector” function.
And straight after you can see we use it to get all the absolute file paths for both the “benign_path” and “malicious_path”.

Now we can extract our feature for the supplied path to create our label vector.

Vectors

A vector in machine learning(ML) is arrays of numbers where each index corresponds to a single feature. An example of this from above is an extracted string set to “1” to say it exists. We could set it to “0” to say it does not exist in another scenario.

In our example, we have two vectors, X and y. X is the features vector (the features returning from “get_string_features” and y being the label vector, which will label each string to its corresponding binary to say whether it is a malware binary or benign.

Next, we build our decision tree. I won’t delve into how decision trees work but know that they can be used for detection. This is done by the decision tree asking a series of question, for example, does the binary contain 50%+ strings that match to our known malware strings, if yes then follow this path of questions else go this path. That is the way I look at it for now to keep it simple.

#classifier = tree.DecisionTreeClassifier()
classifier = ensemble.RandomForestClassifier(64)

Once we decided which decision tree we want use (a random forest is a collection of many decision trees), we’ll pass X and y into it and that will train it.

classifier.fit(X,y)

And we’ll save our detector and hasher using the Python pickle module.

pickle.dump((classifier,hasher),open(“saved_detector.pkl”,”w+”))

Scan file

Now we’ll write a function to take a binary and check it against our trained dectector to see if we can tell if it is malicous or benign.

def scan_file(path):
if not os.path.exists(“saved_detector.pkl”):
print “Train a dectector before scanning files.”
sys.exit(1)
with open(“saved_detector.pkl”) as saved_detector:
classifier, hasher = pickle.load(saved_detector)
features = get_string_features(path,hasher)
result_proba = classifier.predict_proba([features])[:,1]
if result_proba > 0.5:
print “it appears this faile is malicious!!”, `result_proba`
else:
print “it appears this file is benign”, `result_proba`

Most of it is self explanatory but the main bit I want to cover is:

classifier, hasher = pickle.load(saved_detector)
features = get_string_features(path,hasher)
result_proba = classifier.predict_proba(features)[1]

We set our classifier and hasher to our train dectector by using the pickle.load method.
We’ll run our “get_string_features” against the new binary with our original hasher and set that to the local features variable.
We’ll pass that to our classifier (using the random forest method) to predict the probability of the binary being malware or benign.

Put to the test

Now let’s test the detector out, I passed through the same malware samples in my previous posts and some random benign binary that was on my machine.

That will train our detector to tell the differance between the two supplied path, and now let’s pass through a malware sample to see the result:

………….Hmm………