Published on March 11, 2014
FFRI,Inc. 1 Monthly Research Effectiveness of unknown malwaere classification by logistic regression analysis FFRI, Inc http://www.ffri.jp Ver 2.00.01
FFRI,Inc. • Classifies malware from static information of executables • As examples of information it uses – Name of sections – Dlls or APIs imported – File size • Since malware often has structures or APIs which are rarely used by usual executables, the combination of these information allow us to classify malware. Malware Classification by Static Information 2
FFRI,Inc. • These features are used in various way including logistic regression analysis and used to classify malware we still do not know if features effective to a file set is still effective to unknown file set. • Detection rate and false positive are also suspicious if they do not differ between learning file set and other files. Problems 3
FFRI,Inc. • Apply logistic regression analysis to static information of executables and find out how detection rate and false positive are. • Investigate how the tendency of these rates differs to another file set. • Especially for detection rate, it is important to see how the features collected from malware in a specific span and in a span after that are different. Investigation 4
FFRI,Inc. • Prepare 16000 malware – Randomly pick up 8000 from malware found from Jan to Jun in 2013 – Randomly pick up 8000 from malware found from Jul to Dec in 2013 • Randomly pick up 16000 normal files – Divide it to two（8000 for each） • Applying logistic regression analysis to one file set and obtain classification function. Then apply it to another file set. Evaluation method 5
FFRI,Inc. Evaluation methods 6 Malware 8000 (Jan to Jun in 2013) Malware 8000 (Jul to Dec in 2013) Normal File 8000 Normal File 8000 Learning Evaluation Detection rate False positive on a threshold Detection rate False positive how changes?
FFRI,Inc. • Extract features below – File size – Is packed? (0 or 1) – Is the packer UPX? (0 or 1) – Is a DLL? (0 or 1) – Is a driver? (0 or 1) – Is a VisualBasic application? (0 or 1) – Is a .Net application? (0 or 1) – Is a control panel application? (0 or 1) – Has GUI? (0 or 1) – Has invalid dos stub? (0 or 1) – Number of APIs often used by malware (8 at maximum) – Number of DLLs often used by malware (8 at maximum) Features 7
FFRI,Inc. Result • First, classify learning file set by applying logistic regression analysis • The more it is closer to 1 the more likely it is a malware. • The features we picked up gave us distinguishable difference between normal files and malware. 8 Sample ID 0 - 8000 : Malware(Jan – Jun 2013) 8001 - 16000 : Normal
FFRI,Inc. Result • Next, find out how the result for evaluation file set looks • This also gave us the distinguishable difference between normal files and malware. • It also has the similar result to the result of learning file set. 9 Sample ID 0 - 8000 : Malware(Jul – Dec 2013) 8001 - 16000 : Normal
FFRI,Inc. Result • From practical aspect we want to keep false positive rate less than 1.0% • Put both results on top of each other and set the threshold to 0.9. 10 Threshold 0.9 Detection rate False positive rate Learning 19.2% 0.825% Evaluation 22.0% 1.13%
FFRI,Inc. Consideration • We can see that both results from learning file set and evaluation file set do not have big difference. • By reducing threshold we can improve detection rate if more false positives are acceptable • On the other hand, there are groups of files that can not be distinguishable from features we selected. 11 Threshold may be here. （Needs to consult FP） The file on this line can not be distinguishable.
FFRI,Inc. Summary • The methods and feature used this time gave us the same tendency from learning file set and evaluation file set • Especially for malware, we found that the tendency are similar between malware from first half and latter half in 2013 (in terms of the features we selected) • As future works, we should choose features, change the conversion of the values and find out the optimized method. • Especially for the files than can not be classified from the features we need to investigate other features to classify them well. 12
FFRI,Inc. Contact Information E-Mail : research—firstname.lastname@example.org Twitter: @FFRI_Research 13
Effectiveness of unknown malwaere classification by logistic regression ... information allow us to classify malware. Malware Classification by Static ...
FFRI,Inc. • 今回は、実行ファイルの静的情報を特徴として、ロジスティック回帰分析を行い 、それぞれのどの程度 ...
An R tutorial on the residual of a simple linear regression model. ... Lower Tail Test of Population Mean with Unknown Variance; ... Logistic Regression.
This paper extends this analysis to the amount of unknown malware. ... and analysis using statistical classification of ... logistic regression: ...
... functional dyspepsia remains unknown. ... logistic regression analysis was conducted to ... logistic regression confirmed that the ...
... Histogram and Logistic Regression); ... (unknown) files. Static analysis ... et al. "Google Android: A Comprehensive Security
In statistical analysis of binary classification, ... The F 1 score can be ... The F-measure was derived so that "measures the effectiveness of retrieval ...
As traditional signature based methods become less potent in detecting unknown malware, ... Accuracy Android Malware Detection Using ... logistic ...