INTRODUCTION

    In the past few years, the malware industry has grown very rapidly that, the syndicates invest heavily in technologies to evade traditional protection, forcing the anti-malware groups/communities to build more robust softwares to detect and terminate these attacks. The major part of protecting a computer system from a malware attack is to identify whether a given piece of file/software is a malware. Microsoft, being active in building anti-malware products over the years , runs it’s anti-malware utilities over 150 million computers around the world , This generates tens of millions of daily data points to be analyzed as potential malware

    In order to be effective in analyzing and classifying such large amounts of files, we need to be able to group them into groups and identify their respective families. The dataset provided by Microsoft contains about 9 classes of malware.
,

About given data:

->There are 2 kind files
    .asm
    .bytes

->Total train dataset consist of 200GB data
out of which 50Gb of data is .bytes files and 150GB of data is .asm files.

->There are total 10868 .bytes files and 10868 asm files total 21736 files 

->There are 9 classes present in the given data
    1.Ramnit
    2.Lollipop
    3.Kelihos_ver3
    4.Vundo
    5.Simda
    6.Tracur
    7.Kelihos_ver1
    8.Obfuscator.ACY
    9.Gatak

Objective

</pre> ->Classify to which class the malware belongs to </pre>

In [1]:
import shutil
import os

source = 'train/'
dest1 = 'byteFiles'

files = os.listdir(source)
fi = os.listdir(dest1)
for f in files:
    if (f.endswith("bytes")):
        shutil.move(source+f,dest1)
The above code shows the separation of .bytes files and asm files

Plot which shows the distribution of malware files

In [3]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

df=pd.read_csv("trainLabels.csv")
total = len(df)*1.
ax=sns.countplot(x="Class", data=df)
for p in ax.patches:
        ax.annotate('{:.1f}%'.format(100*p.get_height()/total), (p.get_x()+0.1, p.get_height()+5))

#put 11 ticks (therefore 10 steps), from 0 to the total number of rows in the dataframe
ax.yaxis.set_ticks(np.linspace(0, total, 11))

#adjust the ticklabel to the desired format, without changing the position of the ticks. 
_ = ax.set_yticklabels(map('{:.1f}%'.format, 100*ax.yaxis.get_majorticklocs()/total))
plt.show()

The above histogram shows the dataset is imbalanced dataset

Metadata of the files

In [4]:
import os
import seaborn as sns
import matplotlib.pyplot as plt
files=os.listdir('byteFiles')
filenames=df['Id'].tolist()
Class=df['Class'].tolist()
Class1=[]
sizebytes=[]
fnames=[]
for f in files:
    statinfo=os.stat('byteFiles/'+f)
    f=f[:-4]
    if any(f == filename for filename in filenames):
        i=filenames.index(f)
        Class1.append(Class[i])
        sizebytes.append(statinfo.st_size/(1024.0*1024.0))
        fnames.append(f)

df1=pd.DataFrame({'Id':fnames,'size':sizebytes,'Class':Class1})
print (df1.head())
x=df1['Class'].tolist()
y=df1['size'].tolist()
plt.scatter(x,y)
plt.show()
'''plt.bar(x,y)
l=[1,2,3,4,5,6,7,8,9]
plt.xticks(l)
plt.ylabel='size of files'
plt.show()'''
   Class                    Id      size
0      2  58FUPhL3HDB0xZ6VMfcG  5.402344
1      8  HtlDexkruVqzoW5aGpIv  0.363281
2      8  IfP1BbnUAq2upStjrwQ3  0.363281
3      3  DHqruMgb167fkLaiURX3  6.703125
4      3  Iv9Kd3LelrsxDHcYhnJp  8.941406
Out[4]:
"plt.bar(x,y)\nl=[1,2,3,4,5,6,7,8,9]\nplt.xticks(l)\nplt.ylabel='size of files'\nplt.show()"
The above plot shows the file sizes of .bytes files
In [5]:
import os

files=os.listdir('train')
filenames1=df['Id'].tolist()
Class=df['Class'].tolist()
Class2=[]
sizeasm=[]
fnames=[]
for f in files:
    statinfo=os.stat('train/'+f)
    f=f[:-4]
    if any(f == filename for filename in filenames1):
        i=filenames1.index(f)
        Class2.append(Class[i])
        sizeasm.append(statinfo.st_size/(1024.0*1024.0))
        fnames.append(f)
    else:
        print f
#print len(Class2)
df2=pd.DataFrame({'Id':fnames,'size in MB':sizeasm,'Class':Class2})
print (df2.head())
x=df2['Class'].tolist()
y=df2['size in MB'].tolist()
plt.scatter(x,y)
plt.show()
   Class                    Id  size in MB
0      4  buBXk0VawKfpYLytPxFj   11.312403
1      2  j6gIME8eK4T1mn9PFqhb   36.868539
2      8  hb51AsQJ86yW3C4g2wGu    1.386490
3      3  hOtXLq9on7svPe4NTpAd    0.261862
4      3  1hToAGU8YgHQxfVONLqJ    0.163283
The above plot shows the file sizes of .asm files
In [43]:
ax = sns.boxplot(x="Class", y="size", data=df1)
plt.title("boxplot of .bytes file sizes")
plt.show()
In [44]:
ax = sns.boxplot(x="Class", y="size in MB", data=df2)
plt.title("boxplot of .asm file sizes")
plt.show()
The above plot shows the boxplot of the file size of data

feature extraction from byte files

The given files consist of raw data 
I extracted features from both 
Feature extraction is one of the important for this project.
Code for removal address from .bytes files
In [17]:
import os
files = os.listdir('byteFiles')
filenames=[]
array=[]
for f in files:
    if(f.endswith("bytes")):
        f2=f[:-6]
        f1 = open('byteFiles/'+f2+".txt", 'w+')
        with open('byteFiles/'+f,"r") as fli:
            lines=""
            for line in fli:
                c=line
                a=line.rstrip().split(" ")[1:]
                b=' '.join(a)
                b=b+"\n"
                f1.write(b)
            fli.close()
            os.remove('byteFiles/'+f)
        f1.close()
Code for feature extraction from .bytes files
The result is stored in csv file for easy access
In [72]:
import os
import pickle
import numpy as np
files = os.listdir('byteFiles')
filenames2=[]
matrix = np.zeros((len(files),257),dtype=int)
k=-1
count=0
f1=open('result.csv','w+')
for f in files:
    if(count>=8422):
        k+=1
        filenames2.append(f)
        f1.write(f+" ")
        if(f.endswith("txt")):
            with open('byteFiles/'+f,"r") as fli:
                for lines in fli:
                    line=lines.rstrip().split(" ")
                    for a in line:
                        if a=='??':
                            matrix[k][256]+=1
                        else:
                            x=int(a,16)
                            matrix[k][x]+=1
        for i in matrix[k]:
            f1.write(str(i)+" ")
        f1.write("\n")
    count+=1
f1.close()
In [6]:
import pandas as pd
df3=pd.read_csv("result.csv")
print df3.head()
                     ID       0     1     2     3     4     5     6     7  \
0  01azqd4InC7m9JpocGv5  601905  3905  2816  3832  3345  3242  3650  3201   
1  01IsoiSMh5gxyDYTl4CB   39755  8337  7249  7186  8663  6844  8420  7589   
2  01jsnpXSAlgw6aPeDxrU   93506  9542  2568  2438  8925  9330  9007  2342   
3  01kcPWA9K2BOxQeS5Rju   21091  1213   726   817  1257   625   550   523   
4  01SuzwMJEIXsK7A8dQbl   19764   710   302   433   559   410   262   249   

      8  ...      f7    f8    f9    fa    fb    fc    fd     fe     ff     ??  
0  2965  ...    2804  3687  3101  3211  3097  2758  3099   2759   5753   1824  
1  9291  ...     451  6536   439   281   302  7639   518  17001  54902   8588  
2  9107  ...    2325  2358  2242  2885  2863  2471  2786   2680  49144    468  
3  1078  ...     478   873   485   462   516  1133   471    761   7998  13940  
4   422  ...     847   947   350   209   239   653   221    242   2199   9008  

[5 rows x 258 columns]
Adding the class labels to each and every file
In [7]:
Id3=df3['ID'].tolist()
ID=df['Id'].tolist()
Class=df['Class'].tolist()
ClassM=[]
s=[]
count=0
for f in Id3:
    if any(f == filename for filename in ID):
        i=ID.index(f)
        ClassM.append(Class[i])
        count+=1
Normalising Data
In [8]:
from sklearn.manifold import TSNE
from sklearn import preprocessing
df4=df3.drop('ID',axis=1)
normalized_df=(df4-df4.min())/(df4.max()-df4.min())
normalized_df['Class']=ClassM
print(normalized_df.head())
          0         1         2         3         4         5         6  \
0  0.262806  0.005498  0.001567  0.002067  0.002048  0.001835  0.002058   
1  0.017358  0.011737  0.004033  0.003876  0.005303  0.003873  0.004747   
2  0.040827  0.013434  0.001429  0.001315  0.005464  0.005280  0.005078   
3  0.009209  0.001708  0.000404  0.000441  0.000770  0.000354  0.000310   
4  0.008629  0.001000  0.000168  0.000234  0.000342  0.000232  0.000148   

          7         8         9  ...          f8        f9        fa  \
0  0.002946  0.002638  0.003531  ...    0.019969  0.013560  0.013107   
1  0.006984  0.008267  0.000394  ...    0.035399  0.001920  0.001147   
2  0.002155  0.008104  0.002707  ...    0.012771  0.009804  0.011777   
3  0.000481  0.000959  0.000521  ...    0.004728  0.002121  0.001886   
4  0.000229  0.000376  0.000246  ...    0.005129  0.001530  0.000853   

         fb        fc        fd        fe        ff        ??  Class  
0  0.013634  0.031724  0.014549  0.014348  0.007843  0.000129      9  
1  0.001329  0.087867  0.002432  0.088411  0.074851  0.000606      2  
2  0.012604  0.028423  0.013080  0.013937  0.067001  0.000033      9  
3  0.002272  0.013032  0.002211  0.003957  0.010904  0.000984      1  
4  0.001052  0.007511  0.001038  0.001258  0.002998  0.000636      8  

[5 rows x 258 columns]
In [9]:
from sklearn.model_selection import train_test_split
datanormalised=(df4-df4.min())/(df4.max()-df4.min())
X_train, X_test, y_train, y_test = train_test_split(datanormalised, ClassM,stratify=ClassM,test_size=0.30)
In [52]:
xtsne=TSNE(perplexity=50)
results=xtsne.fit_transform(X_train)
vis_x = results[:, 0]
vis_y = results[:, 1]
plt.scatter(vis_x, vis_y, c=y_train, cmap=plt.cm.get_cmap("jet", 9))
plt.colorbar(ticks=range(10))
plt.clim(0.5, 9)
plt.show()

TSNE with perplexity 50

In [54]:
xtsne=TSNE(perplexity=30)
results=xtsne.fit_transform(X_train)
vis_x = results[:, 0]
vis_y = results[:, 1]
plt.scatter(vis_x, vis_y, c=y_train, cmap=plt.cm.get_cmap("jet", 9))
plt.colorbar(ticks=range(10))
plt.clim(0.5, 9)
plt.show()
TSNE with preplexity 30

Feature extraction of ASM files

There are 10868 files of asm 
All the files constitute about 150 GB
The asm files contains 
Address
segments
Opcodes
registers
function calls
APIs
With the help of parallel processing I extracted all the features.In parallel we can use all the cores that are present in our computer.


Here I took only 52 features from all the asm files which are important.

I read top solutions and handpicked the features from those papers

Feature extraction from asm files

Below code helps to get unigram features from asm files

In [ ]:
from multiprocessing import Process
import multiprocessing
import codecs
import os
import numpy as np

def smallprocess(bigramssetsmall,trigramssetsmall,fourgramssetsmall):
    prefixes = ['HEADER:','.text:','.Pav:','.idata:','.data:','.bss:','.rdata:','.edata:','.rsrc:','.tls:','.reloc:','.BSS:','.CODE']
    opcodes = ['jmp', 'mov', 'retf', 'push', 'pop', 'xor', 'retn', 'nop', 'sub', 'inc', 'dec', 'add','imul', 'xchg', 'or', 'shr', 'cmp', 'call', 'shl', 'ror', 'rol', 'jnb','jz','rtn','lea','movzx']
    keywords = ['.dll','std::',':dword']
    registers=['edx','esi','eax','ebx','ecx','edi','ebp','esp','eip']
    file1=open("c:\output\\asmsmallfile.txt","w+")
    opcodefile=open("c:\output\opcodesmallfeatures.txt","w+")
    files = os.listdir('smalldatasize')
    for f in files:
        prefixescount=np.zeros(len(prefixes),dtype=int)
        opcodescount=np.zeros(len(opcodes),dtype=int)
        keywordcount=np.zeros(len(keywords),dtype=int)
        registerscount=np.zeros(len(registers),dtype=int)
        features=[]
        f2=f[:-4]
        file1.write(f2+",")
        opcodefile.write(f2+" ")
        with codecs.open('smalldatasize/'+f,encoding='cp1252',errors ='replace') as fli:
            for lines in fli:
                line=lines.rstrip().split()
                l=line[0]
                for i in range(len(prefixes)):
                    if prefixes[i] in line[0]:
                        prefixescount[i]+=1
                line=line[1:]
                for i in range(len(opcodes)):
                    if any(opcodes[i]==li for li in line):
                        features.append(opcodes[i])
                        opcodescount[i]+=1
                for i in range(len(registers)):
                    for li in line:
                        if registers[i] in li and ('text' in l or 'CODE' in l):
                            registerscount[i]+=1
                for i in range(len(keywords)):
                    for li in line:Below
                        if keywords[i] in li:
                            keywordcount[i]+=1
        for prefix in prefixescount:
            file1.write(str(prefix)+",")
        for opcode in opcodescount:
            file1.write(str(opcode)+",")
        for register in registerscount:
            file1.write(str(register)+",")
        for key in keywordcount:
            file1.write(str(key)+",")
        file1.write("\n")
        for feature in features:
            opcodefile.write(str(feature)+" ")
        opcodefile.write("\n")
        for i in range(len(features)):
            if i<len(features)-1:
                bigramssetsmall[features[i]+features[i+1]]=1
            if i<len(features)-2:
                trigramssetsmall[features[i]+features[i+1]+features[i+2]]=1
            if i<len(features)-3:
                fourgramssetsmall[features[i]+features[i+1]+features[i+2]+features[i+3]]=1
    file1.close()
    opcodefile.close()

def mediumprocess(bigramssetsmedium,trigramssetsmedium,fourgramssetsmedium):
    prefixes = ['HEADER:','.text:','.Pav:','.idata:','.data:','.bss:','.rdata:','.edata:','.rsrc:','.tls:','.reloc:','.BSS:','.CODE']
    opcodes = ['jmp', 'mov', 'retf', 'push', 'pop', 'xor', 'retn', 'nop', 'sub', 'inc', 'dec', 'add','imul', 'xchg', 'or', 'shr', 'cmp', 'call', 'shl', 'ror', 'rol', 'jnb','jz','rtn','lea','movzx']
    keywords = ['.dll','std::',':dword']
    registers=['edx','esi','eax','ebx','ecx','edi','ebp','esp','eip']
    file1=open("c:\output\mediumasmfile.txt","w+")
    opcodefile=open("c:\output\opcodemediumfeatures.txt","w+")
    files = os.listdir('mediumdatasize')
    for f in files:
        prefixescount=np.zeros(len(prefixes),dtype=int)
        opcodescount=np.zeros(len(opcodes),dtype=int)
        keywordcount=np.zeros(len(keywords),dtype=int)
        registerscount=np.zeros(len(registers),dtype=int)
        features=[]
        f2=f[:-4]
        file1.write(f2+",")
        opcodefile.write(f2+" ")
        with codecs.open('mediumdatasize/'+f,encoding='cp1252',errors ='replace') as fli:
            for lines in fli:
                line=lines.rstrip().split()
                l=line[0]
                for i in range(len(prefixes)):
                    if prefixes[i] in line[0]:
                        prefixescount[i]+=1
                line=line[1:]
                for i in range(len(opcodes)):
                    if any(opcodes[i]==li for li in line):
                        features.append(opcodes[i])
                        opcodescount[i]+=1
                for i in range(len(registers)):
                    for li in line:
                        if registers[i] in li and ('text' in l or 'CODE' in l):
                            registerscount[i]+=1
                for i in range(len(keywords)):
                    for li in line:
                        if keywords[i] in li:
                            keywordcount[i]+=1
        for prefix in prefixescount:
            file1.write(str(prefix)+",")
        for opcode in opcodescount:
            file1.write(str(opcode)+",")
        for register in registerscount:
            file1.write(str(register)+",")
        for key in keywordcount:
            file1.write(str(key)+",")
        file1.write("\n")
        for feature in features:
            opcodefile.write(str(feature)+" ")
        opcodefile.write("\n")
        for i in range(len(features)):
            if i<len(features)-1:
                bigramssetsmedium[features[i]+features[i+1]]=1
            if i<len(features)-2:
                trigramssetsmedium[features[i]+features[i+1]+features[i+2]]=1
            if i<len(features)-3:
                fourgramssetsmedium[features[i]+features[i+1]+features[i+2]+features[i+3]]=1
    file1.close()
    opcodefile.close()
def largeprocess(bigramssetslarge,trigramssetslarge,fourgramssetslarge):
    prefixes = ['HEADER:','.text:','.Pav:','.idata:','.data:','.bss:','.rdata:','.edata:','.rsrc:','.tls:','.reloc:','.BSS:','.CODE']
    opcodes = ['jmp', 'mov', 'retf', 'push', 'pop', 'xor', 'retn', 'nop', 'sub', 'inc', 'dec', 'add','imul', 'xchg', 'or', 'shr', 'cmp', 'call', 'shl', 'ror', 'rol', 'jnb','jz','rtn','lea','movzx']
    keywords = ['.dll','std::',':dword']
    registers=['edx','esi','eax','ebx','ecx','edi','ebp','esp','eip']
    file1=open("c:\output\largeasmfile.txt","w+")
    opcodefile=open("c:\output\opcodelargefeatures.txt","w+")
    files = os.listdir('largedatasize')
    for f in files:
        prefixescount=np.zeros(len(prefixes),dtype=int)
        opcodescount=np.zeros(len(opcodes),dtype=int)
        keywordcount=np.zeros(len(keywords),dtype=int)
        registerscount=np.zeros(len(registers),dtype=int)
        features=[]
        f2=f[:-4]
        file1.write(f2+",")
        opcodefile.write(f2+" ")
        with codecs.open('largedatasize/'+f,encoding='cp1252',errors ='replace') as fli:
            for lines in fli:
                line=lines.rstrip().split()
                l=line[0]
                for i in range(len(prefixes)):
                    if prefixes[i] in line[0]:
                        prefixescount[i]+=1
                line=line[1:]
                for i in range(len(opcodes)):
                    if any(opcodes[i]==li for li in line):
                        features.append(opcodes[i])
                        opcodescount[i]+=1
                for i in range(len(registers)):
                    for li in line:
                        if registers[i] in li and ('text' in l or 'CODE' in l):
                            registerscount[i]+=1
                for i in range(len(keywords)):
                    for li in line:
                        if keywords[i] in li:
                            keywordcount[i]+=1
        for prefix in prefixescount:
            file1.write(str(prefix)+",")
        for opcode in opcodescount:
            file1.write(str(opcode)+",")
        for register in registerscount:
            file1.write(str(register)+",")
        for key in keywordcount:
            file1.write(str(key)+",")
        file1.write("\n")
        for feature in features:
            opcodefile.write(str(feature)+" ")
        opcodefile.write("\n")
        for i in range(len(features)):
            if i<len(features)-1:
                bigramssetslarge[features[i]+features[i+1]]=1
            if i<len(features)-2:
                trigramssetslarge[features[i]+features[i+1]+features[i+2]]=1
            if i<len(features)-3:
                fourgramssetslarge[features[i]+features[i+1]+features[i+2]+features[i+3]]=1
    file1.close()
    opcodefile.close()

def hugeprocess(bigramssetshuge,trigramssetshuge,fourgramssetshuge):
    prefixes = ['HEADER:','.text:','.Pav:','.idata:','.data:','.bss:','.rdata:','.edata:','.rsrc:','.tls:','.reloc:','.BSS:','.CODE']
    opcodes = ['jmp', 'mov', 'retf', 'push', 'pop', 'xor', 'retn', 'nop', 'sub', 'inc', 'dec', 'add','imul', 'xchg', 'or', 'shr', 'cmp', 'call', 'shl', 'ror', 'rol', 'jnb','jz','rtn','lea','movzx']
    keywords = ['.dll','std::',':dword']
    registers=['edx','esi','eax','ebx','ecx','edi','ebp','esp','eip']
    file1=open("c:\output\hugeasmfile.txt","w+")
    opcodefile=open("c:\output\opcodehugefeatures.txt","w+")
    files = os.listdir('hugedatasize/')
    for f in files:
        prefixescount=np.zeros(len(prefixes),dtype=int)
        opcodescount=np.zeros(len(opcodes),dtype=int)
        keywordcount=np.zeros(len(keywords),dtype=int)
        registerscount=np.zeros(len(registers),dtype=int)
        features=[]
        f2=f[:-4]
        file1.write(f2+",")
        opcodefile.write(f2+" ")
        with codecs.open('hugedatasize/'+f,encoding='cp1252',errors ='replace') as fli:
            for lines in fli:
                line=lines.rstrip().split()
                l=line[0]
                for i in range(len(prefixes)):
                    if prefixes[i] in line[0]:
                        prefixescount[i]+=1
                line=line[1:]
                for i in range(len(opcodes)):
                    if any(opcodes[i]==li for li in line):
                        features.append(opcodes[i])
                        opcodescount[i]+=1
                for i in range(len(registers)):
                    for li in line:
                        if registers[i] in li and ('text' in l or 'CODE' in l):
                            registerscount[i]+=1
                for i in range(len(keywords)):
                    for li in line:
                        if keywords[i] in li:
                            keywordcount[i]+=1
        for prefix in prefixescount:
            file1.write(str(prefix)+",")
        for opcode in opcodescount:
            file1.write(str(opcode)+",")
        for register in registerscount:
            file1.write(str(register)+",")
        for key in keywordcount:
            file1.write(str(key)+",")
        file1.write("\n")
        for feature in features:
            opcodefile.write(str(feature)+" ")
        opcodefile.write("\n")
        for i in range(len(features)):
            if i<len(features)-1:
                bigramssetshuge[features[i]+features[i+1]]=1
            if i<len(features)-2:
                trigramssetshuge[features[i]+features[i+1]+features[i+2]]=1
            if i<len(features)-3:
                fourgramssetshuge[features[i]+features[i+1]+features[i+2]+features[i+3]]=1
    file1.close()
    opcodefile.close()

def trainprocess(bigramssetstrain,trigramssetstrain,fourgramssetstrain):
    prefixes = ['HEADER:','.text:','.Pav:','.idata:','.data:','.bss:','.rdata:','.edata:','.rsrc:','.tls:','.reloc:','.BSS:','.CODE']
    opcodes = ['jmp', 'mov', 'retf', 'push', 'pop', 'xor', 'retn', 'nop', 'sub', 'inc', 'dec', 'add','imul', 'xchg', 'or', 'shr', 'cmp', 'call', 'shl', 'ror', 'rol', 'jnb','jz','rtn','lea','movzx']
    keywords = ['.dll','std::',':dword']
    registers=['edx','esi','eax','ebx','ecx','edi','ebp','esp','eip']
    file1=open("c:\output\\trainasmfile.txt","w+")
    opcodefile=open("c:\output\opcodetrainfeatures.txt","w+")
    files = os.listdir('train/')
    for f in files:
        prefixescount=np.zeros(len(prefixes),dtype=int)
        opcodescount=np.zeros(len(opcodes),dtype=int)
        keywordcount=np.zeros(len(keywords),dtype=int)
        registerscount=np.zeros(len(registers),dtype=int)
        features=[]
        f2=f[:-4]
        file1.write(f2+",")
        opcodefile.write(f2+" ")
        with codecs.open('train/'+f,encoding='cp1252',errors ='replace') as fli:
            for lines in fli:
                line=lines.rstrip().split()
                l=line[0]
                for i in range(len(prefixes)):
                    if prefixes[i] in line[0]:
                        prefixescount[i]+=1
                line=line[1:]
                for i in range(len(opcodes)):
                    if any(opcodes[i]==li for li in line):
                        features.append(opcodes[i])
                        opcodescount[i]+=1
                for i in range(len(registers)):
                    for li in line:
                        if registers[i] in li and ('text' in l or 'CODE' in l):
                            registerscount[i]+=1
                for i in range(len(keywords)):
                    for li in line:
                        if keywords[i] in li:
                            keywordcount[i]+=1
        for prefix in prefixescount:
            file1.write(str(prefix)+",")
        for opcode in opcodescount:
            file1.write(str(opcode)+",")
        for register in registerscount:
            file1.write(str(register)+",")
        for key in keywordcount:
            file1.write(str(key)+",")
        file1.write("\n")
        for feature in features:
            opcodefile.write(str(feature)+" ")
        opcodefile.write("\n")
        for i in range(len(features)):
            if i<len(features)-1:
                bigramssetstrain[features[i]+features[i+1]]=1
            if i<len(features)-2:
                trigramssetstrain[features[i]+features[i+1]+features[i+2]]=1
            if i<len(features)-3:
                fourgramssetstrain[features[i]+features[i+1]+features[i+2]+features[i+3]]=1
    file1.close()
    opcodefile.close()


def main():
    manager=multiprocessing.Manager() 	
    bigramssetsmall=manager.dict()
    trigramssetsmall=manager.dict()
    fourgramssetsmall=manager.dict()
    p1=Process(target=smallprocess,args=(bigramssetsmall,trigramssetsmall,fourgramssetsmall))
    bigramssetsmedium=manager.dict()
    trigramssetsmedium=manager.dict()
    fourgramssetsmedium=manager.dict()
    p2=Process(target=mediumprocess,args=(bigramssetsmedium,trigramssetsmedium,fourgramssetsmedium))
    bigramssetslarge=manager.dict()
    trigramssetslarge=manager.dict()
    fourgramssetslarge=manager.dict()
    p3=Process(target=largeprocess,args=(bigramssetslarge,trigramssetslarge,fourgramssetslarge))
    bigramssetshuge=manager.dict()
    trigramssetshuge=manager.dict()
    fourgramssetshuge=manager.dict()
    p4=Process(target=hugeprocess,args=(bigramssetshuge,trigramssetshuge,fourgramssetshuge))
    bigramssetstrain=manager.dict()
    trigramssetstrain=manager.dict()
    fourgramssetstrain=manager.dict()
    p5=Process(target=trainprocess,args=(bigramssetstrain,trigramssetstrain,fourgramssetstrain))
    p1.start()
    p2.start()
    p3.start()
    p4.start()
    p5.start()
    p1.join()
    p2.join()
    p3.join()
    p4.join()
    p5.join()
    print(len(bigramssetsmall.keys()))
    print(len(trigramssetsmall.keys()))
    print(len(fourgramssetsmall.keys()))
    print(len(bigramssetsmedium.keys()))
    print(len(trigramssetsmedium.keys()))
    print(len(fourgramssetsmedium.keys()))
    print(len(bigramssetslarge.keys()))
    print(len(trigramssetslarge.keys()))
    print(len(fourgramssetslarge.keys()))
    print(len(bigramssetshuge.keys()))
    print(len(trigramssetshuge.keys()))
    print(len(fourgramssetshuge.keys()))
    print(len(bigramssetstrain.keys()))
    print(len(trigramssetstrain.keys()))
    print(len(fourgramssetstrain.keys()))
    bi=bigramssetsmall.keys()+bigramssetsmedium.keys()+bigramssetslarge.keys()+bigramssetshuge.keys()+bigramssetstrain.keys()
    tri=trigramssetsmall.keys()+trigramssetsmedium.keys()+trigramssetslarge.keys()+trigramssetshuge.keys()+trigramssetstrain.keys()
    four=fourgramssetsmall.keys()+fourgramssetsmedium.keys()+fourgramssetslarge.keys()+fourgramssetshuge.keys()+fourgramssetstrain.keys()
    b=set(bi)
    t=set(tri)
    f=set(four)
    print(len(b))
    print(len(t))
    print(len(f))



if __name__=="__main__":
    main()

I separated all the files according to their sizes and placed in the different folders for parallel execution. This helps us to execute very fastly.
If you are applying on a single box or on your computer this code is very use useful for selected features

In [10]:
import pandas as pd
dfasm=pd.read_csv("asmoutputfile.csv")
print dfasm.head()
                     ID  HEADER:  .text:  .Pav:  .idata:  .data:  .bss:  \
0  01kcPWA9K2BOxQeS5Rju       19     744      0      127      57      0   
1  1E93CpP60RHFNiT5Qfvn       17     838      0      103      49      0   
2  3ekVow2ajZHbTnBcsDfX       17     427      0       50      43      0   
3  3X2nY7iQaPBIWDrAZqJe       17     227      0       43      19      0   
4  46OZzdsSKDCFV8h7XWxf       17     402      0       59     170      0   

   .rdata:  .edata:  .rsrc: ...   :dword  edx  esi  eax  ebx  ecx  edi  ebp  \
0      323        0       3 ...      137   18   66   15   43   83    0   17   
1        0        0       3 ...      130   18   29   48   82   12    0   14   
2      145        0       3 ...       84   13   42   10   67   14    0   11   
3        0        0       3 ...       25    6    8   14    7    2    0    8   
4        0        0       3 ...       18   12    9   18   29    5    0   11   

   esp  eip  
0   48   29  
1    0   20  
2    0    9  
3    0    6  
4    0   11  

[5 rows x 52 columns]
Adding class labels to the files
In [11]:
Idasm=dfasm['ID'].tolist()
ID=df2['Id'].tolist()
Class=df2['Class'].tolist()
si=df2['size in MB'].tolist()
sizeasm=[]
Classasm=[]
count=0
for f in Idasm:
    if any(f == filename for filename in ID):
        i=ID.index(f)
        Classasm.append(Class[i])
        sizeasm.append(si[i])
        count+=1
dfasm['size']=sizeasm
dfm=dfasm
dfm["Class"]=Classasm
In [14]:
asmtsne=dfasm.drop('ID',axis=1)
normalized_dfasm=(asmtsne-asmtsne.min())/(asmtsne.max()-asmtsne.min())
from sklearn.manifold import TSNE
xtsne=TSNE(perplexity=50)
results=xtsne.fit_transform(asmtsne)
vis_x = results[:, 0]
vis_y = results[:, 1]
TSNE for asm data with perplexity 50
In [15]:
plt.scatter(vis_x, vis_y, c=Classasm, cmap=plt.cm.get_cmap("jet", 9))
plt.colorbar(ticks=range(9))
plt.clim(0.5, 9)
plt.show()
In [16]:
y=dfasm['.text:'].tolist()
x=Classasm
ax=plt.scatter(x,y)
plt.ylabel('.text:', fontsize=16)
plt.xlabel('Class',fontsize=16)
plt.title("plot between Class labels and .text segment")
plt.show()
In [51]:
ax = sns.boxplot(x="Class", y=".text:", data=dfm)
plt.title("boxplot of .asm text segment")
plt.show()
The plot is between Text and class 
Class 1,2 and 9 can be easly separated
In [17]:
y=dfasm['.Pav:'].tolist()
x=Classasm
ax=plt.scatter(x,y)
plt.ylabel('.Pav:', fontsize=16)
plt.xlabel('Class',fontsize=16)
plt.title("plot between Class labels and .Pav segment")
plt.show()
In [52]:
ax = sns.boxplot(x="Class", y=".Pav:", data=dfm)
plt.title("boxplot of .asm pav segment")
plt.show()
Observation
only few class 7 files are present with the segment ":Pav"
if pav segment is present then we can classify  it as class 7 to some extent 
In [18]:
y=dfasm['.data:'].tolist()
x=Classasm
ax=plt.scatter(x,y)
plt.ylabel('.data:', fontsize=16)
plt.xlabel('class',fontsize=16)
plt.title("plot between Class labels and .data segment")
plt.show()
In [53]:
ax = sns.boxplot(x="Class", y=".data:", data=dfm)
plt.title("boxplot of .asm data segment")
plt.show()