Transcription

Huffman Algorithm for Antivirus QuarantineSonny Lazuardi Hermawan 13511029Program Studi Teknik InformatikaSekolah Teknik Elektro dan InformatikaInstitut Teknologi Bandung, Jl. Ganesha 10 Bandung 40132, gmail.comAbstract— This paper is about using Huffman Algorithmfor Antivirus to secure malware from computer users. Sincesystem of software is becoming more and more sophisticated,the malware will try to innovate to keep their existence in thesystem. We need to secure the infected file so that user wouldnot be in danger. Quarantine is one solution to secure theinfected file. The best way of antivirus to do quarantine is byencrypting the infected file. By using Huffman algorithm inencrypting the infected file, we will get some benefits such asdata compression, safe encryption for the infected file, and asmall running time.Index Terms— Huffman, algorithm, malware, quarantine,compression, encryption.but malware with a zero population growth may selfreplicate.3. Parasitic malware requires some other executablecode in order to exist. "Executable" in this contextshould be taken very broadly to include anythingthat can be executed, such as boot block code on adisk, binary code in applications, and interpretedcode. It also includes source code, like applicationscripting languages, and code that may requirecompilation before being executed.Based on the characteristics, there are some types ofviruses that may harm your computer and your data. Thisis the statistics for malware spreading based on the types.I. INTRODUCTIONA. Computer and MalwareComputer has becoming a part of human life. Actually,computer has taken role to help the people to get theirneeds. Computerized inventory system manages thesupply chains for food we bought. Computer-controlledwater system dispenses the water. Computer managesfinancial transactions to pay for it all. Students need acomputer for learning and doing their task. Those thingsprove that computers is holding main role in society‟sinfrastructure. Have you ever imagined when computerstop working and losing data? There will be a big loss if ithappens. As the computer system grows, malware willnever stop infecting computer system. They always comewith new way to spread themselves and avoiding theantivirus detection. There are three characteristics ofmalware [2].1. Self-replicating malware actively attempts topropagate by creating new copies, or instances, ofitself. Malware may also be propagated passively,by a user copying it accidentally, for example, butthis isn't self-replication.2. The population growth of malware describes theoverall change in the number of malware instancesdue to self-replication. Malware that doesn't selfreplicate will always have a zero population growth,Makalah IF2091 Struktur Diskrit – Sem. I Tahun 2011/2012http://en.wikipedia.org/wiki/File:Malware statics 2011-03-16-en.svgFig. 1 Malware spreading by typesSince system of software is becoming more and moresophisticated, the malware will try to innovate to keeptheir existence in the system. The traditional method ofantivirus that uses virus definition database for scanningviruses would have taken a lot of time to be updated.Antivirus now has already had some heuristics method toanticipate the threat of malware. Even they havesucceeded to suspect a virus; they have to ensure that thefile is really a malware and analyze the file so that thevirus file can be cleaned later. The problem is how tosecure the file that may not be a malware or may be asystem file so that it can be restored later? The answer isquarantine.

II. QUARANTINEAfter finding a suspected virus file, antivirus softwareneeds to quarantine the infected file. Quarantine is only atemporary measure, and may only be done until the userdecides how to handle the file (e.g., giving approval todisinfect it) [2]. In the other words, Antivirus softwaremay have generically detected a virus, but have no ideahow to clean it. Quarantine will be used until the antivirussoftware has been updated and can deal with the virus thatwas discovered.Actually, Quarantine is a simple technique to secure thesuspected file by copying the file to the “quarantine”directory, removing original infected file, and disallowinguser to access the infected file. But then come theproblem, user may have changed the permission to theinfected file and they can be executed even in aquarantine. That means it is not solving the dangerous file,it just move the danger from one place to another. Theuser may have the virus back if they are not aware of whatthe quarantine is. So, there come some solutions to securethe infected file.antivirus. The best way to do this is in solution number 3.We can secure the infected file by encrypting the fileitself. But is that XOR encryption is really the bestsolution? Can we use other encoding method?Fig. 2 Windows Defender QuarantineIII. HUFFMAN ALGORITHM1.Renaming the infected fileThe simplest solution to this problem is by renamingthe infected file, so it can‟t be directly executed. Bychanging the extension name of a file, it will not beexecuted as an executable (.exe). The weakness of thissolution is that the virus body is still there. Once theinfected file renamed back or executed by other program,the virus would run again in the system. They just have alittle effort to be alive again.2.Render files in Invisible directoryAnother solution is to render the files in the quarantinedirectory invisible - what can't be seen can't be copied.Antivirus software can accomplish this feat using filehiding techniques like stealth viruses and rootkits use.However, this may not be the best idea, as viruses maythen try to hide in the quarantine directory, letting theanti-virus software cloak their presence. There could alsobe issues with false positives produced by virus-likebehavior from anti-virus software [2].3.Encrypt infected fileOne solution is to encrypt quarantined files by sometrivial means, like an XOR with a constant. The virus isthereby rendered inert, because an executable fileencrypted this way will no longer be runnable, andcopying the file does no harm. Also, an encrypted,quarantined file is readily accessible for disinfection [2].Based on three solutions above, we can conclude thatwe must secure the infected file so that it can‟t beaccessed by user and can be easily cleaned after updatingMakalah IF2091 Struktur Diskrit – Sem. I Tahun 2011/2012In data storage, file with big size will take a lot of spaceof storage that we have. This problem can be solved withthe encoding of file content in concise way, so that thespace needed become smaller. This encoding method isalso called data compression. Data compression is doneby encoding each character in file content with shortercode [3].Huffman algorithm is a very popular way to representdata with minimum memory needed to store the data.Huffman algorithm was created to reduce data redundancyin a file. Huffman algorithm was developed by David A.Huffman while he was a Ph.D., student at MIT, andpublished in the 1952 paper “A Method for theConstruction of Minimum-Redundancy Codes” [5].A. Huffman EncodingCoding system which is widely used is ASCIIencoding. With ASCII, each character will be coded in 8bit binary. These are some example of ASCII B01001011K841245401010100T851255501010101UTable I ASCII tableBased on the encoding rule above, string„KAKAKTUA‟ can be represented by this sequence ofbit:

01010101000001From the ASCII encoding, 8 characters need 8 x 8 64bit (8 byte). To minimize the sum of bit, length of codefor each character must be shorten as short as possible,especially for the character which has the biggestfrequency. This thought is the basic of the Huffmanencoding. These are the steps to get the Huffman code.1. Count the frequency of occurrence of eachcharacter in the string;2. Choose two symbols that has the fewest probability3. Make a Huffman tree from the symbol from bottomup using greedy algorithm;4. Determine the Huffman code for each symbolbased on the convention bit;5. Convert the string into new bit representation fromthe Huffman code for each symbol.The first thing to get Huffman code is making theprobability table of each symbol. The process of makingHuffman code is by making binary tree which is calledHuffman tree.Fig. 3 Huffman tree for strings ‘KAKAKTUA’Leaf in the Huffman tree represented the symbol whichis used in the strings. Each code for symbol give label 0for the left branch and label 1 for the right branch. Aftermaking the path from root to leaf, we can get the code foreach symbol. From the Huffman tree above, we get thesecodes for each nCodeA33/8K01T11/8A1U11/8T000U001Table II Probability Table and the Huffman Code forstring ‘KAKAKTUA’Choose two symbols with the lowest probability. In theexample above we choose T and U). Those two symbolsis then combined to a node of symbol TU for the parent ofsymbol T and U with probability of 1/8 1/8 2/8. Thisnew symbol is treated as new node and included for thesearching of the next symbol which has lowestprobability.Choose the next two symbols (the new symbolincluded) with the lowest probability. In this example thenext two symbols are TU (2/8) and K (3/8). Do the samethings as the step before. The result is the new symbolTUK with the probability of 2/8 3/8 5/8.Do the same procedure for the next two symbols whichhave the lowest probability. The next two symbols are A(3/8) and TUK (5/8). The combination of them is ATUKwith the probability of 3/8 5/8 8/8.Makalah IF2091 Struktur Diskrit – Sem. I Tahun 2011/2012Table III Huffman code for each symbolWith the Huffman coding, the string „KAKAKTUA‟can be represented with these sets of bit:011011010000011The symbols that often exist are represented withshorter code than the other symbols. Code for everysymbol can‟t be the prefix of the other code because itwill cause the ambiguity in the decoding process. Thesymbols that often exist have the code with the fewest sumof bit. There is no symbol that has the prefix of othersymbols. Huffman code is not unique, this means code foreach characters is different on every strings depends onthe frequency of the character itself. Besides, the decisionwhether the node in the Huffman tree is placed left orright will also determine the final code but it‟s notinfluenced the length of code.With this Huffman coding, the sum of bits to make thestring „KAKAKTUA‟ is only 15 bit. If we compare withthe ASCII code, we get the difference of 64 – 15 49 bit.

In this case, Huffman code save 49/64 * 100% 76.56%of space. Huffman code typically saves space from 20%up to 90%.B. Huffman DecodingThe preceding method is used to make Huffman code.How to read the Huffman code and getting the stringsback? To recognize the information from Huffman code,we need to reverse the process of encoding which is oftencalled decoding. The reverse process, i.e. to reconstructthe sequence of symbols in the source, is called decoding[4]. Huffman code is also a compression algorithm, so thedecoding is just like decompression. The decompressionalgorithm involves the operations where the codeword fora symbol is obtained by „walking; down from the root ofthe Huffman tree to the leaf for each symbol.Take the previous example for the decoding process.We want to decode 011011010000011 to the originalstring using the Huffman tree.(a) 011011010000011(b) 011011010000011Fig. 4 Decoding string from Huffman treeFigure 4 shows the first two steps of decoding thestring. It shows symbol K and A. The decoder reads 0s or1s bit by bit. Current bit is highlighted (yellow). In step(a), starting from the root of the Huffman tree, we movealong the left branch one edge down to the left child sincea bit 0 is read. Then, we move along the right branch tothe right child since a bit 1 is read. When we reach a leaf,the symbol K at the leaf is output. This process starts fromthe root again until it reaches the leaf. In step (b) it willmove to the right child since a bit 1 is read and also itreaches the leaf. Then, the process begins from the rootagain. So, we can conclude the decoding process in thisthree points (Algorithm) [4].1. Read the coded string bit by bit. Starting from theroot, then traverse one edge down the tree to achild according to the bit value. If the current bitread 0 we move to the left child, otherwise, to theright child.2. Repeat this process until we reach a leaf. If wereach a leaf, we will decode one character andrestart the traversal from the root.3. Repeat this read-and-move procedure until the endof the message.Makalah IF2091 Struktur Diskrit – Sem. I Tahun 2011/2012IV. HUFFMAN FOR EXECUTABLE FILESMany people say that Huffman algorithm is the best forthe text file compression. Huffman is not good for binaryfile like executable files because they have their owncompression method. This is not a wrong statement, butwe will try how Huffman code works for binary file likeexecutable file.A. Lossless compressionIf we are talking about executable file, we are dealingwith sensitive data which means every single bit in the filecan‟t be removed. If it loses some data, it may not beexecuted anymore. Lossless data compression is a class ofdata compression algorithms that allows the exact originaldata to be reconstructed from the compressed data. Thelossless compression only allows constructing anapproximation of the original data, in exchange for bettercompression rates. The process of Huffman losslesscompression on executable is like text file Huffmanencoding. The different is we must split each binary codefrom the executable file as a character and encode it withHuffman method.For the comparison of Huffman compression with othercompressor, we see Zhuff compression tool that usesHuffman algorithm. Zhuff is compression softwaredesigned for speed (especially decompression speed). It isbased around LZSS & Huffman mechanisms [7].CompressorCompressed 834.1b3Pos.167 WINZIP 8.01732476203 Zhuff Table IV Comparison of Huffman compression forexecutable file with other compressorThe data above comes from the compression of anexecutable file Acrobat Reader 5.0 executable(acrord32.exe) with total file size of 3870784 bytes. If wecompare Huffman compression (Zhuff), it is below theother popular compressor such as WinRAR and WinZip.But, if we see the compression ratio, it‟s not very bad,Huffman can save 49.33% spaces for executable file. Itproves that Huffman can be used for executable file thatwe will use it for quarantining an executable virus file.V. HUFFMAN FOR QUARANTINENow, we are trying to decode the virus file into theHuffman code. In this case, we will not trying to detect a

virus file, but we assume that the file is a positive virusfile. Then, we will do quarantine for the virus file. TheHuffman code will be used for quarantine of antivirus.A. Virus ExperimentThe experiment needed to prove that Huffman can bedone in quarantining a virus. For example, we will try toquarantine a Worm.Win32.VB.kz (Winamps.exe). It is anIndonesian worm which has popular name of„Amburadul‟. This worm can spread and duplicate itself tothe system and make self-defense by auto running at thestartup.Firstly, we will use Zhuff compression tool to getinformation of how much Huffman works for thisexecutable virus file.Fig. 5 Compression of Worm.Win32.VB.kz fileFig. 6 Huffman Encoding process of the virus fileThe figure above shows the result of the Huffmanencoding of the virus file (Winamps.exe.bak). Thecompression ratio is 34% from 64KB to 21KB. This pureHuffman encoding can save 66% spaces. It meansHuffman encoding could be done for quarantining viruses.The next procedure is decoding the file. In the decodingprocess, it will identify whether the file is compressedusing Huffman encoding or not. Then it will createHuffman tree and begin the decoding process using theHuffman tree.From the figure above, we can see the Huffmancompression for this executable virus file. The actual sizeof virus file is 63KB can be compressed into 12KB byusing Huffman compression. It can save 80.7% spaces.The time needed for the encoding is 0.02 seconds in speedof 3.8MB/s. But, Zhuff is not a pure Huffman algorithm.Let‟s check about the compression of pure Huffmanencoding.B. Huffman for Virus FileWe can get the source of Huffman encoding anddecoding from planet-source-code.com by FredrikQvarfort written in VB (Visual Basic) [8]. This is the pureHuffman encoding for the file. We can use this source forour antivirus with the credits to Fredrik. This Huffmanencoding process is the same with the process of Huffmanencoding explained above. The encoded file will haveprefix of “HE3”, and the Huffman tree will be saved inthe first part of the file. If the destination file is larger thanthe source file, it will leave uncompressed and will give“HE0” prefix. Meanwhile, the decoding process includesthe identification of the file, extracting Huffman tree, anddecode the actual data. The identification process checksthe prefix of file whether it is “HE3” or not. If it is “HE0”,this will return the uncompressed data. Then, it will createa Huffman tree and decode the data from the Huffmantree. The usage of the Huffman encoding and decoding isby filling the parameter for original file path andcompressed/decompressed file path.Makalah IF2091 Struktur Diskrit – Sem. I Tahun 2011/2012Fig. 7 Huffman Decoding process of the virus fileThe decoding result is the same with the original file.We can prove by the size and the content of the decodedfile and the original file. The time needed for decoding isfaster than the encoding process that is 0.014s. It meansthat Huffman compression succeeds to convert thequarantined file back to the original file without losingany data (lossless compression).C. Huffman Coding BenefitsBased on the experiment above, we can get threeimportant factors which is the benefits from Huffmanencoding for quarantine of antivirus.

1.EncryptionVI. CONCLUSIONThe basic function of quarantine is to secure theexecutable virus file to be executed by the user. One ofthe ways to do this is by using encryption for theexecutable file. The encryption will prevent the user bydirectly access the original virus file. From the experimentof virus quarantine above, we get the encrypted virus fileafter encoding it using Huffman coding. Two figuresbelow show the original (after decoding) file and theencrypted file (after encoding).Huffman algorithm can be used for quarantine inantivirus because of these three things.1. Huffman algorithm can be used for encrypting thevirus file so that the file cannot be directlyexecuted by user.2. Huffman algorithm can reduce the virus file size sothat it can save more spaces in the user‟s disk.3. Huffman algorithm has a fairly efficient runningtime so that it can quarantine some viruses in alittle time.VII. ACKNOWLEDGMENTI want to say thanks for lecturer of Discrete Structure,Mrs. Harlili and Dr. Ir. Rinaldi for the patience to teach. Ialso thanked my parents for the support to me. Thanks toMr. Hirin for sharing a lot about virus and antivirus withme. Thanks to Fredrik Qvarfort for the Huffman encodingand decoding in Visual Basic source code.REFERENCES[1][2][3][4][5][6][7][8](a) Huffman Encoded File(b) Huffman Decoded FileFig. 8 Encryption by Huffman Coding2.SizeViruses have various file size. There are some virusesthat have a big file size. When we want to do quarantinefor that files, we will lose a lot of space just for holdingthe dangerous file in our disk. It means that we needcompression for quarantined virus file so that it doesn‟teat a lot of space in our disk. From the experiment above,we get 66% saved space (compression ratio 34%). Itmeans we can save more spaces after doing quarantineand use it for other purposes.3.SpeedDoing encryption and compression need some times.This Huffman coding, based on the experiment above,shows a relatively small time to encode and decode( 0.01s). The bigger the file being compressed, the slowerthe compression will be. But, Huffman algorithm have afairly efficient running time, because it has complexity ofO(n log n).Makalah IF2091 Struktur Diskrit – Sem. I Tahun 2011/2012C. W. Ko. Method and apparatus for detecting a macro computervirus using static analysis. United States Patent. February 2004.J. Aycock, Computer Viruses and Malware, Springer 2006.R. Munir. Diktat Kuliah IF 2091 Struktur Diskrit 4th ed. ProgramStudi Teknik Informatika STEI ITB. 2008.M. I. Pu. Fundamental Data Compression. Elsevier d at 16 December 2012 pretrieved at 17 December 2012 tmlretrieved at 17 December 2012 howCode.asp?txtCodeId 11000&lngWId 1retrieved at 17 December 2012 08:00DECLARATIONI hereby declare that the paper is my own writing, not anadaptation, nor translation from another person‟s paper,and not a form of plagiarism.Bandung, 18 Desember 2012Sonny Lazuardi Hermawan13511029

Makalah IF2091 Struktur Diskrit – Sem. I Tahun 2011/2012 II. QUARANTINE After finding a suspected virus file, antivirus software needs to quarantine the infected file. Quarantine is only a temporary measure, and may only be done until the user decides how to handle the file (e.g., giving approval to disinfect it) [2].