Quick Glance At Lazarus Group MATA APT On Linux (PART1) Using Ghidra

Here is my first look/attempt at analyzing MATA a multi-platform APT Lazarous group. This particular sample is for a Linux environment. We start with basic static analysis by running the file utility on the binary. We see the binary is of the ELF binary format, statically linked, stripped of symbols and compiled for x86_64.

A quick look at the ELF Header using utility readelf output at the type field tells us this binary is an executable (elf type ET_EXEC).

Loading the binary into the decompiler/disassembler Ghidra we see that there are no symbols because the binary was stripped as we found out from previous analysis. This is a particularly problematic as it will be difficult to distinguish between code written by the malware authors and code imported at compile time from libraries, increasing the time and difficulty to statically analyze the sample. Luckily Ghidra is capable of matching functions in the binary against a dataset of functions associated with frequently used libraries (libc for example is the standard library used in C programs on unix platforms). With this approach, a large portion the functions in the binary can be identified.

First we have to import the function ID dataset. However we must enable access to this functionality, in Ghidra we do File->Configure.

Then we do Tools->Function ID-> Attach existing fidDb

Then we have a list of *.fidb datasets compiled from various libraries based on different archs. We know the binary has a x86_64 architecture so all the *.fidb for x86_64 will do for now. Ghidra comes with function ID dataset for Visual Studios (not applicable to reversing UNIX binaries) by default, the other function ID datasets you see here can be found in the reference portion of this writeup.

From here we simply have to re-analyze the binary, performing keyboard shortcut Alt-A brings us to the auto analysis window.

We deselect all options turned on, then select only Function ID as an option, this will make sure we don’t have to redo a complete analysis which isn’t necessary and only consumes more time.

After running the analyzer we get a populated symbol tree where Ghidra was able to match some functions in the binary against the function ID database. We see multiple functions (single example below) were matched against the libc version below in the snapshot of the decompiler window.

Ofcourse the function ID analysis was not perfect and will need tweaking to improve matches. However for now we can begin to do some code analysis in the decompiler to get some more information.

Ideally I like to start at main(), but Ghidra cannot locate this function. So we will do it manually. We can start by viewing the disassembly at the entry point for execution in the binary by simply typing "entry" into the symbol-tree filter form and clicking on the resulting entry function icon above the form.

The code above resembles CRT (C runtime) code produced by GCC when linked against libc. This code has a few responsibilities for which it carries out before entry into typical int main(int argc, char **argv) and exiting main. To give some context without getting too granular it is responsible for setting up environmental variables for program execution, provides a function pointer along with number of arguments (argc), command line arguments (argv) and registers functions to be called on exit of main(). All these things are accomplished with libc_start_main.

Based on the function prototype of libc_start_main and x86_64 Linux calling convention, we can expect the RDI register to hold the address (FUN_0040e699) of the main() function.

Double-clicking on the RDI register brings us to the main() function.

From here we can rename/label the function to main to help us stay oriented on possible paths of execution. This is achieved by highligting the function name in the form FUN_* and hitting the letter L on our keyboard.

You can confirm this is infact main. Looking at the decompiler window we can see the typing information for arguments to main. We expect main to have a function prototype of int main(int argc, char **argv). Type int on param1 is indicative of argc, with a size of 4 bytes (GCC represents int types with 4 byte registers). Then type long indicative of 8 byte parameter, suitable for a pointer for a binary compiled for x86_64 as addresses are 8 bytes long.

We can also verify the return type. EAX instead of RAX tells us the type is 4 bytes, again fitting int main(int argc, char **argv) prototype. The label LAB_0040e72d in the disassembler view is responsible for returning from this function. At every instance EAX is used as the return register.

We can go ahead and begin to work in the decompiler. By highlighting the name of the function in the decompiler window and right-clicking on it then we can rename parameters and give more type information to produce easier to read decompiler output.

From here we fill in type information for arguments to the function, its return type and calling conventions.

So we have successfully matched quite a few functions using Ghidra Function IDs and established where main() is in the application to begin to read the code the malware authors wrote. And since we have adequate type information thanks to the disassembly view we can now decide what these other unknown functions in main do. Some of them are likely library functionality, where Ghidra Function IDs failed to return a positive, so they mightrequire manual reverse engineering or further tweaking of function ID analysis for better results. Either way,the more of a target you reverse the easier it is to make subsequent discoveries of what the code is doing.

References:

MATA sample hosted on VX-UNDERGROUND (Caution donot run outside of VM)

Creating Ghidra Function ID databases.

Ghidra Function ID dataset repository.