Reflective DLL Injection: Theory & Practice

Understanding reflective DLL loading - manually mapping a PE in memory without disk or LoadLibrary, with a complete walkthrough of the bootstrap, header parsing, section mapping, relocation fixups, import resolution, and modern OPSEC improvements.

Reflective DLL Injection Header

Why Reflective Loading

Standard DLL injection requires the DLL on disk and triggers image load callbacks. Reflective loading parses and maps the PE entirely in memory; no disk, no LoadLibrary.

That single sentence sounds simple, but the consequences are huge from a defender’s perspective. Every conventional injection path Windows offers (LoadLibrary, LdrLoadDll, KnownDlls mapping, NtMapViewOfSection from a file-backed section) leaves loud telemetry behind:

The image load callback (PsSetLoadImageNotifyRoutine) fires in the kernel and is the primary trigger for EDR module-load events (Sysmon Event ID 7, ETW Microsoft-Windows-Kernel-Process ImageLoad).
The PEB Ldr doubly-linked list (InLoadOrderModuleList, InMemoryOrderModuleList, InInitializationOrderModuleList) is updated, leaving an entry that any user-mode scanner can enumerate.
The on-disk file is hashed, scanned by AV minifilters, and tied to your process by image path.

Reflective loading sidesteps every one of those, the DLL never touches the file system, and the OS loader is bypassed entirely. The trade-off is that you must implement the loader yourself, faithfully enough to make the DLL work as if Windows had loaded it.

Pioneered by Stephen Fewer in 2008, reflective DLL injection became the foundational primitive that almost every modern offensive framework still builds on, including Cobalt Strike, Sliver, Mythic agents, Metasploit’s Meterpreter, and countless custom red-team toolkits.

The Two Halves of the Trick

A reflective DLL is a normal-looking PE file with one extra exported function conventionally called ReflectiveLoader that knows how to map the rest of the file from raw bytes into a runnable image. Injection becomes a two-stage operation:

Get the raw bytes into the target process. A traditional VirtualAllocEx + WriteProcessMemory pair, an existing C2 download cradle, a shared section, a process hollowing payload anything that places the unparsed DLL bytes somewhere in the victim’s address space.
Hand control to ReflectiveLoader. A CreateRemoteThread, NtCreateThreadEx, an RtlCreateUserThread, a hijacked thread context, an APC queued to an alertable thread, or a thread pool callback. The starting address is the offset of ReflectiveLoader inside the raw bytes, not a real virtual address yet.

When ReflectiveLoader starts running, it is executing in unmapped memory: the image looks the way it does on disk, with file alignment, no relocations applied, no imports resolved, and no IMAGE_BASE_RELOCATION entries processed. Its first job is to figure out where it is and bootstrap from there.

How It Works

The reflective loader is embedded inside the DLL itself. When injected as raw bytes, it: finds its own base, parses PE headers, allocates memory, maps sections, processes relocations, resolves imports, and calls DllMain.

That bullet hides a lot of subtlety. Let’s walk each step the way the loader actually has to perform it, because every line is a potential blue-team detection if done lazily.

Step 1 - Locate Your Own Base

The loader cannot use any global or absolute address relocations have not been applied, so any &g_variable reference in the binary is wrong. Instead, the loader uses a position-independent trick: take its own return address (or a call $+5; pop reg sequence) and walk backwards page by page until it sees the MZ magic of a PE header followed by a valid PE\0\0 signature at the offset stored in e_lfanew.

// Position-independent base discovery
uintptr_t caller_eip;
__asm { call here; here: pop caller_eip; }

uintptr_t base = caller_eip & ~0xFFF;        // page-align downward
while (((PIMAGE_DOS_HEADER)base)->e_magic != IMAGE_DOS_SIGNATURE ||
       ((PIMAGE_NT_HEADERS)(base + ((PIMAGE_DOS_HEADER)base)->e_lfanew))->Signature
           != IMAGE_NT_SIGNATURE) {
    base -= 0x1000;
}

Step 2 - Resolve the APIs You Need to Be a Loader

The loader cannot import anything; the import directory will not be processed until much later. So before it can do its real job it must hand-resolve a small set of APIs from kernel32.dll and ntdll.dll. The standard recipe walks the PEB → Ldr → InMemoryOrderModuleList, then enumerates each module’s export directory comparing API name hashes (not strings, strings are a huge signature, fivestar hotel banner for AV).

The minimal required set is:

API	Purpose
`LoadLibraryA`	Load dependent DLLs referenced in the import table
`GetProcAddress`	Resolve imported function addresses by name/ordinal
`VirtualAlloc`	Allocate the new image region
`VirtualProtect`	Set per-section protections after mapping
`NtFlushInstructionCache`	Flush I-cache before transferring control

A common hashing function is ROR13 (rotate-right-13) applied byte-by-byte to the export name, compared against precomputed constants. Metasploit’s stub uses exactly this; Cobalt Strike uses a slightly different hash but the structure is identical.

Step 3 - Allocate the New Image Region

VirtualAlloc(NULL, SizeOfImage, MEM_COMMIT|MEM_RESERVE, PAGE_READWRITE) - note PAGE_READWRITE, not PAGE_EXECUTE_READWRITE. Allocating RWX memory is one of the easiest behavioral signals an EDR can flag, so the modern pattern is RW now, RX (per-section) later.

SizeOfImage is taken from the optional header. The OS loader would use this same value, so the allocation matches the legitimate expected layout.

Step 4 - Map Headers and Sections

Copy the headers (SizeOfHeaders bytes from the start of the raw bytes) to offset 0 of the new region. Then walk the section table (IMAGE_SECTION_HEADER array immediately after the optional header) and for each section:

PIMAGE_SECTION_HEADER sect = IMAGE_FIRST_SECTION(nt_headers);
for (WORD i = 0; i < nt_headers->FileHeader.NumberOfSections; i++, sect++) {
    void* dst = (BYTE*)new_base + sect->VirtualAddress;
    void* src = (BYTE*)raw_bytes + sect->PointerToRawData;
    memcpy(dst, src, sect->SizeOfRawData);
    // BSS-style sections may have VirtualSize > SizeOfRawData; the rest is zero.
}

This converts the file-aligned layout into the in-memory virtual layout the rest of the PE expects.

Step 5 - Apply Base Relocations

Unless the DLL was lucky enough to load at its preferred ImageBase, every absolute address baked into the code (function pointers, jump tables, vtables, string references on x86, RIP-relative references on x64 are not relocated, but absolute 64-bit references are) must be patched.

Walk the relocation directory (IMAGE_DIRECTORY_ENTRY_BASERELOC). It is a sequence of IMAGE_BASE_RELOCATION blocks, each describing a 4 KB page and a list of 16-bit fix-up entries:

intptr_t delta = (intptr_t)new_base - nt_headers->OptionalHeader.ImageBase;

PIMAGE_BASE_RELOCATION reloc = /* base + reloc dir RVA */;
while (reloc->VirtualAddress) {
    DWORD count = (reloc->SizeOfBlock - sizeof(*reloc)) / sizeof(WORD);
    WORD* entry = (WORD*)(reloc + 1);
    for (DWORD i = 0; i < count; i++, entry++) {
        WORD type   = *entry >> 12;
        WORD offset = *entry & 0x0FFF;
        if (type == IMAGE_REL_BASED_DIR64)         // x64
            *(uintptr_t*)((BYTE*)new_base + reloc->VirtualAddress + offset) += delta;
        else if (type == IMAGE_REL_BASED_HIGHLOW)  // x86
            *(DWORD*)((BYTE*)new_base + reloc->VirtualAddress + offset) += (DWORD)delta;
    }
    reloc = (PIMAGE_BASE_RELOCATION)((BYTE*)reloc + reloc->SizeOfBlock);
}

Skipping this step produces a DLL that mostly works until the first global pointer dereference crashes the host process.

Step 6 - Resolve the Import Address Table

For each entry in IMAGE_DIRECTORY_ENTRY_IMPORT:

LoadLibraryA the Name field (e.g. WININET.dll).
Walk the parallel OriginalFirstThunk (lookup table) and FirstThunk (IAT) arrays.
For each entry, look up the function by ordinal if the high bit of the lookup is set, otherwise by IMAGE_IMPORT_BY_NAME->Name.
Write the resolved address into the IAT slot.

After this step, every call into kernel32!CreateFileA, wininet!HttpSendRequestA, etc. inside the mapped image goes to the right place.

Step 7 - Apply Final Section Protections

Walk the section table again and call VirtualProtect per section based on the Characteristics flags:

Section flags	Final protection
`IMAGE_SCN_MEM_EXECUTE \\| _READ`	`PAGE_EXECUTE_READ`
`IMAGE_SCN_MEM_EXECUTE \\| _READ \\| _WRITE`	`PAGE_EXECUTE_READWRITE` (avoid, see OPSEC below)
`IMAGE_SCN_MEM_READ \\| _WRITE`	`PAGE_READWRITE`
`IMAGE_SCN_MEM_READ` only	`PAGE_READONLY`

This is the moment the image stops being a uniform RW blob and starts looking like a real loaded image with .text as RX and .data/.rdata distinct.

Step 8 - Flush, Then Call `DllMain`

Instruction caches are not coherent with data writes on x64; before calling code you just wrote, flush the I-cache:

NtFlushInstructionCache(GetCurrentProcess(), new_base, SizeOfImage);

Then call the entry point with DLL_PROCESS_ATTACH:

typedef BOOL (WINAPI *DllMain_t)(HINSTANCE, DWORD, LPVOID);
DllMain_t entry = (DllMain_t)((BYTE*)new_base + nt_headers->OptionalHeader.AddressOfEntryPoint);
entry((HINSTANCE)new_base, DLL_PROCESS_ATTACH, NULL);

DllMain is where your real payload runs, staging the next-stage beacon, hooking, persistence, whatever.

What the Loader Skips That the Real OS Does

A full faithful implementation also handles:

TLS callbacks (IMAGE_DIRECTORY_ENTRY_TLS) - many anti-analysis builds plant payloads here, so you must run them.
Exception/SEH directory (IMAGE_DIRECTORY_ENTRY_EXCEPTION) - without registering function tables on x64, RaiseException walking will fail. RtlAddFunctionTable registers your .pdata so SEH works.
Delay-load imports - only matter if the DLL uses them.
Manifest / activation context - usually skipped; modern implants avoid COM dependencies.
InMemoryOrderModuleList linkage - deliberately skipped. Linking the module makes it visible to EnumProcessModules. That visibility is exactly what reflective loading is trying to avoid.

Evasion Improvements

Use RW→RX permissions (not RWX), erase PE headers after loading, use indirect syscalls for API calls, and consider module stomping over fresh allocation.

Each of those is a small concept on its own; together they are what separates a 2008-era loader from something that survives a 2026-era EDR scan.

RW → RX, Never RWX

A page that is simultaneously writable and executable is one of the most reliable EDR signals there is. Behavioural detections (CrowdStrike’s “Suspicious memory regions”, Defender’s RWX_MEMORY AMSI signal, Elastic’s unbacked_executable_memory rule) trip the moment you allocate a private RWX region in another process. Always allocate PAGE_READWRITE, copy/relocate/fix-up, then VirtualProtect to PAGE_EXECUTE_READ per section.

Erase the PE Headers After Loading

Once DllMain has run successfully, the first 0x1000 bytes of the mapping (DOS header, NT headers, section table) are dead weight that screams “I am a PE in private memory”. Memory scanners look for MZ/PE\0\0 at page-aligned addresses. After loading, zero them out:

DWORD old;
VirtualProtect(new_base, 0x1000, PAGE_READWRITE, &old);
RtlSecureZeroMemory(new_base, 0x1000);
VirtualProtect(new_base, 0x1000, old, &old);

Tools like Moneta and pe-sieve flag any private executable mapping where the first bytes do not match a known module on disk; scrubbing the headers helps the rest of the body blend in.

Indirect Syscalls

API hooks live in user-mode ntdll.dll. If your loader calls NtAllocateVirtualMemory through the normal kernel32!VirtualAlloc → ntdll!Nt... path, every EDR with user-mode hooks (most of them) sees the call. Direct syscalls bypass the hook by writing the syscall stub manually but the call ends up returning into your own image, which itself becomes a tell.

Indirect syscalls combine both worlds: you locate the unhooked syscall stub inside ntdll.dll, copy its syscall number, and jmp into the legitimate ntdll syscall; ret so the return address sits inside ntdll where the EDR expects it. Frameworks like SysWhispers3, HellsHall, and TartarusGate automate this for the full Nt-API set.

Module Stomping Over Fresh Allocation

Even with RW→RX and zeroed headers, a private memory region with executable code is anomalous. The fix is to not allocate at all, its to overwrite a legitimate, on-disk-backed module’s .text instead. A loader picks a “victim” DLL the host process already has loaded but isn’t actively using (amsi.dll in non-AMSI processes, wpaxholder.dll, region-specific localization DLLs), VirtualProtects its .text to RW, copies the implant into it, and restores RX. Result: an executable region that is file-backed by a signed Microsoft DLL, so memory scanners that compare disk-vs-memory hashes get a hit, but private-region scanners see nothing, you’ve moved the detection to a different category.

The newer “Phantom DLL Hollowing” / “Module Doppelgänging” / “Process Herpaderping” family extends this idea by writing modified bytes to the file, mapping them, then reverting the file content so on-disk hashes look clean. A related modern evolution of this concept is Process Mockingjay. Rather than stomping on a .text section and dealing with VirtualProtect calls (which EDRs monitor closely), a Mockingjay loader hunts for legitimate, signed DLLs that are already compiled with a default RWX section (such as certain versions of msys-2.0.dll). By injecting the payload directly into this naturally occurring RWX space, the loader completely avoids calling memory allocation or protection APIs, starving the EDR of its primary behavioral triggers.

Other Modern Hardening

Spoofed call stacks during the load: techniques like Thread Call Stack Spoofing or VEH(Vectored Exception Handling)-based stack spoofing stack spoofing make sure that when the loader calls VirtualProtect, the stack walk shows a legitimate-looking caller (KERNELBASE!VirtualProtect from a worker thread inside ntdll!TppWorkerThread) instead of the implant.
Encrypted import hashes: even the precomputed ROR13 hashes are static bytes. AES-encrypt them with a per-build key derived from a value only available at runtime.
Fragment the loader stub: split the loader code across multiple sections, chained via small jmp shims, so the loader itself doesn’t sit as a single contiguous YARA-able blob.
Shellcode-only variant (sRDI): Nick Landers’ sRDI converts any DLL into position-independent shellcode by emitting a tiny custom loader that doesn’t require the DLL to export a ReflectiveLoader. This is what most modern loaders are based on.

Detection Side

If you read this from the defender’s chair, the signals worth watching for are:

RtlCreateUserThread / NtCreateThreadEx / CreateRemoteThread with a start address pointing into a private (non-image-backed) region.
Private executable regions whose first bytes are not MZ (header-erased), or whose first bytes are MZ but no corresponding file is mapped.
VirtualProtect calls flipping RW → RX inside another process’s memory shortly after a WriteProcessMemory.
Calls into LoadLibraryA from threads whose start address is unbacked.
Discrepancies between Module32First/Next (Toolhelp) and NtQueryVirtualMemory enumeration of executable mappings reflective modules show up in the latter but not the former.
ETW Threat Intelligence provider events (EtwTi) for AllocateVirtualMemory with PAGE_EXECUTE_* and unusual stack traces.

Reflective DLL injection is the foundation of modern offensive tooling. Cobalt Strike and most C2 frameworks use variants of this approach. Understanding it deeply, both the what and the why of every step is what separates an operator who can copy-paste a loader from one who can write a custom loader that survives the next EDR update.