FastMove: Optimizing System.Move

This post describes a technique that can be used to optimize some RTL functions at run-time. At the end of the post there’s a unit; add it to your uses list (preferably as the first one) and the System.Move routine will gain a significant speed boost. The unit will patch up the Move routine at run-time with MMX/SSE/SSE2/SSE3 versions depending on what SIMD sets your CPU supports.

It performs these steps at run-time (in initialization section):

  1. Detect the supported SIMD sets using CPUID instruction. (see GetSupportedSimdInstructionSets, CPUIIDSupports, CPUID functions)
  2. Detect the L2 cache size of your CPU to be used in optimizing the moving operation. (see GetL2CacheSize and GetExtendedL2CacheSize functions)
  3. Depending on what SIMD sets are supported the best function variant is used (SSE3, then SSE2, SSE and MMX as the last option).
  4. If your CPU doesn’t have support for any of those SIMD sets the original Move is left untouched.
  5. After the proper variant of the function has been selected, the System.Move routine is patched and a “JMP NEW_OPTIMIZED_VERSION” jump is written over the first instructions — this forces the use of our optimized versions. (see PatchMethod function)
  6. VirtualProtect windows function to un-protect the address space in which System.Move routine resides and then restore the protection.

Of course I could have just exported those function variants and make them available to consumer code, but he interesting side-effect of run-time patching is that all code in Delphi that uses Move will get that speed boost, including RTL and VCL units. Anyway, no more words, get your copy now (while it’s hot):
[download#17]
(Available under LGPL License – the original incense of the YAWE project)


Note: All SIMD versions of Move method were written by Seth and initially included in our YAWE project. You can find much more interesting stuff if you look at the code we’ve laid down there.

You May Also Like

About the Author: Alexandru Ciobanu

2 Comments

  1. Salut.

    Sunt programator Delphi.
    Multumesc pentru codul tau. Intr-adevar e rapid.

    Dar exista 2 probleme:

    1. Instructiunile de tip lddqu, movntdq etc nu sunt recunoscute de compilator la versiunile mai vechi de Delphi.
    Trebuie inlocuite cu echivalentele DB.
    Un exemplu:

    {$IFDEF SSE2Basm}
    lddqu xmm0, [eax+ecx]
    lddqu xmm1, [eax+ecx+16]
    lddqu xmm2, [eax+ecx+32]
    lddqu xmm3, [eax+ecx+48]
    {$ELSE}
    DB $F2,$0F,$F0,$04,$01
    DB $F2,$0F,$F0,$4C,$01,$10
    DB $F2,$0F,$F0,$54,$01,$20
    DB $F2,$0F,$F0,$5C,$01,$30
    {$ENDIF}

    2. Anumite operatii de mutare (cum ar fi cea backward) nu arata diferenta clara la procesoare Intel iar la AMD viteza de rulare e mult mai mica decat Move clasica.

    De exemplu acest cod:

    var
    s: AnsiString;
    ab: array of Byte;

    implementation

    procedure Test;
    var
    i: Integer;
    t: TTime;
    begin
    SetLength(s, 10485780);
    SetLength(ab, 1048578);
    t := Now;
    for i := 1 to 500 do
    begin
    Move(s[11], s[1], 10485760);
    Move(ab[2], ab[1], 1048576);
    end;
    t := Now – t;
    ShowMessage(FormatDateTime(‘ss.zzz’, t));
    end;

    Pe procesoare Intel e doar de 1.08 ori mai rapid cu Optimize.Move.

    Procesorul meu: AMD Athlon(tm) II x2 3.2 GHz (x86, x86-64, MMX, 3DNow!, SSE, SSE2, SSE3)
    SO + mediu de programare: Windows 7 + Delphi XE5. Am si Windows XP + Delphi 7 dar nu compileaza.

    Optimize.Move: 11.6 sec
    Move clasica: 1.8 sec

    Folosesc des acest tip de de copieri in codul meu (daca vrei ti-l pot arata).

    Sper ca ti-au folosit aceste informatii pentru a localiza si rezolva problema. Dar daca mai ai nevoie de alte informatii te ajut bucuros.

    Numai bine,
    DavidB

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.