• C# ReadOnlySpan and static data

    Since C# 7 there have been a lot of point releases that contain all kinds of goodies. Many of them are performance focused, such as safe stack allocations using Span<T>, or interoperability with improvements to fixed.

    One that I love, but is not documented well, is some special treatment that ReadOnlySpan<byte> gets when its contents are known at compile time.

    Here’s an example of a lookup table I used to aide with hex encoding that uses a byte[]:

    private static byte[] LookupTable => new byte[] {
        (byte)'0', (byte)'1', (byte)'2', (byte)'3', (byte)'4',
        (byte)'5', (byte)'6', (byte)'7', (byte)'8', (byte)'9',
        (byte)'A', (byte)'B', (byte)'C', (byte)'D', (byte)'E',
        (byte)'F',
    };
    

    This binary data has to get stored somewhere in our produced library. If we use dumpbin we can see it in the .text section of the binary.

    dumpbin /RAWDATA /SECTION:.text mylib.dll
    

    Right at the bottom, we see:

    00402A40: 30 31 32 33 34 35 36 37 38 39 41 42 43 44 45 46  0123456789ABCDEF
    

    I won’t go into the a lot of the details on how this data is compiled into the .text section, but at this point we need to get that data into the array somehow.

    If we look at the jit assembly of LookupTable, we see:

    sub rsp, 0x28
    vzeroupper
    mov rcx, 0x7ffc4638746a
    mov edx, 0x10
    call 0x7ffc49b52630
    mov rdx, 0x1b51450099c
    lea rcx, [rax+0x10]
    vmovdqu xmm0, [rdx]
    vmovdqu [rcx], xmm0
    add rsp, 0x28
    ret
    

    Where 0x7ffc49b52630 is InitializeArray.

    With an array, our property leans on InitializeArray, the source of which is in the CoreCLR. For little-endian platforms, it boils down to a memcpy from a runtime field handle.

    Indeed, with a debugger we finally see:

    00007ffd`b18b701a e831a40e00       call    coreclr!memcpy (00007ffd`b19a1450)
    

    Dumping @rdx L10 yields:

    000001f0`4c552a90  30 31 32 33 34 35 36 37-38 39 41 42 43 44 45 46  0123456789ABCDEF
    

    So that was a very long-winded way of saying that when using arrays, initializing a field or variable with bytes results in memcpy from the image into the array, which results in more data on the heap.

    Now, starting in 7.3, we can avoid that memcpy when using ReadOnlySpan<byte>.

    private static ReadOnlySpan<byte> LookupTable => new byte[] {
        (byte)'0', (byte)'1', (byte)'2', (byte)'3', (byte)'4',
        (byte)'5', (byte)'6', (byte)'7', (byte)'8', (byte)'9',
        (byte)'A', (byte)'B', (byte)'C', (byte)'D', (byte)'E',
        (byte)'F',
    };
    

    Looking at the jit assembly:

    mov eax, 0x10
    xor edx, edx
    mov r8, 0x1b5144c0968
    mov [rcx], rdx
    mov [rcx+0x8], r8
    mov [rcx+0x10], eax
    mov rax, rcx
    ret
    

    We see that there is mov r8, 0x1b5144c0968. The contents of 0x1b5144c0968 are:

    000001b5`144c0968  30 31 32 33 34 35 36 37-38 39 41 42 43 44 45 46  0123456789ABCDEF
    

    So we see that the method is now returning the data directly and omitting the memcpy entirely, so our ReadOnlySpan<byte> is pointing directly to the .text section.

    This works for property getters as shown above, but also as the return of a method:

    ReadOnlySpan<byte> GetBytes() {
        return new byte[] { ... };
    }
    

    Which works similar to the getter of the property. In addition, this also works for locals in a method body as well:

    void Write200Ok(Stream s) {
        ReadOnlySpan<byte> data = new byte[] {
            (byte)'H', (byte)'T', (byte)'T', (byte)'P',
            (byte)'/', (byte)'1', (byte)'.', (byte)'1',
            (byte)' ', (byte)'2', (byte)'0', (byte)'0',
            (byte)' ', (byte)'O', (byte)'K'
        };
        s.Write(data);
    }
    

    Which also produces a reasonable JIT disassembly:

    sub     rsp, 0x38
    xor     eax, eax
    mov     qword ptr [rsp+0x28], rax
    mov     qword ptr [rsp+0x30], rax
    mov     rcx, 0x1e595b42ade
    mov     eax, 0x0F
    lea     r8, [rsp+0x28]
    mov     qword ptr [r8], rcx
    mov     dword ptr [r8+8], eax
    mov     rcx, rdx
    lea     rdx, [rsp+0x28]
    cmp     dword ptr [rcx], ecx
    call    0x7ff89ede10c8 (Stream.Write(System.ReadOnlySpan`1<Byte>), mdToken: 0000000006000001)
    add     rsp, 0x38
    ret
    

    Here we see mov rcx, 0x1e595b42ade which moves the address of the static data directly in to the register with no additional work to create a byte array.

    These optimizations currently only works with ReadOnlySpan<byte> right now. Other types will continue to use InitializeArray due to needing to handle different platforms and how they handle endianness.

  • C# 8 using declarations

    Visual Studio 2019 preview 2 was released a few days ago and I took the time to install it. Visual Studio itself is actually rather uninteresting to me, however the inclusion of the next C# 8 preview got my attention. I glanced at the feature highlights and posted “looks nice” on Twitter.

    Predictably, I got a few responses like “I’m not sure I like that”, and there is always a guarantee that if F# has a similar feature, an F# developer will appear and tell you F# has had this feature for 600 years.

    The one I like a lot is using declarations. This allows a local to automatically be disposed at the end of the block. Essentially, it hides the try/finally or the using() {...}. The .NET team’s blog kind of gave a bad example of this, so I’ll use one from Open OPC SignTool. Here is the original snippet:

    private static X509Certificate2 GetCertificateFromCertificateStore(string sha1)
    {
        using (var store = new X509Store(StoreName.My, StoreLocation.LocalMachine))
        {
            store.Open(OpenFlags.OpenExistingOnly | OpenFlags.ReadOnly);
            var certificates = store.Certificates.Find(X509FindType.FindByThumbprint, sha1, false);
            return certificates.Count > 0 ? certificates[0] : null;
        }
    }
    

    A using var can make this:

    private static X509Certificate2 GetCertificateFromCertificateStore(string sha1)
    {
        using var store = new X509Store(StoreName.My, StoreLocation.LocalMachine);
        store.Open(OpenFlags.OpenExistingOnly | OpenFlags.ReadOnly);
        var certificates = store.Certificates.Find(X509FindType.FindByThumbprint, sha1, false);
        return certificates.Count > 0 ? certificates[0] : null;
    }
    

    This has the same effect of store having Dispose called on it at the end of the method. The benefit here being that there is less indentation and braces. This keeps me focused on the code that matters. I don’t care when store is disposed in the method, I can just observe that it has a using modifier on the local and be assured that Dispose will be called.

    This isn’t the same as garbage collection or finalizers. Both of those are non- deterministic, and can lead to unexpected program behavior. That’s less so in the case of X509Store, so let’s look at another example:

    using Stream stream = entry.Open();
    var xmlDocument = XDocument.Load(stream, LoadOptions.PreserveWhitespace);
    return new OpcRelationships(location, xmlDocument, readOnlyMode);
    

    Not disposing a stream that is backed by a file can cause access errors later in software that might try to open that file again - it is already open, so not only is it a bad idea it leave streams to the GC, it is just simply incorrect.

    However again using on the local ensures it is deterministically closed.

    When it gets disposed I can see being slightly unclear to the developer. The quick explanation is when the local is no longer reachable, not when it is last used. The C# 8 above gets compiled roughly to:

    var stream = entry.Open();
    try
    {
        var xmlDocument = XDocument.Load(stream, LoadOptions.PreserveWhitespace);
        return new OpcRelationships(location, xmlDocument, readOnlyMode);
    }
    finally
    {
        if (stream != null)
        {
            ((IDisposable)stream).Dispose();
        }
    }
    

    The disposal is done after the return, when the local is no longer reachable, not after XDocument is created.

    I find this very helpful to keep code readable. This doesn’t work when you need fine control over when Dispose is called. A place where this does not work well is when the Dispose pattern is used for scopes, such as logging. The AzureSignTool project has code similar to this in SignCommand:

    var logger = loggerFactory.CreateLogger<SignCommand>();
    Parallel.ForEach(AllFiles, options, () => (succeeded: 0, failed: 0), (filePath, pls, state) =>
    {
        using (var loopScope = logger.BeginScope("File: {Id}", filePath))
        {
            logger.LogInformation("Signing file.");
            //Sign the file. omit a bunch of other code.
            logger.LogInformation("Done signing the file.");
        }
        logger.LogDebug("Incrementing success count.");
        return (state.succeeded + 1, state.failed);
    }
    

    Here, we cannot change this to a using var because then the LogDebug would be inside of that logging scope, which it wasn’t before. This is a place where we continue to want Dispose to be called at a different time from the when loopScope would no longer be in scope.

    My impression from C# developers is that they do not tend to call Dispose on resources as soon as it can be disposed, just at a reasonable point in the same method. Most developers do not write this code:

    public bool MightBeExe(string filePath)
    {
        var firstBytes = new byte[2];
        int bytesRead;
        using (var file = File.Open(filePath, FileMode.Open))
        {
            bytesRead = file.Read(firstBytes, 0, 2);
        }
        return bytesRead == 2 && firstBytes[0] == (byte)'M' && firstBytes[1] == (byte)'Z';
    }
    

    They will instead write something like:

    public bool MightBeExe(string filePath)
    {
        using (var file = File.Open(filePath, FileMode.Open))
        {
            var firstBytes = new byte[2];
            var bytesRead = file.Read(firstBytes, 0, 2);
            return bytesRead == 2 && firstBytes[0] == (byte)'M' && firstBytes[1] == (byte)'Z';
        }
    }
    

    Which is a perfect candidate for using var:

    public bool MightBeExe(string filePath)
    {
        using var file = File.Open(filePath, FileMode.Open);
        var firstBytes = new byte[2];
        var bytesRead = file.Read(firstBytes, 0, 2);
        return bytesRead == 2 && firstBytes[0] == (byte)'M' && firstBytes[1] == (byte)'Z';
    }
    

    There are of course some reasonable limitations to this feature. For example, it cannot be combined with out-variables.

    if (Crypto32.CryptEncodeObjectEx(
        // other stuff
        out var handle,
        ref size)
    )
    {
        using (handle)
        {
            // Do stuff
        }
    }
    

    This does not work:

    if (Crypto32.CryptEncodeObjectEx(
        // other stuff
        out using var handle,
        ref size)
    )
    {
        // Do stuff
    }
    

    Jared Parsons said on Twitter that C# folks thought of this, and decided that it had “Too much confusion about ownership.” Thinking about it myself, I agree, so it’s nice that the feature is limited in that regard.

    Another limitation is that the variable cannot be reassigned. For example:

    using var stream = entry.Open();
    stream = entry2.Open();
    

    This will produce error CS1656, “Cannot assign to ‘stream’ because it is a ‘using variable’”.

    All in all, I very much like this small feature in C# 8. It has reasonable guard rails on it from doing something too weird like re-assigning to it, while giving the benefit of less blocks, braces, indentation.

  • Secure Random Integers in .NET Core 3

    .NET Core 3.0 is tentatively set to include a new API for securely generating a random integer bound to a specific range.

    I won’t be shy in admitting that it was something I pushed for and made the initial attempt at implementing, though it’s unfair to say that I implemented it by myself given all of the outstanding feedback I got on the initial pull request (thanks Levi and Jeremy!)

    It’s been known for a while that System.Random shouldn’t be used when cryptographic randomness is required. Despite that, there wasn’t anything built in to .NET that made creating bounded random integers easy. You could either use System.Random and hope for the best, or use a CSPRNG like RandomNumberGenerator that gave back raw bytes, which requires some thought on how to to properly convert it to a random integer without introducing any kind of bias.

    Starting in .NET Core 3.0, you’ll be able to do:

    var min = 1;
    var max = 1_000;
    var randomNumber = RandomNumberGenerator.GetInt32(min, max);
    

    If you need this before .NET Core 3, well, the source is right there. It can be adapted with a bit of effort to work on the .NET Framework as well as other environments that don’t have Span<T>.

  • FixedTimeEquals in .NET Core

    .NET Core introduced FixedTimeEquals which I have personally found to be very helpful. It’s a small method, but given the kind of code I tend to write, I was writing it a lot from project to project, and am happy to see it in the box.

    This API is meant to prevent a timing side-channel when comparing two sequences of bytes. The goal is, the comparison should take the same amount of time regardless of the contents of the bytes, assuming they are the same length. This is often required when doing comparisons of MACs or simple possession demonstrations for APIs.

    Consider you had a web API that required an API key and was implemented in such a way that it just expected a magic string to act as a password to interact with the API.

    It required an HTTP header, like

    X-Api-Key: SecretVooDooWord
    

    and TLS was used to ensure it was kept confidential in transit. The API did a naive implementation like this:

    [HttpGet]
    public IActionResult SuperSensitiveApi() {
        var key = Request.Headers["X-Api-Key"].Single();
        if (key != "SecretVooDooWord") {
            return Unauthorized();
        }
        return new JsonResult(new { AllTheSecretData = "tada" });
    }
    

    The issue here is that neither != or == are fixed-time, and will return false as soon as there is an immediate difference. In pseudo code, it might look something like this:

    compare(word1, word2) {
        if word1.length != word2.length
            return false
        i = 0
        while (i < word1.length) {
            letter1 = word1[i]
            letter2 = word2[i]
            if letter1 != letter2
                return false
            else
                i = i + 1
        }
        return true
    }
    

    This function will compare the letters one by one. As soon as it encounters a difference, it returns false and doesn’t bother checking the rest. Why should it? The answer is going to be false every time.

    To an attacker that is trying to figure out the contents of the string, carefully measuring the time can leak the contents of the string. For argument’s sake, let’s say that checking each letter takes 2ns and the attacker has a very accurate way of measuring time over a network and can account for jitter and latency.

    GET /SuperSensitiveApi HTTP/1.1
    X-Api-Key: Raaaaaaaaaaaaaaa
    

    Checking the API key will take 2ns because the first letters do not match. The attacker moves on to the next letter.

    GET /SuperSensitiveApi HTTP/1.1
    X-Api-Key: Saaaaaaaaaaaaaaa
    

    This time, checking the API key takes 4ns because the first letter matched (2ns) and the second failed (another 2ns)

    GET /SuperSensitiveApi HTTP/1.1
    X-Api-Key: Sbaaaaaaaaaaaaaa
    

    Fails again taking 4ns to check the API key. After a few more tries…

    GET /SuperSensitiveApi HTTP/1.1
    X-Api-Key: Seaaaaaaaaaaaaaa
    

    This time it took 6ns to check the API key because the first and second letter were checked, and the third failed.

    The attacker can keep doing this by observing the amount of time each call takes. The longer it takes, the attacker can assume they have guessed the next letter. In practice, this is extremely difficult to do over a network and to get accurate enough timing information, but it is believed that it might be possible given enough persistence and an adversary that has the means to be in a network position that is very stable.

    These kinds of attacks do exist, an example timing side-channel is Lucky 13, which affected many library’s approach to handling CBC padding in TLS.

    So == in C# is a bad way to check strings for equality where timing side channels need to be mitigated. So you say, perhaps you’ll do something like this:

    private static bool CheckStringsFixedTime(string str1, string str2) {
        if (str1.Length != str2.Length) {
            return false;
        }
        var allTheSame = true;
        for (var i = 0; i < str1.Length; i++) {
            if (str1[i] != str2[i]) {
                allTheSame = false;
            }
        }
        return allTheSame;
    }
    

    This looks like it’s constant time, but on modern CPUs and .NET, it’s not. Branch statements, like if, have timing implications. We can’t take a branch.

    OK, what about this?

    private static bool CheckStringsFixedTime(string str1, string str2) {
        if (str1.Length != str2.Length) {
            return false;
        }
        var allTheSame = true;
        for (var i = 0; i < str1.Length; i++) {
            allTheSame &= str1[i] == str2[i];
        }
        return allTheSame;
    }
    

    This looks somewhat promising for C#. We know from the language specification that &= does not short circuit, so str1[i] == str2[i] will always be evaluated.

    We still have a few problems though, and this is where the JIT and x86 can make things more complicated for us.

    The first issue is that allTheSame is a simple true/false flag. That still leaves the .NET JIT and x86 instruction execution to make a few optimizations that introduce timing attacks. Timing side channel mitigation should avoid all decisions until the very end. & is a decision. However, we can improve that some:

    private static bool CheckStringsFixedTime(string str1, string str2) {
        if (str1.Length != str2.Length) {
            return false;
        }
        var result = 0;
        for (var i = 0; i < str1.Length; i++) {
            result |= str1[i] ^ str2[i];
        }
        return result == 0;
    }
    

    This is fairly close to being time-fixed. We update an integer using | to combine the bits from the result of an XOR. XOR anything with itself, and the the result is zero. Anything else, and you get a non-zero value. We use a binary OR to combine any bits. At the end, if we have a non-zero value, we know one of the comparison checks failed.

    There is some debate as to what arithmetic operators are better for fixed time operations between str1[x] and str2[x]. Some implementations use XOR, like above, other may use SUB for subtraction. Unfortunately, most CPU architectures make no guarantee if any operations are constant-time.

    We still have a problem to address. The JIT could be making unintended optimizations. The C# compiler itself makes very little optimizations. The JIT on the other hand may do all sorts of things, like unroll a loop or make a variety of optimizations so the code executes faster.

    We can tell the JIT to leave it alone, like so.

    [MethodImpl(MethodImplOptions.NoInlining | MethodImplOptions.NoOptimization)]
    private static bool CheckStringsFixedTime(string str1, string str2) {
        if (str1.Length != str2.Length) {
            return false;
        }
        var result = 0;
        for (var i = 0; i < str1.Length; i++) {
            result |= str1[i] ^ str2[i];
        }
        return result == 0;
    }
    

    So now the JIT will cease to optimize this method at all and it will be closer to as-written.

    This is close to time-independent as possible with the String type. We can improve things a bit by making the parameters ReadOnlySpan<char> which will generate very similar x86 as string parameters, however the null checks will be eliminated since a ROS cannot be null.

    This is what .NET Core 2.1’s FixedTimeEquals does. It just does it over a ReadOnlySpan<byte> instead of characters. The x86 produced by the current JIT will be identical for ReadOnlySpan<char> and ReadOnlySpan<byte> with the exception of the size of the MOVZX.

    It’s handy that this is just in the box for .NET Core now.

  • Playing with RISC-V

    Over the past few years I’ve started to sour on x86 architecture for just about everything. From servers, desktop, and mobile, I’ve long wished we had architectures competing with x86 at the high end.

    Fortunately, there are plenty of architectures out there. From ARM and ARM64, to MIPS, there are choices. There is one recent one that has been getting my attention lately, and I’ve thought it’s time to start diving in to it.

    RISC-V

    RISC-V (pronounced “Risk Five”) is a fairly new architecture that has some interesting ideas behind it that make it worth looking it. Originally designed at UC Berkeley, it is an open architecture with the goal of being applicable to a wide range of devices.

    An open ISA that is available for commercial use is not unique, but it is rare. Contrast to ARM, if you want to make your own ARM CPU, you need to license the ISA from ARM Holdings Group, to the tune of millions of dollars. While “open” is not a direct benefit to developers, it does mean that a variety of companies that fabricate their own silicon are interested.

    RISC-V is also designed for multiple device profiles. It supports 32-bit and 64-bit, and is broken down in to a set of extension instruction sets, plus the always available integer base set. The ISA also documents and reserves a 128-bit architecture.

    This “base” integer instruction set is present on any RISC-V implementation. Often called “RV32I” for 32-bit, or “RV64I”, this instruction set consists of 47 instructions required to move memory, perform computations, and arithmetic.

    As of writing, some of the designs of the RISC-V architecture are still under active development, while others are finalized. The RV64I and RV32I are complete, while RV128I is still open for changes. The RV64I and RV32I set guarantees 32 registers. For smaller implementations, there is RV32E which limits the register count to 16.

    The extensions are are follows

    • “M” extension instructions offer multiplication and division.
    • “A” extension instructions offer atomic instructions.
    • “C” extension instructions are a few compressed for smaller encoding.
    • “F” extension instructions offer single-precision floating point.
    • “D” extension instructions offer double-precision floating point.
    • “Q” extension instructions offer quad-precision floating point.
    • “L” extension instructions offer decimal floating point.

    There are more than that, but these are the basic and complete extensions. A vendor may choose to implement any, or none, of these extensions. Most of these extensions can be emulated with the base Integer instructions using compiler replacements, save for the atomic instruction set.

    A key aspect though is that the ISA leaves open guidance and design for allowing a vendor to implement their own extensions and describing how they would be encoded.

    The extension set is not entirely meant to be pick and choose whatever. The RV32E base set was designed with only supporting the M, A, C extensions in mind, plus any additional extensions sets implemented on top of it.

    Hardware

    That all sounds swell. But what about actual hardware? Currently, there is not a lot of hardware out there. SiFive though, makes a board called the HiFive 1 which is a small board with an RV32IMAC capable processor.

    Notably, it works with the Arduino Studio, so there is a plenty easy way to get started with that.

    I however didn’t have a lot of interest in using it as a board to run code that I already knew, I wanted to learn the instruction set, which meant I wanted to write assembly and throw it at the board and see what it did.

    I initially struggled doing this in WSL for some reason, the details of which I haven’t quite fully comprehended yet. After switching to MacOS, things went a lot smoother, since much of the documentation was for Linux and Unix-like systems.

    I decided to start with a very simple program. Add two numbers.

    .section .text
    .globl _start
    _start:
        li t1, 42
        li t2, 48
        add t3, t1, t2
        nop
    

    li is “load immediate” which allows putting numbers in to registers immediately. The t prefixed registers are temporary registers. A simple compile might look something like this:

    riscv32-unknown-elf-gcc example.S -nostdlib -o example.elf
    

    This will produce a binary that is close to working but has a problem that took me a while to track down. The disassembly looks like this:

    example.elf:     file format elf32-littleriscv
    
    
    Disassembly of section .text:
    
    00010054 <_start>:
       10054:	02a00313          	li	t1,42
       10058:	03000393          	li	t2,48
       1005c:	00730e33          	add	t3,t1,t2
       10060:	0001                	nop
    

    Loading this on the HiFive 1 doesn’t work because it doesn’t load at the correct address. Reading through the HiFive 1 ISA and their documentation, the entry point must be at the address 0x20400000. Fortunately, the HiFive 1 comes with a GCC linker script that does all the right things. It handles the start address as well as other memory layout concerns. After re-compiling with their linker script:

    riscv32-unknown-elf-gcc example.S -nostdlib -o example.elf -T link.lds
    

    Re-dumping the assembly shows that _start appears in the correct location. Loading the program can be done with openocd, which is included in the HiFive software kit.

    Starting it with the correct configuration:

    openocd -f ~/.openocd/config/hifive1.cfg 
    

    I moved some files around to make things easier, but they should be easy enough to track down in their software kit.

    Once openocd, I can use the included RISC-V GDB debugger to remotely debug it. openocd will start a GDB server on localhost:3333. Firing up the RISC-V GDB, like riscv32-unknown-elf-gdb ~/Projects/riscv/example.elf, I can run the following commands to load the software and start debugging it.

    target extended-remote localhost:3333
    monitor reset halt
    monitor flash protect 0 64 last off
    load
    layout asm
    

    and if all goes well, we get something like this:

    Which is fairly exciting! Debugging my first RISC-V process, even if it does something simple like addition. Performing a few stepi to step a few instructions up to the nop, I can then use info register t3 (or just i r t3 for short) we see we have 90 in the register, which the result of adding 42 and 48.

    I’m hopeful that RISC-V will be competitive in the embedded and mobile CPU ISA space. It will take a very long time to get there, and even longer for less homogeneous environment like desktops and laptops, but I believe everyone will be better off if we simply had more to choose from. Even those that will firmly continue to use x86 would benefit from additional innovation and research into architecture design.

    I’ll continue to play with RISC-V. If something interesting pops up, I’ll do my best to write about it.