Kinect v2 SDK: Gesture::get_Name expects a file path

The Short Version: When reading the name of a gesture, using the Gesture::get_Name method of the IGesture interface, bufferSize should be at least 260. Check out this gist for more detail.

For the last handful of years I have been exploring the development of accessible virtual reality experiences, with a specific focus on enabling 'full-body' access for those who use a wheelchair. This has been a rewarding experience, full of smiles, laughs and enjoyable challenges. Of those involved, the largest technical effort has been engineering a lightening fast Kinect integration for Unreal Engine, to form the basis of an accessible control toolkit named Jester. The Kinect, also known as Project Natal during its development, is a line of motion sensors designed by Microsoft and first released in 2010. The Kinect supports a plethora of input sources, incorporating cameras, infrared projectors, microphone arrays and software-based artificial intelligence enabling real-time gesture- and speech-recognition as well as skeletal tracking of up to four people. A second-generation unit, released in 2013, also incorporated a time-of-flight sensor – capable of processing 2 gigabits of environmental data per second – and greatly increased fidelity.

A year later, Microsoft would release the same hardware for Windows, and, in 2018, they would discontinue it. Despite this, the Kinect 'v2' has found a loving home among academics and in commercial settings. My research focuses on utilising motion sensing capabilities to map wheelchair movements to control inputs, often specialised for a particular game or genre. The mature hardware, coupled with rich software support and documentation, provides a suitable basis for undertaking such an exercise.

One such element of the software is the 'Visual Gesture Builder,' a set of API's and toolkits providing support for artificial intelligence based gesture detection. The Visual Gesture Builder toolkit analyses recordings of gestures and intelligently creates an AI model representing them. Each gesture can be discrete, meaning they are either detected or not; or continuous, where detection is exposed as a threshold. The result of this process is a 'gesture builder database' file containing the gestures, including their type and name, and the models associated with them. Internally to Jester, each gesture is mapped to a game-specific control scheme. One such example might be a 'left steer' gesture mapping to the negative x axis of the left stick on a controller.

Utilising a gesture builder database in Jester is made relatively straightforward by the API. Let's take the 'SampleDatabase' file from the Kinect for Windows v2 SDK as an example. This gesture builder database contains four such gestures: Steer_Left, a discrete left-hand steering motion; Steer_Right, a discrete right-hand steering motion; SteerProgress, a continuous gesture of which the threshold indicates the angle of steer; and SteerStraight, a continuous gesture indicating angle from camera forward. Let's go ahead and write some code to load this database.

#include <iostream>
#include <Kinect.VisualGestureBuilder.h>

/* ... */

IVisualGestureBuilderDatabase* database;
result = CreateVisualGestureBuilderDatabaseInstanceFromFile(TEXT("C:\\Program Files\\Microsoft SDKs\\Kinect\\v2.0_1409\\Tools\\KinectStudio\\databases\\SampleDatabase.gbd"), &database);
if (result != S_OK) {
    std::cout << "Failed to create database\n";
    return E_FAIL;
}

In this example we pass an empty pointer to an IVisualGestureBuilderDatabase object into the CreateVisualGestureBuilderDatabaseInstanceFromFile method, as well as the name of the file we wish to load – that being the sample database. All being well, this method will return an HRESULT of S_OK – or zero – and we can continue. The IVisualGestureBuilderDatabase interface exposes the number of gestures contained in the database through the get_AvailableGesturesCount method. Whilst we already know there's four, and we could make an assumption, let's do the right thing and use this API anyway.

UINT32 num_gestures;
result = database->get_AvailableGesturesCount(&num_gestures);
if (result != S_OK) {
    std::cout << "Failed to get available gesture count\n";
    return E_FAIL;
}

Here we pass the address of num_gestures, an unsigned 32-bit integer, into get_AvailableGesturesCount. Again assuming that the method returns S_OK, num_gestures(which, in our example, should now be four) will offer us a clue as to how much memory we need to allocate going forward. All of this in hand, we can now try to load the gestures.

IGesture** gestures = new IGesture * [num_gestures];
result = database->get_AvailableGestures(num_gestures, gestures);
if (result != S_OK) {
    std::cout << "Failed to get available gestures\n";
    return E_FAIL;
}

Using the value we stored in num_gestures above, we can new up an array of IGesture objects. The IGesture interface encapsulates information about each gesture – specifically it's name and type. These objects don't contain anything useful just yet –  they're simply not null – so we populate them with a call to get_AvailableGestures. Another S_OK later and we have a set of gesture structures in memory – great stuff! Now we want to know what they're called. In the interests of efficiency, let's allocate only as much memory as we need to. Since we know that the longest gesture name is exactly 14 bytes ("SteerProgress" and a null terminator), we'll use that as our buffer size.

/* gesture_name_length = strlen("SteerProgress") + 1; */
const UINT32 gesture_name_length = 14;
for (UINT32 i = 0; i < num_gestures; i++) {
    wchar_t gesture_name[gesture_name_length];
    result = gestures[i]->get_Name(gesture_name_length, gesture_name);
    if (result != S_OK) {
        std::cout << "Failed to read gesture name\n";
        return E_FAIL;
    }
    std::wcout << TEXT("Gesture ") << i << TEXT(": ") << gesture_name << TEXT("\n");
}

We call Gesture::get_Name for each gesture, giving the maximum length and the address of a buffer in which to copy it. Should everything go according to plan, the application will then write the name and 1-index of each gesture to the standard output. Let's give it a whirl.

Wait a minute...

Not what we were hoping for.

Analysis

As with any programming problem, the very first thing any of us will do is check stackoverflow. Unfortunately, the internet is full of folks experiencing exactly the same issue. One stackoverflow user gets close to stumbling across the answer and there are several examples around that would work but, if you've read this far, you're interested in the why of the matter.

The next logical step, then, is to look into the headers and other documentation. The online documentation for this method gives the following signature.

public:
HRESULT get_Name(
         UINT bufferSize,
         wchar_t *gestureName
)

However the headers offer a different signature, in which a clue to the nature of our problem is hidden.

 public:
 virtual /* [propget] */ HRESULT STDMETHODCALLTYPE get_Name( 
     UINT bufferSize,
     /* [annotation][out][retval] */ 
     _Out_writes_z_(bufferSize)  wchar_t *filePath) = 0;

The headers in our installation suggest that the Gesture::get_Name method takes a wchar_t* called filePath as its second argument. It is annotated as an 'out' parameter, and as the return value. If we entertain for a moment that this is the correct signature, and it is indeed expecting to provide us with a file path, then we can try setting bufferSize to some obnoxiously large value – say, 1000 – to see what happens. The result?

Success. From here, we could reasonably assume that Gesture::get_Name expects to receive a buffer capable of containing a full-length file path. This would give rise to the assumption that it ensures the buffer is at least MAX_PATH (or, 260 characters) in length. Sadly there are no publicly available debugging symbols for the Kinect for Windows v2 SDK, so we'll get creative.

Disassembling the Problem

For the next stage of my investigation, I had to dream back to my early days at the University of Wolverhampon. In my very first semester, we were taught the basics of understanding assembly languages using ARM7 as an example. An assembly language is a special kind of programming language that matches a processors instructions almost one-to-one. Just as ARM7 assembly is specific to ARM7 processors, x86 assembly is specific to Intel and AMD x86 processors. We've used 64-bit libraries, so we will actually be looking at 'x64 assembly' – an extended version of x86 assembly.

In order to prove our suspicions that Gesture::get_Name is ensuring bufferSize is at least MAX_PATH, we need to disassemble Kinect20.VisualGestureBuilder.dll. This is the process by which we take the bytes that make up a binary executable and translate them into the matching assembly language – in this case, x64.

The first thing we need to find is the value 260, which may be written as 104h or 0x104. Since we suspect a greater-than-or-equal operation, we are looking for 0x104 occuring in a 'compare', or CMP, instruction immediately followed by a 'jump if above or equal', or JAE, instruction. This looks something like the following –

000000001:	cmp ab, 0x104
000000002:	jae 0x000000010

The former of these, compare, accepts two unsigned integers. It subtracts the second from the first and discards the results, setting bits in the flag register as appropriate. If the second integer is larger than the first, the 'carry flag' is set. The latter of this, jump if above or equal, resumes execution at the address given as its sole argument if, and only if, the carry flag is zero. This means that, if the value we give as the buffer size is greater than or equal 260, the execution will 'jump' several instructions and, if not, it continues at the immediately following instruction. In our hypothetical example above, if the value in register 'ab' is greater than or equal to 260, then execution jumps to the instruction at 0x000000010.

We can use regular expressions to locate the compare instructions --

(cmp[\t\s]*[^,]+,[\t\s]*(?>0x104|104h))

-- and apply it to the disassembled library, turning up two instances of a 'compare' operation where the second argument is 260. These are at offsets 0x1744 and 0x1A74, relative to the start of the code section, respectively. The former of these is followed by a JAE instruction, and the latter a JB, so we'll continue investigating the former.

In the land of assembly, we can easily pick out subroutines by looking for a pattern of instructions that looks like the following.

000000001:	push ab
...
000000009:	pop ab
000000010:	ret

The first of these instructions pushes the contents of register ab onto the 'stack' . This is followed by any number of instructions that make up the subroutine body, and then a pop – which moves the value at the top of the stack into register ab. The last instruction, ret, returns control of execution to the calling subroutine. There may be several pushes and pops, though this depends on the subroutine and the application state.

The second and third values we are looking for are E_POINTER, 0x80004003, and E_INVALIDARG, 0x80070057. These are Windows system error codes, the former of which is returned when a pointer is null, and the latter if an argument is invalid in some way. These are returned from Gesture::get_Name if *filePath is NULL or if bufferSize is invalid, respectively. These will likely be immediate values, meaning they will be numbers rather than data in registers, due to how applications are compiled to machine code.

With all this in mind, we can start to understand the structure of our mystery subroutine, which, after analysing the instructions therein, is the following block of assembly code.

180001730:	push rbx
180001732:	sub rsp, 0x30
180001736:	xor ebx, ebx
180001738:	mov rax, r8
18000173b:	test r8, r8
18000173e:	jz 0x180001779
180001740:	mov [r8], bx
180001744:	cmp edx, 0x104
18000174a:	jae 0x180001762
18000174c:	mov dword [rsp+0x20], 0x14
180001754:	lea r8, [rip+0x16e95]
18000175b:	mov ecx, 0x80070057
180001760:	jmp 0x18000178d
180001762:	lea r8, [rcx+0x20]
180001766:	mov edx, edx
180001768:	mov rcx, rax
18000176b:	call qword [rip+0x16b2f]
180001771:	mov eax, ebx
180001773:	add rsp, 0x30
180001777:	pop rbx
180001778:	ret 
180001779:	mov dword [rsp+0x20], 0x13
180001781:	lea r8, [rip+0x16e58]
180001788:	mov ecx, 0x80004003
18000178d:	lea r9, [rip+0x16e3c]
180001794:	xor edx, edx
180001796:	call dword 0x8000f850
18000179b:	add rsp, 0x30
18000179f:	pop rbx
1800017a0:	ret 
The x64 assembler of Gesture::get_Name(uint, wchar_t*)

We can see that the subroutine is bookended by push, pop and ret instructions as we expect. However, there is a second pair of pop and ret instructions at offset 0x1777 – these are the positive return case of our mystery subroutine. How do we know? Let's break it down: We can see two jumps – a JAE and JZ, or 'jump if zero'.

180001730:	push rbx
...
18000173b:	test r8, r8
18000173e:	jz 0x180001779
...
180001744:	cmp edx, 0x104
18000174a:	jae 0x180001762
...
18000179f:	pop rbx
1800017a0:	ret 
Branch control within Gesture::get_Name(uint, wchar_t*)

The JZ, or 'jump if zero', instruction in this code is new to us. The jump if zero instruction, if the zero-flag is set, causes execution to resume at the address in its sole argument. The preceding instruction, test, compares the logical and of its two arguments. If one or both arguments are zero, the result is zero and the zero flag, ZF, is set. If the zero flag is set, JZ causes execution to resume at 0x1779. In this code, the first jump is triggered if an input argument is zero (or null, in high-level terms) and the second jump if the value in edx is larger less than 0x104.

180001779:	mov dword [rsp+0x20], 0x13
180001781:	lea r8, [rip+0x16e58]
180001788:	mov ecx, 0x80004003
18000178d:	lea r9, [rip+0x16e3c]
180001794:	xor edx, edx
180001796:	call dword 0x8000f850
18000179b:	add rsp, 0x30
18000179f:	pop rbx
1800017a0:	ret

Jumping to the instruction at 0x1779, we can see the outlines of some error handling statements. At 0x1788, the value 0x80004003 – or E_POINTER – is moved into the edx register. For the purposes of this writeup, we won't break down the subroutine called at 180001796h – however, it appears to be a debugging routine used internally by Microsoft. Similarly, if the jump at 18000174a isn't triggered, the following statements are executed.

18000174c:	mov dword [rsp+0x20], 0x14
180001754:	lea r8, [rip+0x16e95]
18000175b:	mov ecx, 0x80070057
180001760:	jmp 0x18000178d

Can you see it? E_INVALIDARG is moved into the edx register at 18000175b and then execution is resumed inside the error handling block. What happens if the JAE instruction results in a jump, then?

180001762:	lea r8, [rcx+0x20]
180001766:	mov edx, edx
180001768:	mov rcx, rax
18000176b:	call qword [rip+0x16b2f]
180001771:	mov eax, ebx
180001773:	add rsp, 0x30
180001777:	pop rbx
180001778:	ret 

This block is executed when all of the conditions (i.e. our and pointers being valid) are met. Three pieces of data are moved into position, and then a call is made to a subroutine that exists at [rip+0x16b2f]. The rip, or 'relative instruction pointer', contains the address of the next instruction. The value of rip during this call instruction is 0x1771, therefore we can easily work out that our subroutine of interest is at 0x182A0. This address is in the .data, rather than .text, section of the library. However, based on our mystery method being a getter for a name, we can reasonably guess it's strcpy or a variant thereof.

In order to understand what is being called, I used a piece of software named IDA. IDA, or Interactive DisAssembler, is a program for diassesmbling and decompiling executables. It has support for numerous processor architectures, but we're most interested in Intel x64. Using IDA, we can take a peak at what lives at 0x182A0.

.idata:00000001800182A0 ; errno_t __cdecl wcscpy_s(wchar_t *Dst, rsize_t SizeInWords, const wchar_t *Src)
.idata:00000001800182A0                 extrn wcscpy_s:qword    ; CODE XREF: sub_180001730+3B↑p
.idata:00000001800182A0                                         ; sub_180005170+1C1↑p ...
wcscpy_s at 0x182A0

Just as we thought, it's a variant of strcpy. A Microsoft Visual C (MSVC) specific variant of the safe strcpy, wcscpy_s operates on wide characters. This is indicated to us by the wc prefix, and can be confirmed by referencing the MSVC documentation.

The last thing we might like to do before 'reimplementing' the Gesture::get_Name method is visualise the subroutine. This is somewhere else that IDA is very useful, generating a graph for us with a single click.

A flow diagram of the Gesture::get_Name subroutine.

The diagram generated for us by IDA shows the jumps in action, controlling the flow of execution and raising error conditions as appropriate.

We now know everything we need to know about this little block, and have confirmed our suspicions. For reasons unknown, Gesture::get_Name treats our pointer as if it is going to contain a file path and, as a result, ensures that the buffer is capable of receiving one. Taking everything we've learned throughout this analysis, we can present a C++ implimentation of this method. Enjoy.

HRESULT Gesture::get_Name(UINT bufferSize, wchar_t *filePath)
{
    HRESULT result = S_OK;
    if (!filePath) {
        result = E_POINTER;
    } else {
        if (bufferSize >= MAX_PATH) {
            wcscpy_s(filePath, bufferSize, this->_gestureName);
        } else {
            result = E_INVALIDARG;
        }
    }
    return result;
}
A reimplementation of the Gesture::get_Name method, showing the implementation error.
Show Comments