NVIDIA PTX generation and modification
Not sure how low level it is compared to the hardware (likely undocumented?), but it can be generated from C/C++-like GPU languages, modified and reused in a few ways: 
- OpenCL - clGetProgramInfo(program, CL_PROGRAM_BINARIES+- clCreateProgramWithBinary: minimal runnable example: How to use clCreateProgramWithBinary in OpenCL?
 - These are standardized OpenCL API's, which produce and consume implementation defined formats, which in driver version 375.39 for Linux happens to be human readable PTX. - So you can dump the  PTX, modify it, and reload. 
- nvcc: can compile CUDA  GPU-side code to ptx assembly simply with either:
 - nvcc --ptx a.cu
 - nvcccan also compile OpenCL C programs containing both device and host code: Compile and build .cl file using NVIDIA's nvcc Compiler? but I could not find how to get the ptx out with nvcc. Which kind of makes sense since it is just plain C + C strings, and not a magic C superset. This is also suggested by: https://arrayfire.com/generating-ptx-files-from-opencl-code/
 - And I'm not sure how to recompile the modified PTX and use it as I did with - clCreateProgramWithBinary: How to compile PTX code
 
Using clGetProgramInfo, an input CL kernel:
__kernel void kmain(__global int *out) {
    out[get_global_id(0)]++;
}
gets compiled to some PTX like:
//
// Generated by NVIDIA NVVM Compiler
//
// Compiler Build ID: CL-21124049
// Cuda compilation tools, release 8.0, V8.0.44
// Based on LLVM 3.4svn
//
.version 5.0
.target sm_20
.address_size 64
    // .globl   _Z3incPi
.visible .entry _Z3incPi(
    .param .u64 _Z3incPi_param_0
)
{
    .reg .pred  %p<2>;
    .reg .b32   %r<4>;
    .reg .b64   %rd<5>;
    ld.param.u64    %rd1, [_Z3incPi_param_0];
    mov.u32     %r1, %ctaid.x;
    setp.gt.s32 %p1, %r1, 2;
    @%p1 bra    BB0_2;
    cvta.to.global.u64  %rd2, %rd1;
    mul.wide.s32    %rd3, %r1, 4;
    add.s64     %rd4, %rd2, %rd3;
    ldu.global.u32  %r2, [%rd4];
    add.s32     %r3, %r2, 1;
    st.global.u32   [%rd4], %r3;
BB0_2:
    ret;
}
Then if for example you modify the line:
add.s32     %r3, %r2, 1;
to:
add.s32     %r3, %r2, 2;
and reuse the PTX modified, it actually increments by 2 instead of 1 as expected.