// I think that the optimizer removes the for loop:
// for _ in 0..10000 {
// asm::nop();
// }
// Thus the end variable is set right after the start variable. This makes the time difference very
// small. But it is not near 10000 time faster because the function DWT::get_cycle_count() is much
// slower at getting the cycle count then it is to run one iteration of the loop.
//
// The unoptimized version has to store and pop all the register every time the nop function call is made. But the optimized version doesn't have to do that because it identifies that most of the registers are not used by the nop function. Thus it is ~68 times faster because there is 32 registers.