Flame graph of CPU implementation#

Most computation time is spent on cumsum operation inside get_eta and get_h. Input tensor size was increased by 2 dimension for easier comparison with the original implementation

Right click + ‘Open image in new tab’ to see timing details.