PerformancecuIOgood first issue
Description
Currently, the CSV reader parses data using a thread per row, and a separate thread is used for each row, regardless of the file size. Using a grid stride loop would allow kernels to launch with preset number of blocks even with large input.
This applies both to the parser and the data inference kernels.