Fix CSVReader chunk/thread orchestration and mmap error handling #282
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Pull Request: Fix CSVReader chunk/thread orchestration and mmap error handling
This PR fixes two user-visible failure modes when parsing large CSV files (10MB+) on Linux:
std::terminateduring file parsing, due tostd::error_codebeing thrown from a worker thread (mmap path).row["ets"]becomes a multi-line, non-numeric string (stream path), often seen downstream as “ets not a number”.Related issues
std::error_code→std::terminate)CSVReader(std::istream&, fmt)corruptsrow["ets"]around chunk boundary)User impact
A) mmap / filename path
Typical user code:
Before this PR:
terminate called after throwing an instance of 'std::error_code'std::thread, so user try/catch cannot intercept it.After this PR:
std::system_errorwith context:B) stream path (
CSVReader(std::istream&, fmt))Typical user code:
Before this PR:
row["ets"].get_sv()becomes a multi-line string containing commas and\n, even embedding the next record.After this PR:
etsstring at the problematic index.Root causes
1) Throwing
std::error_codedirectly inMmapParser::next()In
include/internal/basic_csv_parser.cpp, mmap errors were reported via a rawstd::error_codeand thrown directly.This is problematic because:
std::error_codeis not derived fromstd::exceptionstd::threadentry point, the program callsstd::terminate2) Exceptions escaping the CSVReader worker thread
CSVReaderperforms chunk reads in a worker thread (CSVReader::read_csv). Previously it did not catch exceptions, so any parsing/mmap failure could escape the thread and triggerstd::terminate.3) Chunk transition ordering / scheduling race in
CSVReader::read_row()When
recordsbecomes empty,read_row()decides whether to wait, join the worker, or start a new worker based onrecords->is_waitable()andparser->eof().The previous ordering allowed a window where:
is_waitable()did not reflect the correct stateeof()could be checked beforejoin()(missing a strict happens-before), resulting in an incorrect decision and a non-deterministic call sequence into the parserThis manifested as corrupted field boundaries in the stream path around the chunk boundary (e.g.,
etsswallowing the remainder of a row plus the next line).What changed
A) MmapParser: throw
std::system_errorwith contextstd::error_codetostd::system_errorfile / offset / lengthin the error message to makeEINVAL(22)and similar failures diagnosablemmap_poson failureFiles:
include/internal/basic_csv_parser.cppB) CSVReader: propagate worker-thread exceptions to the caller
CSVReader::read_csv()std::exception_ptrjoin()(inread_row(),begin(),initial_read())Files:
include/internal/csv_reader.cppinclude/internal/csv_reader.hppinclude/internal/csv_reader_iterator.cppC) CSVReader: fix chunk transition ordering / race
records.empty()branch: join → rethrow → eof-check → start next workerrecordsas waitable before starting the worker to close the “thread created but waitable still false” windowFiles:
include/internal/csv_reader.cppD) Regenerate single-include headers
This project ships both:
include/single_include/andsingle_include_test/After changing internal sources, the single-header outputs must be regenerated so that users including
single_include/csv.hppget the same fixes.Files:
single_include/csv.hppsingle_include_test/csv.hppGeneration command:
Verification (local)
On a ~13MB / ~79k-line / 32-column CSV dataset:
std::system_error(catchable).etsaroundn≈61640now shows numericets(no newline embedded).TA2601.csv