Fast Csv How to Read and Write From Stream at the Same Time
I recently had to undertake pre-processing on a CSV file with NodeJS+Typescript before ingesting it into a system.
The CSV file in question presents a number of challenges:
- The CSV file is large @ ~125k rows
- Includes a header row simply individual headers need to be renamed
- There are redundant columns to remove
- In that location may exist additional columns that we also don't know about that need to be dropped
- The columns need reordering
- Blank lines must be skipped
Via a quick Google I found fast-csv.
An initial & superficial await at fast-csv highlights a few qualities making it bonny plenty to explore further:
- It is still actively beingness developed (at the time of this post) giving some assurance effectually bug fixes
- Uses the MIT friendly open source license
- Has no runtime dependencies minimizing whatever down stream license issues
In looking at the feature set, fast-csv is comprised of 'parse' and 'format' routines for ingesting and transforming CSV files. It also supports streams for fast processing of large files. The following describes how I made use of fast-csv features to run into the above requirements.
To start with hither's the initial CSV file we will ingest:
beta,alpha,redundant,charlie,delta betaRow1,alphaRow1,redundantRow1,charlieRow1,deltaRow1 betaRow2,alphaRow2,redundantRow2,charlieRow2,deltaRow2 betaRow3,alphaRow3,redundantRow3,charlieRow3,deltaRow3
Our goal is to rename and reorder the columns, drop the blank line, drib the 'redundant' column, and our program should be able to also drop the 'delta' cavalcade which it wont know about at all. The final output should look similar:
NewAlpha,NewBeta,NewCharlie alphaRow1,betaRow1,charlieRow1 alphaRow2,betaRow2,charlieRow2 alphaRow3,betaRow3,charlieRow3
The following code shows the solution:
import * as fs from ' fs ' ; import * every bit csv from ' fast-csv ' ; const inputFile = __dirname + ' /../sample-data/input.csv ' ; const outputFile = __dirname + ' /../sample-data/output.csv ' ; ( async function () { const writeStream = fs . createWriteStream ( outputFile ); const parse = csv . parse ( { ignoreEmpty : true , discardUnmappedColumns : true , headers : [ ' beta ' , ' alpha ' , ' redundant ' , ' charlie ' ], }); const transform = csv . format ({ headers : true }) . transform (( row ) => ( { NewAlpha : row . blastoff , // reordered NewBeta : row . beta , NewCharlie : row . charlie , // redundant is dropped // delta is not loaded by parse() above } )); const stream = fs . createReadStream ( inputFile ) . pipage ( parse ) . pipe ( transform ) . pipe ( writeStream ); })();
In explaining the solution:
parse() options
- ignoreEmpty takes care of skipping the blank line(s)
- discardUnmappedColumns will driblet any columns we don't specify in the following headers option, taking care of dropping the 'delta' column
- headers maps the columns nosotros are loading. Note how I've used discardUnmappedColumns to driblet 'delta' just I'm still loading 'redundant'. The 'redundant' column is dropped in the format() options described adjacent
format() options
- headers directs the output to include the header row
- The transform() row post-processor allows u.s. to reorder the columns, rename the columns, and too drop the 'redundant' cavalcade
With a larger CSV file in hand, testing shows the higher up routine can procedure ~125k rows with 126 columns, from a file of approx 135MB in size, in ~19 seconds on my MBP iii.2Ghz i7.
fast-csv indeed.
Source: https://dev.to/chriscmuir/fast-csv-for-csv-files-21a1
0 Response to "Fast Csv How to Read and Write From Stream at the Same Time"
Post a Comment