That moment when you spend all weekend generating a whole lotta data, only to realise there was an error in the input set and everything you created is invalid and needs to be thrown away…

I generated 1.1TB of string data for a project, overnight. It’s just one big text file on a disk. Now I just have to grep through it to find the particular patterns I need… that 1.1TB will probably come down to 500-600GB by the end of it, but I can see the pattern-matching process taking the rest of the weekend…

Python and command-line utilities have been super useful at generating this data, and definitely helped the process along. As a reminder to myself, these are the commands I’m using to “post-process” the data:

Look for lines in input.csv which don’t match this pattern, and echo them to output.csv:

$ grep -vE "([A-K]{3}),\1" input.csv > output.csv

Split output.csv into files 800MB in size, called data_n, where n is an 8-digit incremental number (e.g. data_00000001):

$ split -a 8 -d -b 800M output.csv data_

For each data file in the directory, give it the .csv extension:

$ for f in data*; mv "$f" "$f.csv"; done