Half the people in my office have been cited for jury duty in the last 3-4 months, and I guess today was my turn!
I appreciate you didnβt want to ask me my name for a third time. You tried.
That moment when you spend all weekend generating a whole lotta data, only to realise there was an error in the input set and everything you created is invalid and needs to be thrown awayβ¦
Reposting: a tweet
I generated 1.1TB of string data for a project, overnight. It’s just one big text file on a disk. Now I just have to grep through it to find the particular patterns I needβ¦ that 1.1TB will probably come down to 500-600GB by the end of it, but I can see the pattern-matching process taking the rest of the weekendβ¦
Python and command-line utilities have been super useful at generating this data, and definitely helped the process along. As a reminder to myself, these are the commands I’m using to “post-process” the data:
Look for lines in input.csv which don’t match this pattern, and echo them to output.csv:
$ grep -vE "([A-K]{3}),\1" input.csv > output.csv
Split output.csv into files 800MB in size, called data_n, where n is an 8-digit incremental number (e.g. data_00000001):
$ split -a 8 -d -b 800M output.csv data_
For each data file in the directory, give it the .csv extension:
$ for f in data*; mv "$f" "$f.csv"; done