![]() |
version seven.   http://demongin.org |
The Simplest Possible CSV Batcher
How to split a CSV file on a column -or- a post in which I play a little "python golf" against myself.
Wednesday, 2011-05-25 | Careerism, Programming
| The whole idea of drama is three things:
|
| David Mamet |
So a client dropped a monster, 18MB CSV file on me.
At work, we've got tools for analyzing CSV files--breaking them up, organizing columns, etc.--but they were choking on the file because of its (kind of) outrageous size.
So, I decided to take my file to the command line, fly out there for the day and straighten it out. Just to keep things interesting, I decided to see how short I could keep the script:
#!/usr/bin/env python import csv batch_suffix = "_batch.csv" # for naming output files split_on = 2 # the column to batch on path_to_file = "reports.csv" # c = file(path_to_file,"r").readlines() csv_reader = csv.reader(c) for r in csv_reader: fh = file(r[split_on - 1] + batch_suffix,"a") csv_writer = csv.writer(fh) csv_writer.writerow(r)
Alternately, you could import sys, swap out the path_to_file routine for a sys.argv[1] type of deal and run the thing interactively. In such a case, you'd get something like this:
toconnell@kumiko:~/whatever$ ls -ltar total 18268 drwxr-xr-x 7 toconnell toconnell 4096 2011-05-25 15:56 .. -rw-r--r-- 1 toconnell toconnell 18691905 2011-05-25 16:01 reports.csv -rwxr-xr-x 1 toconnell toconnell 365 2011-05-25 18:17 csv_split.py drwxr-xr-x 2 toconnell toconnell 4096 2011-05-25 18:30 . toconnell@kumiko:~/whatever$ ./csv_split.py reports.csv toconnell@kumiko:~/whatever$ ls -ltar total 36468 drwxr-xr-x 7 toconnell toconnell 4096 2011-05-25 15:56 .. -rw-r--r-- 1 toconnell toconnell 18691905 2011-05-25 16:01 reports.csv -rwxr-xr-x 1 toconnell toconnell 365 2011-05-25 18:17 csv_split.py -rw-r--r-- 1 toconnell toconnell 1742831 2011-05-25 18:30 COLUMN_VALUE_1_batch.csv -rw-r--r-- 1 toconnell toconnell 2080895 2011-05-25 18:30 COL_VAL_2_batch.csv -rw-r--r-- 1 toconnell toconnell 1250599 2011-05-25 18:30 THING_batch.csv ...
...and so on.
