The Now Platform® Washington DC release is live. Watch now!

Help
cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
SlightlyLoony
Tera Contributor

Today we're going to take the things you've already learned, and apply them to something a bit more like the real world. Let's say we've imported that old standby, a .csv file (comma-separated values, with optional quoting of strings containing commas or quotes) — a .csv file with a set of line items on an invoice. Our sample data, from the Dweezle-Dwee Government Surplus Store, looks like this:


dat,qty,itm,amt,ext,cur,typ
10/14/2011,1,"Meat Cleaver, Sharp",47.8,47.8,dollars,purchase
10/12/2011,10,Ballpoint Pen,0.44,4.4,dollars,purchase
7/4/2011,3,"Space Shuttle, Used",-500,-1500,dollars,rma
8/21/2011,1,"Country, ""Greece""",1,1,euros,purchase
9/30/2011,5,"Computer, PC, Obsolete",169.99,849.95,dollars,purchase

Your job is to parse that mess into a nice JavaScript object, ready for some code to deal with it. Our accountant (at right) is eagerly anticipating your success...

Most of this is very straightforward, boring regex stuff. The one challenging bit is the "item" column — it's contents might or might not be quoted. If it is quoted, we don't want those quotes in our result. Here's what we came up with:

var test = 'dat,qty,itm,amt,ext,cur,typ\n';
test += '10/14/2011,1,"Meat Cleaver, Sharp",47.8,47.8,dollars,purchase\n';
test += '10/12/2011,10,Ballpoint Pen,0.44,4.4,dollars,purchase\n';
test += '7/4/2011,3,"Space Shuttle, Used",-500,-1500,dollars,rma\n';
test += '8/21/2011,1,"Country, ""Greece""",1,1,euros,purchase\n';
test += '9/30/2011,5,"Computer, PC, Obsolete",169.99,849.95,dollars,purchase\n';
var result = parse(test);
JSUtil.logObject(result);

function parse(csv) {
var parser = /^(\d+)\/(\d+)\/(\d+),(\d+),((?:"(?:(?:""|[^"])+)")|(?:[^,]+)),(-?[\d.]+),(-?[\d.]+),([^,]+),([^,]+?)$/gm;
var result = [];
var x;
while ((x = parser.exec(csv)) != null) {
var item = {};
item.date = x[3] + '/' + x[1] + '/' + x[2];
item.quantity = x[4] - 0;
item.description = x[5].replace(/^"/, '').replace(/"$/, '').replace(/""/g,'"');
item.amount = x[6] - 0;
item.extended = x[7] - 0;
item.currency = x[8];
item.type = x[9];
result.push(item);
}
return result;
}

When we run this test code, here's the result:

Array of 5 elements
[0]: Object
quantity: number = 1
description: string = Meat Cleaver, Sharp
type: string = purchase
amount: number = 47.8
date: string = 2011/10/14
currency: string = dollars
extended: number = 47.8
[1]: Object
quantity: number = 10
description: string = Ballpoint Pen
type: string = purchase
amount: number = 0.44
date: string = 2011/10/12
currency: string = dollars
extended: number = 4.4
[2]: Object
quantity: number = 3
description: string = Space Shuttle, Used
type: string = rma
amount: number = -500
date: string = 2011/7/4
currency: string = dollars
extended: number = -1500
[3]: Object
quantity: number = 1
description: string = Country, "Greece"
type: string = purchase
amount: number = 1
date: string = 2011/8/21
currency: string = euros
extended: number = 1
[4]: Object
quantity: number = 5
description: string = Computer, PC, Obsolete
type: string = purchase
amount: number = 169.99
date: string = 2011/9/30
currency: string = dollars
extended: number = 849.95

That JavaScript object is a lot easier than the raw .csv file text for other code to work with.

Let's take it one piece at a time...

The first thing to note is that the column names don't appear in the result. The reason is very simple: that big ol' regex doesn't match on that line. The regex is looking for dates and numbers that don't appear in the header; no match, no output.

Now note that we have two flags after the regex: g (for global), and m for multiline. This time we really need the multiline flag. Our regex starts with a "^" (beginning of line) and ends with a "$" (end of line), and we need those to be interpreted as the beginning and end of a line, not of the entire text.

The part of the regex that parses the date looks like this: (\d+)\/(\d+)\/(\d+),. This should be easy for you — we're looking for digits, a slash, digits, a slash, digits, and a comma. The slashes ("/") have to be escaped because the slash character is a metacharacter that brackets the regex literal. If we don't escape it, JavaScript will get very confused and give you hateful error messages when you try to run it. Note that we have three separate capture groups here, one for each group of digits (month, day, year). In the code where we're using the date information, we're reordering it (to year, month, day):
item.date = x[3] + '/' + x[1] + '/' + x[2];

Next comes the bit that parses the quantity: (\d+),. At this point, that should be quite obvious to you!

Now we get the tricky bit: the possibly-quoted item description. The piece of the regex that deals with this is: ((?:"(?:(?:""|[^"])+)")|(?:[^,]+)),. If you look closely, there are really two separate things going on there, with one or the other working: one for quoted strings ((?:"(?:(?:""|[^"])+)")) and one for unquoted strings ((?:[^,]+)). These two bits are surrounded by (a|b), (where the a and b stand for our two parts), which just says to capture one or the other and look for a comma after it. Regexes try alternatives like this from left to right, so it will first look for the quoted string, then if that fails, it will look for the unquoted string.

The quoted string piece is a little tricky all by itself. It's easy to see that it's looking for a beginning and ending quote, but what is that stuff in the middle ((?:(?:""|[^"])+))? Well, it's looking for one or more of either a pair of quotes or a non-quote character. This little trick lets it match on the doubled quotes that Microsoft used to specify a quote within a quoted string. We don't want those to match as the end-of-string, and this little trick neatly prevents that.

The unquoted string piece is easy: it's just looking for a sequence of characters that don't include a comma.

In the JavaScript code you'll see this line for dealing with the item description:

item.description = x[5].replace(/^"/, '').replace(/"$/, '').replace(/""/g,'"');

This obscure-looking line first replaces any leading quote with nothing, then any trailing quote with nothing, and finally any double quotes with a single quote. That scrubs the quotes and makes the string look like what we really expect it to be.

Next we're looking for the amount and extended amount (quantity times amount). These both use the same regex: (-?[\d.]+),. The only new thing for you here is the -?, which tells the regex to match on either 0 or 1 dashes. This matches the optional minus sign (for a negative number) that might precede the digits. The rest of it just matches any combination of digits and decimal points (periods). Note that inside a character class, the period is not a metacharacter, so we don't need to escape it.

The last bits (currency and type) are just looking for an unquoted string. But the very last one (type) has one very slightly tricky bit to it: ([^,]+?). The tricky bit is the question mark near the end. In this case, it's indicating that the quantifier (the plus sign) should be reluctant. Why do we need that here? It actually only matters on the very last line of the input text. In that case, the line is terminated with a newline (the \n). If we didn't have the question mark in there, the last line would match on the end of text instead of the end of line (like all the other lines did), just because with a greedy quantifier, it could. Matching at the end of text would mean that the newline itself would be included in the result, which we don't want. By making that quantifier reluctant, it will stop matching at the newline and won't include it in the result.

And our accounting girl is overjoyed...