Wednesday, May 02, 2007

Taking out the Slow Parts

Sage Timberline Office, like many accounting systems, allows you to customize the captions that appear on-screen for most fields. To make it easier for the user, it makes sense for Stonefield Query for Sage Timberline Office, the newest member in the Stonefield Query family of products, to display the user's own captions. However, one slight complication is that they aren't stored in a manner that's easily readable. Instead, we have to call a COM object to retrieve the captions. This object returns XML specifying the name of each field and its caption. Given that there are tens of thousands of fields in Timberline, this XML can be over 1 MB in size. Here's what the XML looks like:

<captions>
<newrecord>
<tablename>SomeTableName</tablename>
<fieldname>SomeFieldName</fieldname>
<fieldcaption>Caption for Field</fieldcaption>
</newrecord>
...
</captions>
We want to use this XML to update the Stonefield Query data dictionary. Our first attempt used XMLTOCURSOR() to put the XML into a cursor, which we then process. It took somewhere around 30 seconds, which seems like forever because it happens every time the user starts Stonefield Query. XMLTOCURSOR() uses the XML DOM object, which gets exponentially slower as the XML gets larger.

The next approach was to manually parse the XML using STREXTRACT(). STREXTRACT() is the perfect function for parsing XML because it was designed to find text between delimiters. If you know the XML structure, it's really easy to use. Here's the code we used:

create cursor __CAPTIONS (FIELDNAME C(119), CAPTION C(60))
for lnI = 1 to occurs('<tablename>', lcXML)
lcTable = strextract(lcXML, '<tablename>', '</tablename>', lnI)
lcField = strextract(lcXML, '<fieldname>', '</fieldname>', lnI)
lcCaption = strextract(lcXML, '<fieldcaption>', '</fieldcaption>', lnI)
insert into __CAPTIONS values (lcTable + '.' + lcField, lcCaption)
next lnI
Interestingly, this turned out to be way slower than XMLTOCURSOR() (in fact, I stopped it after a couple of minutes, so I don't know how long it would've taken).

It occurred to me while looking at the XML that if we could convert it into comma-delimited text, we could use APPEND FROM to suck it into the cursor. Here's the code for that:

#define ccCRLF chr(13) + chr(10)
lcXML = strtran(lcXML, '<tablename>')
lcXML = strtran(lcXML, '</tablename>' + ccCRLF + '<fieldname>', '.')
lcXML = strtran(lcXML, '</fieldname>' + ccCRLF + '<fieldcaption>', ',"')
lcXML = strtran(lcXML, '</fieldcaption>' + ccCRLF + '</newrecord>' + ;
ccCRLF + '<newrecord>', '"')
lcXML = strtran(lcXML, '</fieldcaption>' + ccCRLF + '</newrecord>' + ;
ccCRLF + '</vfpdata>', '"')
lcTemp = forcepath(sys(2015) + '.txt', sys(2023))
strtofile(substr(lcXML, 91), lcTemp)
create cursor __CAPTIONS (FIELDNAME C(119), CAPTION C(60))
append from (lcTemp) delimited
erase (lcTemp)
This took 0.7 seconds, 42 times faster than using XMLTOCURSOR().

But still, I wondered why the STREXTRACT() approach was so slow; my experience is that the VFP text handling functions are blindingly fast. I wondered if it had to do the with fourth parameter, the one that specified the occurrence. I rewrote the code to this:
create cursor __CAPTIONS (FIELDNAME C(119), CAPTION C(60))
lnLines = alines(laLines, lcXML)
for lnI = 1 to lnLines
lcLine = laLines[lnI]
do case
case '<TableName>' $ lcLine
lcTable = strextract(lcLine, '<TableName>', '</TableName>')
case '<FieldName>' $ lcLine
lcField = strextract(lcLine, '<FieldName>', '</FieldName>')
case '<FieldCaption>' $ lcLine
lcCaption = strextract(lcLine, '<FieldCaption>', '</FieldCaption>')
insert into __CAPTIONS values (lcTable + '.' + lcField, lcCaption)
endcase
next lnI
This took 0.4 seconds. Wow! Nearly two orders of magnitude improvement over the initial attempt. It really does pay to take out the slow parts!

8 comments:

Anonymous said...

Doug,

Thank you for an excellent post. Your post 100% solved the problem I was facing.

We have a log file stored in SQL Server where the log can contain hundreds of thousands of records. To manage this mountain of data, we archive any data beyond 7 days. The data is brought from SQL Server to VFP in a cursor, translated into an XML string and then compressed and stored back in SQL Server. This process is very fast and usually only takes a matter of seconds to complete.

Now has come the time to decompress and extract the XML back to a VFP cursor to get at the archived data. However, when I attempted an XMLTOCURSOR() function, I thought I was making a mistake or that the system had hung.

I went to my smallest archive entries first and noted that the time to translate them through XMLTOCURSOR was getting longer and longer as the XML string grew in size. I did not know why until I read your post.

Realizing now what is going on, I copied your code example, re-worked it to fit our needs here and then put it in to practice.

I am happy to report that a 9.5 MB XML file is translated using the STREXTRACT method you've outlined in about 10 seconds or less. The data comes through perfectly.

Final note: On these big XML files, the XMLTOCURSOR was complaining about control characters like CHR(6). This seemed odd to me because we use CURSORTOXML() to make the conversion in the first place. So, all things being equal, I would expect the complaint to also show up on the CURSORTOXML() as well as the XMLTOCURSOR().

Using the code you've inspired, the problem does NOT show up. I really DO hope someone in Redmond will provide a little help on the matter and get the XMLTOCURSOR() more efficient AND have it work as well as the CURSORTOXML().

Regardless -- again, thank you for your hard work and your post. Without it, I would have been struggling with this for a long time to come. As it was, I spent the better portion of the last 3 days on the issue.

Larry
Marietta, GA

Doug Hennig said...

You're welcome, Larry -- glad this was of help.

Anonymous said...

Hi Doug,

Sorry for asking dumb question, but where do you get the lcXML or what does it contain? Thanks. NP

Doug Hennig said...

lcXML contains the XML shown in the first part of the entry.

Anonymous said...

Thanks Doug, how would I load that data into lcXML? NP.

Doug Hennig said...

As I mentioned, in this case it's a COM object that returns that XML. Otherwise, the XML could come from anywhere: a file, a call to a function, TEXT ... ENDTEXT statements, etc. IOW, the same way you'd get the content for any string (which is all XML really is).

Anonymous said...

Hi,

If i have an xml that has a collection of tables, how can i specify that lcXML should be only checking the table specified. in the xml2cursor function, you load the xml and use: oXML.Tables(arrayposition).ToCursor()to specify the table you want to use in this collection. how can i do the same in the algorithm you specified?
Thanks,
Nazha

Doug Hennig said...

Hi Nazha.

Change the STREXTRACT() statement to extract the table name following the table tag as well.

Doug