XML Storage

Cost-effective XML storage and querying at scale

by Aleck Kulabukhov, Managing Consultant Sydney

Introduction

XML parsing is a compute-expensive exercise. XML is quite a popular way to store unstructured data so we often face a “Big XML Dilemma” – we can either:

  • pre-parse XML, extract the required fields and store them in a most efficient and fast-to-query way (such as Data Warehouse for example) or
  • store XMLs in a NoSQL database and process them on-the-fly while running a query.

There are several major drawbacks with each approach along with the obvious benefits.

MethodProsCons
Pre-processingFast and efficient querying.Only a limited set of XML attributes is usually extracted.
XML parsing is a part of non-time critical ETL process that can be optimized/scaled.Bringing new attributes is a massive re-development effort.
 Changing XML schema can break the ETL.
Processing-on-flyXML is captured as a whole.Querying is either much slower or more expensive.
No complex ETL to extract attributes from the XML required.Limited support by the current DBMS offerings for XML column types.
 Efficient indexing is problematic, slowing down the querying even further.

A proposed approach

The approach pioneered in Altis combines benefits of both paradigms with little, if any, added drawbacks.

Let’s take an extremely simple XML document as an example (we will omit XML header information for the simplicity and focus on the document contents instead):

<Note Importance=”high” Classification=”non-sensitive” Type=”Private”>
<To Bcc=”NSA”>Mike michaelj@company1.com.au</To>
<From>Jani janik@company2.com.au</From>
<Heading>Reminder</Heading>
<Body>Fishing this weekend!</Body>
<Meta>
<Sent Time=”20191018104902″/>
<Received/>
</Meta>
</Note>

If we take the pre-processing approach, then flattening this extremely simple tiny document is not a trivial task and adding support for additional fields (marked in yellow) that can be added to this kind of document later is full of grief and pain.

The solution discussed dissects XML as a parent-child tree structure with attributes being leaves.

For example, the mentioned above XML can be presented in a tabular form:

Document IDIDParent IDItem TypeLabelValue
0000Note 
0101TypePrivate
0201Classificationnon-sensitive
0301Importancehigh
0400ToMike michaelj@company1.com.au
0541BccNSA
0600FromJani janik@company2.com.au
0700HeadingReminder
0800BodyFishing this weekend!
0900Meta 
01090Sent 
011101Time20191018104902
01290Received 

Item type is 0 for the nodes, and 1 for the attributes.

The Document ID allows us to use multiple XML documents in the same table. The root element is determined by both Parent ID and ID being equal 0.
 

Analysis

If we add the proposed approach to the analysis table presented above the benefits are quite striking:

MethodProsCons
Pre-processingFast and efficient querying.Only a limited set of XML attributes is usually extracted.
XML parsing is a part of non-time critical ETL process that can be optimized/scaled.Bringing new attributes is a massive re-development effort.
 Changing XML schema can break the ETL.
Processing-on-flyXML is captured as a whole.Querying is either much slower or more expensive.
No complex ETL to extract attributes from the XML required.Limited support by the current DBMS offerings for XML column types.
 Efficient indexing is problematic, slowing down the querying even further.
XML-flatteningXML pre-processing is done at ETL stage.Querying within a hierarchy requires more complex queries with no massive impact on efficiency.
XML is captured fully, no extracts. 
Indices can and should be added for a fast querying. 
Any XML can be stored in the same DBMS, even in the same table if needed. 
Any modern RDBMS can store the data, same applies to the Data Warehouse. 
Data Lake ready. 

Outcome

The resulting flattened data structured can be easily stored in RDBMS (or in a Data Warehouse), with full indexing for efficient querying or in a Data Lake in of the efficient storage formats (parquet, avro, orc, etc.).

Querying such structure is quite simple.

Retrieving full list of messages sent by Mike would be as simple as:

SELECT DISTINCT document_id FROM messages WHERE label=’From’ AND value=’ Mike michaelj@company1.com.au’

All messages matching ‘John’ in the From field and a word ‘fishing’ in the body:

SELECT DISTINCT document_id
FROM messages m1
JOIN messages m2 ON m1.document_id=m2.document_id
WHERE m1.label=’From’ AND POSITION(‘Mike’ IN m1.value)>0
AND m2.label=’Body’ AND POSITION(‘fishing’ IN m2.value)>0

If required, for even more efficient querying a set of materialised views can be created and easily maintained by most organisations.
 

Impact

Being able to store a potentially unlimited number of XMLs in a columnar storage system (potentially indexed if we consider RDBMS) with an ability to query it without incurring massive CPU-costs (compared to XML-processing on fly) and having access to all XML attributes at once makes a clear option for organisations acquiring/storing their as XMLs.

If you would like to know more about XML Storage or if you are interesting in using our services – contact us here.
 

Similar blog posts:

https://altisconsulting.com/au/putting-together-a-roadmap-to-guide-your-data-analytics-journey/
https://altisconsulting.com/au/greater-control-of-your-power-bi-datasets-with-the-power-bi-rest-api/

Share

Facebook
Twitter
LinkedIn
WhatsApp

One Response

Leave a Reply

Your email address will not be published.

Recent

Connect with us

If you’d like to be kept in the loop on courses, events and other related topics, simply complete your details and we’ll add you to our list.

We use cookies to improve your experience and support our mission.
Read more about it here. By using our sites, you agree to our use of cookies.