Blog

Processing XML in Python - ElementTree

photo-1533105320376-1b9894a4af15.jpeg

Robert Anasch via Unsplash

Learn how you can parse, explore, modify and populate XML files with the Python ElementTree package, for loops and XPath expressions. As a data scientist, you’ll find that understanding XML is powerful for both web-scraping and general practice in parsing a structured document

Extensible Markup Language (XML) is a markup language which encodes documents by defining a set of rules in both machine-readable and human-readable format. Extended from SGML (Standard Generalized Markup Language), it lets us describe the structure of the document. In XML, we can define custom tags. We can also use XML as a standard format to exchange information.

  • XML documents have sections, called elements, defined by a beginning and an ending tag. A tag is a markup construct that begins with < and ends with >. The characters between the start-tag and end-tag, if there are any, are the element’s content. Elements can contain markup, including other elements, which are called “child elements”.
  • The largest, top-level element is called the root, which contains all other elements.
  • Attributes are name–value pair that exist within a start-tag or empty-element tag. An XML attribute can only have a single value and each attribute can appear at most once on each element.

Here’s a snapshot of movies.xml that we will be using for this tutorial:

<?xml version="1.0"?>
<collection>
    <genre category="Action">
        <decade years="1980s">
            <movie favorite="True" title="Indiana Jones: The raiders of the lost Ark">
                <format multiple="No">DVD</format>
                <year>1981</year>
                <rating>PG</rating>
                <description>
                'Archaeologist and adventurer Indiana Jones 
                is hired by the U.S. government to find the Ark of  the Covenant before the Nazis.'
                </description>
            </movie>
               <movie favorite="True" title="THE KARATE KID">
               <format multiple="Yes">DVD,Online</format>
               <year>1984</year>
               <rating>PG</rating>
               <description>None provided.</description>
            </movie>
            <movie favorite="False" title="Back 2 the Future">
               <format multiple="False">Blu-ray</format>
               <year>1985</year>
               <rating>PG</rating>
               <description>Marty McFly</description>
            </movie>
        </decade>
        <decade years="1990s">
            <movie favorite="False" title="X-Men">
               <format multiple="Yes">dvd, digital</format>
               <year>2000</year>
               <rating>PG-13</rating>
               <description>Two mutants come to a private academy for their kind whose resident superhero team must oppose a terrorist organization with similar powers.</description>
            </movie>
            <movie favorite="True" title="Batman Returns">
               <format multiple="No">VHS</format>
               <year>1992</year>
               <rating>PG13</rating>
               <description>NA.</description>
            </movie>
               <movie favorite="False" title="Reservoir Dogs">
               <format multiple="No">Online</format>
               <year>1992</year>
               <rating>R</rating>
               <description>WhAtEvER I Want!!!?!</description>
            </movie>
        </decade>    
    </genre>

    <genre category="Thriller">
        <decade years="1970s">
            <movie favorite="False" title="ALIEN">
                <format multiple="Yes">DVD</format>
                <year>1979</year>
                <rating>R</rating>
                <description>"""""""""</description>
            </movie>
        </decade>
        <decade years="1980s">
            <movie favorite="True" title="Ferris Bueller's Day Off">
                <format multiple="No">DVD</format>
                <year>1986</year>
                <rating>PG13</rating>
                <description>Funny movie on funny guy </description>
            </movie>
            <movie favorite="FALSE" title="American Psycho">
                <format multiple="No">blue-ray</format>
                <year>2000</year>
                <rating>Unrated</rating>
                <description>psychopathic Bateman</description>
            </movie>
        </decade>
    </genre>

Introduction to ElementTree

The XML tree structure makes navigation, modification, and removal relatively simple programmatically. Python has a built in library, ElementTree, that has functions to read and manipulate XMLs (and other similarly structured files).

First, import ElementTree. It’s a common practice to use the alias of ET:

import xml.etree.ElementTree as ET

Parsing XML Data

In the XML file provided, there is a basic collection of movies described. The only problem is the data is a mess! There have been a lot of different curators of this collection and everyone has their own way of entering data into the file. The main goal in this tutorial will be to read and understand the file with Python — then fix the problems.

First you need to read in the file with ElementTree.

tree = ET.parse('movies.xml')
root = tree.getroot()

Now that you have initialized the tree, you should look at the XML and print out values in order to understand how the tree is structured.

root.tag

'collection'

At the top level, you see that this XML is rooted in the collection tag.

root.attrib

{}

For Loops

You can easily iterate over subelements (commonly called “children”) in the root by using a simple “for” loop.

for child in root:
    print(child.tag, child.attrib)
genre {'category': 'Action'}
genre {'category': 'Thriller'}
genre {'category': 'Comedy'}

Now you know that the children of the root collection are all genre. To designate the genre, the XML uses the attribute category. There are Action, Thriller, and Comedy movies according the genre element.

Typically it is helpful to know all the elements in the entire tree. One useful function for doing that is root.iter().

[elem.tag for elem in root.iter()]

['collection',
 'genre',
 'decade',
 'movie',
 'format',
 'year',
 'rating',
 'description',
 'movie',
 .
 .
 .
 .
 'movie',
 'format',
 'year',
 'rating',
 'description']

There is a helpful way to see the whole document. If you pass the root into the .tostring() method, you can return the whole document. Within ElementTree, this method takes a slightly strange form.

Since ElementTree is a powerful library that can interpret more than just XML, you must specify both the encoding and decoding of the document you are displaying as the string.

You can expand the use of the iter() function to help with finding particular elements of interest. root.iter() will list all subelements under the root that match the element specified. Here, you will list all attributes of the movie element in the tree

for movie in root.iter('movie'):
     print(movie.attr)

{'favorite': 'True', 'title': 'Indiana Jones: The raiders of the lost Ark'}
{'favorite': 'True', 'title': 'THE KARATE KID'}
{'favorite': 'False', 'title': 'Back 2 the Future'}
{'favorite': 'False', 'title': 'X-Men'}
{'favorite': 'True', 'title': 'Batman Returns'}
{'favorite': 'False', 'title': 'Reservoir Dogs'}
{'favorite': 'False', 'title': 'ALIEN'}
{'favorite': 'True', 'title': "Ferris Bueller's Day Off"}
{'favorite': 'FALSE', 'title': 'American Psycho'}
{'favorite': 'False', 'title': 'Batman: The Movie'}
{'favorite': 'True', 'title': 'Easy A'}
{'favorite': 'True', 'title': 'Dinner for SCHMUCKS'}
{'favorite': 'False', 'title': 'Ghostbusters'}
{'favorite': 'True', 'title': 'Robin Hood: Prince of Thieves'}

XPath Expressions

Many times elements will not have attributes, they will only have text content. Using the attribute .text, you can print out this content.

Now, print out all the descriptions of the movies.

for description in root.iter('description'):
    print(description.text)

'Archaeologist and adventurer Indiana Jones is hired by the U.S. government to find the Ark of the Covenant before the Nazis.'
None provided.
Marty McFly
Two mutants come to a private academy for their kind whose resident superhero team must oppose a terrorist organization with similar powers.
NA.
WhAtEvER I Want!!!?!
"""""""""
Funny movie about a funny guy
psychopathic Bateman
What a joke!
Emma Stone = Hester Prynne
Tim (Rudd) is a rising executive who “succeeds” in finding the perfect guest, IRS employee Barry (Carell), for his boss’ monthly event, a so-called “dinner for idiots,” which offers certain 
advantages to the exec who shows up with the biggest buffoon.
Who ya gonna call?
Robin Hood slaying

Printing out the XML is helpful, but XPath is a query language used to search through an XML quickly and easily. However, Understanding XPath is critically important to scanning and populating XMLs. ElementTree has a .findall() function that will traverse the immediate children of the referenced element.

Here, you will search the tree for movies that came out in 1992:

for movie in root.findall("./genre/decade/movie/[year='1992']"):
    print(movie.attrib)

{'favorite': 'True', 'title': 'Batman Returns'}
{'favorite': 'False', 'title': 'Reservoir Dogs'}

The function .findall() always begins at the element specified. This type of function is extremely powerful for a “find and replace”. You can even search on attributes!

Now, print out only the movies that are available in multiple formats (an attribute).

for movie in root.findall("./genre/decade/movie/format/[@multiple='Yes']"):
    print(movie.attrib)

{'multiple': 'Yes'}
{'multiple': 'Yes'}
{'multiple': 'Yes'}
{'multiple': 'Yes'}
{'multiple': 'Yes'}

Brainstorm why, in this case, the print statement returns the “Yes” values of multiple. Think about how the “for” loop is defined.

Tip: use '...' inside of XPath to return the parent element of the current element.

for movie in root.findall("./genre/decade/movie/format[@multiple='Yes']..."):
    print(movie.attrib)

{'favorite': 'True', 'title': 'THE KARATE KID'}
{'favorite': 'False', 'title': 'X-Men'}
{'favorite': 'False', 'title': 'ALIEN'}
{'favorite': 'False', 'title': 'Batman: The Movie'}
{'favorite': 'True', 'title': 'Dinner for SCHMUCKS'}

Modifying an XML

Earlier, the movie titles were an absolute mess. Now, print them out again:

for movie in root.iter('movie'):
    print(movie.attrib)

{'favorite': 'True', 'title': 'Indiana Jones: The raiders of the lost Ark'}
{'favorite': 'True', 'title': 'THE KARATE KID'}
{'favorite': 'False', 'title': 'Back 2 the Future'}
{'favorite': 'False', 'title': 'X-Men'}
{'favorite': 'True', 'title': 'Batman Returns'}
{'favorite': 'False', 'title': 'Reservoir Dogs'}
{'favorite': 'False', 'title': 'ALIEN'}
{'favorite': 'True', 'title': "Ferris Bueller's Day Off"}
{'favorite': 'FALSE', 'title': 'American Psycho'}
{'favorite': 'False', 'title': 'Batman: The Movie'}
{'favorite': 'True', 'title': 'Easy A'}
{'favorite': 'True', 'title': 'Dinner for SCHMUCKS'}
{'favorite': 'False', 'title': 'Ghostbusters'}
{'favorite': 'True', 'title': 'Robin Hood: Prince of Thieves'}

Fix the ‘2’ in Back 2 the Future. That should be a find and replace problem. Write code to find the title ‘Back 2 the Future’ and save it as a variable:

b2tf = root.find("./genre/decade/movie[@title='Back 2 the Future']")
print(b2tf)

<Element 'movie' at 0x10ce00ef8>

Notice that using the .find() method returns an element of the tree. Much of the time, it is more useful to edit the content within an element.

Modify the title attribute of the Back 2 the Future element variable to read “Back to the Future”. Then, print out the attributes of your variable to see your change. You can easily do this by accessing the attribute of an element and then assigning a new value to it:

b2tf.attrib["title"] = "Back to the Future"
print(b2tf.attrib)

{'favorite': 'False', 'title': 'Back to the Future'}

Write out your changes back to the XML so they are permanently fixed in the document. Print out your movie attributes again to make sure your changes worked. Use the .write() method to do this:

tree.write("movies.xml")

tree = ET.parse('movies.xml')
root = tree.getroot()

for movie in root.iter('movie'):
    print(movie.attrib)

{'favorite': 'True', 'title': 'Indiana Jones: The raiders of the lost Ark'}
{'favorite': 'True', 'title': 'THE KARATE KID'}
{'favorite': 'False', 'title': 'Back to the Future'}
{'favorite': 'False', 'title': 'X-Men'}
{'favorite': 'True', 'title': 'Batman Returns'}
{'favorite': 'False', 'title': 'Reservoir Dogs'}
{'favorite': 'False', 'title': 'ALIEN'}
{'favorite': 'True', 'title': "Ferris Bueller's Day Off"}
{'favorite': 'FALSE', 'title': 'American Psycho'}
{'favorite': 'False', 'title': 'Batman: The Movie'}
{'favorite': 'True', 'title': 'Easy A'}
{'favorite': 'True', 'title': 'Dinner for SCHMUCKS'}
{'favorite': 'False', 'title': 'Ghostbusters'}
{'favorite': 'True', 'title': 'Robin Hood: Prince of Thieves'}

Fixing Attributes

The multiple attribute is incorrect in some places. Use ElementTree to fix the designator based on how many formats the movie comes in. First, print the formatattribute and text to see which parts need to be fixed.

for form in root.findall("./genre/decade/movie/format"):
    print(form.attrib, form.text)

{'multiple': 'No'} DVD
{'multiple': 'Yes'} DVD,Online
{'multiple': 'False'} Blu-ray
{'multiple': 'Yes'} dvd, digital
{'multiple': 'No'} VHS
{'multiple': 'No'} Online
{'multiple': 'Yes'} DVD
{'multiple': 'No'} DVD
{'multiple': 'No'} blue-ray
{'multiple': 'Yes'} DVD,VHS
{'multiple': 'No'} DVD
{'multiple': 'Yes'} DVD,digital,Netflix
{'multiple': 'No'} Online,VHS
{'multiple': 'No'} Blu_Ray

There is some work that needs to be done on this tag.

You can use regex to find commas — that will tell whether the multiple attribute should be “Yes” or “No”. Adding and modifying attributes can be done easily with the .set()method.

import re

for form in root.findall("./genre/decade/movie/format"):
    # Search for the commas in the format text
    match = re.search(',',form.text)
    if match:
        form.set('multiple','Yes')
    else:
        form.set('multiple','No')

# Write out the tree to the file again
tree.write("movies.xml")

tree = ET.parse('movies.xml')
root = tree.getroot()

for form in root.findall("./genre/decade/movie/format"):
    print(form.attrib, form.text)

{'multiple': 'No'} DVD
{'multiple': 'Yes'} DVD,Online
{'multiple': 'No'} Blu-ray
{'multiple': 'Yes'} dvd, digital
{'multiple': 'No'} VHS
{'multiple': 'No'} Online
{'multiple': 'No'} DVD
{'multiple': 'No'} DVD
{'multiple': 'No'} blue-ray
{'multiple': 'Yes'} DVD,VHS
{'multiple': 'No'} DVD
{'multiple': 'Yes'} DVD,digital,Netflix
{'multiple': 'Yes'} Online,VHS
{'multiple': 'No'} Blu_Ray

Moving Elements

Some of the data has been placed in the wrong decade. Use what you have learned about XML and ElementTree to find and fix the decade data errors. It will be useful to print out both the decade tags and the year tags throughout the document.

for decade in root.findall("./genre/decade"):
    print(decade.attrib)
    for year in decade.findall("./movie/year"):
        print(year.text)

{'years': '1980s'}
1981 
1984 
1985 
{'years': '1990s'}
2000 
1992 
1992  
{'years': '1970s'}
1979 
{'years': '1980s'}
1986 
2000 
{'years': '1960s'}
1966 
{'years': '2010s'}
2010 
2011 
{'years': '1980s'}
1984 
{'years': '1990s'}
1991

The two years that are in the wrong decade are the movies from the 2000s. Figure out what those movies are, using an XPath expression.

for movie in root.findall("./genre/decade/movie/[year='2000']"):
    print(movie.attrib)

{'favorite': 'False', 'title': 'X-Men'}
{'favorite': 'FALSE', 'title': 'American Psycho'}

You have to add a new decade tag, the 2000s, to the Action genre in order to move the X-Men data. The .SubElement() method can be used to add this tag to the end of the XML.

action = root.find("./genre[@category='Action']")
new_dec = ET.SubElement(action, 'decade')
new_dec.attrib["years"] = '2000s'

Now append the X-Men movie to the 2000s and remove it from the 1990s, using .append() and .remove(), respectively.

xmen = root.find("./genre/decade/movie[@title='X-Men']")
dec2000s = root.find("./genre[@category='Action']/decade[@years='2000s']")
dec2000s.append(xmen)
dec1990s = root.find("./genre[@category='Action']/decade[@years='1990s']")
dec1990s.remove(xmen)

Build XML Documents

Nice, so you were able to essentially move an entire movie to a new decade. Save your changes back to the XML.

tree.write("movies.xml")

tree = ET.parse('movies.xml')
root = tree.getroot()

print(ET.tostring(root, encoding='utf8').decode('utf8'))

Conclusion

ElementTree is an important Python library that allows you to parse and navigate an XML document. Using ElementTree breaks down the XML document in a tree structure that is easy to work with. When in doubt, print it out (print(ET.tostring(root, encoding='utf8').decode('utf8'))) – use this helpful print statement to view the entire XML document at once.

References

The Dynamics of Data Roles & Teams

teamwork

I’m continually surprised by the responsibilities and titles of new roles emerging within the ‘data profession’. Admittedly, this is fairly a nebulous concept and I suspect there are a variety of opinions amongst practitioners as to what the composition of this space looks like. However there are certain trends within this area that practitioners would also agree on. Data is being taken more seriously by organisations than ever before with comparable growth seen in terms of dedicated ‘data people’, investment and technology.

For the sake of convenience & readability, I would like to go over data roles briefly categorised by tech-revolutions — those that influenced a substantial change — and especially ones that will keep evolving in future. In addition, i wrote a piece on Evolution of Analytics with Data recently, that helps gather a better context for this article.

As an amateur blogger, this is clearly a perspective and could be a long-read for them drowsy eyes. A word of advice: grab a cup of Coffee.

Business Intelligence Roles

Quite rightly so, ‘BI’ doesn’t qualify to compete with the trendy buzzonyms around the tech-ecosystem in 2018 and isn’t pleasing to the ears of our data-savvy generation. Are ETL tools & strategies no more in use ? Is the scope of  BI overshadowed by the vast application of big-data & data science methodologies ?? — Hell NO !!

BI_roles_final.png

How traditional BI roles were structured in-accordance with the 
business model of the organisation. Source: Microsoft Technet Wiki

Business Intelligence has seen a considerable decline in the last year or two. However, I wouldn’t go so far to call BI dead as it’s application is very critical to major businesses. Roles like BI Analysts, Data Architects, ETL Developer, DW Engineer, BIDW Admins would only become more crucial, emphasising an extra eye on market-leading tools & technology over the jack-of-all-trades roles in present domains.

Business intelligence(BI) concept icons

Scope of Business Intelligence techniques employed in 2018.
Source:Check out infographics & vector designs on DepositPhotos

According to a recent Wisdom of Crowds® Business Intelligence Market Study, BI will continue to provide competent job salaries and dominate certain areas in the market. Here are some of it’s key numeric take-aways in 2018:

  • Executive Management, Operations, & Sales: 3 areas driving BI adoption.
  • Dashboards, reporting, end-user self-service, advanced visualisation, and data warehousing: 5 technologies and initiatives strategic to BI.
  • Small organisations up-to 100 employees have highest rate of BI penetration.
  • 50% of vendors offer perpetual on-premises licensing and cloud subscriptions .
  • Fewer than 15% of respondent organisations have a Chief Data Officer.

In case if you still have a difference of opinion, i recommend you the read the full-post: The State of Business Intelligence, 2018

Big Data & Data Science Roles

Before we take a deep-dive into the current roles, let’s take a step-back to understand how and where it all started. My idea is to demonstrate these roles with a storytelling narrative over the traditional plaintext definitions — the latter being easily accessible around Internet. Additionally, every new-wave in the industry gives birth to confusing buzzwords, false renditions & surrealistic stipulations (which is a mouthful to say the least).

The Change

‘Big data’ was coined to distinguish from small data as it was not generated purely by the firm’s transaction systems. It also stated that predictive analytics offered better data trends in contrary to the fact-based comprehension to go beyond intuition when making decisions. If dimensions & analytics weren’t justifying enough, this phase welcomed the use of a community-driven “Open Source” tools over the highly priced licences.

I usually refrain from citing names of tools in my posts, but it’s fairly impossible to describe this revolution without mentioning Apache Hadoop. The technology-stack & extensible projects, the functional programming paradigms (scalable, concurrent & distributed systems), the rise of noSQL DB systems, job scheduling & cluster resource management, the changing aspects of Drag-n-Drop ETL and better data modelling techniques — all of which was brought together by Hadoop, but it ultimately emphasised on the last — code is the best abstraction for software. And, it introduced — typically on a broad sense — an idea of having custom architecture ready for future integrations with Data Science & Machine Learning.

From the developers’ perspective, What this meant is you don’t necessarily have to be working for tech big-guns to develop new disruptive projects. You had the backing of a community at your disposal and emerging collaboration platforms like Github to showcase your work.

big_data_analytics_hierarchy.jpg

Hierarchy of roles in Big Data & Analytics-driven companies.

From an organisational view, Software Engineers (java developers), DW engineers (BI/ETL developers, Data architects), Infra Admins (DBAs, Linux SAs) explored fancier titles as Big-Data Engineer, Hadoop Developers, Hadoop Architects, Big-Data Support Engineers began to flourish in the job-market. BI-roles fell down the pecking order and the years where line of business users and data-personnel using the same tools, were nothing but over.

Hadoop roles_2

BI roles gradually moving out of the circle of Big Data teams. 
Source: DataFlair

At an industrial level, it had the most impact — as it’s not just tech-firms and online companies that can create products and services from big-data analytics — It’s practically every firm in the industry.

The Fusion

The tech-industry suddenly got divided due to the rising demand of employing Big data with Data Science strategies. As such, the field-roles were classified into three buckets : Software Engineering (Strong programming with Front & Back-end engineers, Web developers, Infra-admins, Middleware specialists, iOS/Android developers), Data Engineering (Strong Data background like ETL developers, DWH architects, BI analysts, Hadoop engineers, DBAs) and welcomed a third set of individuals deemed as the next-generation quantitative analysts (possessing both computational & analytical skills), who specialised in a growing field of study: Data Science.

data-science-engineer-software

Venn Diagram showing tools & techniques under SE vs DE vs DS domains.
Source: Ryan Swanstrom, Data Science 101

According to me, this classification yielded in a significant transition with the positives best-leveraged by small-scale firms (< 50 employees) like emerging startups, research-facilities as well as large-scale enterprises (> 1000 employees) like telecom, e-commerce, social media etc. Startups had the liberty of combining multiple-roles into one and encouraging multi-disciplinary growth opportunities, while the mainstream giants had no trouble in employing distinct roles across different departments, thereby adding areas of generating more business.

Entrepreneurs with now a medium-sized (or SMBs) company, who were striving to gain commercial reckoning — competing with the big-players in their respective market — were arguably affected the most. The initial success — through series funding rounds or backed by venture capitalist investments — allowed them to grow larger in numbers (50-300+ employees). They rushed into indefinite-hires, redundant roles, poor decision-making strategies. Eventually, the constant pressure to stay in the market under quarterly-timelines enforced unprecedented lay-offs, stock-distribution losses and even  resulted liquidation at an early stage. Some tech-savvy investors (whom i’d like to refer as guardian-angels) offered M&A assistance, but the industry saw the downside of absorbing roles for the first time.

The Overlap

Meanwhile, it wasn’t just companies having a hard-time with evolving data-roles. This era saw an uprising number of data science enthusiasts (both Academic & Experienced) coming out of their comfort-caves & expanding their skill-set. And Why not, each of these applicants (Mathematicians, Phd Doctorates, Analysts) had every right to apply for one of the finest-paid jobs of the 21st century. Along came esteemed-university professors & philanthropists, with their versions of the ideal-candidature, but that didn’t stop the mob.

Titles with Data Prefixes helped make early distinctions between roles with similar line of tasks. The intent was aimed at identifying skill-coverage and harnessing the right-potential. Data Analysts shied away from business and drove their eyes onto statistics & engineering while Data Architects kept their depth-focus on publishing models (not to be confused with ML), database design, governance with their trademark politically-neutral attitude.

Data Roles

Radar chart explaining overlap of skills between Data-driven roles.
Ignore "Mad Skillz" as it implies "Natural Abilities". Source:edX

Businesses started to gather more understanding by nurturing capabilities of Prescriptive Analytics with Machine Learning around their premise. They began competing on analytics not only in the traditional sense – by improving internal business decisions – but also by creating more valuable products and services. The sheer need (or greed) to attain concrete goals — improved results than last quarter — proportionally showered an overhead of roles and responsibilities. As such, a promising yet challenging position as the likes of a Data Scientist, also beckoned for a central figure across teams — the daily go-to person for anything related to data. Not a lot has been spoken about the stress, fatigue of many a such burdened individuals. If a person of such calibre invested a majority of their time on analysing, they also managed to find time to pursue better opportunities for themselves. Here’s a satirical treat on KDnuggets supporting my claim.

The Trade-Off

Two big questions came into light: Is Data-Science the next bubble ? My answer: NO, but the “Data Scientist” title was arguably becoming one. A textbook demand-and-supply problem —where every aspirant wants a fair share of goods & commodities, but only a few proved worthy of claiming it. Hmm, a bit confusing ?. How do you deal with a fresh graduate applying for this role or what do you do when your data scientist is likely to leave, and you’re left with a pack of “self-proclaimed” ones knocking on your door.

Secondly, With data accessed directly from sources like websites, APIs, social media or internet; the need for software programming languages & the prowess to do so with fast efficiency — couldn’t be compromised .”Not all data scientists held great software foundations” or “Why were software engineering concepts ignored, amidst all the buzz for Data Science ?”. Companies soon realised that only a role reallocation can normalise such inclinations as they looked onto broader engineers to heavily support their data scientists and find that equilibrium amongst different entity roles.

Software engineers, who appeared to have a knack for data science & machine learning , stepped-up to help with this dilemma and strengthened the data engineer club. While those practising core web-programming & stack-driven ambitions moved onto bigger challenges: Full-Stack Engineer.

fullstack_final.png

Full-Stack by past roles (left) & by tech-stack areas (right).

A win-win situation : data scientists got a reliable sidekick with a sigh-of-relief (the inflated hype for their ‘crown’ lowered) and an equally-competent role on the horizon to challenge them. The collusion not only sent those-craving-enthusiasts spinning but also opened another door, making data engineering one of the most sophisticated disciplines today. This modern-day Data Engineer complements every other role, a must-have handyman in every firm and are practically the first-hires in startups these days.

DE vs DS.png

An Infographic-take on Data Engineers and Data Scientists.
Source: Read Full Post on DE vs DS, by Karlijn Willems

The gamble (workaround play that clicked) by balancing mutually distinct roles paid off perfectly but the tech-industry knew they couldn’t afford another setback and had to be prepared with the increasing acceptance of Artificial Intelligence looming around the corner.

The Resolution

Inevitably, companies identified the flaws in their organisational-structure: positions, priorities and capabilities — and incepted Data-Driven TeamsThe prime focus being on role-distinctions, division of labour, avoiding task conflicts, proper rules of collaboration. An extended example of role-based leaders pioneering respective units inside such a team would be : Principal Data Scientist & Engineering Lead.

This slideshow requires JavaScript.

An early look of a well-structured Data Science team under the 
same roof. Source: DataCamp Blog Community

Today, A perfect data-science team is a myth or otherwise an engaging subject of heated debate. What companies expect from their teams is to assemble as a group of superheroes (The Avengers— What they fail miserably on occasions is to appoint a person who provides such teams with a context (Nick Fury). This is where Chief Data Officers come into powerful existence. With data becoming an integral business strategy, CDOs are becoming a more critical role in an organisation. In a Forbes survey, more than 50% of CDOs will likely report directly to the CEO in 2018. They’re bound to take on more active roles in shaping their businesses’ initiatives.

I often get disappointed upon seeing job-descriptions containing “Advanced English Skills” or “Native candidates only”. So, I proactively question (or troll) such job-posters every single time (I do enjoy their apparent pause). Language shouldn’t be deemed as a barrier, rather be utilised as a formidable source of unifying teams. The best example in 2018 to make my stance clear is indeed a language in itself: Python. Founders (CEOs & CDOs) must trickle these little communications within their teams and most importantly — their first focal point — the Talent Requisition team

github2018_downloads

How Python brings a team of diversified role-types together.
Source: ActiveWizards

These days HR coordinators, recruiters, outsourcing head-hunters all have access to ample data resources (Medium, Datacamp) & data-friendly platforms (LinkedIn Recruiter, Glassdoor) to refine their search for an improved hiring; thereby making their roles even data-driven.

Machine Learning & AI-driven Roles

Perhaps the most compelling aspect about Machine Learning is its seemingly limitless applicability. There are already so many fields being impacted by ML and now AI, including Education, Finance, and more. Machine Learning techniques are already being applied to critical areas within the Healthcare sphere, impacting everything from care variation reduction efforts to medical scan analysis

There are a number of companies for whom their data (or their data analysis platform) is their product. In this case, the data analysis or machine learning going on can be pretty intense. This is probably the ideal situation for someone who has a formal mathematics, statistics, or physics background and is hoping to continue down a more academic path.

“Machine Learning Engineers often focus more on producing great data-driven products than they do answering operational questions for a company.”

Data-Science-Skills-Udacity-Matrix

New addition to the DataScience team working on ML. Source:Udacity

Companies have become more encouraging and are constantly on the lookout for Machine Learning Engineers : open-minded candidates for ranging from all age-groups (Academic Interns to Research Scientists). The social media generation also have a far more appreciation than before as seen on LinkedIn, Medium, Github.

ML-Graph.png

Bird's-Eye view of multiple ML-roles in AI firms. Source:Udacity

AI-driven companies successfully implementing intelligent machines (like Chatbots) are already a step-ahead than others. Roles organised by software, applied & core is a clear indication — they’re serious about their product developments & service offerings. Since there isn’t any generalisation on profile & seniority today, they’re in full liberty to improvise AI-titles in the future.

Encompassing Roles

There are many roles that complement data-driven teams on a day-to-day basis. They are a must-have in organisation irrespective of the teams they belong to. You’d probably wonder why i didn’t mention them earlier. Honestly, I was skeptical for reasons below:

  • I have limited expertise on these profiles and their scope.
  • They are not primarily seen under the category of data-driven roles.
  • Their domain versatility allows them to operate across different teams.

Let me try to explain before the knife-wielding mob gets here.

  • Graphic Designers : The Creative Heads in every sense. A complete package of art, science, programming, ideas and imagination with endless capabilities. They add value with their vocal-presence & fearless attitude. My personal favourites.
  • Decision-Makers : A role often misconstrued and overlooked. Especially in domain-specific startups, Before hiring that PhD-trained data scientist, make sure you have a decision-maker who understands the art & science of decision-making.
  • DevOps & Site-Reliability Engineers : Broadly in two categories: “business capabilities teams” and “agile operations teams”. Data Architects & Engineers can coordinate, learn and implement tasks like cloud-based (IaaS,PaaS,SaaS) configs, containers, micro-services deployment & virtualisation. However, DataOps is a new platform allowing continuous data-flow within the enterprise.
  • Cloud Architects : Technology Specialists who usually take up consulting roles (charge by hours like their cloud services). Again if your Data engineer is familiar with cloud concepts or a certified associate/professional, you may not hire them.
  • Project & Delivery Managers – Some data science & analytics firms still have to bend to old norms of Agile & Scrum methodologies. Before they start consulting clients to orchestrate sales of their products & services, they need experienced managers to ensure PoC (proof-of-concept) timelines & resources are well-allocated.
  • Network & Cyber Security engineer : Often seen as internal teams but amongst all the above mentions, they will soon be an integral part of the data-driven teams. With data security already showing menacing-concerns in 2018, these roles have been realised “critical” as most companies operate daily with online presence.

Parting Thoughts

Certainly on the tool front, the technology is becoming more accessible and intuitive than ever before. There are an array of adaptors for instance in most cleansing, modelling, reporting & visualisation tools meaning loading data is itself no longer a hugely significant requirement. However this has also encouraged a somewhat ubiquitous view of data – it should just work with minimal effort. There is an ominous risk that less and less time will be dedicated in getting the fundamentals right.

Tech & Industries to watch out in 2018-19:

  • Progressive Web Apps (PWAs) – A mixture of a mobile and web apps.
  • Blockchain & Fintech- Metamodel building,reliable trading & credit scoring.
  • Healthcare Technology – Diagnosis by Medical Imaging (Computer vision & ML).
  • AR/VR – Sport Analysis, Business Cards (Image Tracking), Techno eSports (Hado).
  • AI Speech Assistants, smarter Chat-bot integrations.
  • Smart Supply Chain – Digital twins (IoT Sensors).
  • 5G – Big data, Mobile cloud computing, scalable IoT & network virtualisation (NFV).
  • 3D Printing – Prefabrication efficiency, Defect detection, PredictiveML maintenance.
  • Dark Data – Information that is yet to become available in digital format.
  • Quantum Computing – Cutting data processing times into fractions.

Finally, On the job front, its evident the roles won’t be able to keep with the dynamics of technologies. Landing that next opportunity will be difficult. As per many job advisors, there are binary ways to keep that job security intact: Be an expert in one domain affirming a stance within a stable company or seek challenging roles by identifying newer domains aligned with tech-trends. As a Data Engineer, I follow a hybrid approach — maintaining a learning discipline between professional career & personal ambitions — practically allowing me to work in any tech-driven industry. If there’s any consolation, I surely know that i’m responsible for my success & failures in the future.

Don’t ever let someone tell you that you can’t do something. You got a dream, you gotta protect it. People can’t do something themselves, they wanna tell you that you can’t do it. You want something, go get it. Period.

—  The Pursuit of Happyness

 

 

 

 

The Evolution of Analytics with Data

vineet-blog

We have made a tremendous progress in the field of Information & Technology in recent times. Some of the revolutionary feats achieved in the tech-ecosystem are really worth commendable. Data and Analytics have been the most commonly-used words in the last decade or two. As such, it’s important to know why they are inter-related, what roles in the market are currently evolving and how they are reshaping businesses.

Technology ,often regarded as a boon to those already aware of its potential, can also be a curse to audiences who can’t keep up with it’s rapid growth. Each era has had it’s moments of breakthrough and an equal share of victims (or as i’d like to call them collateral damage). As of today, every monetary-driven industry completely relies on Data and Analytics for their survival.

This blog is an attempt to look over these different stages ; simplifying the various buzzwords, narrating the scenarios which were never explained and keeping an eye on the road that lies ahead. So, without further ado , Grab your “cheat-day” meal & lets take a walk down the memory lane.

Analytics 1.0   →  Need for Business Intelligence : This was the uprising of Data warehouse where customer (Business) and production processes (Transactions) were centralised into one huge repository like eCDW (Enterprise Consolidated Data Warehouse) . A real progress was established in gaining an objective, deep understanding of important business phenomena – thereby giving managers the fact-based comprehension to go beyond intuition when making decisions.

The data surrounding eCDW was captured , transformed , queried using ETL & BI tools. The type of analytics exploited during this phase were mainly classified as Descriptive (what happened) and Diagnostic (why something happened).

However , The main limitations observed during this era was that the potential capabilities of data were only utilised within organisations , i.e. , the business intelligence activities addressed only what had happened in the past and offered no predictions about it’s trends in the future.

Analytics 2.0   →  Big Data :  The certain drawbacks of the previous era became more prominent by the day as companies stepped out of their comfort-zone and began their pursuit for a wider (if not better) approach towards attaining a sophisticated form of analytics. Customers surprisingly reacted well to this new strategy and demanded information from external sources (clickstreams , social media , internet , public initiatives etc) . The need for powerful new tools and the opportunity to profit by providing them – quickly became apparent. Inevitably , the term ‘Big data’ was coined to distinguish from small data as it was not generated purely by a firm’s internal transaction systems.

What companies expected from their employees was to help engineer platforms to handle large volumes of data with a fast-processing engine . What they didn’t expect – was a huge response from an emerging group of individuals or what is today better known as the “Open Source Community”. This was the hallmark of Analytics 2.0.

With the unprecedented backing of the community , Roles like Big-Data Engineers , Hadoop Administrators grew upon the job-sector and were now critical to every IT organisation. Tech-firms rushed to build new frameworks that were not only capable of ingesting , transforming and processing big-data around eCDW/Data Lakes but also integrating Predictive (what is likely to happen) analytics above it. This uses the findings of descriptive and diagnostic analytics to detect tendencies, clusters and exceptions, and to predict future trends, which makes it a valuable tool for forecasting.

In today’s tech-ecosystem , I personally think the term big-data has been used, misused & abused on many occasions. So technically, ‘big data’ now really means ‘all data’ — or just Data.

Analytics 3.0  →  Data Enriched Offerings :  The pioneering big data firms began investing in analytics to support customer-facing products, services, and features. They attracted viewers to their websites through better search algorithms, recommendations , suggestions for products to buy, and highly targeted ads, all driven by analytics rooted in enormous amounts of data. The outbreak of the Big-Data phenomena spread like a virus, so now it’s not just tech-firms and online companies that can create products and services from analysis of data. It’s practically every firm in every industry.

On the other hand, the wide-acceptance for big-data technologies had a mixed impact . While the tech-savvy giants forged ahead by making more money, a majority of other enterprises & non-tech firms suffered miserably at the expense of not-knowing about the data. As a result, a field of study Data Science was introduced which used scientific methods, exploratory processes, algorithms and systems to extract knowledge and insights from data in various forms.

Indeed, an interdisciplinary field defined as a “concept to unify statistics, data analysis, machine learning and their related methods” in order to “understand and analyse actual phenomena” with data. In other words , a well-refined data complemented with good training models would yield in better prediction results. The next-generation of quantitative analysts were called data scientists, who possessed both computational and analytical skills.

The tech-industry exploded with the benefits of implementing Data Science techniques and leveraged the full power of predictive & prescriptive (what action to take) analytics ,i.e, eliminate a future problem or take full advantage of a promising trend. Companies began competing on analytics not only in the traditional sense – by improving internal business decisions – but also by creating more valuable products and services. This is the essence of Analytics 3.0.

There has been a paradigm shift in how analytics are used today. Companies are scaling at a speed beyond imagination, identifying disruptive services, encouraging more R&D divisions – many of which are strategic in nature. This requires new organisational structure : positions, priorities and capabilities. A closely-knit team of data-driven roles ( Data Scientists , Data Engineers , Solution Architects , Chief Analysts ) when brought under the same roof, is a guaranteed-recipe for achieving success.

Analytics 4.0  →   Automated Capabilities : 

There have always been four types of analytics: descriptive, which reports on the past; diagnostic, which uses the data of the past to study the present; predictive, which uses insights based on past data to predict the future; and prescriptive, which uses models to specify optimal behaviours and actions. Although , Analytics 3.0 includes all of the above types in a broad sense, it however emphasises on the last . And it introduces — typically on a small scale — the idea of automated analytics.

Analytics 3.0 provides an opportunity to scale decision-making processes to industrial strength. Creating many more models through machine learning can let an organisation become much more granular and precise in its predictions. Having said that ,the cost & time for deploying such customised models wasn’t entirely affordable and summoned for a cheaper or faster approach.  The need for automation through intelligent-systems finally arrived and this idea (deemed as beyond-reach) that loomed on the horizon is where Analytics 4.0 came into existence .

There is no doubt that the use of artificial intelligence, machine learning, deep learning is going to profoundly change knowledge work. We have already seen their innovative capabilities in the form of Neural Machine Translation , Smart Reply , Chat-bots , Meeting Assistants etc ,which will be extensively used for the next couple of years. The data involved here originated from vast heterogenous sources consisting of indigenous types — one that requires complex training methods — and especially one that can sustain (make recommendations, improve decision-making, take appropriate actions) in itself.  Employing data-mining techniques and machine learning algorithms along with the existing descriptive-predictive-prescriptive analytics — comes to full fruition in this era. One reason why Automated Analytics is seen as the next stage in analytic maturity.

Analytics 5.0  →   Future of Analytics and Whats Next ???  : 

Analytics 4.0 is filled with the promise of a utopian society run by machines and managed by peace-loving managers and technologists. We could reframe the threat of automation as an opportunity for augmentation — combining smart humans and smart machines to achieve an overall better result.

Now, instead of pondering “What tasks currently employed by humans will soon be replaced by machines?” I’d rather optimistically question “What newly feats can companies achieve if they had better-thinking machines to assist them? or How can we prevent death tolls in a calamity-prone area with improved evacuation AI routine or Why can’t AI-driven e-schools be implemented in poverty-ridden zones ?”

Most organisations that are exploring “cognitive” technologies—smart machines that automate aspects of decision-making processes—are just putting a toe in the water. They’re doing a pilot to explore the technology. While others are working on the concept of building a Consumer-AI-Controlled platform. Personal AI agents that can communicate with other AI services or so called bots to get the job done. No more manual interventions with an AI-powered framework to steer your personal day-to-day activities.

I wouldn’t be surprised to see either of these technologies making giant leaps in the future. Surely, there’s an element of uncertainty tied to them but unlike many, I’m rather very optimistic about the intent. There’s always something waiting at the end of the road. If you’re not willing to see what it is, you probably shouldn’t be out there in the first place.

“Everything should be made as simple as possible , but not simpler”

                                                                                          Albert Einstein

 

Apples != Oranges

applesoranges

 

Building a Career roadmap never featured first on my bucket list. I’m sure a lot of you came across people that influenced a phase of your life. Well , I recently discovered mine in the most unconventional way and would really (3x) like to share with the world.

The title is a Fuzz-Phrase (Oxford pending*) that helps me drive a Perspective model towards achieving my near-future career goal.

Initially , I started an exhaustive approach without any plan and failed steadily for months . Then , I tried one of those alternative life hacks that bought me some time , but only fell deeper into the pit.

A few months ago , I witnessed something unusual about my recent failures and observed a recurring pattern.  I analysed again with finer details and found more   (routine day for a Data Engineer). At this point , I had identified the missing pieces of this jigsaw puzzle and since then it’s been a positive experience everyday.

Firstly , I immediately turned OFF my “rat-race” pursuit for potential careers by and devoted my maximum efforts in Learning about myself

Then I … <to be continued>

Food4Thought : “Data is today’s currency buying tomorrow’s job. “  –  Deepesh Nair