Thursday, July 3, 2014

The Structured Data Spectrum

Adrian Bowles 

"Unstructured data" is a terrible term that does more to obfuscate than to reveal the truth. Processing what passes for unstructured data today - like natural language text - is not alchemy or magic. It may be difficult to discover or identify the structure but that doesn’t mean the data itself is unstructured. If there really is no structure, you have noise, not data. 

The sciences are disciplines of discovery.  An advance in science reveals and explains that which is true but perhaps not obvious. The engineering disciplines train students to design and use artifacts such as machines and structures, using principles based in science and mathematics. Computer science and software engineering use principles of both, but often lack the rigor of their more mature brethren. The term “unstructured data” is evidence of such a breakdown in critical thinking. It implies that structure must be imposed - rather than discovered -  and sets up a false dichotomy in place of a useful scale. 

Let’s look at how another discipline updates its common knowledge as a model for clearing up similar misunderstandings. 

The Diagnostic and Statistical Manual (DSM) of the American Psychiatric Association has codified disorders since DSM I was published in 1952. Last year they published DSM V. Each revision attempts to capture the state of psychiatric practice. Checklists are provided for each disorder in the taxonomy to guide diagnoses according to symptoms.  Disorders come and go in the DSM, but that doesn't mean that human nature and pathology necessarily change. It means that understanding changes. Truth doesn't change. Belief changes.

Autism first appeared in DSM III in 1980 as a subcategory of Pervasive Developmental Disorder(PDD). From 1952 to 1980, of course, people exhibited the symptoms of the disorder that "became" autism in DSM III. In fact, in the mid-1970s, I worked as a psychology student in a clinical setting with autistic children.  


In 1994 in DSM IV, the class of PDDs expanded: PDD-Not Otherwise Specified and  Autistic Disorder were joined by Asperger Disorder, Childhood Disintegrative Disorder, and Rett Syndrome. In 2013, though, something very interesting happened in DSM V. Asperger Disorder disappeared. Rett Syndrome disappeared. Childhood Disintegrative Disorder disappeared. They were all replaced by Autism Spectrum Disorder, which was no longer considered to be a PDD (acknowledging that symptoms appear in specific contexts, not pervasively). In other words, instead of discrete disorders, a continuous spectrum was defined to allow for cases with varying degrees of severity across numerous symptoms. 

People with disorders - the truth - didn’t change as a result of relabeling. Relabeling occurred to reflect better understanding of the underlying truth. Gause and Weinberg (Exploring Requirements) note that military recruits are taught “when the map and the terrain don’t agree, believe the terrain.” My corollary is, when the map and terrain don’t agree, fix the map. If the APA can do it (granted, it took nearly 14 years, three drafts, and over 10,000 comments from practitioners to produce DSM V after the last revision of DSM IV), surely we can be more precise with the way we classify data. It's time for those of us who work in data management and cognitive computing to take a lesson from the APA. It's time to abandon the diagnosis of unstructured data. 


To replace the misleading structured/unstructured dichotomy, I propose a structured data spectrum that recognizes two simple facts:
    • If it is unstructured, it isn't data. 
    • If it is data, the degree of difficulty required to process it can be codified based on factors like inherent ambiguity and context-dependencies, which creates a range from deep structure (hard to process) to surface structure (easy, already obvious, known structure).
This approach creates a structural discovery effort metric like the degree of difficulty measure in Olympic sports. In the timeless words of Louis Sullivan, form ever follows function. When data is created to be processed by an application, structure is generally shallow and visible. When it is created for human consumption, structure may be deep and difficult. Whatever terms you use to describe the degrees of difficulty between these extremes, abandoning the term “unstructured data” is a necessary first step to clarity.