Understanding Your Data for AI -

New Artificial Intelligence technologies have made understanding your data more important than ever before. Data has always been data — the most elemental, informational aspect of your business. But these days, data is exponentially more plentiful and sophisticated than ever before.

There is structured data that your organization is capturing in databases and Customer Relationship Management (CRM) systems, and then there is unstructured data and Big Data coming in from a multitude of sources — video streams, social media streams, discovery of information in your business and website, and so forth — stored in newer approaches such as Data Lakes. NoSQL databases have become popular to capture non-structured data, and so many techniques are being used to pull data from so many varied sources.

Data Management Approaches

Data Warehouses

Data warehousing is a longstanding method to aggregate data from different relational databases into a single, central repository. Data is extracted from each database and put through a transformation to meet the data model of the data warehouse. Business Intelligence (BI) tools are used to analyze the data in the Data Warehouse.

Data Marts

A Data Mart is a smaller, more focused version of a Data Warehouse — geared to helping a particular department analyze its data. Data Marts emerged from the difficulties organizations had in setting up Data Warehouses.

Data Lakes

Data Lakes have emerged as the leading technique to house large amounts (petabytes) of structured and unstructured data. Data Lakes can handle the volume, velocity, and variety of big data. Just throw all the data into one big lake and then examine it.

*Data Lake — graphic from https://k21academy.com/microsoft-azure/data-engineer/azure-data-lake/*

Data scientists, data engineers, and developers use data discovery tools to access the data, and machine learning algorithms to analyze it.

The advantages of a Data Lake are that a) it supports unstructured Big data and structured data, and b) are easier than data warehouses to set up — in that you do not need to create a data model schema beforehand to ‘gate keep’ the data. The goal of a Data Lake is to keep an open door for all of the organization’s data, which makes data onboarding as simple as possible.

The tradeoff of a Data Lake is that nobody understands what is in the data lake — it is a massive heap of data and you may have no idea what it represents.

Data Architecture

Data Architecture within an Enterprise Architecture is now essential to running an efficient business — to understand how you can utilize latest technologies such as Artificial Intelligence techniques, to best and safest advantage. A well-architected Data Architecture can help you manage your Data Lake.

Developing and maintaining a Data Architecture within the constructs of an Enterprise Architecture enables you to:

Understand Data Sources — understand where the data came from, and how it has changed from its origination point.
Reduce Redundancy of Data — discover overlapping data fields across different sources, resulting in inconsistency and data inaccuracies. A good Data Architecture can enable you to standardize how data is stored, reducing duplication, and providing better data quality and analysis.
Enhance Quality of Data — A well-designed data architecture enables you to enforce data quality, data governance, and data security practices, helping you manage your Data Lake.
Provide Cross-Enterprise Integration — A good data architecture enables integration across domains, so that different departments in different geographies have access to each other’s data. This enables a more hollistic understanding of data across the enterprises, such as product info, customer info, revenue, expenses, and cost drivers.
Perform Lifecycle Management of Data — A good data architecture enables you to manage data over time. Typically data will become less useful as it ages. Knowing this, you can migrate it to slower, cheaper storage facilities so you can still access it for reports but reduce the cost of high-performance storage solutions.

Types of Data Architectures

All of that said, there are two basic types of Data Architectures currently in vogue: Data Fabrics and Data Meshes.

Data Fabrics

Data fabrics are an emerging architecture being used to enhance customer profiling, fraud detection, and preventative maintenance.

A data fabric architecture focuses on the automation of data integration, data engineering, and governance in a value chain between data providers and data consumers. It is based on the notion that active metadata exists in the big data that you are getting your hands around — and uses knowledge graphs, semantics, data mining, and machine learning (ML) technology to discover patterns in various types of metadata, such as system logs, social media, and so forth. Then, it applies this insight to automate and orchestrate the data value chain.

*Data Fabric graphic courtesy of IBM in this article: https://developer.ibm.com/articles/introduction-to-data-fabric/*

For example, a Data Fabric architecture can automatically enable a data consumer to find a data product and then have that data product provisioned to them automatically. The increased data access between data consumers and data products reduces data siloes. It also provides a better overall view of your organization’s data.

Data Meshes

A data mesh is a decentralized data architecture that organizes data by business domain. Data lakes and data warehouses can be used as multiple decentralized data repositories to realize a data mesh.

A data mesh architecture leads you to stop thinking of data as a by-product of a process and start thinking of it as a product in its own right.

The producers of Data become product owners. They are the subject matter experts of their data and use their understanding of the data’s consumers to design APIs for them. These APIs can then also be accessed from other parts of the organization, or external organizations, providing broader access to managed data.

Data Mesh principles. Image from https://d1ugv6dopk5bx0.cloudfront.net/s3fs-public/data-mesh-concept.webp

A data mesh can also work with a data fabric. The data fabric’s automation enables new data products to be created quickly and enforces global governance.

Data Cataloging & Master Data Management

Master Data Management is a method that Data Architect’s have been practicing for years to understand the ontology and sources of data in an organization. In recent years, Data Cataloging efforts have been established to understand how data is sourced and transformed in an organization.

Data Cataloging aims to answer the following questions:

Who is the source of the data
What data is used for
Where data is
When data is transformed
Why data exists
How data is used
Software programs and code data participates in

A Data Catalog does not conform data — it just lists it in a catalog, and identifies its uses. Consumers of data — who are are all over the organization — use the Data Catalog in the same way they use Yelp.

Data Modeling

Traditional data modeling techniques — logical and physical data modeling — are excellent at understanding the structure of data housed in relational databases. In System Architect, you can even reverse-engineer an Oracle or Microsoft SQL Server database into a physical data model, and maintain the source of the data (what database it came from). The physical model can be mapped to a logical data model (or Entity Relation (ER) diagram), which enables data architects to understand the entities and relationships. This data can be tracked against how it is used by the business (via for example, traditional Create-Read-Update-Delete (CRUD) matrices and so forth.

Conceptual Data Modeling for AI

Non-relational data and big data are better captured using Conceptual data modeling techniques, also provided by all the major frameworks in System Architect (TOGAF, DoDAF, UAF, ArchiMate, NAF, etc).

In fact, if you look up “Conceptual Data modeling AI” in Google’s search engine right now and your AI-driven answer will tell you:

Conceptual Modeling in DoDAF 2

If you are using the DoDAF 2 framework, you may model conceptual data in a DIV-01 (Data Information View 01) in System Architect. You model “Information” as a definition type, and “Conceptual” relationships between Information.

Conceptual Modeling in UAF

If you are using the Unified Architecture Framework (UAF), then the taxonomy of the metamodel is slightly different — the main conceptual data definition is called StrategicInformation. The name of the diagram is St-If Strategic Information view.

Conceptual Modeling in TOGAF 10

If you are using the TOGAF 10 framework, you can use a Conceptual Data Model to model “Conceptual “Business Information”.

Conceptual Modeling in ArchiMate

If you are using the ArchiMate framework, you can use an Information Structure Viewpoint to model “ArchiMate Data Objects”.

Enterprise Architecture and AI

Understanding your data — where it comes from, how good it is, how old it is, and so forth — is key to decision making using Artificial Intelligence. And then that data needs to be understood within the context of the Enterprise Architecture:

The processes that the organization performs, which transform the data,
The applications and technologies that the organization uses to get things done,
The organizational structure and decision making,
The key drivers and goals of the organization, and
The capabilities it brings to bear.

We will explore this in more detail in other articles. Thanks for reading. Please feel free to comment below.

Community Site for UNICOM System Architect -- the Market-Leading EA Tool

Understanding Your Data for AI