The Three Main Data Challenges of Machine Learning

by Tech Mainstream Staff

May 6, 2019

The Three Main Data Challenges of Machine Learning

According to the Western Digital Blog article "3 Key Data Challenges of Machine Learning" there are three critical data challenges of Machine Learning: Quality, Sparsity and Integrity.

Quality assesses data from external sources where "no quality control or guarantee on how the original data is captured" and "you need to understand the quality of the data and how to prepare it." Data from experiments and examples must be free from errors and must be cleaned up before proper analysis is conducted.

Sparsity involves incomplete metadata especially when data comes from diverse sources without a standard definition of metadata. When data sources are combined, often fields do not correspond. "How do you correlate and filter data" when you have the same type of data with different metadata fields populated? The answer is "through the metadata disclosing when it was captured. When scientists are doing historical analysis they need metadata in order to be able to adjust their models accordingly."

Integrity is data accuracy and consistency assurance:

"The chain of data custody is critical to prove that data is not compromised as it moves through pipelines and locations."

When capture and ingestion of the data is controlled data veracity is not an issue. Yet issues arise such as when one cannot maintain the data was recorded originally as intended nor that the data you obtain is the same as when it was originally recorded. Therefore data integrity is contigent on a combination of security technologies and policies such as using https and encryption. Policy driven access control eliminates human errors.

In summation, organizations and businesses should begin refining its machines learning environment success by defining data collection policy, metadata format, and apply standard security techniques.


Visit Tech Mainstream's homepage for more stories.



Read All News...

Upcoming Tech Events

July 27-28, 2020- gRPC Conf 2020

August 31- September 2, 2020- Digital Transformation Connect

September 15-16, 2020- Automotive Linux Summit

October 13-16, 2020- HR Tech 2020

October 19-22, 2020- TensorFlow World

October 26-29, 2020- Sitecore Symposium

November 9-12, 2020- Dreamforce

November 11-12, 2020- The MarTech Summit

November 17-20, 2020- KubeCon + CloudNativeCon North America


Tech Definitions in the News

Arduino is an open-source electronics platform based on easy-to-use hardware and software. It's intended for anyone making interactive projects. Arduino boards are able to read inputs - light on a sensor, a finger on a button, or a Twitter message - and turn it into an output - activating a motor, turning on an LED, publishing something online


Augmented Reality is an enhanced version of reality where live direct or indirect views of physical real-world environments are augmented with superimposed computer-generated images over a user’s view of the real-world, thus enhancing one’s current perception of reality.

Source: augmented-reality/

Chatbot is a piece of software that interacts with users in a conversational way.

Source: intelligent-chatbots

Deep Learning is a new area of Machine Learning research, which has been introduced with the objective of moving Machine Learning closer to one of its original goals: Artificial Intelligence.


DevOps is a set of software development practices that combines software development (Dev) and information technology operations (Ops) to shorten the systems development life cycle while delivering features, fixes, and updates frequently in close alignment with business objectives. Different disciplines collaborate, making quality everyone's job.


Docker is a tool designed to make it easier to create, deploy, and run applications by using containers. Containers allow a developer to package up an application with all of the parts it needs, such as libraries and other dependencies, and ship it all out as one package.


Edge computing is a distributed computing paradigm which brings computation and data storage closer to the location where it is needed, to improve response times and save bandwidth.


FogHorn is an intelligent Internet of Things ( IoT) edge solution that delivers data processing and real-time inference where data is created.


Hybrid Cloud is a computing environment that combines a public cloud and a private cloud by allowing data and applications to be shared between them.


Kubernetes (k8s) is an open-source system for automating deployment, scaling, and management of containerized applications.


WWW2 and WWW3 (k8s) are hostnames or subdomains, typically used to identify a series of closely related websites within a domain, such as,, and; the series may be continued with additional numbers: WWW4, WWW5, WWW6 etc. 


Did You Know?

Duck Duck Go Search Engine has six different themes to choose from for its search interface.