Machine Learning Marketing: Ignore Size

TELEGRID uses Machine Learning in many of its products specifically behavioral biometrics for user authentication.  I personally spend countless hours researching Machine Learning and performing market analysis.  Doing so I have noticed a troubling trend in Machine Learning marketing whereby size is being promoted as the ultimate differentiator.  I believe this ignores the math behind Machine Learning and so I decided to focus this post on helping consumers ask the right questions.

My Data Set is Bigger

Recently a team at the University of Manchester released a study about a system that identifies users by the way they walk.  The system uses pressure pads on the floor and a high-res camera to authenticate users based on their footsteps.   The article states that the team “compiled a database consisting of 20,000 footstep signals from more than 120 individuals.  It’s now the largest footsteps database in existence.”  I would love to hear the debate between this researcher and the researcher who has the second largest footsteps database in existence.

I understand the importance of having a large data set to cross validate and test an algorithm but Machine Learning marketing should be focused on the algorithm and not the data set size.  For instance with algorithms that suffer from high bias (AKA underfitting), the size of the data set will not have much of an impact.  Additionally certain algorithms (e.g., Support Vector Machines) can be very slow if the data set is too large.  If the speed of your Machine Learning system is important this should matter to you.

My Feature Set is Bigger

In Machine Learning features are used to predict an outcome.  For user authentication features include motion sensor data, keyboard clicking rhythm, GPS location, etc.  I recently saw an advertisement for a Machine Learning system that claimed its algorithm was the best because it used 1,000 features.

This Machine Learning marketing was claiming that the higher the number of features the better the quality of the Machine Learning algorithm.  However, if your algorithm suffers from high variance (AKA overfitting) the number of features should be reduced not increased.  Additionally, performing linear algebra functions on matrices with a high number of features can consume valuable resources.  This is an issue for Machine Learning systems that are designed to run on low power and low compute devices like mobile devices.

My Number of Iterations is Bigger

Researchers often base the superiority of their prediction on the number of times the underlying algorithm was run.  For instance, a recent study on the World Cup found that Germany had a 12.8% chance of winning.  As the Machine Learning marketing clearly shows the support for this prediction was the fact that the algorithm was run 100,000 times.  Despite the number of iterations Germany crashed out in the first round.  Now you can blame the algorithm, the human element or simply the fact that the study was performed by German researchers and was therefore biased from the start.  One thing is clear though, the number of iterations had little impact on the accuracy of this Machine Learning algorithm.

I believe the lesson from these examples is that we must cut through size-based Machine Learning marketing and challenge developers to justify their choices.  You wouldn’t select a software package simply because it was written by 10,000 engineers would you?  We should ask developers why they picked specific features.  Are all the selected features necessary or is the correlation high enough so that a few can be removed?  How is the large data set being used to improve the algorithm?  Also, how will the algorithm design affect its performance on your specific hardware?

We need to refocus Machine Learning marketing away from size and instead towards justification of the Machine Learning model.

Eric Sharret is Vice President of Business Development at TELEGRID.  TELEGRID has unique expertise in secure authentication, PKI, Multi-Factor Authentication, and secure embedded systems.

Disclaimer: The opinions expressed here do not represent those of TELEGRID Technologies, Inc.  The Company will not be held liable for any errors, omissions, or delays in this information or any losses, injuries, or damages arising from its display or use.  All information is provided on an as-is basis.