Friday, 27 September 2019

On modern data scientist: A blind empiricist is not a data scientist

Preamble

A better software developer than a statistician and better statistician than a software developer would have been a good definition for the early 2010s in identifying who would be a data scientist. In the late 2010s, trends changed dramatically, a data scientist is now identified as who can turn any set of data to run through machine learning libraries and getting a model to deploy for service.  Unfortunately, this blind empiricism is now considered as a data science practice in many industrial places and the term "scientist" lost its intellectual practice and turn into the mass hysteria of producing "junk science" blindly in the name of "democratisation of data science".
Hubble Space Telescope (Wikipedia)
Computational science is to
modern data scientist, as telescopes are
for astrophysics.

Who is the modern data scientist? 

Modern data science actually goes beyond statistics and machine learning. Modern data scientist practice computational science from dynamical systems to game theory or graph theory. One could think of such practice as applied mathematics or statistical physics as well.  For example, most of the neural networks is actually originating from statistical physics. In that sense, a modern data scientist is a computational scientist building mechanics of data.


  1. The exploratory analysis goes beyond basic PCA or clustering to be able to form causal relationships or establish mechanics of the data.
  2. Can express the mechanics of data in mathematical models and build parametric inference. Not all parameter estimations are learning.
  3. Use machine learning algorithms from libraries by knowing the underlying algorithm and can relate this to the mechanics of data.
  4. Build algorithms fusing above work.
  5. Explainable and transparent work.
  6. Document the findings as in the scientific paper and scientific software. 

Ignoring the above practice and treating data science similar to a web-based software development activity is not a fair practice and an immense waste of time. Organisations should understand that investing in data science means investing in the new computational science of building mechanics of data. Pushing the outcome of such a scientific practice to make a real-world impact lies in the novelty of scientist and as in any scientific funding, this is a very risky investment.

Misconception in democratisation of data science

The democratisation of data science does not mean that anyone should build learning or statistical models using machine learning libraries and put lots of data to get a black-box model as a blind empiricist. Democratisation was about the availability of tools and services at very low cost and open culture of transparency in algorithmic and software work.

Artificial Intelligence is modern data science

The separation of AI from the above definition of data science is not really clear. While AI combines the same characteristics to build so-called intelligent agents.

Conclusion

Having a perspective and understanding of what is modern data science about will help organisations better orient in building modern data science capabilities.

Postscript: Further reading and on the mechanics of data

We used a term the mechanics of data, it implies the effort to put in finding signatures of causal relationships and make sense of the correlations within the data. The reason is one of the core scientific methods that give rise to modern science lies in Newton-Leibniz mechanics. Coveney and his co-worker's deep dives in intricacies of practising science and data science.
  • Big data: the end of the scientific method?
    Sauro Succi and Peter V. Coveney
    [article]
  • Big data need big theory too
    Peter V. Coveney, Edward R. Dougherty and Roger R. Highfield
    [article]

3 comments:

Peter M. said...

I really like this perspective. Well done, sir.

Robin said...

Hi Mehmet,

Thanks for the post. There are definitely some key points you go over here. However, one question: I saw this on LinkedIn with the post "Agile manifesto is designed for data science". Could you tell me your thoughts on why you think the agile manifesto itself (https://agilemanifesto.org/) - and not specific agile implementations or methodologies - is inconsistent with your post above?

msuzen said...

Hi Robin; Agile manifesto has nothing to do with data science. It is used in software product development as an umbrella term and a manifesto itself is a religious movement. Data Science is an emerging academic field. Unfortunately, agile methodology has been religiously forced upon to data scientist teams and now in most places data scientist are actually treated like software developers. Please see my other post for my views on how core data science should run:
"Core principles of sustainable data science, machine learning and AI product development: Research as a core driver"
http://memosisland.blogspot.com/2019/01/core-principles-of-sustainable-data.html

(c) Copyright 2008-2020 Mehmet Suzen (suzen at acm dot org)

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License