Memo's Island: ai

Showing posts with label ai. Show all posts

Tuesday, 12 May 2020

Collaborative data science: High level guidance for ethical scientific peer reviews

Preamble

Catalan Castellers are
collaborating (Wikipedia)

Availability of distributed code tracking tools and associated collaborative tools make life much easier in building collaborative scientific tools and products. This is now especially much more important in data science as it is applied in many different industries as a de-facto standard. Essentially a computational science field in academics now become industry-wide practice.

Peer-review is a pull request

Peer-reviews usually appears as pull requests, this usually entails a change to base work that achieves the certain goal by changes. A nice coincidence that acronym PR here corresponds to both peer review and pull request.

Technical excellence does come with decent behaviour

Aiming at technical excellence is all we need to practice. Requesting technical excellence in PRs is our duty as peers. However, it does come with a decent behaviour. PRs are tools for collaborative work, even if it isn't your project or you are in a different cross-function. Here we summarise some of the high-level points for PRs. This can manifest as software code, algorithmic method or a scientific or technical article:

Don’t be a jerk We should not request things that we do not practice ourselves or invent new standards on the fly. If we do, then give a hand in implementing it.
Focus on the scope and be considerate We should not request things that extend the scope of the task much further than the originally stated scope.
Nitpicking is not attention to details Attention to details is quite valuable but excessive nitpicking is not.
Be objective and don’t seek revenge If someone of your recommendations on PRs is not accepted by other colleague don’t seek revenge on his suggestions on your PRs by declining her/his suggestions as an act of revenge or create hostility towards that person.

Conclusion

We provided some basic suggestion on high-level guidance on peer review processes. Life is funny, there is a saying in Cyprus and probably in Texas too, -what you seed you will harvest-..

Friday, 27 September 2019

On modern data scientist: A blind empiricist is not a data scientist

As mentioned by Professor Pearl graciously on twitter

Preamble

Hubble Space Telescope (Wikipedia)
Computational science is to
modern data scientist, as telescopes are
for astrophysics.

A better software developer than a statistician and better statistician than a software developer would have been a good definition for the early 2010s in identifying who would be a data scientist. In the late 2010s, trends changed dramatically, a data scientist is now identified as who can turn any set of data to run through machine learning libraries and getting a model to deploy for service. Unfortunately, this blind empiricism is now considered as a data science practice in many industrial places and the term "scientist" lost its intellectual practice and turn into the mass hysteria of producing "junk science" blindly in the name of "democratisation of data science".

Who is the modern data scientist?

Modern data science actually goes beyond statistics and machine learning. Modern data scientist practice computational science from dynamical systems to game theory or graph theory. One could think of such practice as applied mathematics or statistical physics as well. For example, most of the neural networks is actually originating from statistical physics. In that sense, a modern data scientist is a computational scientist building mechanics of data.

The exploratory analysis goes beyond basic PCA or clustering to be able to form causal relationships or establish mechanics of the data.
Can express the mechanics of data in mathematical models and build parametric inference. Not all parameter estimations are learning.
Use machine learning algorithms from libraries by knowing the underlying algorithm and can relate this to the mechanics of data.
Build algorithms fusing above work.
Explainable and transparent work.
Document the findings as in the scientific paper and scientific software.

Ignoring the above practice and treating data science similar to a web-based software development activity is not a fair practice and an immense waste of time. Organisations should understand that investing in data science means investing in the new computational science of building mechanics of data. Pushing the outcome of such a scientific practice to make a real-world impact lies in the novelty of scientist and as in any scientific funding, this is a very risky investment.

Misconception in democratisation of data science

The democratisation of data science does not mean that anyone should build learning or statistical models using machine learning libraries and put lots of data to get a black-box model as a blind empiricist. Democratisation was about the availability of tools and services at very low cost and open culture of transparency in algorithmic and software work.

Artificial Intelligence is modern data science

The separation of AI from the above definition of data science is not really clear. While AI combines the same characteristics to build so-called intelligent agents.

Conclusion

Having a perspective and understanding of what is modern data science about will help organisations better orient in building modern data science capabilities.

Postscript: Further reading and on the mechanics of data

We used a term the mechanics of data, it implies the effort to put in finding signatures of causal relationships and make sense of the correlations within the data. The reason is one of the core scientific methods that give rise to modern science lies in Newton-Leibniz mechanics. Coveney and his co-worker's deep dives in intricacies of practising science and data science.

Big data: the end of the scientific method?
Sauro Succi and Peter V. Coveney
[article]
Big data need big theory too
Peter V. Coveney, Edward R. Dougherty and Roger R. Highfield
[article]

Post-Postscript

Judea Pearl, a pioneering scientist on causal inference field, a quiet revolution in statistics and data science, Turing award laureate has similar critique on excessive empiricism. His post explains:

Radical Empiricism and Machine Learning Research, which is also published as an article here: doi

Thursday, 3 January 2019

Core principles of sustainable data science, machine learning and AI product development: Research as a core driver

Kindly reposto to KDnuggets by Gregory Piatetsky-Shapiro

Preamble

Almost all businesses and industry embraced Machine learning (ML) technologies. Apart from ROI concerns, as it is an expensive endeavour to develop and deploy a service driven by ML techniques, sustainability as in going beyond proof-of-concept core development appears to be one of the roadblocks in data science. In this post, we will outline basic logical core principles that can help organisations for sustainable AI product development cycle, apart from reproducibility issues. The aim is giving a coarser view, rather than listing fine-grain good practice advice.

Research as a core driver: Research Environment

Regardless of the size of your organisation, if you are developing machine learning or AI products, the core asset you have is a research professional, data scientist or AI scientist, regardless of their academic background. Developing a model using software libraries blindly won't resolve issues you might encounter after deployment of the product. For example, even if you need to do a simple hyperparameter search, this can easily yield to research. Why? Because most probably no one ever tried building a model or try a modelling task using your dataset and you might need a different approach than ML libraries provide. A simplest different angle or deviation from ML libraries will yield to a research problem.

No full 'black-box' approaches.
No blind usage of software libraries.
Awareness and skills in the mathematical and algorithmic aspect in detail.

Figure: A schematic of core principles for AI product development.

Separate out research code and production code

Software development is an integral part of ML product development. However, during research, a code development can go very wild and a scientist, even if they are very good software developers, would end up creating hard to follow and poor code. Once there is a confidence in reproducibility and robustness of results, the production code should be re-written with high-quality software engineering principles.

Data Standardisation: Release data-sets for research

A cold start problem for ML products is to release and design data-sets before even doing any research like work. This, of course, has to be aligned with industrial requirements. Imagine datasets like MINST or imagenet for benchmarking. Released sets will be the first step for any model building or product development, and would constitute a data product themselves. Data versioning is also a must.

Do not obsess with workflows: All workflows are ad-hoc

There is no such thing as a universal or generic workflow. A workflow depends on a human understanding of processes and steps. Human understanding is based on language and linguistically there is no such thing as universal language, at least it isn't practical yet c.f., universal grammar. Loosely defined steps are sufficient for research steps. However, once it entered into production, then much more strict workflow design might be needed, but be aware all workflows are ad-hoc.

Do not run sprints for core data science

Agile principles are suitable for software development innovations. Sprints or Agile is not suitable for AI research and research environment, it is a different kind of innovation than software engineering. Thinking that Agile is a cure to do scientific innovation is naive wishful thinking. Structuring a research group, periodic reviews and releases of the results via presentations and detailed technical reports are much more suitable for data science on top of mini-workshops. A simple proposal runs can also be made to decide which direction to invest, akin to research proposals.

Feedback loop: Service to Business decision making back to research

A service using ML technologies should produce more data. The very first service monitoring is A/Null testing, meaning that what would happen in the absence of the AI product. Detailed analysis of the service data would bring more insights both for business and to research.

produce impact assessment: A/null testing
quality of service: Quality of service can be measured basically on what is the success of the ML model, this has to be technical.

Conclusion and outlook

There is no such thing as free-lunch and developing AI products won't be fully automated soon. Tools may improve the productivity immensely but AI replacing a data scientist or AI scientist is far from reality, at least for now. If you are investing in AI products, basically you are investing in research at the core, missing that important point may cost organisations a lot. The basic core principles or variation of them may help in sustaining AI products longer and form your teams accordingly.

Memo's Island