Saturday, 5 August 2017

Post-statistics: Lies, damned lies and data science patents

US Patent (Wikipedia)
Statistics is so important field in our daily lives nowadays, the emerging field of 50 years old data science that is applied to almost every human activity now, or post-statistics, a kind of post-rock,  fusing operations research, data mining, software and performance engineering and of course multitude fields of statistics to machine learning. Even though, the reputation of statistics is a bit shaky due to quotes like  Lies, damned lies, and statistics. Post-statistics is still emerging and drive innovation in almost all industries that drive things with data.
One of the most important characteristics of data science appears to be shared ideal with open source movement, i.e. free software. Note that "free" here means freedom of using the source code and sharing recipes, i.e. a workflows/combination of algorithms for example. The entire innovation in data science we are witnessing last 5 years or so fundamentally driven by this attitude that is embraced by giants like Microsoft, Google and IBM supporting a huge number of enthusiastic individuals from industry and academics. These technology giants open source their workflows and tools to the entire community like Tensorflow and supporting community via event or investing in research that partly goes into public.  On the other hand, traditionally patents are designed to encourage innovation and invention culture. A kind of a gift and a natural right to innovator that given certain time frame he/she or organisation ripe some benefits. 

A recent patent on predicting data science project outcome, unfortunately, do not entirely served to this purpose: Data science project automated outcome prediction US 9710767 B1Even though it is very well written a patent, scope reads very restricted in the first instance, however, the core principle is identical to the standard work-flow activity a data science professional applies in daily routine: where to produce an automated outcome prediction.  The interpretation of  'data science project' is open to any activity on prediction outcome. I am of course no legal expert but based on this patent, which claims to invent outcome prediction pipeline for a 'data science project', Sci-kit learn's workflow manager, pipelines can be taken to court while it facilitates the exact same outcome prediction pipeline this patent claims to invent. It does not matter how this is enforceable but it gives right to patent holder an opportunity to sue everyone doing automated data science outcome prediction. 

This patent US 9710767 B1  is a tremendous disservice to the entire data science community and damaging to an industry and professionals that are trying to use the data in outcome prediction for the greater good in society and solve problems. We definitely do not claim that data science is the solution to our problems in general but will help us to tackle important problems in industry and society. So maybe in the post-statistics world, we have to yell; lies, damned lies and data science patents. While holders of such patent may look like encouraging a patent shark or troll, rather than  the intention of innovating or inventing.


Marcos Sanches said...

You, liike many people working with data science, which has little of science, seem to understand little of statistics. I cannot see a post statistic world also because data science is useless in addressing statistical questions. For example, a clinical trial. Or who is going to win the American election. Statistics may get it a bit wrong, but data science does not even try. So, the statistical theory, strongly founded in mathematics, will remain valid forever, despite of folks, among them a lot of data scientist, who uses statistics incorrectly and give it a bad reputation.

msuzen said...

@Marcos Sanches: Sorry, I did not understand what was your objection. This post was about a recent patent ( that seemed to patented a workflow that data scientist uses daily. Was it the term 'post-statistics' you didn't like? Post-statistics does not mean without statistics of course. I am not sure what makes you think that data science and statistics are mutually exclusive.

msuzen said...

Unfortunately time to time I get trolls on this blog whose purpose is not the contribute but create a toxic environment. As I believe in free-speech, the ones give their real names I let them post.

Marcos Sanches said...

First, I appologize if I misinterpreted your post.

My problem is with the term 'post-statistics', and I think that is quite clear. What does it mean if not the end of statistic as we know it? I find the term VERY demeaning. Could you define it, please?

Data science folks and others, among them even statisticians, by using terms like 'post-statistics' and 'Lies, damm lies and statistics' only contributes to the misleading idea that statistics is useless, just lies, a way to mislead the uninformed. I would suggest you to help get rid of people who does statistics incorrectly and spread this kind of misinformation, and on doing that you would help us get rid of bad science. Or, dont talk about statistics, stick to data sciences.

I said "you (...) SEEM to". I do question how much you know of statistics, given what you said (see paragraph above) and given how many data scientists know very little of statistics. For the record, I could not care less about your phD credentials or that of anybody else for that matter; like you said, lets discuss facts, not brag about our credentials.

You surely went personal, calling me direcly a troll. I cannot see the basis for that. You did not give me time to reply. You stalked me on my work emails. All this despite the fact that I havent posted the comment annonymously. All this makes me think I should have posted annonymously.

Finally, you said:

"I know so many people working in data science with superb academic credentials and with very high-quality statistics knowledge. So your premise is not true."

To which I agree. I have never said all data scientists are bad, ignorant or anything like that. So, this is just you trying to attack me personally again.

DouglasSkinner said...

I generally agree with what you said about the patent. I'd take it a bit further. I'm against intellectual property as a whole. To the extent there is intellectual property it should do two things: protect an author's identity by giving him protection against those who would use his name against his wishes; protect the author from claims by others that something he invented or created actually came from them. I do not think inventors, authors, etc. are entitled to royalties anymore than any other enterprise. I don't accept the argument which says we have to provide such guarantees in order to stimulate inventions and various creative productions. The late F. A. Hayek makes a convincing argument that there is very little evidence to support this notion.

msuzen said...

@douglasskinner I think we need to look at Intellectual Property (IP) in terms of cost/benefit. Traditionally IP brings a competitive advantage to an individual or organisation. If in the near future, sharing an IP brings more benefit than advantage maybe as a society or as an individual we would choose to share our IP to the community not only we are charitable people but by sharing it will bring us more benefit than keeping it.

msuzen said...

@Marcos Sanches Thank you for your comments and contributions. You are very welcome to post your opinions as a professional statistician here. Sincere apologies if you are offended as a statistician with the term 'post-statistics'. All perspectives are welcome here.

I could be wrong on what constitutes as 'post-statistics'. But, I refer to 'post-statistics' as in the capabilities we have today from a computational perspective in statistics. I use 'Post' there as 'Modern'. I didn't mean after statistics or end of statistics. Statistics as a field is with us and as you pointed out mathematical fundemantals are solid rock. A recent book from Efron-Hastie, "Computer Age Statistical Inference" has a collection of chapters called "Twenty-First-Century Topics", probably those topics could be called 'post-statistics' in my opinion.( Regarding the title of the post, it was just a word play on the phrase 'Lies, damn lies and statistics'.

The mentioned patent was the main reason I have written this blog post.
What do you think about this particular patent as a professional statistician? Do you think should we have patents in general statistical procedures?

Marcos Sanches said...

@msuzen. I was pleasantly surprised by your conciliatory and positive reply to my comment, given that I find difficult to engage on civilized discussions on the internet nowaddays. So, thanks for that and also for the link, I was not aware of Eron's book, which is quite interesting book.

I still disagree with you. I think people usually interpret things their own way, and the immediate interpretation, which was my interpretation, is not what you meant. To be clear, I defend your right to use the expression and to say whatever you want about statistics, this is not about censoring. But I tend to leave clear my opinion when I think expressions do more harm than good to a field that is already a lot misunderstood.

I do agree with you in terms of patent, and I like the comment from @douglasskinner. I think things like data science, statistics, tech, medicine... are all important to lift people out of the misery they find themselves in many places. It is not capitalism, markets, money... it is just technology. Then also, I think that it is the case that many patents are for things that in large part was developed with taxpayers money, either because the grondwork was done in public institutions like university, or because the education one had the allowed the development came from public funds.

(c) Copyright 2008-2020 Mehmet Suzen (suzen at acm dot org)

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License