Stocks, Significance Testing & p-Hacking: How volatile is volatile?

October is historically the most volatile month for stocks, but is this a persistent signal or just noise in the data? Stocks, Significance Testing & p-Hacking. Follow me on Twitter (twitter.com/pdquant) for more. Over the past 32 years, October has been the most volatile month on average for the S&P500 and December the least, in this article we will use simulation to assess the statistical significance of this observation and to what extent this observation could occur by chance. All code […]

Read more

Visually Explained: How Can Executives Grasp What Programming Is All About?

Quite often, non-technical executives have difficulties understanding what programming, on a very fundamental level, is all about. Because of that knowledge-gap, they tend to hire and overburden experienced data professionals with tasks which they are hopelessly overqualified for. Such as, for example, doing ad-hoc SQL queries on CRM data: “You’re the go-to-guy for all things data, and we need the results for the board meeting tomorrow.” That’s a quite humbling and frustrating experience for anyone who calls himself a Data […]

Read more

Training with historical data! Surely, you’re joking says the IoT asset that just got connected

By Priya Sharma – Sr. Data Scientist -IoT Analytics, SAS Institute Inc. Saurabh Mishra – Product Management, IoT, SAS Institute Inc. June 12, 2020 Description: Majority of AI approaches are based on the construct of training against historical data and then inferencing new data. While this is a sound and proven approach, a lot of IoT assets coming online don’t have historical data and we don’t necessarily have the time to wait. Modern Machine Learning methods can be employed to […]

Read more

Random Forests Algorithm

One of the most popular methods or frameworks used by data scientists at the Rose Data Science Professional Practice Group is Random Forests. The Random Forests algorithm is one of the best among classification algorithms – able to classify large amounts of data with accuracy. Random Forests are an ensemble learning method (also thought of as a form of nearest neighbor predictor) for classification and regression that construct a number of decision trees at training time and outputting the class that is […]

Read more

Python Scikit-learn to simplify Machine learning : { Bag of words } To [ TF-IDF ]

Text (word) analysis and tokenized text modeling always give a chill air around ears, specially when you are new to machine learning. Thanks to Python and its extended libraries for its warm support around text analytics and machine learning. Scikit-learn is a savior and excellent support in text processing when you also understand some of the concept like “Bag of word”, “Clustering” and “vectorization”. Vectorization is  must-to-know technique for all machine leaning learners, text miner and algorithm implementor. I personally consider […]

Read more

Machine Learning – Anomaly Detection: “Finding a Needle in a Haystack”

After exploring formulation, classification, benchmarking, we explore another facet of Machine Learning: anomaly detection. This part is key in the IoT transformation, as it enables internet-connected AI devices to alert, adapt and respond accordingly. Once properly trained, an IoT could not only warn and prevent imminent failure, but also execute a response, adaptive to the anomaly detected. In this process, we’ll explore intrinsic hurdles that makes the anomaly detection process a non-trivial task of “finding a needle in haystack”. Opportunities abound to explore, and any univariate sequential […]

Read more

Naive Bayes Classification explained with Python code

Machine Learning is a vast area of Computer Science that is concerned with designing algorithms which form good models of the world around us (the data coming from the world around us). Within Machine Learning many tasks are – or can be reformulated as – classification tasks. In classification tasks we are trying to produce a model which can give the correlation between the input data  and the class  each input belongs to. This model is formed with the feature-values of the input-data. […]

Read more

Deep Learning with TensorFlow in Python

Classifying the letters with notMNIST dataset Let’s first learn about simple data curation practices, and familiarize ourselves with some of the data that are going to be used for deep learning using tensorflow. The notMNIST dataset to be used with python experiments. This dataset is designed to look like the classic MNIST dataset, while looking a little more like real data: it’s a harder task, and the data is a lot less ‘clean’ than MNIST. Preprocessing First the dataset needs to be downloaded and […]

Read more

Recommender Engine – Under The Hood

Many of us are bombarded with various recommendations in our day to day life, be it on e-commerce sites or social media sites. Some of the recommendations look relevant but some create range of emotions in people, varying from confusion to anger. There are basically two types of recommender systems, Content based and Collaborative filtering. Both have their pros and cons depending upon the context in which you want to use them. Content based: In content based recommender systems, keywords […]

Read more

Machine Learning with Signal Processing Techniques

Stochastic Signal Analysis is a field of science concerned with the processing, modification and analysis of (stochastic) signals. Anyone with a background in Physics or Engineering knows to some degree about signal analysis techniques, what these technique are and how they can be used to analyze, model and classify signals. Data Scientists coming from a different fields, like Computer Science or Statistics, might not be aware of the analytical power these techniques bring with them. In this blog post, we […]

Read more