Small quantitative changes add up, and eventually reach a tipping point where you see a large qualitative change, like water turning to ice.
“It is said that there are no sudden changes in nature, and the common view has it that when we speak of a growth or a destruction, we always imagine a gradual growth or disappearance. Yet we have seen cases in which the alteration of existence involves not only a transition from one proportion to another, but also a transition, by a sudden leap, into a … qualitatively different thing; an interruption of a gradual process, differing qualitatively from the preceding, the former state”
G. W. F. Hegel (1770-1831)
German philosopher, first to write about the hype cycle and the tipping point
import plotly
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from plotly.graph_objs import *
from plotly.figure_factory import create_table
init_notebook_mode(connected=True)
import numpy as np
import pandas as pd
import scipy
import statsmodels
import statsmodels.api as sm
from statsmodels.formula.api import ols
# create a data set, sin wave plus random noise
nobs = 4000
x = np.linspace(0, 6*np.pi, num=nobs)
y = -np.cos(x) + x*0.05 + np.random.normal(0, 0.25, nobs)
z = np.sin(x) + x*0.05 + np.random.normal(0, 0.25, nobs)
df = pd.DataFrame({'x' : x,'y': y,'z': z})
# chart it
def mychart(*args):
# pass some 2d n x 1 arrays, x, y, z
# 1st array is independent vars
# reshape to 1 dimensional array
x = args[0].reshape(-1)
# following are dependent vars plotted on y axis
data = []
for i in range(1, len(args)):
data.append(Scatter(x=x,
y=args[i].reshape(-1),
mode = 'markers',
marker = dict(size = 2)
))
layout = Layout(
autosize=False,
width=800,
height=600,
yaxis=dict(
autorange=True))
fig = Figure(data=data, layout=layout)
return iplot(fig) # , image='png' to save notebook w/static image
mychart(x,y)
#table = create_table(df)
#iplot(table, filename='mydata')
# Very Important
### These are *in-sample* statistics on the training data
### Hence the disclaimer
formula = 'y ~ x'
model = ols(formula, df).fit()
ypred = model.predict(df)
model.summary()
Dep. Variable: | y | R-squared: | 0.114 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.114 |
Method: | Least Squares | F-statistic: | 514.3 |
Date: | Fri, 21 Sep 2018 | Prob (F-statistic): | 3.32e-107 |
Time: | 10:46:29 | Log-Likelihood: | -4493.4 |
No. Observations: | 4000 | AIC: | 8991. |
Df Residuals: | 3998 | BIC: | 9003. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 0.0100 | 0.024 | 0.426 | 0.670 | -0.036 | 0.056 |
x | 0.0490 | 0.002 | 22.677 | 0.000 | 0.045 | 0.053 |
Omnibus: | 2872.469 | Durbin-Watson: | 0.225 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 232.553 |
Skew: | 0.003 | Prob(JB): | 3.17e-51 |
Kurtosis: | 1.819 | Cond. No. | 21.9 |
mychart(x,y,np.array(ypred))
If you don't have a good linear model, you need to find a better model, transform the data, add variables. If your data violates the assumptions of OLS, you need to understand why and fix it. You can't just throw data at statistical models without knowing what you're doing.
# visualizing how everything is an outlier when you add dimensoions
# curse of dimensionality
import matplotlib.pyplot as plt
# set up the figure
fig = plt.figure()
ax = fig.add_subplot(111)
ax.set_xlim(0,11)
ax.set_ylim(0,11)
# draw lines
xmin = 0
xmax = 10
y = 5
height = 1
plt.hlines(y, xmin, xmax)
plt.hlines(y, 2.5, 7.5, color='r')
plt.vlines(xmin, y - height / 2., y + height / 2.)
plt.vlines(xmax, y - height / 2., y + height / 2.)
plt.vlines(2.5, y - height / 2., y + height / 2., color='r')
plt.vlines(7.5, y - height / 2., y + height / 2., color='r')
# add numbers
plt.text(xmin - 0.1, y, '0', horizontalalignment='right')
plt.text(xmax + 0.1, y, '100', horizontalalignment='left')
plt.axis('off')
plt.show()
fig = plt.figure()
ax = fig.add_subplot(111, aspect='equal')
plt.axis('on')
#plt.axes().set_aspect('equal', 'datalim')
offset = np.sqrt(50)/2
ax.set_xlim(0,10)
ax.set_ylim(0,10)
# (or if you have an existing figure)
# fig = plt.gcf()
# ax = fig.gca()
square1 = plt.Rectangle((0, 0), width=10, height=10, color='g', alpha=0.2)
square2 = plt.Rectangle((5-offset, 5-offset), width=offset*2, height=offset*2, color='blue', alpha=0.3)
ax.add_artist(square1)
ax.add_artist(square2)
plt.show()
sz = 0.5 ** (1/3)
start = (1-sz)/2
end = start + sz
print(start)
print(end)
data = [
Mesh3d(
x = [start, start, end, end, start, start, end, end],
y = [start, end, end, start, start, end, end, start],
z = [start, start, start, start, end, end, end, end],
colorscale = [[0, 'rgb(255, 0, 255)'],
[0.5, 'rgb(0, 255, 0)'],
[1, 'rgb(0, 0, 255)']],
intensity = [0, 0.142857142857143, 0.285714285714286,
0.428571428571429, 0.571428571428571,
0.714285714285714, 0.857142857142857, 1],
i = [7, 0, 0, 0, 4, 4, 6, 6, 4, 0, 3, 2],
j = [3, 4, 1, 2, 5, 6, 5, 2, 0, 1, 6, 3],
k = [0, 7, 2, 3, 6, 7, 1, 1, 5, 5, 7, 6],
name='y',
showscale=True
)
]
layout = Layout(
xaxis=plotly.graph_objs.layout.XAxis(
title='x',
),
yaxis=plotly.graph_objs.layout.YAxis(
title='y',
range=[0, 1]
)
)
fig = Figure(data=data, layout=layout)
iplot(fig, filename='3d-mesh-cube-python')
0.1031497370079501 0.8968502629920498
x1 = [n for n in range(1,50)]
y1 = [(0.9 ** n) for n in range(1,50)]
pd.DataFrame({'x' : x1, 'y' : y1})
x | y | |
---|---|---|
0 | 1 | 0.900000 |
1 | 2 | 0.810000 |
2 | 3 | 0.729000 |
3 | 4 | 0.656100 |
4 | 5 | 0.590490 |
5 | 6 | 0.531441 |
6 | 7 | 0.478297 |
7 | 8 | 0.430467 |
8 | 9 | 0.387420 |
9 | 10 | 0.348678 |
10 | 11 | 0.313811 |
11 | 12 | 0.282430 |
12 | 13 | 0.254187 |
13 | 14 | 0.228768 |
14 | 15 | 0.205891 |
15 | 16 | 0.185302 |
16 | 17 | 0.166772 |
17 | 18 | 0.150095 |
18 | 19 | 0.135085 |
19 | 20 | 0.121577 |
20 | 21 | 0.109419 |
21 | 22 | 0.098477 |
22 | 23 | 0.088629 |
23 | 24 | 0.079766 |
24 | 25 | 0.071790 |
25 | 26 | 0.064611 |
26 | 27 | 0.058150 |
27 | 28 | 0.052335 |
28 | 29 | 0.047101 |
29 | 30 | 0.042391 |
30 | 31 | 0.038152 |
31 | 32 | 0.034337 |
32 | 33 | 0.030903 |
33 | 34 | 0.027813 |
34 | 35 | 0.025032 |
35 | 36 | 0.022528 |
36 | 37 | 0.020276 |
37 | 38 | 0.018248 |
38 | 39 | 0.016423 |
39 | 40 | 0.014781 |
40 | 41 | 0.013303 |
41 | 42 | 0.011973 |
42 | 43 | 0.010775 |
43 | 44 | 0.009698 |
44 | 45 | 0.008728 |
45 | 46 | 0.007855 |
46 | 47 | 0.007070 |
47 | 48 | 0.006363 |
48 | 49 | 0.005726 |
mychart(np.array(x1), np.array(y1))
With decision trees, you might use fewer or shallower trees, with ensemble models, you might use fewer models. The point is to make your model simpler, chase fewer outliers, be more robust, generalize better out-of-sample.
The 2 key things in machine learning are
Statistics | Machine learning |
---|---|
Small data | Big data (need a lot of data for complex models - exponential with #variables/kinks) |
Optimize model in-sample error | Optimize cross-validation error |
Assume linear (or some a priori functional form) | Algorithm finds model |
Choose predictors and functional form (usually parsimonious and linear) | Algorithm chooses from many predictors and models (greedy and nonlinear while being robust) |
Optimize as much as possible | Worse is usually better |
Can't overfit a parsimonious model with limited data | Use regularization to tune and find optimal balance between bias and variance |
Inference: description, prediction, attribution | Focus on prediction - attribution often opaque |
You need to know what you are doing | You need to know what you are doing |
These days big data has to be REALLY REALLY BIG before you need a Google cluster/Hadoop cluster approach.
Sometimes it might make sense to use the big data scale-out cluster approach as opposed to scale-up huge instances.
Big data can refer to this type of stack, even if the data is not big data in a Google / Facebook web-scale sense or in a computational sense.
Big Data:
See also https://datascience.berkeley.edu/what-is-big-data/ - A lot of definitions boiling down to, size of data that was unreasonable in the PC era but tractable in cloud environment. Big enough for machine learning. A lot of people say big data when they mean machine learning, modern predictive analytics.
You have labeled data: a sample of ground truth with features and labels. You estimate a model that predicts the labels using the features. Alternative terminology: predictor variables and target variables. You predict the values of the target using the predictors.
Regression. The target variable is numeric.
Classification. The target variable is discrete or categorical.
You have a sample with unlabeled information. No single variable is the specific target of prediction. You want to learn interesting features of the data:
Anomaly detection. Which of these things are different?
Dimensionality reduction. How can you summarize the data in a high-dimensional data set using a lower-dimensional dataset which captures as much of the useful information as possible (possibly for further modeling with supervised or unsupervised algorithms)?
Representation: take a large body of text or movies or songs and create dense vectors describing them.
You might think unlabeled/labeled pretty much covers all the bases.
Reinforcement learning can be viewed as 'meta' supervised learning.
You are presented with a game or real-world task that responds sequentially or continuously to your inputs, and you learn to optimize behavior in the form of a policy function to maximize an objective through trial and error.
RL resembles supervised learning but at a higher level.
It's sufficiently important that nowadays people put it in its own category. (Also this is how you might implement a trading robot).
Vision
Audio
Narrowness and brittleness of deep learning is an Achilles heel - doesn't work well yet in unstable, hostile environments
This deck is based on a couple of blog posts I did
Get the Anaconda distribution and dive in via a course or tutorials
Python resources
Online books
Tutorials
Other good reads
Practice
MOOCs
Frameworks
Textbooks
Blogs/roadmaps
Data Science Hierarchy of Needs
Data engineering is 80% of the battle.
http://mattturck.com/the-new-gold-rush-wall-street-wants-your-data/
Matt Turck - http://mattturck.com/wp-content/uploads/2018/07/Matt_Turck_FirstMark_Big_Data_Landscape_2018_Final.png
Jupyter Notebook https://github.com/druce/HFTC2018Q3/blob/master/HFTC.ipynb
git clone https://github.com/druce/HFTC2018Q3 ( Jupyter notebook or HFTC.slides.html )
Follow @streeteye on Twitter