BigConnect Discovery Data Science
This tutorial will teach you the basics of data science with BigConnect Discovery. The main points addressed below are:
  • Using Jupyter Notebooks with data from BigConnect Discovery
  • Training a ML model for sales forecast
BigConnect Discovery has seamless integration with Jupyter Notebooks. You can create notebooks for each Workbook to address various use cases like custom data transformation, data science and whatever scenario you can think of.

Prerequisites

1.Follow the BigConnect Discovery Basics tutorial, Preparing data section to create the salesdata datasource.

Tutorial

1.The first step is to connect Jupyter Notebook to BigConnect Discovery. Login to BigConnect Discovery and choose MANAGEMENT / Notebook / Notebook Server from the menu. Click on Add a server on the Notebook page:
2. Choose jupyter for the Type, enter http://localhost:9888 for the URL and BDL Jupyter Server for Name and click Done.
This sets up the connection to the Jupyter Notebook server provided by BDL.
3. Head to the WORKSPACE / Admin Workspace from the menu and connect the notebook server to our workspace. Click on the three dots on the right hand side of the top pane and choose Set notebook server:
4. Select BDL Jupyter Server and click Done.
5. A new button named Notebook will be shown on the bottom pane of the Admin Workspace. Click on it to create a new notebook
6. Choose Datasource as the source type, select our previously created salesdata datasource and click Next.
7. Choose PYTHON for the Development language (R is not installed in the Sandbox), enter Sales Forecast for Name and click Done
8. You will be taken to the notebook details page. Click on the Detail link to open the actual Jupyter Notebook.
9. The Jupyter Notebook will be opened in a new tab. It already contains the necessary code to load data from our datasource.
10. Copy the following piece of code into the analyze empty field and run the notebook:
1
import numpy as np
2
import pandas as pd
3
import matplotlib.pyplot as plt
4
from sklearn import linear_model
5
from statsmodels.tsa.stattools import acf, pacf
6
from statsmodels.tsa.arima_model import ARIMA
7
from sklearn.model_selection import train_test_split
8
9
df_sales = pd.json_normalize(dataset)
10
df_sales.head()
Copied!
11. Add another cell with the following code and run the notebook again:
1
df_sales['OrderDate'] = pd.to_datetime(df_sales['OrderDate'], format='%y-%m-%d')
2
df_sales = df_sales.groupby('OrderDate').Sales.sum().reset_index()
3
train, test = train_test_split(df_sales, test_size=0.2)
4
5
regr = linear_model.LinearRegression()
6
regr.fit(train.OrderDate.values.reshape(-1, 1), train.Sales.values.reshape(-1, 1))
7
8
# Make predictions using the testing set
9
y_pred = regr.predict(test.OrderDate.values.astype(float).reshape(-1, 1))
10
df_new = test.copy()
11
df_new['pred'] = y_pred
12
13
ax = df_sales.plot(x='OrderDate', y='Sales', color='black', style='.')
14
df_new.plot(x='OrderDate', y='pred', color='orange', linewidth=3, ax=ax, alpha=0.5)
Copied!
12. Add another cell with the following code and run the notebook again to save the trained model to a file and run the notebook again:
1
from joblib import dump, load
2
dump(regr, 'my_model.joblib')
3
!ls
Copied!
The output should be similar to:
The trained model is not saved on the local disk, as the my_model.joblib file.
You can use it in a Data Collector pipeline to apply it on new data as it arrives.
Last modified 1yr ago
Copy link