UCONN

UCONN
UCONN

Machine Learning Regression

 What is Machine Learning and Linear Regression?


Machine learning is a branch of artificial intelligence that focuses on developing algorithms that allow computers to learn patterns and make decisions based on data. It enables the computers to analyze the data and draw insights without explicit programming, allowing them to recognize patterns, make predictions, and improve their accuracy over time as they process additional information.  On top of this, linear regression is a fundamental supervised learning technique that is commonly used to model relationships between variables and make predictions. Both are used in this product.


1. Introduction


In this project, we utilize linear regression to predict the closing price of stocks based on trading volume. Using Python and its robust machine learning libraries, we aim to assess the relationship between these variables and evaluate the predictive performance of our model. The dataset contains various financial attributes such as opening price, closing price, adjusted closing price, and trading volume. Trading volume

was selected as the independent variable and the closing price was selected as the dependent variable. By using Python and machine learning libraries, the aim is to assess the relationship between these two features and evaluate the model's predictive capabilities. 


Our approach involves preprocessing the data to ensure its quality. The model is trained on historical stock data and tested on its performance using relevant metrics like mean squared error (MSE) and R-squared values. Through this analysis, we assess both the strength of the relationship between trading volume and stock prices as well as the overall predictive accuracy of our model. This project not only demonstrates the practical application of machine learning in financial data analysis but also highlights the versatility of Python's ecosystem in solving real-world problems.



2. Rationale for Import Statements and Libraries

Several Python libraries are used to streamline data analysis and model building. The pandas library was used for data manipulation and preprocessing, including loading and cleaning the dataset. 


The train_test_split function from sklearn.model_selection enabled the splitting of data into training and testing sets. This is a critical step in evaluating the model’s performance. The LinearRegression class from sklearn.linear_model provided the framework for constructing and training the linear regression model. The mean_squared_error and r2_score functions from sklearn.metrics were used to evaluate the model's

accuracy through key performance metrics. Lastly, matplotlib.pyplot was employed to create visualizations that illustrate the relationship between the independent variable (trading volume) and the dependent variable (closing price), helping to interpret the model's results effectively. I chose these libraries for their robustness and widespread adoption in machine learning workflows.


3. Data Loading and Preprocessing

The first step was to load the dataset from an excel file using the pandas


The sklearn.model_selection module in Scikit-Learn provides functions for splitting data into training and test sets, evaluating machine learning models, and performing cross-validation. The train_test_split() function is the most commonly used function in the sklearn. model_selection module


Split arrays or matrices into random train and test subsets. Quick utility that wraps input validation, next(ShuffleSplit().split(X, y)) , and application to input data into a single call for splitting (and optionally subsampling) data into a one-liner.


linear_model is a class of the sklearn module if contain different functions for performing machine learning with linear models. The term linear model implies that the model is specified as a linear combination of features.


Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data. It's a fundamental tool in data science and machine learning, allowing for predictions and understanding of relationships between variables. 


The sklearn.metrics module implements several loss, score, and utility functions to measure classification performance. Some metrics might require probability estimates of the positive class, confidence values, or binary decisions values.


In the fields of regression analysis and machine learning, the Mean Square Error (MSE) is a crucial metric for evaluating the performance of predictive models. It measures the average squared difference between the predicted and the actual target values within a dataset.


In statistics, the coefficient of determination, denoted R² or r² and pronounced "R squared", is the proportion of the variation in the dependent variable that is predictable from the independent variable.


To install the scikit-learn (sklearn) library in Python, use the following steps:

  • Ensure Python and pip are installed: Scikit-learn requires Python 3.6 or later. Check your Python version by running python --version or python3 --version in your terminal or command prompt. If Python is not installed or is outdated, download and install the latest version from the official Python website. Pip usually comes with Python, but if it is not installed, you might need to install it separately.

  • Open your terminal or command prompt: This is where you will enter the installation command.

  • Install scikit-learn using pip: Type the following command and press Enter:

Code

   pip install scikit-learn

If you encounter permission errors, you might need to run the command with administrator privileges or use the --user flag:

Code

   pip install --user scikit-learn



import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error, r2_score


import yfinance as yf                # yfinance API

from datetime import date, timedelta # Date / time modules

from matplotlib import pyplot as plt # Plotting Library

from google.cloud import storage     # Leverage Google Storage Buckets


#Google Cloud project ID

# use  your project ID from Google cloud

PROJECT_ID = 'sentiment-company'


# The ID of your GCS bucket (bucket name)

bucket_name = "yastocks"

storage_client = storage.Client()

bucket = storage_client.bucket(bucket_name)


#calculate starting and ending date of range you wish to retrieve


Start = date.today() - timedelta(365)

Start.strftime('%Y-%m-%d')


End = date.today() + timedelta(2)

End.strftime('%Y-%m-%d')


# function to get daily stock volume of shares traded


# Download historical stock data

ticker = 'AMZN'

df = yf.download(ticker, start='2024-01-01', end='2025-01-01')


X = df['Volume'] #Independent Variable

y = df['Close'] #Dependent Variable

X_train, X_test, y_train, y_test = train_test_split(

X, y, test_size=0.2, random_state=42

)

print(f"\nTraining Set Size: {X_train.shape[0]} samples")

print(f"Testing Set Size: {X_test.shape[0]} samples")



print(f"\nTraining Set Size: {X_train.shape[0]} samples")

print(f"Testing Set Size: {X_test.shape[0]} samples")

#Initialize and train linear regression Model

model = LinearRegression()

model.fit(X_train, y_train)

print("\nLinear Regression model trained successfully.")

y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)

r2 = r2_score(y_test, y_pred)

print(f"\nModel Evaluation Metrics:")

print(f"Mean Squared Error (MSE): {mse:.2f}")

print(f"R-squared (R²) Score: {r2:.4f}")

plt.figure(figsize=(10, 6))

plt.scatter(X_test, y_test, color='blue', label='Actual Close Prices')

plt.plot(X_test, y_pred, color='red', linewidth=2, label='Predicted Close Prices')

plt.xlabel('Volume')

plt.ylabel('Close Price')

plt.title('Linear Regression: Volume vs Close Price')

plt.legend()

plt.grid(True)

plt.show()


plt.savefig('amzn_reg.png')




plot_filename = 'amzn_reg.png'


destination_blob_name = f'stocks/{plot_filename}'


source_file_name = 'amzn_reg.png'


blob = bucket.blob(destination_blob_name)


blob.upload_from_filename(source_file_name)  #upload file to specified destination


print(f'File {source_file_name} uploaded to {destination_blob_name}.')



ohn_iacovacci1@cloudshell:~/ml (sentiment-analysis-379200)$ python3 stock_reg.py

YF.download() has changed argument auto_adjust default to True

[*********************100%***********************]  1 of 1 completed


Training Set Size: 201 samples

Testing Set Size: 51 samples


Training Set Size: 201 samples

Testing Set Size: 51 samples


Linear Regression model trained successfully.


Model Evaluation Metrics:

Mean Squared Error (MSE): 280.20

R-squared (R²) Score: -0.0049

File amzn_reg.png uploaded to stocks/amzn_reg.png.


What is Mean Square Error

Model Evaluation is an important part of system model development. In cases when making predictions is the goal of the model, the mean squared error of predictions is a good metric to use when assessing the model’s accuracy.

  • Mean Squared Error evaluates the proximity of a regression line to a group of data points. It is a risk function that corresponds to the predicted squared error loss value.

The mean square error is computed by calculating the average, especially the mean, of the squared mistakes resulting from a function’s data.

Mean squared error (MSE) is a measure of the error in prediction algorithms. This statistic quantifies the average squared variance between observed and predicted values. When there are no errors in a model, the MSE equals 0. A model’s worth increases in proportion to the degree of error it contains. The average squared error is often called MSD – the average squared deviation.

The mean squared error in regression, for instance, might indicate the average squared residual.

The R-squared score, also known as the coefficient of determination, indicates how well a statistical model predicts an outcome. It represents the proportion of the variance in the dependent variable that is explained by the independent variable(s). A higher R-squared value (closer to 1) suggests a better fit, meaning the model explains more of the data's variability

No comments:

Post a Comment

Assignment #12 due 5/9/25

  Assignment #12 due 5/9/25 Build 4 graphs using machine learning - linear regression I want two separate publicly traded companies e.g. AAP...