Personal history

Column

Background

Hi there, my name is Sujith Pai. I hold two degrees in engineering, one in mechanical engineering and one in aerospace engineering from the University of California,Irvine. I am currently pursuing a masters in mechanical engineering at the University of Dayton, Ohio. This is my final project for Math 543

RESEARCH INTERESTS
Combustion
Internal combustion engines
Designing of internal combustion engines

PROFESSIONAL EXPERIENCE
Canara springs, Karnataka, India
Techno marine, UAE

CONTACT
paisujith@yahoo.com
9493446755

Column {data-height=150}

Profile Picture

Sujith Pai

Introduction

Column {data-width=25}

Abstract

In today’s fast moving world we are more focused on making life easier for human beings and are less focused on the impact of our actions. The Dataset obtained from the UCI learning repository is used to build a linear model to predict the toxicity of water towards the Flathead minnows. First the data is diagnosed and made sure that all of our assumptions are met. Collinearity is found within two of the continuous regressors in the dataset. The continuous dataset is reorganized using the ifelse function to provide better performance. One of the variables is dropped after analysing all possible models using the regsubset model. The final model is obtained with around ~60% prediction accuracy.

Row

Introduction

     X1    X2    X3 X4 X5    X6     Y
1 3.260 0.829 1.676  0  1 1.453 3.770
2 2.189 0.580 0.863  0  0 1.348 3.115
3 2.125 0.638 0.831  0  0 1.348 3.531
4 3.027 0.331 1.472  1  0 1.807 3.510
5 2.094 0.827 0.860  0  0 1.886 5.390
6 3.222 0.331 2.177  0  0 0.706 1.819

In today’s fast moving world we are more focused on making life easier for human beings and are less focused on the impact of our actions. Several of our rivers and water sources are highly polluted with plastics and chemicals making these water bodies uninhabitable for fish and other aquatic life. Each day the global human population keeps rising and as a result the extent of pollution created by industrial chemicals and plastic keeps rising as well.

The dataset being used was found in the database of the UCI machine learning repository to predict acute aquatic toxicity towards the fish Pimephales promelas (fathead minnow) on a set of 908 chemicals.This species were selected because they are very tolerant of a wide range of conditions in both water clarity and pH. The largest populations are found in streams or bog ponds where the conditions are rather poor for most other species of fish (2).Coincidentally the highest population of the fathead minnow is in Ohio,USA. LC50 data, which is the concentration that causes death in 50% of test fish over a test duration of 96 hours, was used as model response. The model comprised 6 molecular descriptors: MLOGP (molecular properties of the tested water), CIC0 (information indices, which are entropy-based measures that quantify discrepancies between probability distributions.), GATS1i (2D autocorrelations, which describe the degree to which one subject is similar to others), NdssC (atom-type counts), NdsCH ((bonded atom-type counts), SM1_Dz (2D matrix-based descriptor of the water being tested).The goal of the study is to be able to accurately predict the concentration of LC50 in the water. This concentration tells us whether the water is poisonous for the fish. To get a better understanding of what type of fish we are dealing with, a picture of the Fathead Minnow has been included below.

Flathead Minnow

https://www.usgs.gov/media/images/fathead-minnow

Data Exploration

column {data-width=25}

Analysis

I started off my data exploration by finding the unique values in the data set using the unique function. Only one value was found to be not unique, so my data set had 907 values instead of 908. I also changed the column names using the colname function to make it easier to see what each regressor stands for. A linear model with all regressors was used for the following plots. The scale of the y axis is small further provoking the existence of a linear relationship This was just for an initial diagnosis to determine if the model with all variables was viable and to get a better sense of the data.

Instance analysis
Next I looked at the instances that 0,1,2,3,4,5,6 that occured in the continous NdsCH and NdssC regressors. Most of the instances occur at 0 and 1 for both variables. Even though the occurrences of values of 0 and 1 are relatively low, it seems like they do have a negative effect on the significance of NdssC and NdsCH.

Linearity assumption
First off we would like to look at the values in the table to be able to judge if it is an appropriate model. So I diagnosed the dataset by utilizing various diagnosis plots. I took all of the regressors and built a linear model with LC50 as the response and looked at the pattern produced. The plot looks good and we can see the line through the middle indicating there is a linear relationship for this dataset.

Normality Assumption
The qq plot looks good as most of the points lie on the 45 degree line. Indicating that the normality assumption is satisfied. There is a slight deviation at the end, but this is no cause for concern.

Equal Variance Assumption
The equal variance does not look that great as the values tend to clump at certain points but it is still acceptable for a real world model. This will need to be further investigated. There might be one or two variables that are slightly collinear with each other. But a general lack of distinct patterns signifies that the residuals should be independent.

Cooks distance
There seems to be no issues from the Residual vs leverage plot. Since the scale is small, the 3 outliers do not have much of an impact on the performance of the model.

Performance of initial model
The initial summary table showed an R^2 value of 49%. All of the variables are significant except for the NdssC and NdsCH variables.

First six values of the dataset

   CIC0 SM1_Dz GATS1i NdsCH NdssC MLOGP  LC50
1 3.260  0.829  1.676     0     1 1.453 3.770
2 2.189  0.580  0.863     0     0 1.348 3.115
3 2.125  0.638  0.831     0     0 1.348 3.531
4 3.027  0.331  1.472     1     0 1.807 3.510
5 2.094  0.827  0.860     0     0 1.886 5.390
6 3.222  0.331  2.177     0     0 0.706 1.819

Instances of NdsCH and NdsCH


  0   1   2   3   4 
759 107  29   5   7


  0   1   2   3   4   5   6 
621 176  81  18   8   1   2

Column

Linearity plot

Normality plot

Equality Variance plot

Residuals vs Leverage plot

Analysis

Column

Plot for CIC0 vs LC50

Plot of GATS1i vs LC50

Plot of SM1_Dz vs LC50

Plot of MLOGP vs LC50

Diagnosis plot

column {data-width=25}

Analysis

Plots were made for all variables and it was analyzed that the NdssC and NdsCH variables had outputs that were troubling. For outputs to be reliable, regressors need to have separate and distinct patterns.

To improve the performance of the model, NdsCH and NdssC value greater than 1 were assigned the value of 1. Since toxicity is determined by the presence of chemicals, the output does not depend on the number of atoms. The collinearity plot on the left uses CIC0, SM1_Dz, GATS1i, and MLOGP on the X-axis and LC50 on the Y axis respectively with NdssC.Normally we see separate patterns with little to no overlapping. It shows here NdssC has severe overlapping and this will need to be fixed. There is a possibility that NdsCH has multicollinearity with all other regressors.Since the performance is poor, the continuous regressors will need to be scaled using the if else function.

The plots on the left relate NdssC to CIC0 and LC50 on the X and Y axes respectively. Here the results are troublesome as well. There are no separate patterns and there appears to be collinearity that we cannot ignore. By double clicking on the 0 legend towards the top right of the interactive plot we are able to isolate the instances of 0. This can be used to get a better sense of the spread

Since the data meshes well we cannot use CIC0 for either NdssC and NdsCH as a predictor for the final variable.

Results

Column

R^2 Evaluation

Reordering variables and trying again:

[1] 5

Goodness of Fit

Considering the summary table on the right we see the t values of all regressors. We use this test to asses the contribution of each regressor to the model.

H0:β1=β2=β3=β4=β5=0 versus H1: at least one of βi≠0,i=1,2,3,4,5.

where H0 is the null hypothesis.

The F-statistic is 216.3 on 5 degrees of freedom. TYhe p value is p-value =2.2e-16 < 0.05. We reject H0 at the α=0.05 level. There is sufficient evidence to conclude that using the multiple regression model with CIC0, SM1_Dz, GATS1i,NdsCH adn MLOGP is better than just using the mean toxicity level of the water.

CIC0 is significant to the model at the α = 0.05 level given that all other variables are included in the model.

SM1_Dz is significant to the model at the α = 0.05 level given that all other variables are included in the model.

GATS1i is significant to the model at the α = 0.05 level given that all other variables are included in the model.

NdsCH1 is not significant to the model at the α = 0.05 level (p(t)=0.202) given that all other variables are included in the model.

MLOGP is significant to the model at the α = 0.05 level given that all other variables are included in the model.

Column

Variable Selection

In order to find the highest performing model, we have to analyse all possible combinations of 6 variables in the model. To do this we use the reg subset function and the result is displayed on the graph to the left. Using the which max function we find the highest number of variables that can be used to form the best model. Looking at the graph it can be assumed that the NdssC variable hurts the final R^2 number rather than add to it. The model in this scenario would only include CIC0, SM1_Dz,GATS1i,NdsCH and MLOGP regressors.

Final R^2
Disregarding NdssC raised the R^2 value from 49% initially to 54.5% which can be gleamed from the summary table below.

Summary of the fitted model


Call:
lm(formula = LC50 ~ CIC0 + SM1_Dz + GATS1i + NdsCH + MLOGP, data = fish)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.1080 -0.3857 -0.0643  0.3524  3.8663 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.02043    0.02756  -0.741    0.459    
CIC0         0.22203    0.03193   6.953 6.84e-12 ***
SM1_Dz       0.35222    0.02558  13.768  < 2e-16 ***
GATS1i      -0.20989    0.02843  -7.382 3.55e-13 ***
NdsCH1       0.06480    0.05073   1.277    0.202    
MLOGP        0.38262    0.03407  11.229  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.676 on 901 degrees of freedom
Multiple R-squared:  0.5456,    Adjusted R-squared:  0.5431 
F-statistic: 216.3 on 5 and 901 DF,  p-value: < 2.2e-16

Conclusion

The goal of this study was to determine whether a linear model can could be built from the provided dataset. We were successfully in building a model with ~54% variability in the response which is adequate for real world use. In other words, researchers would be able to use this linear model and predict whether certain water bodies are toxic for the flathead minnow, given that other variables are known. Vastly increasing the probability survivability for these fish in toxic waters.In conclusion, all models are wrong, some models are useful. This model was limited by the small number of observations in the dataset. If this regression analysis was to be repeated in the future, the performance of this model could be improved if a non-parametric model(which is beyond the scope of this class) was used rather than a linear model. As more data is collected we will be able to increase the accuracy of this model and will be able to expand this model to other fish as well.

References

Row

[1] UCI Machine Learning Repository

https://archive.ics.uci.edu/ml/index.php

[2] Dr. Tessa Chen, University of Dayton,Ohio.

---
title: "Build a linear model to determine the levels of toxicity of water towards the Flathead Minnow "
output: 
  flexdashboard::flex_dashboard:
    storyboard: true
    theme: cerulean
    social: menu
    source: embed
---

```{r setup, include=FALSE}
# load necessary packages
library(ggplot2)
library(plotly)
library(plyr)
library(flexdashboard) 
library(readr)
fish <- read.csv("C:/Users/tessa/Dropbox/MTH543/Projects/data/Pai/qsar_fish_toxicity.csv")


```

 Personal history
=======================================================================

Column {data-width=500}
-----------------------------------------------------------------------

### Background 

Hi there, my name is Sujith Pai. I hold two degrees in engineering, one in mechanical engineering and one in aerospace engineering from the University of California,Irvine.  I am currently pursuing a masters in mechanical engineering at the University of Dayton, Ohio. This is my final project for Math 543

RESEARCH INTERESTS 

Combustion 

Internal combustion engines

Designing of internal combustion engines 



PROFESSIONAL EXPERIENCE 

Canara springs, Karnataka, India 

Techno marine, UAE 



CONTACT  

paisujith@yahoo.com

9493446755



Column {data-height=150} {data-width=150}
-----------------------------------------------------------------------

### Profile Picture
```{r , echo=FALSE, fig.cap="Sujith Pai", out.width = '100%'}
knitr::include_graphics("capture.png")
```

Introduction
=======================================================================

Column {data-width=25} {data-height=20}
-----------------------------------------------------------------------
### Abstract 


 In today's fast moving world we are more focused on making life easier for human beings and are less focused on the impact of our actions. The Dataset obtained from the UCI learning repository is used to build a linear model to predict the toxicity of water towards the Flathead minnows. First the data is diagnosed and made sure that all of our assumptions are met. Collinearity is found within two of the continuous regressors in the dataset. The continuous dataset is reorganized using the ifelse function to provide better performance. One of the variables is dropped after analysing all possible models using the regsubset model. The final model is obtained with around ~60% prediction accuracy. 
 


Row {data-height=650}
-----------------------------------------------------------------------
### Introduction 
```{r}
head(fish)
```



  In today's fast moving world we are more focused on making life easier for human beings and are less focused on the impact of our actions. Several of our rivers and water sources are highly polluted with plastics and chemicals making these water bodies uninhabitable for fish and other aquatic life. Each day the global human population keeps rising and as a result the extent of pollution created by industrial chemicals and plastic keeps rising as well.

  

  The dataset being used was found in the database of the UCI machine learning repository  to predict acute aquatic toxicity towards the fish Pimephales promelas (fathead minnow) on a set of 908 chemicals.This species were selected because they are very tolerant of a wide range of conditions in both water clarity and pH. The largest populations are found in streams or bog ponds where the conditions are rather poor for most other species of fish (2).Coincidentally the highest population of the fathead minnow is in Ohio,USA. LC50 data, which is the concentration that causes death in 50% of test fish over a test duration of 96 hours, was used as model response. The model comprised 6 molecular descriptors: MLOGP (molecular properties of the tested water), CIC0 (information indices, which are entropy-based measures that quantify discrepancies between probability distributions.), GATS1i (2D autocorrelations, which describe the degree to which one subject is similar to others), NdssC (atom-type counts), NdsCH ((bonded atom-type counts), SM1_Dz (2D matrix-based descriptor of the water being tested).The goal of the study is to be able to accurately predict the concentration of LC50 in the water. This concentration tells us whether the water is poisonous for the fish. To get a better understanding of what type of fish we are dealing with, a picture of the Fathead Minnow has been included below. 
 
 

### Flathead Minnow

```{r pressure, echo=FALSE, out.width = '40%'}
knitr::include_graphics("download.png")
```


https://www.usgs.gov/media/images/fathead-minnow

Data Exploration
=======================================================================


column {data-width=25} {data-height=20}
-----------------------------------------------------------------------

Analysis  

***
I started off my data exploration by finding the unique values in the data set using the unique function. Only one value was found to be not unique, so my data set had 907 values instead of 908. I also changed the column names using the colname function to make it easier to see what each regressor stands for. A linear model with all regressors was used for the following plots. The scale of the y axis is small further provoking the existence of a linear relationship  This was just for an initial diagnosis to determine if the model with all variables was viable and to get a better sense of the data.

Instance analysis  
Next I looked at the instances that 0,1,2,3,4,5,6 that occured in the continous NdsCH and NdssC regressors. Most of the instances occur at 0 and 1 for both variables. Even though the occurrences of values of 0 and 1 are relatively low, it seems like they do have a negative effect on the significance of NdssC and NdsCH.    

Linearity assumption   

First off we would like to look at the values in the table to be able to judge if it is an appropriate model. So I diagnosed the dataset by utilizing various diagnosis plots. I took all of the regressors and built a linear model with LC50 as the response and looked at the pattern produced. The plot looks good and we can see the line through the middle indicating there is a linear relationship for this dataset.

Normality Assumption  

The qq plot looks good as most of the points lie on the 45 degree line. Indicating that the normality assumption is satisfied. There is a slight deviation at the end, but this is no cause for concern. 

Equal Variance Assumption  

The equal variance does not look that great as the values tend to clump at certain points but it is still acceptable for a real world model. This will need to be further investigated. There might  be  one or two variables that are slightly collinear with each other. But a general lack of distinct patterns signifies that the residuals should be independent. 

Cooks distance  

There seems to be no issues from the Residual vs leverage plot. Since the scale is small, the 3 outliers do not have much of an impact on the performance of the model.

Performance of initial model  

The initial summary table showed an R^2 value of 49%. All of the variables are significant except for the NdssC and NdsCH variables.




### First six values of the dataset

```{r}
colnames(fish) <- c("CIC0", "SM1_Dz", "GATS1i", "NdsCH", "NdssC", "MLOGP", "LC50")
fish <- unique(fish)
head(fish)
```




### Instances of NdsCH and NdsCH

```{r}
table(fish$NdsCH)
table(fish$NdssC)
```

Column {.tabset data-width=400}
-----------------------------------------------------------------------

### Linearity plot 


```{r}

ak <- lm(LC50~CIC0+SM1_Dz+GATS1i+NdsCH+NdssC+MLOGP,fish)
df <- fish
fish$NdsCH <- as.factor(ifelse(fish$NdsCH>0, 1, 0))
fish$NdssC <- as.factor(ifelse(fish$NdssC>0, 1, 0))


#obtain values needed in order to get diagnostics plots
# Extract fitted values
Fitted.Values <- ak$fitted.values

# Extract residuals
Residuals <- ak$residuals

# Calculate standardized residuals 
Standardized.Residuals <- scale(ak$residuals)

# Extract fitted values for lm() object
Theoretical.Quantiles <- qqnorm(Residuals, plot.it = F)$x

# find Square root of abs(residuals)
Root.Residuals <- sqrt(abs(Standardized.Residuals))

# Calculate Leverage
Leverage <- lm.influence(ak)$hat

# Create data frame 
# Will be used as input to plot_ly

diagnostics <- data.frame(Fitted.Values, 
                     Residuals, 
                     Standardized.Residuals, 
                     Theoretical.Quantiles,
                     Root.Residuals,
                     Leverage)



m <- list(
  l = 100,
  r = 100,
  b = 100,
  t = 100,
  pad = 4
)

# Fitted vs Residuals
p1 <- plot_ly(diagnostics, x = Fitted.Values, y = Residuals, 
          type = "scatter", mode = "markers", hoverinfo = "x+y", name = "Data", 
          marker = list(size = 10, opacity = 0.5))%>%
  layout(title = "Residuals vs Fitted Values", 
       xaxis = list(title="Fitted Values", font=list(size=14)), 
       yaxis = list(title="Residuals", font=list(size=14)), 
       plot_bgcolor = "#e6e6e6", 
       font=list(size=14), margin=m)

ggplotly(p1)
```




### Normality plot

```{r}
p2 <- plot_ly(diagnostics, x = Theoretical.Quantiles, y = Standardized.Residuals, type = "scatter", mode = "markers", hoverinfo = "x+y", name = "Data", marker = list(size = 10, opacity = 0.5), showlegend = F)%>%
    add_trace(x = Theoretical.Quantiles, y = Theoretical.Quantiles, type = "scatter", mode = "line", name = "", line = list(width = 2))%>%
  layout(title = "Q-Q Plot", plot_bgcolor = "#e6e6e6",
              xaxis = list(title="Theoretical Quantiles", font=list(size=14)), 
       yaxis = list(title="Standardized Residuals", font=list(size=14)), font=list(size=14), margin=m)

ggplotly(p2)

```




### Equality Variance plot
```{r}
p3 <- plot_ly(diagnostics, x = Fitted.Values, y = Root.Residuals, 
          type = "scatter", mode = "markers", hoverinfo = "x+y", name = "Data",
          marker = list(size = 10, opacity = 0.5), showlegend = F)%>%
  layout(title = "Scale-Location", plot_bgcolor = "#e6e6e6", xaxis = list(title="Fitted Values", font=list(size=14)), 
       yaxis = list(title=expression(sqrt("|Standardized Residuals|")), font=list(size=14)), font=list(size=14), margin=m)

ggplotly(p3)

```



### Residuals vs Leverage plot
```{r}
s <- loess.smooth(Leverage, Residuals)
p4 <- plot_ly(diagnostics, x = Leverage, y = Residuals, 
            type = "scatter", mode = "markers", hoverinfo = "x+y", name = "Data", marker = list(size = 10, opacity = 0.5), showlegend = F) %>% 
    add_trace(x = s$x, y = s$y, type = "scatter", mode = "line", name = "Smooth", line = list(width = 2)) %>% 
    layout(title = "Leverage vs Residuals", plot_bgcolor = "#e6e6e6", xaxis = list(title="Leverage", font=list(size=14)), 
       yaxis = list(title="Residuals", font=list(size=14)), font=list(size=14), margin=m)
ggplotly(p4)  
```





 Analysis 
=======================================================================


Column {.tabset data-width=1200} 
-----------------------------------------------------------------------

###  Plot for CIC0 vs LC50
```{r}
library(ggplot2)
library(plotly)

pai <- plot_ly(fish, x = ~CIC0, y=~LC50, color = ~NdssC, type = "scatter",col='pink',marker = list(size = 15),colors = c("#FF0000", "#1403FF")) 
ggplotly(pai)
```



###  Plot of GATS1i vs LC50

```{r}
library(ggplot2)
library(plotly)
fish$NdsCH <- as.factor(fish$NdssC)
pai1 <- plot_ly(fish, x = ~GATS1i, y=~LC50, color = ~NdssC, type = "scatter",col='pink',marker = list(size = 10),colors = c("#FF0000", "#20B2AA")) 
ggplotly(pai1)
```

###  Plot of SM1_Dz vs LC50

```{r}
library(ggplot2)
library(plotly)
fish$NdsCH <- as.factor(fish$NdssC)
pai2 <- plot_ly(fish, x = ~SM1_Dz, y=~LC50, color = ~NdssC, type = "scatter",col='pink',marker = list(size = 10),colors = c("#FF0000", "#20B2AA")) 
ggplotly(pai2)
```

###  Plot of MLOGP vs LC50

```{r}
library(ggplot2)
library(plotly)
fish$NdsCH <- as.factor(fish$NdssC)
pai3 <- plot_ly(fish, x = ~MLOGP, y=~LC50, color = ~NdssC, type = "scatter",col='pink',marker = list(size = 10),colors = c("#FF0000", "#20B2AA")) 
ggplotly(pai3)
```

### Diagnosis plot
```{r}
 library(lindia)
gg_diagnose(ak)
```

column {data-width=25} {data-height=20}
-----------------------------------------------------------------------


Analysis

***
Plots were made for all variables and it was analyzed that the NdssC and NdsCH variables had outputs that were troubling.
For outputs to be reliable, regressors need to have separate and distinct patterns. 

 To improve the performance of the model, NdsCH and NdssC value greater than 1 were assigned the value of 1. Since toxicity is determined by the presence of chemicals, the output does not depend on the number of atoms. The collinearity plot on the left uses CIC0, SM1_Dz, GATS1i, and MLOGP on the X-axis and LC50 on the Y axis respectively with NdssC.Normally we see separate patterns with little to no overlapping. It shows here NdssC has severe overlapping and this will need to be fixed. There is a possibility that NdsCH has multicollinearity with all other regressors.Since the performance is poor, the continuous regressors will need to be scaled using the if else function.  

The plots on the left relate NdssC to CIC0 and LC50 on the X and Y axes respectively. Here the results are troublesome as well. There are no separate patterns and there appears to be collinearity that we cannot ignore. By double clicking on the 0 legend towards the top right of the interactive plot we are able to isolate the instances of 0. This can be used to get a better sense of the spread 

Since the data meshes well we cannot use CIC0 for either NdssC and NdsCH as a predictor for the final variable.  



 Results
=======================================================================

Column {.tabset data-width=400}
-----------------------------------------------------------------------

### R^2 Evaluation 
```{r, echo=FALSE}
library(leaps)
best <- regsubsets(LC50~., fish, nbest=1, nvmax=NULL, force.in=NULL, force.out = NULL, method="exhaustive" )
                  
result <- summary(best)
plot(best, scale = "adjr2", main = "Adjusted R^2", col='pink' )

which.max(result$adjr2)

fish[,c(1,2,3,6,7)] <- apply(fish[,c(1,2,3,6,7)], 2, scale) 
best.model <- lm(LC50~CIC0+SM1_Dz+GATS1i+NdsCH+MLOGP,
                    fish)
```

### Goodness of Fit

Considering the summary table on the right we see the t values of all regressors. We use this test to asses the contribution of each regressor to the model.

H0:β1=β2=β3=β4=β5=0 versus H1: at least one of βi≠0,i=1,2,3,4,5.




where H0 is the null hypothesis.




The F-statistic is 216.3 on 5 degrees of freedom. TYhe p value is p-value =2.2e-16 < 0.05. We reject H0 at the α=0.05 level. There is sufficient evidence to conclude that using the multiple regression model with CIC0, SM1_Dz, GATS1i,NdsCH adn MLOGP is better than just using the mean toxicity level of the water.




CIC0  is significant to the model at the α = 0.05 level given that all other variables are included in the model.




SM1_Dz is significant to the model at the α = 0.05 level given that all other variables are included in the model.




GATS1i is significant to the model at the α = 0.05 level given that all other variables are included in the model.




NdsCH1 is not significant to the model at the α = 0.05 level (p(t)=0.202) given that all other variables are included in the model.




MLOGP is significant to the model at the α = 0.05 level given that all other variables are included in the model.




Column {data-width=600}
-----------------------------------------------------------------------



### Variable Selection 

***

In order to find the highest performing model, we have to analyse all possible combinations of 6 variables in the model. To do this we use the reg subset function and the result is displayed on the graph to the left. Using the which max function we find the highest number of variables that can be used to form the best model. Looking at the graph it can be assumed that the NdssC variable hurts the final R^2 number rather than add to it. The model in this scenario would only include CIC0,  SM1_Dz,GATS1i,NdsCH and MLOGP regressors. 

Final R^2  

Disregarding NdssC raised the R^2 value from 49% initially to 54.5% which can be gleamed from the summary table below.




### Summary of the fitted model 

```{r}

ak1 <- lm(LC50~CIC0+SM1_Dz+GATS1i+NdsCH+MLOGP,fish)
summary(ak1)
```


### Conclusion 

The goal of this study was to determine whether a linear model can could be built from the provided dataset. We were successfully in building a model with ~54% variability in the response  which is adequate for real world use. In other words, researchers would be able to use this linear model and predict whether certain water bodies are toxic for the flathead minnow, given that other variables are known. Vastly increasing the probability survivability for these fish in toxic waters.In conclusion, all models are wrong, some models are useful. This model was limited by the small number of observations in the dataset. If this regression analysis was to be repeated in the future, the performance of this model could be improved if a non-parametric model(which is beyond the scope of this class) was used rather than a linear model. As more data is collected we will be able to increase the accuracy of this model and will be able to expand this model to other fish as well.


 References
=======================================================================

Row {data-hight=650}
-----------------------------------------------------------------------



[1] UCI Machine Learning Repository 

https://archive.ics.uci.edu/ml/index.php

[2] Dr. Tessa Chen, University of Dayton,Ohio.