This page documents the R-coding procedure to predict the fatigue statue of the participants in two common manufacturing tasks.

Our detailed code is available at: https://github.com/zahrame/FatigueManagement.github.io

The reader can show any code chunk by clicking on the code button. We chose this option to increase the readability of the proposed approach and moreover to avoid mixing the codes and descriptions for each step in the proposed framework.

Figure 1 provides an overview of our the proposed fatigue management data-driven framework.

Figure 1: The Proposed Framework for Fatigue Management

1 Section: Loading Data and Data Preparation

1.1 Section: Required Packages

This section presents the set of R packages that are utilized in different sections in this study. These packages are used for feature selection, model development, and parallel computing.

rm(list=ls()) # clear global environment
### import all needed packages
  library(randomForest)  
  library(ROCR)
  library(bestglm)
  library(gtools)
  library(bst)
  library(caret)
  library(pROC)
  library(plyr)
  library(ipred)
  library(e1071)
  library(kernlab)
  library(klaR)
  library(MASS)
  library(C50)
  library(glmnet)
  library(snow)   ### for the parallel computation
  library(readr)  ### for read_csv online
  library(xlsx)   ### save outputs into an excel file

1.2 Section: Data Collection, Data Preprocessing, Creating Training and Testing Sets

In this section, we load the data file used in this study, including the 15 participants for the Manual Material Handling task and 13 participants for the Supply Insertion task, presented in two subsections. The output of this step is a set of training and testing data that can be used for model development. The Processed raw data with the defined features can be obtained through the github link presented in the code below.

1.2.1 Manual Material Handling task

The output of this step is 105 training and test sets using 2-participants out cross-validation for the Manual Material Handling task.

# data are loaded from github
### data preparation
myfile <- "https://raw.githubusercontent.com/zahrame/FatigueManagement.github.io/master/MMH_15p.csv"
alldata <- read.csv(url(myfile))

#define the participants
g1 <- c("P1","P2","P3","P4","P5","P6",
        "P7","P8","P9","P10","P11","P12",
        "P13","P14","P15")

#define the combinations of 2-participants out for cross-validation
allparticipants <- cbind(c(g1))
allcombinations2 <- t(data.frame(combn(c(g1), 2)))
allcombinations <- apply(allcombinations2, 2, as.character)

1.2.2 Supply Insertion task

The output of this step is 78 training and test sets using 2-participants out cross-validation for theSupply Insertion task.

# data are loaded from github
### data preparation
myfile <- "https://raw.githubusercontent.com/zahrame/FatigueManagement.github.io/master/WLK_13p.csv"
alldata <- read.csv(url(myfile))

#define the participants
g1 <- c("P1","P2","P4","P6","P8","P12","P13","P14","P15","P16","P17","P18","P19")

#define the combinations of 2-participants out for cross-validation
allparticipants <- cbind(c(g1))
allcombinations2 <- t(data.frame(combn(c(g1), 2)))
allcombinations <- apply(allcombinations2, 2, as.character)

1.2.3 Data Scaling

In this snippet below, we define a function to scale the features, where it is implemented for continuous features. The scale function subtracts the mean value of column X (feature) from each sample in column X and then divides by the column’s standard deviation.

## In the following lines, we scaled the features 
scaleContinuous <- function(data) {
    binary = apply(data, 2, function(x) {all(x %in% 0:1)}) 
    data[!binary] = scale(data[!binary])
    return(data)
  }

2 Section: Model Development and Evaluation

In this section, we develop several analytical models for the Manual Material Handing and Supply Insertion tasks. The output of this section is the best analytical model that has higher performance in predicting both fatigued and non-fatigued states for each of the two tasks.

2.1 Section: Feature Selction

In this step, we implement feature selection to reduce the dimensionality of our dataset. In order to do the feature selection. First we scale the training set. Second, we use a) best subset selection using BIC as a criterion, and b) LASSO for feature selection. The output of this section is the set of important features for the training set. It should be noted that this procedure is repeated for each cross-validation set, 105 times for Manual Material Handling and 78 times for Supply Insertion tasks in overall.

2.1.1 Best Subset Selection

This subsection below shows the implementation of the Best Subset Selection for feature selection using the BIC as Criteria.

test.index <- which(alldata$subject %in% allcombinations[j,])
train.index <- which(!(alldata$subject %in% allcombinations[j,]))
X <- scaleContinuous(alldata[train.index,-seq(1,6)])
y <- alldata[train.index, 4]
Xy <- cbind(X, fatigue.level = y)
out1 <- bestglm(Xy, IC = "BIC", intercept = FALSE)
##select the best GLM model 
best.temp <- out1$BestModels[1,]
BEST <- colnames(best.temp)[best.temp==TRUE]
BEST <- c("subject", "fatiguestate1", "fatiguestate", BEST)

2.1.2 LASSO for Feature Selection

This subsection below presents the implementation of the LASSO for feature selection.

test.index <- which(alldata$subject %in% allcombinations[j,])
train.index <- which(!(alldata$subject %in% allcombinations[j,]))
X1 <-alldata[train.index,-seq(1,6)]
X<- as.matrix(scaleContinuous(X1))
y <- as.matrix(alldata[train.index, 4])
cvfit <- glmnet::cv.glmnet(X, y,alpha=1, intercept=FALSE)
Coefficients<-data.frame(as.matrix(coef(cvfit,  s =   cvfit$lambda.min)))
Coefficients1<- data.frame(feature = rownames(Coefficients), coef = Coefficients[,1])
BEST<- as.vector(Coefficients1[!rowSums(Coefficients1 == 0), 1])[-1]
BEST <- c("subject", "fatiguestate1", "fatiguestate", BEST)

3 Section: Model Construction

In this step the analytical models are constructed. We use train function from caret package. Within the train function, we use rf for random forest, treebag for bagging, gbm for boosting, svmRadial for SVM, glm for logistic regression. We use different performance metrics to measure the performance of the each developed model, these metrics are: Sensitivity, Specificity, Accuracy, and Consistency, and Gmeans.

3.1 Section: Manual Material Handling Task Model Developemnt

For each 105 cross-validation set in Manual Material Handling task, We first generate a bootstrapped samples from the training set (equal to the data for 13 (=15-2) participants from training set sample size, results in 117 (13*9) samples for each fatigued and not fatigue states. Then we develop these models: bagging, boosting, random forest, SVM, logistic regression, and penalized logistic regression. We repeat the bootstarpping procedure for 200 times for each training set. We record the average of the performance metrics for each training set.

Train.0 <- which(alldata$fatiguestate1[train.index]==0)
Train.1 <- which(alldata$fatiguestate1[train.index]==1)
  
accuracy <- rep(NA, 200)
sen <- rep(NA, 200)
spe <- rep(NA, 200)
  
for (k in 1:200){
    set.seed((1994+k))
    index.train <- c(sample(Train.0, 117, replace=TRUE), sample(Train.1, 117, replace=TRUE))
    test <- alldata[test.index, BEST]
    train <- alldata[index.train, BEST]
    fatigued.states<- sum(test$fatiguestate1)
    not.fatigued.states<- nrow(test)-sum(test$fatiguestate1)
    valid.participant<-allcombinations[j,]
    
    train_x <- scaleContinuous(train[,-c(1,2,3)])
    dataset <- data.frame(cbind(class=train[,c("fatiguestate")],train_x))
    train.colsd <- apply(train[,-c(1,2,3)], 2, sd)
    train.colmeans <- apply(train[,-c(1,2,3)], 2, mean)
    prepare_test <- test[,-c(1,2,3)]
    
    ### prepare the test data
    ### Center and Scale the features in the test data using the training data
    for (o in 1: ncol(prepare_test)){
      prepare_test[,o] <- (prepare_test[,o]-train.colmeans[o])/train.colsd[o]
    }
    
    test_forfeatureselection <- data.frame(cbind(class=test[,c("fatiguestate")],prepare_test))
    test_forfeatureselection$class <- as.factor(test[,c("fatiguestate")])
    dataset_test <- data.frame(test_forfeatureselection)
    testsamplesize <- nrow(dataset_test) 
    
    ### model developemnt
    set.seed((123+k))
    mod <- train(class ~ ., method="insert the model, for example: rf", data = dataset)
    pred <- predict(mod, dataset_test)
    
    ### Check the performance of the model
    comparison <- table(dataset_test$class,pred)
    accuracy[k] <- (comparison[1,1]+comparison[2,2])/testsamplesize
    if(not.fatigued.states==0){
      spe[k] <- 0
    }else{
      spe[k] <- comparison[2,2]/ (not.fatigued.states)
    }
    
    if(fatigued.states==0){
      sen[k] <- 0
    }else{
      sen[k] <- comparison[1,1]/(fatigued.states)
    }
  }

3.2 Section: Supply Insertion Task Model Developemnt

For each 78 cross-validation set in Supply Insertion task, We first generate a bootstrapped samples from the training set (equal to the data for 11 (=13-2) participants from training set sample size, results in 99 (11*9) samples for each fatigued and not fatigue states. Then we develop similar models as we did for Manual Material Handling task. Then, we record the average of the performance metrics for each training set.

Train.0 <- which(alldata$fatiguestate1[train.index]==0)
Train.1 <- which(alldata$fatiguestate1[train.index]==1)
  
accuracy <- rep(NA, 200)
sen <- rep(NA, 200)
spe <- rep(NA, 200)
  
for (k in 1:200){
    set.seed((1994+k))
    index.train <- c(sample(Train.0, 99, replace=TRUE), sample(Train.1, 99, replace=TRUE))
    test <- alldata[test.index, BEST]
    train <- alldata[index.train, BEST]
    fatigued.states<- sum(test$fatiguestate1)
    not.fatigued.states<- nrow(test)-sum(test$fatiguestate1)
    valid.participant<-allcombinations[j,]
    
    train_x <- scaleContinuous(train[,-c(1,2,3)])
    dataset <- data.frame(cbind(class=train[,c("fatiguestate")],train_x))
    train.colsd <- apply(train[,-c(1,2,3)], 2, sd)
    train.colmeans <- apply(train[,-c(1,2,3)], 2, mean)
    prepare_test <- test[,-c(1,2,3)]
    
    ### prepare the test data
    ### Center and Scale the features in the test data using the training data
    for (o in 1: ncol(prepare_test)){
      prepare_test[,o] <- (prepare_test[,o]-train.colmeans[o])/train.colsd[o]
    }
    
    test_forfeatureselection <- data.frame(cbind(class=test[,c("fatiguestate")],prepare_test))
    test_forfeatureselection$class <- as.factor(test[,c("fatiguestate")])
    dataset_test <- data.frame(test_forfeatureselection)
    testsamplesize <- nrow(dataset_test) 
    
    ### model developemnt
    set.seed((123+k))
    mod <- train(class ~ ., method="insert the model, for example: rf", data = dataset)
    pred <- predict(mod, dataset_test)
    
    ### Check the performance of the model
    comparison <- table(dataset_test$class,pred)
    accuracy[k] <- (comparison[1,1]+comparison[2,2])/testsamplesize
    if(not.fatigued.states==0){
      spe[k] <- 0
    }else{
      spe[k] <- comparison[2,2]/ (not.fatigued.states)
    }
    
    if(fatigued.states==0){
      sen[k] <- 0
    }else{
      sen[k] <- comparison[1,1]/(fatigued.states)
    }
  }

3.3 Section: Consolidating Results of Analytical Models

In this section, we consolidate the output from model development section. It should be noted that all of the computations are performed using the Ohio Super Computer. The following codes merge the performance metrics, their descriptive statistics, and important features across all cross-validation models for each task into one excel file for each task.

sen_spe <- data.frame(matrix(0, nrow = ncases, ncol = 10))
colnames(sen_spe) <- c("model","#features","par1","par2","accuracy", "sd(accuracy)","sensitivity","sd(sensitivity)", "specificity", "sd(specificity)") 
best_features <- list()
participant.valid <- list()
metric <- list()

for (i in 1:ncases){
  sen_spe[i,] <- as.vector(Result[[(4*i-3)]])
  metric[[i]] <- Result[[(4*i-2)]]
  best_features[[i]] <- Result[[(4*i-1)]]
  participant.valid[[i]] <- Result[[(4*i)]]
}


metric1 <- (data.frame(plyr::ldply(metric, rbind)))
metric2 <- metric1[complete.cases(metric1), ]
mean.model <- apply(metric2, 2, mean) 
sd.model <- apply(metric2, 2, sd) 
median.model <- apply(metric2, 2, median) 
q.model <- apply( metric2, 2, quantile ,probs=c(0.25,0.75))

all.metrics <- rbind(mean.model,sd.model,median.model, q.model)
colnames(all.metrics) <- c("accuracy", "sd(accuracy)", "sensitivity","sd(sensitivity)","specificity", "sd(specificity)")
model_features <- (data.frame(plyr::ldply(best_features, rbind)))

model <- "name of the  model_"
sampling <- "bootstrap_"
par <- "2out_"
feature.selection <- "BIC_"
task <- "name of the task, for example MMH"


### export outputs to an excel file
Myfile <- paste(task,"_", model,"_", par,"_", feature.selection,".xlsx", sep="")
write.xlsx(sen_spe, file = Myfile, sheetName=paste("sen_spe", sampling, sep=""), row.names=FALSE)
write.xlsx(model_features, file = Myfile, sheetName=paste("features", sampling, sep=""), append=TRUE, row.names=FALSE)
write.xlsx(all.metrics, file = Myfile, sheetName=paste("all metrics", sampling, sep=""), append=TRUE, row.names=TRUE)

4 Section: Parallel Processing Implementation

The section below shows the job script written to submit a parallel processing code to the Ohio Super Computer. This code records the required amount of time to finish this job as well.

time.begin <- proc.time()[3]
cl <- makeCluster(, type="SOCK") ### number of cores 
ncases <- nrow(allcombinations) 
Result <- parSapply(cl, 1:ncases, Mainfun, alldata, allparticipants, allcombinations)
stopCluster(cl)
time.end <- proc.time()[3]-time.begin
paste("It took", time.end, "seconds to run the program.")

  1. Adelphi University, zmaman@adelphi.edu↩︎

  2. University of Dayton, ychen4@udayton.edu↩︎

  3. University of Calgary, amirbagh@buffalo.edu↩︎

  4. Massachusetts Institute of Technology, seamuslo@mit.edu↩︎

  5. University at Buffalo, loracavu@buffalo.edu↩︎

  6. Miami University, fmegahed@miamioh.edu↩︎