MODELING THE BEHAVIOR IN CHOOSING THE TRAVEL MODE FOR LONG-DISTANCE TRAVEL USING SUPERVISED MACHINE LEARNING ALGORITHMS

Resume The long-distance travel (LDT) mode choice modeling is important for transportation planners. This study investigated alternative mode choice behavior for the LDT between the intercity buses and trains. A questionnaire survey, consisting of important mode choice attributes, was conducted on various groups of people in Bangladesh. Numerous travel mode choice contributing features (e.g., travel time, travel costs, origin-destination, comfort, safety, travel time reliability, ticket availability and schedule flexibility) were considered and the LDT mode choice models were developed using various machine learning algorithms typically applied for classification problems. With 95.31 % accuracy and 0.95 F1-score, Random Forest model was the best performing model for the dataset. According to the findings of this study, the intercity bus is preferred over the intercity train for LDT in Bangladesh.

In travel behavior research, discrete choice models, such as the multinomial logit model [7], are usually used. According to Cheng et al. [8] machine learning (ML) methods are a viable alternative to statistical models to predict the mode preferences for traveling. Many researchers have employed the ML techniques to predict travel mode preference behavior in recent years. However, differences can be identified between ML and conventional statistical methodologies in terms of understanding the data structure [9]. A logit model presupposes the data structure using assumptions regarding behavior and statistics, whereas many popular ML methods are non-parametric, devoid of theoretical assumptions about the underlying data structure and rely on computers for analyzing the data [10]. Hence, more flexible structures can be formed using the ML algorithms to yield better predictive ability on test samples. Several recent works in travel behavior research have shown that ML models are outperforming logit models in predictive capacity, especially in research related to travel behavior [8][9][11][12][13].
Preference for travel mode or modal split has been studied over the past few decades for transportation planning and policymaking. Mode choice depends on

Introduction
Understanding the causal variables is vital in predicting the travel demand in the transportation planning domain. Factors such as individual traits, household type, security, comfort level, weather and built environment affect a person's travel mode choice [1][2][3]. Two of the most competitive public transport modes for the long-distance travel (LDT) within a country are intercity train and intercity bus. Bus uses shared road space with other vehicles on highways, whereas the train uses exclusive right-of-way with high ridership potential without occupying any road space [4]. Hence, prioritizing the train services will lead to more efficient use of land for transportation than bus services. On the contrary, intercity buses provide more accessibility and flexibility than intercity trains. Besides, construction, operation and improvement of the bus service systems are less expensive and time-consuming than the intercity train systems [5]. It has been argued by Kampf et. al. [6] that different modes of transport are essential parts of a sustainable transport system. Therefore, there is a need for research to assess peoples' preferences between intercity train and intercity bus services.
knowledge and activities at the cognitive level into mode choice behavior; Tang et al. [26] discussed the travel mode switching behavior of people who were only given two options using DTs. On the other hand, Cheng et al. [8] modeled the behavior in travel mode preference using the RF algorithm. Instead of using logit model, travel mode preference can be modeled using a classification problem-based approach. Many researchers have argued that ML classifiers can effectively model individual travel behavior [8-9, 11, 13].
Across the world, many researchers have focused on revealing different factors affecting the LDT. Moeckel et al. [27] explained why it is important to predict and understand the LDT mode in terms of vehicle miles traveled, as well as looked at different logit and combined choice models for proposing a new (modified R 3 logit) nested multinomial logit model to predict LDT mode for the state of North Carolina, USA. A study conducted in Japan by Shen [28] found that the Latent Class Model performed better than the Mixed Logit Model when it came to choosing transport mode among monorail, car and bus. Bok et al. [29] conducted an empirical study for LDT in Portugal by car, train, or bus. Similarly, many researchers for example Rohr et al. [30] in the UK, MVA [31] in the Netherlands, De Jong and Gunn [32] in Italy, Mandel et al. [33] in Germany and RAVE [34] in Portugal, conducted studies regarding the long-distance travel using different logit models. Furthermore, Gasparik et al. [35] explored the technical and non-technical obstacles to the operation of longdistance rail services in the European Union.
In addition, ML classifiers have been found to outperform conventional logit models in forecasting the travel-mode preferences, e.g., RF classifier achieves better accuracy in less computational time and with less modeling effort than the multinomial logit model (MNL) [12,36]. ML models increase compatibility with the empirical data by allowing flexible model structures, whereas logit models work on a predetermined model structure; thus, ML models perform better than logit models in predicting mode preferences for travel [9,37]. This is due to the fact that the logit model prioritizes the estimation of the parameter to increase the predictive precision of the model [38]. Wang and Ross [39] compared the performance between extreme gradient boosting (XGB) and the MNL model in predicting travel mode choice and found that in overall, the XGB model is better at making predictions than the MNL model, especially when the data set is not extremely unbalanced. Omrani [40] applied four ML methods (neural net-RBF, neural net-MLP, multinomial logistic regression and SVM) to predict how people in Luxembourg will choose to travel and found that the ML methods perform better.
Despite showing a significant overall model fit, the mixed logit model was found to have worse prediction accuracy than the simpler multinomial logit model [41]. According to Mullainathan and Spiess [38], although the ML approaches yield better predictive accuracy, these are several features, such as time required for travel, travel expenses, level of comfort, safety, convenience and so on [14][15]. Each feature has underlying relation with the mode choice model individually and in combination. The mode choice model varies with demographics, socioeconomic and geographic conditions. Two of the most significant factors in choosing among available alternative travel modes are found to be the adequacy of transportation infrastructure and level of service [16].
The LDT differs from short-distance daily travel in various aspects. According to the European DATELINE study [17], the LDT is defined as trips that cover 100 kilometers or more, whereas the US Bureau of Transportation Statistics [18] defines LDT as trips greater or equal to 50 miles (83.33 kilometers) travel from the origin. Though few prior pieces of literature have been found to distinguish clearly between the different features and aspects of LDT than shortdistance travel, it is apparent that people used to make a distinctive choice for both cases. Short-distance travels are typically work trips, non-work trips, shopping trips which are most frequent, whereas the LDT is mostly infrequent and subjected to non-work trips and vacation trips.
Several studies performed nested logit model [19], structured equation model [20], neural network [21], decision tree (DT) [10] and random forest (RF) algorithm [8,12,22] to understand modal split behavior. Each of the models has its pros and cons. This research aimed to use popular supervised ML algorithms. used for classification problems such as Naive Bayes (NB), Support Vector Machine (SVM), K-Nearest Neighbors (KNN), RF and DT models, to develop a mode preference model for LDT, using stated preference survey data. Further, the attributes important in predicting the travel mode preference have been identified by investigating the users' LDT mode choice behavior using the best-performing ML-based classification method in the context of Dhaka, the capital city of Bangladesh. According to the authors' knowledge, none has performed such study for modeling the LDT mode choice using ML approaches in a low-income country like Bangladesh.

Literature review
The growing challenge of increasing demand for travel, safety concerns, energy exhaustion, emission of deadly gases and environmental deterioration has prompted transportation engineers to adopt ML techniques to solve these dynamic problems [23]. The ML is an assemblage of methodologies or algorithms that allow computers to program the development of the data-driven model by detecting patterns in statistically significant data [24]. Recently, a variety of ML approaches have been employed for modeling the travel mode choice. Using artificial neural networks, Pulugurta et al. [25] were able to incorporate human railway stations. The questionnaire survey is designed with close-ended multiple-choice questions containing two segments. The questions in the questionnaire have been adapted from similar research [45][46] performed to model travel mode choice for the LDT. The first part of the questionnaire includes general questions regarding route traveling, frequency of travel and demographic information such as gender and income level. The second portion asks about travel mode preference and various features related to mode choice, e.g., the time required for travel, i.e., travel time, expenses incurred for travel, i.e., travel costs, comfort during journey, safety, reliability of journey time, stop or station closer (proximity) to destination, stop or station closer (proximity) to origin, availability of tickets and flexible schedule. The respondents were asked to choose between intercity train and intercity bus mode for LDT in the questions related to each of the features. Figures 1 and 2 show the distribution of respondents in terms of the travel frequency and gender. It is observed from Figure 1 that 59.86 % of respondents regularly travel while 40.14% are occasional travelers and among them, 73.24 % are male and 26.76 % are female respondents (ref. Figure 2). Female passengers in Bangladesh rarely travel long distances alone, which explains the less dominance of female respondents in the survey.
According to monthly household income and origin-destination of the journey, the distributions of respondents are represented in Figures 3 and 4. Among the income group, 38.50 % of the respondents have a monthly income of less than 20,000 Bangladeshi Taka (BDT) (equivalent to 212 USD, currency conversion rate as of 16 July, 2022) and 37.09 % have a monthly income between 50,000 BDT (equivalent to 578 USD) to 1,00,000 BDT (equivalent to 1,156 USD), 19.13 % earn 20,000 BDT (equivalent to 231 USD) to 50,000 BDT (equivalent to 532 USD) on a monthly basis and only 5.28 % have a monthly household income over 1,00,000 BDT (equivalent to 1,064 USD).
From Figure 4 can be observed that the portion of respondents traveling from Dhaka to four major metropolitan cities of Bangladesh in descending order often thought to have a lower level of explanatory power. Furthermore, the ML models are termed as hard to be explainable due to their inability to facilitate behavioral interpretation [42]. However, a development has been made recently in the ML domain to facilitate decisionmaking with the availability of various interpretation tools which can be applied in extracting knowledge from these uninterpretable models [43][44].
Variable importance is being commonly used as an aid to ML tools for modeling the mode preferences for travel [8,10,12]. Recently, Hagenauer and Helbich [12] distinguished the variable importance results between the ML methods and the multinomial logit model. Cheng et al. [8] have assessed the relative value of the explanatory variables of RF model by using the variable importance tool to formulate transportation policies. Therefore, it is clear that the analysis of the variable outputs of ML models can show which factors drive prediction decisions. This research focuses on identifying the attributes affecting travel mode choice for LDT in Bangladesh.

Data collection
In this study, four routes have been considered to understand the LDT mode choice behavior. The origin node of the routes selected for this study is Dhaka, the capital city of Bangladesh and the destination nodes are four other major metropolitans of Bangladesh, i.e., Chittagong, Rajshahi, Khulna and Sylhet. The distances between Dhaka and Chittagong, Rajshahi, Khulna and Sylhet are 244 kilometers, 247.7 kilometers, 270.3 kilometers and 240.5 kilometers, respectively; therefore, these trips can be considered as long-distance trips according to the Bureau of Transportation Statistics [18]. A total of 852 responses have been collected, out of which 302 responses were collected through an online questionnaire survey circulated via google forms and the remaining 550 responses were collected in person by a group of enumerators from different bus stands and dataset has been used in ML algorithms for further analysis.

Travel mode choice classification model development
A labeled dataset of 852 participants is used to classify travel mode choice to train different supervised ML algorithms, i.e., NB, SVM, DT, RF and KNN. NB, based on Bayes' theorem [47], is a classification strategy that predicts the probability of an occurrence on the basis of past knowledge about associated factors [48]. It works best in two scenarios: features that are fully independent and features that are functionally dependent [49]. The DT classifier creates a tree-like structure by categorizing the data set into smaller nodes, with terminal nodes indicating decision outcomes [48,50]. The RF is an ensemble classifier that is made up of numerous DTs, similar to a forest being made up of many trees [51]. Distinct parts of the dataset used for training are employed to train various DTs of an RF model. The RF classifier selects the classification that receives the greatest number of votes if the outcome is discrete and the mean of all trees is considered for numeric categorization [48]. In an m-dimensional space, an SVM generates a separation hyperplane, where m is the number of features. These hyperplanes

Data preprocessing
The collected data stored in google forms is exported as comma-separated values (.csv) file and then the data is converted to categorical dummy variables. The travel mode choice option "Intercity Train" was converted to 0 and "Intercity Bus" was converted to 1, the four different routes, i.e. Dhaka-Rajshahi, Dhaka-Chittagong, Dhaka-Khulna and Dhaka-Sylhet, are coded as 0, 1, 2 and 3 respectively, frequency of travel being occasionally or regularly is converted to 0 and 1, age of the respondents classified as less than 18 years, 18 to 40 years and more than 40 years are coded as 0, 1 and 2, males are coded as 0, in contrast, females are coded as 1, monthly household income being less than BDT 20,000 is converted to 0 and similarly, income levels of BDT 20,000 to 50,000, BDT 50,000 to 1,00,000 and more than BDT 1,00,000 are converted to 1, 2 and 3, the occupation of the respondents classified as service holder, student, businessman and housewife are coded as 0, 1, 2 and 3. For training the model, 70 % of the dataset was chosen randomly and the rest was used for testing purposes. Using various python libraries, i.e., NumPy, Pandas, Matplotlib, Seaborn and Scikit-learn, the preprocessed

Mode choice classification models
A variety of supervised ML algorithms have been used to classify the preferences for the LDT mode choice. The accuracy and F1-score achieved by different methods are depicted in Table 1. It shows that the RF classifier produces the highest accuracy and F1-score among all of the methods tested. Hence, the RF classifier is selected as the best method for classifying mode choice preferences for LDT considering the F1-score and accuracy achieved in the testing dataset. Besides that, the RF method can determine out-of-bag accuracy by measuring the average prediction accuracy obtained from the trees whose samples are not considered in the bootstrap sample. Therefore, the over-fitting problem can be detected by observing the out-of-bag accuracy. The out-of-bag accuracy of RF classifier was found to be 93.12 % which is very high, so it is evident that the model does not suffer from an over-fitting problem. Table 1 shows that the accuracy and F1-scores of various algorithms are very close and Grandini et al. [54] showed that the accuracy and F1-score might be the same in some cases.
Furthermore, a random search cross-validation technique has been utilized to discover the optimum hyperparameters for refining the RF model's performance in order to improve its accuracy. From a grid of a range of function as decision boundaries for classifying both linear and non-linear data points. All the data are first mapped onto an m-dimensional feature space, then the SVM determines the separation hyperplane having the maximum margin with minimum classification errors [48,52]. The KNN algorithm is a simplified form of the NB classifier where probability values are not required [48,53]. The letter K in the KNN stands for the number of nearest neighbors who are regarded to be eligible to receive a 'vote' from the algorithm. In some cases, differing values for the variable K can result in different categorization results for the same datasets.
To evaluate which model is the best suited to the datasets used in this study, the accuracy and F1-score of each model are compared to one another. The hyperparameters of the model achieving the highest accuracy and F1-score are tuned to obtain an optimized model. Furthermore, the receiver operating characteristic (ROC) curve, F1-score, confusion matrix, precision, recall and area under the curve (AUC) of the selected ML model are all examined to evaluate its overall performance. Additionally, variable importance scores are computed to identify the most significant factors in predicting the travel mode choice.

Mode choice preferences
Mode choice preferences stated by the respondents corresponding to nine independent variables (i.e., station closer to origin or destination, flexibility of travel schedule, availability of tickets, reliability of travel time, safety, comfort, travel costs and time required for travel) and one dependent variable (preferred travel mode for a particular route) are summarized in Figure  5. In terms of the mode choice for LDT, 55.75 % of the respondents preferred to use the bus, while 44.25 % The performance evaluation report of the optimized RF model in terms of different metrics is provided in Table 4. Table 4 portrays different metrics, i.e., precision, recall and F1-score, for evaluating the performance of the optimized RF model. The support values for class 0 and class 1 are very close. This makes the testing dataset a balanced set leading to similar precision, recall and F1 scores for the classifier. The precision value of 0.95 reveals that 95 % of the positive predictions made by the model are also positive observations, the recall value of 0.95 means that 95 % of the positive observations are also predicted as positive labels by the model and the F1-score of 0.95 explains that 95 % of the positive predictions are correctly classified. All of these metrics are close to 1.0, representing a good predictive power of the developed classification model. Figure 6 shows the ROC curve of the optimized RF model.
An ROC curve denotes connection amid false positive (FP) rate and true positive (TP) rate where the FP rate = FP/ (FP + true negative) and TP rate = TP/ (TP + false negative). The AUC is used to summarize the ROC curve since it measures the capacity of a classifier to distinguish between different classes. The AUC different hyperparameters, i.e. the total number of trees used, maximum number of features used to split a node, the maximum number of steps performed in each DT, the minimum points placed in a node before splitting the node, the minimum number of points a leaf node can hold, methodology (with or without replacement) used to sample data points, samples are randomly picked from the grid and ten-fold cross-validation is performed with each combination of values using Scikit-Learn's RandomizedSearchCV method. A summary of the hyperparameter values used, testing accuracy and validation accuracy of the RF model after optimization is presented in Table 2.
After optimizing the RF model, the testing accuracy has been unchanged, but the out-of-bag accuracy has improved by 0.36 %. Confusion matrix for the testing dataset generated by the optimized model is shown in Table 3.
From the confusion matrix for the testing dataset, presented in Table 3, can be observed that only seven testing instances of the bus are falsely predicted as train and five testing samples of the train are falsely predicted as bus. In contrast, the 115 testing samples of train and 129 samples of train are correctly classified.   can be considered as effective in distinguishing between two classes.

Determining important features
In order to determine the important features in choosing a travel mode for LDT, the feature importance scores of all the features used in developing the reveals how satisfactorily the model differentiates amid positive and negative classes. As AUC increases, so does the model performance. Usually, the AUC value ranges between 0.5 and 1.0. From Figure 6 can be observed that the ROC curve of the optimized RF model is way above the random classifier line and very close to the perfect separation point. In addition, the AUC score of the developed model is 0.991, which is very close to 1.0, representing the perfect classifier. As a result, the model  long-distance travel by airways and waterways can be considered as a future scope of the study. Besides, only machine learning models have been used to model the mode choice preferences. As an extension of this study, discrete choice modeling techniques can be used and compared to the performance of the machine learning models.

Conclusions
This study investigates travelers' alternative mode choice behavior between the intercity trains and intercity buses for the LDT. Data related to demographics, socioeconomic status of the respondents and various features of mode choice are collected from a questionnaire survey conducted on various groups of people in the capital city, Dhaka, in Bangladesh. Among the features considered for modeling travel mode choice, time required for travel, costs associated with travel, the proximity of origin or destination from stop or station, comfort, safety, reliability of journey time, availability of tickets and flexibility of schedule have been considered. Using the acquired data, several ML algorithms are used to predict the travel mode choice behavior. Considering the model accuracy and F1-score, the RF model outperformed all the others, with 95.31 % accuracy and 0.95 F1-score. Further, the model has been optimized by tuning different hyper-parameters, which led to an unchanged accuracy but an increased out-of-bag accuracy of 93.46 %. The feature importance score determined from the model revealed that reliable journey time, time required for travel, stop or station closer to destination, journey location and frequency of travel are the most critical features in forecasting travel mode choice.  Figure 7. The feature importance score represents the contribution of the feature in making the decision regarding the travel mode choice for LDT.

Nomenclature
In the RF classification model, Gini importance, or mean decrease in impurity, is used to measure the importance of features [55]. The features shown in Figure 7 are ranked as per decreasing importance score. Reliable journey time is the most important feature for the proposed LDT mode choice, time required for travel, stop or station closer to destination, journey location and frequency of travel are the second, third, fourth and fifth most important features. The other features, deemed important in deciding on the travel mode, are flexible schedule, stop or station closer to the origin, availability of tickets, safety, comfortability, monthly household income, occupation, gender, travel costs and age of the respondent in descending order of importance.
So, it is evident that for the LDT, travelers in Bangladesh emphasize more the reliable journey time, time required for travel, stop or station closer to destination, journey location and frequency of travel compared to other factors. On the other hand, flexible schedule, stop or station closer to the origin, availability of tickets, safety, comfortability, monthly household income, occupation, gender, travel costs and age of the respondent have less influence in choosing travel mode for LDT. Although the intercity trains are more preferable to intercity buses in terms of comfort and travel costs, buses are more preferred by the respondents in overall (see Figure 5). This is because these two factors contribute less in deciding on the travel mode in comparison to other factors. Hence, the intercity bus is the preferred mode of LDT over intercity train in Bangladesh, especially due to less reliable journey time, time required for travel, stop or station closer to destination, journey location and frequency of travel. However, it is to be kept in mind that flexible roads have been found to deteriorate early in Bangladesh [56][57], if such situation continues to degrade in the future, people's mode choice in the LDT might get changed.

5
Limitations of the study One of the limitations of this study is that it only considered two modes of the long-distance travel, i.e., bus and train. There are two other modes of transportation available for long-distance travel in Bangladesh, i.e., airways and waterways. The railway network in Bangladesh covers approximately 2877.10 kilometers connecting 44 of the 64 districts, whereas only eight districts are connected by air and only a few districts in the Barishal division have waterways (launch) for longdistance travel. So, buses and trains represent most of the long-distance travel in Bangladesh, while the