Analysis of Train Accidents 2001-2011 Exploratory Data Analysis

Published on January 2017 | Categories: Documents | Downloads: 30 | Comments: 0 | Views: 152
of 17
Download PDF   Embed   Report

Comments

Content

Analysis of Train Accidents in the U.S. During
2001-2011

1. Problem Description
1.1. Situation
Between 2001 and 2011, Amtrak has lost more than $4.2 Billion in accident damages from
almost 40 thousand accidents. The human toll over the past decade has been 619 deaths and
6504 injuries [1, 2, 7]. As alarming as these totals are, most of the accidents were minor, and a
relatively small number of catastrophes are responsible for the bulk of the cost and damages.
Additionally, rail generally comprises a small proportion of the total fatalities each year in the
U.S. [4]
Amtrak has developed an impressive database to catalog and capture key elements of each
event over the past ten years. With a few exceptions, the data categories have remained
consistent with over 140 different components [3].

Figure 1: All Accident Pair Comparison

Figure 1 is a preliminary example of component comparisons using all accident data [2, 7]. The
total killed and equipment damage had 0.31 and 0.76 correlation values, respectively, to total

accident damage. The type of accident and total tonnage had a 0.29 correlation value. None of
these relationships were surprising, but they do support further component analysis. I included
a number of categorical variables, such as weather type, since I initially suspected that it might
play a role in the accident. According to this chart, there appears to be no real correlation
between weather and the other components, as most of accidents appear to have taken place
on clear days. This might be misleading since it does not yet account for extreme events.

Figure 2: Total Accident Damage (Log$) and Total Killed by Accident Cause

Figure 2 breaks out the total cost and total killed into five categories [3]. The box plots show that
all five have similar average costs. However, the miscellaneous, human factors, and rack
categories had higher total killed as well as a significant number of extreme accident costs.

Figure 3: Total Accident cost ($) and Type

Next, I separated the total accident cost by type of accident, displayed in the boxplots in Figure
3. From this perspective, explosive accidents (type 10) have the largest total cost. These events
were rare (as evidenced by the span of its fifty percent box), but had costly consequences. A
number of other accident types had extreme event outliers, such as the head-on (type 1) and
derailment events (type 1), and side collisions (type 4), warranting further possible analysis.

Figure 4: Total Number of Accidents by Cause and Train Type

As displayed in Figure 4, most accidents were categorized as train operations (human factors)
error or rack, roadbed and structures issues, and most of the accidents occurred on freight
trains or rail yard switching. As highlighted in yellow (indicating roughly one thousand events) in
Figure 4, most of the freight train accidents are caused by rack, roadbed and structures,
followed by human factors, and most of the yard or switching accidents are caused by human
factors. Based on this data, it may be useful to focus on freight trains and rail yard operations.

Next, I chose the year with the highest 4th quartile (2010) and focused on possible components
of extreme events [9]. There were 51 events above the 4th quartile of total accident cost, which
had at least one killed or injured. Figure 5 shows six possible important variables and their
correlations when filtered for extreme events. This chart shows correlations above 0.5 for total
accident cost and total killed, total killed and tonnage, as well as accident temperature and
visibility. Since visibility is a categorical value, the trend lines and numbers are of little value, but
there are two distinct clumps of data at 2(daytime) and 4(nighttime).

Figure 5: Extreme Events (2010 4th Qtl) Pair Comparison

Figure 6: biplot of 2010 extreme events

As the biplot in Figure 6 shows, total accident damage and total equipment damage are the
strongest variables since they are both "near" parallel and have by far the largest magnitudes.
This also detracts from the importance of other chosen variables since their respective arrows
have relatively small magnitudes.

Figure 7: Loadings Plot of all Events for Extreme Data

Figure 7 shows the loadings plot for the top three components. This shows the top component is
almost entirely comprised of total equipment damage and total accident cost and are positively
related.

1.2. Goal
The overall goal of this study is to provide safety recommendations based on statistical evidence
to Amtrak in order to improve their safety record. The objective of this effort is to minimize the
severity of the accident measured in injuries, deaths, and total accident costs.

1.3. Metrics
I define the severity of the accident as any extreme cost accident (defined as the top quartile)
and the total injured and total deaths are more than zero.

1.4. Hypothesis
Based on the preliminary data observations, the null hypothesis I chose is that the severity of
the accident (measured by total accident cost and number injured/killed) cannot be mitigated
by decreasing the number of track derailments.

The second null hypothesis I chose is that the severity of the accident cannot be mitigated by
improving rail yard operations through operator training and regulations.

2. Approach
2.1. Data

2.1.1. Bias
There were a few possible biases could affect the results. By reducing the bin size in the
accident temperature histogram, it became apparent that there were large fluctuations in data,
as displayed in Figure 8: Total Accident Damage: Frequency vs. Temp (F). There is likely a strong
bias in data recording towards rounding to the nearest five or zero. This should not have a
significant effect on the hypothesis testing or regression analysis, but should be reconsidered if
any abnormal results are generated.

Figure 8: Total Accident Damage: Frequency vs. Temp (F)

As mentioned in section 1, Amtrak has developed an extensive database of accident
measurements. However, there were a few discrepancies in the components and categories
used from year to year. In 2008, Amtrak removed six data types from the data collected per
event, as displayed in Figure 9. However, there did not appear to be any issues with this data
discrepancy.

Figure 9: Different Data Components

A number of data components had input "Not Available," however. This could have an effect on
the results if significant events were missing critical components [5]. Figure 10 shows two
possible examples of significant missing data. In this case, there were 1,830 events in 2011,
which did not have any data on the role of alcohol on the accident. There could be accident
causes missed by this omission of data.

Figure 10: Missing Drug and Alcohol Data

2.2. Analysis
Describe the analysis techniques and modeling approaches you applied to this problem. At a
minimum, this should include the following:
 The linear models you used to provide evidence;
 The feature and model selection techniques you used to find appropriate models for this
problem; and
 Your treatment of ordinal and categorical variables (i.e., how were they coded).
 box plot off ACCDMG
 SPM
 catagorical box plots
 biplot
 loadings plot
 boxplot of all years

3. Evidence
Describe your results from the application of the methods in Subsection 2.2 to the data in Subsection
2.1 to produce evidence for the hypothesis(es) in Subsection 1.4. These results should answer the
question posed by the problem in Subsection 1.2. You should also formally describe your confidence
in the results by explaining
 How you assessed your models (e.g., adjusted R2, AIC, etc.);
 How you diagnosed problems with the models; and
 How you adjusted the models based on these assessments.
 Ex: bi plot with vectors - collapsing objectives down

4. Recommendation
State your findings and your recommendations for safety improvements based on the evidence you
have discovered. Be sure to back-up your recommendations with formal measures of confidence (i.e.,
confidence intervals), model validity (e.g., adjusted R2) and possibly visual displays. You do need to
repeat your results or graphics in this section. Instead, you can summarize them and refer the reader
to the appropriate sections, tables, or figures.

5. References
[1] PCOPlots.R and VizHO.R code packages, D. E. Brown and L. Barnes, \Laboratory 1: Train accidents,"
November 2012, assignment in class AMPA 6430.
[2] Accident Data\Laboratory 1: Train accidents," November 2012, assignment in class AMPA 6430.
[3] RAIL EQUIPMENT ACCIDENT/INCIDENT FORM F 6180.54, August 2012, assignment in class AMPA
6430.
[4] "2010 Transportation Fatalities" 2010 data, [online] Data and Statistics - NTSB - National
Transportation Safety Board, available at http://www.ntsb.gov/data/index.html
[5] Data Analysis and Graphics Using R, John MacDonald and W. John Braun
[6] Lattice Graphics, R code package, Author: Deepayan Sarkar <[email protected]>
[7] F. R. Administration, \Federal railroad administration office of safety analysis," August 2012. [Online].
Available: http://safetydata.fra.dot.gov/officeofsafety/
[8] "List of State FIPS Codes" (online) U.S Department of Labor, available at
http://www.bls.gov/lau/lausfips.htm
[9] Chris Dunham, 29 November 2012. Chris gave me the idea of how to filter the accident data by year
using the highest 4th quartile data (2010). The author developed all code.

Appendix A: Additional Graphs and Background Data
1. Additional Graphs
I included other R generated graphs in support of the general situation. Figure 11 is the numerical
display of Figure 4.

Figure 11: Total Accident Frequency by Category and Type of Train

Figure 12: Total Accident Frequency by State

During my search for bias among the data, I filtered the total accidents by state, as shown in
Figure 12. According to the FIPS code used in the Amtrak data [8], the top states are Illinois,
California, and Texas have the largest number of accidents.

Figure 13: Maximum Accident Damage by Year

Another possibly interesting point is the periodic spike in maximum accident cost every few
years as shown in Figure 13. There appears to be a general downward trend in the worst
accident cost since 2001, however. At this point, it is unknown if there is any underlying causes.
It may be a change in regulations, the periodic personnel joining work crews, a slip into
complacency, or some unpredictable "black swan" events.

Figure 14: Accident Damage ($) and Total Killed vs. Type of Weather

Figure 14 displays the total cost and total killed separated by type of weather conditions. As
displayed in both box plots, most of the events occur in cloudy or clear conditions - situations
that should not predispose severe accidents.

Figure 15: Box Plot of Total Accident Cost by Type of Train

Figure 15 shows a different display of the total cost versus type of train. Similar to Figure 4,
freight trains (type 1) and yard/switching (type 7) had a high number of extreme events. Also of
possible significance, the top outlier in terms of total cost was categorized as a "cut of cars"
accident.

Figure 16: Train Speed vs Frequency

Figure 16 displays the total frequency of accidents versus the train speed. It is quickly apparent
that most of the accidents are at low speeds, but there may be some bias (rounding to nearest 5
or 0) in the higher speeds, as evidenced by the small spikes in frequency between 40 and 60
mph.

Sponsor Documents

Or use your account on DocShare.tips

Hide

Forgot your password?

Or register your new account on DocShare.tips

Hide

Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in

Close