The tsibble package extends the tidyverse to temporal-context data. Built on top of the tibble, a tsibble (or
tbl_ts) is a data-centric format, following the tidy data principle (Wickham 2014). Compared to the conventional time series objects in R, for example
mts, the tsibble preserves time indices as the essential component and makes heterogeneous data structures possible. Beyond the tibble-like representation, new syntax is introduced to impose additional and informative structures on the tsibble, which is referred to as “key” variables. Multiple keys separated by a vertical bar (
|) or a comma (
,) are expressive of nested or crossed variables. This binds hierarchical and grouped time series together into the
tbl_ts class. The tsibble package aims at managing temporal data and getting analysis done in a tidy and modern manner.
tsibble() creates a tsibble object, and
as_tsibble() is an S3 method to coerce other objects to a tsibble. An object that a vector/matrix underlies, such as
hts, can be automated to a tsibble using
as_tsibble() without any specification. If it is a tibble or data frame, the
as_tsibble() requires a little more setup in order to identify the index and key variables.
#> # A tibble: 26,130 x 5 #> origin time_hour temp humid precip #> <chr> <dttm> <dbl> <dbl> <dbl> #> 1 EWR 2013-01-01 11:00:00 37.0 54.0 0 #> 2 EWR 2013-01-01 12:00:00 37.0 54.0 0 #> 3 EWR 2013-01-01 13:00:00 37.9 52.1 0 #> 4 EWR 2013-01-01 14:00:00 37.9 54.5 0 #> 5 EWR 2013-01-01 15:00:00 37.9 57.0 0 #> # ... with 2.612e+04 more rows
weather data included in the package
nycflights13 contains the hourly meteorological records (such as temperature, humid and precipitation) over the year of 2013 at three stations (i.e. JFK, LGA and EWR) in New York City. Since the
time_hour is the only one column consisting of the timestamps, the
as_tsibble() detects it as the index variable; alternatively, it would be more verbose to specify the argument
index = time_hour. A tsibble is comprised of an index and key variables. In this case, the
origin variable is the identifier created via the
id() and passed to the
key argument in the
as_tsibble(). Therefore, the key together with the index uniquely identifies each observation, which gives a valid tsibble. In other words, each unit of observation is measured at a time point for a key or each combination of keys. Others—
precip—are considered as measured variables.
#> The 'index' variable: time_hour
#> # A tsibble: 26,130 x 5 [1HOUR] #> # Keys: origin  #> origin time_hour temp humid precip #> <chr> <dttm> <dbl> <dbl> <dbl> #> 1 EWR 2013-01-01 11:00:00 37.0 54.0 0 #> 2 EWR 2013-01-01 12:00:00 37.0 54.0 0 #> 3 EWR 2013-01-01 13:00:00 37.9 52.1 0 #> 4 EWR 2013-01-01 14:00:00 37.9 54.5 0 #> 5 EWR 2013-01-01 15:00:00 37.9 57.0 0 #> # ... with 2.612e+04 more rows
The tsibble fully utilises the
weather_tsbl its one-hour interval and the
origin as keys. It should be noted that the tsibble does not attempt to arrange the data in time order. Given this format, it is much easier for users, in particular who are familiar with tidyverse, to perform common data tasks in temporal context. For example the
tsummarise() (summarise over time) is used to examine daily highs and lows at each station. As a result, the index is updated to the
date with one-day interval from
time_hour; two new variables are created and computed for daily maximum and minimum temperatures.
#> # A tsibble: 1,095 x 4 [1DAY] #> # Keys: origin  #> # Groups: origin  #> origin date temp_high temp_low #> <chr> <date> <dbl> <dbl> #> 1 EWR 2013-01-01 39.9 37.0 #> 2 EWR 2013-01-02 41.0 24.1 #> 3 EWR 2013-01-03 34.0 25.0 #> 4 EWR 2013-01-04 34.0 27.0 #> 5 EWR 2013-01-05 39.9 32.0 #> # ... with 1,090 more rows
The key is not constrained to a single variable, but expressive of nested and crossed data structures (Wilkinson 2005). A built-in dataset
tourism includes the quarterly overnight trips from 1998 Q1 to 2016 Q4 across Australia, which is sourced from Tourism Research Australia. The key structure is imposed by
Region | State, Purpose. The
State naturally form a two-level geographical hierarchy: the lower-level regions are nested into the higher-level states. This nesting/hierarchical structure is indicated using a vertical bar (
|). The crossing of
Purpose (purpose of visiting) with the geographical variables suffices to validate the tsibble, where a comma (
,) separates these two groups. Each observation is the number of trips made to a specific region for a certain purpose of travelling at one quarter of the year.
#> # A tsibble: 23,408 x 5 [1QUARTER] #> # Keys: Region | State, Purpose  #> Quarter Region State Purpose Trips #> <qtr> <chr> <chr> <chr> <dbl> #> 1 1998 Q1 Sydney New South Wales Holiday 828 #> 2 1998 Q1 Blue Mountains New South Wales Holiday 104 #> 3 1998 Q1 Capital Country New South Wales Holiday 99.2 #> 4 1998 Q1 Central Coast New South Wales Holiday 279 #> 5 1998 Q1 Central NSW New South Wales Holiday 170 #> # ... with 2.34e+04 more rows
The commonly used dplyr verbs, such as
mutate(), have been implemented to support the tsibble. To obtain the numerical summaries for the nesting of geography, the
summarise() is performed in conjunction with the
Region | State in the
group_by(). This specification retains the hierarchical structure. The tsibble
summarise() never collapses the rows over the time index, which is slightly different from the dplyr
#> # A tsibble: 5,852 x 4 [1QUARTER] #> # Keys: Region | State  #> # Groups: Region | State  #> Quarter Region State Geo_Trips #> <qtr> <chr> <chr> <dbl> #> 1 1998 Q1 Adelaide South Australia 659 #> 2 1998 Q1 Adelaide Hills South Australia 9.80 #> 3 1998 Q1 Alice Springs Northern Territory 20.2 #> 4 1998 Q1 Australia's Coral Coast Western Australia 133 #> 5 1998 Q1 Australia's Golden Outback Western Australia 162 #> # ... with 5,847 more rows
This syntactical approach appears more advantageous for the structural variables when coming to hierarchical and grouped time series forecast.
It has been seen that the tsibble handles regularly-spaced temporal data well, from seconds to years. The option
regular, by default, is set to
TRUE in the
FALSE to create a tsibble for the data collected at irregular time interval. Below shows the scheduled date time of the flights in New York City:
The keys are comprised of the
dest variables. With
regular = FALSE, it turns to an irregularly-spaced tsibble, where
[!] highlights the irregularity.
#> # A tsibble: 336,776 x 8 [!] #> # Keys: flight, origin, dest  #> flight origin dest sched_date_time dep_delay arr_delay air_t… dist… #> <int> <chr> <chr> <dttm> <dbl> <dbl> <dbl> <dbl> #> 1 1545 EWR IAH 2013-01-01 05:15:00 2.00 11.0 227 1400 #> 2 1714 LGA IAH 2013-01-01 05:29:00 4.00 20.0 227 1416 #> 3 1141 JFK MIA 2013-01-01 05:40:00 2.00 33.0 160 1089 #> 4 725 JFK BQN 2013-01-01 05:45:00 -1.00 -18.0 183 1576 #> 5 461 LGA ATL 2013-01-01 06:00:00 -6.00 -25.0 116 762 #> # ... with 3.368e+05 more rows
More functions on their way to deal with irregular temporal data in the future release.
Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (10). Foundation for Open Access Statistics:1–23.
Wilkinson, Leland. 2005. The Grammar of Graphics (Statistics and Computing). Secaucus, NJ: Springer-Verlag New York, Inc.