The tsibble package extends the tidyverse to temporal-context data. Built on top of the tibble, a tsibble (or tbl_ts) is a data-centric format, following the tidy data principle (Wickham 2014). Compared to the conventional time series objects in R, for example ts, zoo, and xts, the tsibble preserves time indices as the essential component and makes heterogeneous data structures possible. Beyond the tibble-like representation, a “key” comprised of single or multiple variables is introduced to uniquely identify units over time, using a syntactical and user-oriented approach in which it imposes nested or crossed structures on the data. Multiple variables separated by a vertical bar (|) or a comma (,) are expressive of nested or crossed factors. This binds hierarchical and grouped time series together into the tbl_ts class. The tsibble package aims at managing temporal data and getting analysis done in a succinct and transparent workflow.

tsibble() creates a tsibble object, and as_tsibble() is an S3 method to coerce other objects to a tsibble. An object that a vector/matrix underlies, such as ts, mts, or hts, can be automated to a tsibble using as_tsibble() without any specification. If it is a tibble or data frame, as_tsibble() requires a little more setup in order to identify the index and key variables.

Index and key

#> # A tibble: 26,115 x 5
#>   origin time_hour            temp humid precip
#>   <chr>  <dttm>              <dbl> <dbl>  <dbl>
#> 1 EWR    2013-01-01 01:00:00  39.0  59.4      0
#> 2 EWR    2013-01-01 02:00:00  39.0  61.6      0
#> 3 EWR    2013-01-01 03:00:00  39.0  64.4      0
#> 4 EWR    2013-01-01 04:00:00  39.9  62.2      0
#> 5 EWR    2013-01-01 05:00:00  39.0  64.4      0
#> # ... with 2.611e+04 more rows

The weather data included in the package nycflights13 contains the hourly meteorological records (such as temperature, humid and precipitation) over the year of 2013 at three stations (i.e. JFK, LGA and EWR) in New York City. Since the time_hour is the only one column consisting of the timestamps, as_tsibble() detects it as the index variable; alternatively, it would be more verbose to specify the argument index = time_hour. A tsibble is comprised of an index and key variables. In this case, the origin variable is the identifier created via id() and passed to the key argument in as_tsibble(). Therefore, the key together with the index uniquely identifies each observation, which gives a valid tsibble. Others—temp, humid and precip—are considered as measured variables.

weather_tsbl <- as_tsibble(weather, key = id(origin))
#> The `index` is `time_hour`.
#> # A tsibble: 26,115 x 5 [1HOUR]
#> # Key:       origin [3]
#>   origin time_hour            temp humid precip
#>   <chr>  <dttm>              <dbl> <dbl>  <dbl>
#> 1 EWR    2013-01-01 01:00:00  39.0  59.4      0
#> 2 EWR    2013-01-01 02:00:00  39.0  61.6      0
#> 3 EWR    2013-01-01 03:00:00  39.0  64.4      0
#> 4 EWR    2013-01-01 04:00:00  39.9  62.2      0
#> 5 EWR    2013-01-01 05:00:00  39.0  64.4      0
#> # ... with 2.611e+04 more rows

The tsibble fully utilises the print method from tibble: a tsibble object (along with its dimension and time interval) and key variables in the header. Above displays the weather_tsbl its one-hour interval and the origin as the key. Given the nature of temporal ordering, a tsibble must be sorted by time index. If a key is explicitly declared, the key will be sorted first and followed by arranging time in ascending order. This tidy data representation most naturally supports thinking of operations on the data as building blocks, forming part of a “data pipeline” in time-based context. Users who are familiar with tidyverse would find it easier to perform common time series analysis tasks. For example, index_by() is the counterpart of group_by() in temporal context, but it only groups the time index. index_by() + summarise() is used to summarise daily highs and lows at each station. As a result, the index is updated to the date with one-day interval from time_hour; two new variables are created and computed for daily maximum and minimum temperatures.

#> # A tsibble: 1,092 x 4 [1DAY]
#> # Key:       origin [3]
#>   origin date       temp_high temp_low
#>   <chr>  <date>         <dbl>    <dbl>
#> 1 EWR    2013-01-01      41       28.0
#> 2 EWR    2013-01-02      34.0     24.1
#> 3 EWR    2013-01-03      34.0     26.1
#> 4 EWR    2013-01-04      39.9     28.9
#> 5 EWR    2013-01-05      44.1     32  
#> # ... with 1,087 more rows

Nested and crossed structures

The key is not constrained to a single variable, but expressive of nested and crossed data structures (Wilkinson 2005). A built-in dataset tourism includes the quarterly overnight trips from 1998 Q1 to 2016 Q4 across Australia, which is sourced from Tourism Research Australia. The key structure is imposed by Region | State, Purpose. The Region and State naturally form a two-level geographical hierarchy: the lower-level regions are nested into the higher-level states. This nesting/hierarchical structure is indicated using a vertical bar (|). The crossing of Purpose (purpose of visiting) with the geographical variables suffices to validate the tsibble, where a comma (,) separates these two groups. Each observation is the number of trips made to a specific region for a certain purpose of travelling at one quarter of the year.

as_tsibble(tourism, key = id(Region | State, Purpose), index = Quarter)
#> # A tsibble: 23,408 x 5 [1QUARTER]
#> # Key:       Region | State, Purpose [308]
#>   Quarter Region   State           Purpose  Trips
#> *   <qtr> <chr>    <chr>           <chr>    <dbl>
#> 1 1998 Q1 Adelaide South Australia Business  135.
#> 2 1998 Q2 Adelaide South Australia Business  110.
#> 3 1998 Q3 Adelaide South Australia Business  166.
#> 4 1998 Q4 Adelaide South Australia Business  127.
#> 5 1999 Q1 Adelaide South Australia Business  137.
#> # ... with 2.34e+04 more rows

The commonly used dplyr verbs, such as filter(), summarise() and mutate(), have been implemented to support the tsibble. To obtain the numerical summaries for the geographical variables, summarise() is performed in conjunction with the Region, State in the group_by(). The Purpose variable has been dropped from the key, but Region | State keeps the hierarchical structure. The tsibble summarise() never collapses the rows over the time index, which is slightly different from the dplyr summarise().

tourism %>%
  group_by(Region, State) %>%
  summarise(Geo_Trips = sum(Trips))
#> # A tsibble: 5,852 x 4 [1QUARTER]
#> # Key:       Region | State [77]
#> # Groups:    Region [77]
#>   Region   State           Quarter Geo_Trips
#>   <chr>    <chr>             <qtr>     <dbl>
#> 1 Adelaide South Australia 1998 Q1      659.
#> 2 Adelaide South Australia 1998 Q2      450.
#> 3 Adelaide South Australia 1998 Q3      593.
#> 4 Adelaide South Australia 1998 Q4      524.
#> 5 Adelaide South Australia 1999 Q1      548.
#> # ... with 5,847 more rows

This syntactical approach appears more advantageous for the structural variables when coming to hierarchical and grouped time series forecast.

Intervals

It has been seen that the tsibble handles regularly-spaced temporal data well, from seconds to years based on its time representation (see ?tsibble). The option regular, by default, is set to TRUE in as_tsibble(). Specify regular to FALSE to create a tsibble for the data collected at irregular time interval. Below shows the scheduled date time of the flights in New York City:

The key variable is the flight_num. With regular = FALSE, it turns to an irregularly-spaced tsibble, where [!] highlights the irregularity.

#> # A tsibble: 336,776 x 21 [!]
#> # Key:       flight_num [5,725]
#>    year month   day dep_time sched_dep_time dep_delay arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>
#> 1  2013    11     3     1531           1540        -9     1653
#> 2  2013    11     4     1539           1540        -1     1712
#> 3  2013    11     5     1548           1540         8     1708
#> 4  2013    11     6     1535           1540        -5     1657
#> 5  2013    11     7     1549           1540         9     1733
#> # ... with 3.368e+05 more rows, and 14 more variables:
#> #   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, flight <int>,
#> #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#> #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
#> #   sched_dep_datetime <dttm>, flight_num <chr>

More functions on their way to deal with irregular temporal data in the future release.

Reference

Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (10). Foundation for Open Access Statistics:1–23.

Wilkinson, Leland. 2005. The Grammar of Graphics (Statistics and Computing). Secaucus, NJ: Springer-Verlag New York, Inc.