Before using this package a number of steps are required: First, your eye gaze data must have been collected using an SR Research Eyelink eye tracker. Second, your data must have been exported using SR Research Data Viewer software. For this basic example, it is assumed that you have specified an interest period relative to the onset of the critical stimulus in Data Viewer. However, this package is also able to preprocess data without a specified relative interest period. If you have not aligned your data to a particular message in Data Viewer, please refer to the Message Alignment vignette for functions related to this.
The Sample Report should be exported along with all available columns (this will ensure that you have all of the necessary columns for the functions contained in this package to work). Additionally, it is preferable to export to a .txt file rather than a .xlsx file.
The following preprocessing assumes that, in your experiment, interest area IDs and Labels were assigned consistently to the object types displayed on the screen. For example, in a typical VWP experiment, the target was always in interest area 1, the competitor was always in interest area 2, et cetera. This is typically done by dynamically moving the interest areas trial-by-trial to correspond with the position of the objects. If, instead, your interest areas were static and you have columns indicating the location of each object for each trial, you will need to reassign your interest areas. Specific functions for this are available in this package; please see the Interest Areas vignette for illustration. Once that is complete, you can follow the preprocessing procedure below. Note that the functions presented here are capable of handling data with a maximum of 8 interest areas. If you have more than 8 interest areas, it is necessary to adjust the source code to accommodate the number needed (please contact the package maintainer for an example).
Lastly, the functions included here, internally make use of
dplyr
for manipulating and restructuring data. For more
information about dplyr
, please refer to its reference
manual and extensive collection of vignettes.
First, load the sample report. By default, Data Viewer will assign “.” to missing values; therefore it is important to include this in the na.strings parameter, so R will know how to handle any missing data.
library(VWPre)
VWdat <- read.table("1000HzData.txt", header = T, sep = "\t", na.strings = c(".", "NA"))
However, for the purposes of this vignette we will use the sample dataset included in the package.
In order for the functions in the package to work appropriately, the
data need to be in a specific format. The prep_data
function examines the presence and class of specific columns
(LEFT_INTEREST_AREA_ID
,
RIGHT_INTEREST_AREA_ID
,
LEFT_INTEREST_AREA_LABEL
,
RIGHT_INTEREST_AREA_LABEL
, TIMESTAMP
, and
TRIAL_INDEX
) to ensure they are present in the data and
appropriately assigned (e.g., categorical variables are encoded as
factors). It also checks for columns SAMPLE_MESSAGE
,
RIGHT_GAZE_X
, RIGHT_GAZE_Y
,
LEFT_GAZE_X
, and LEFT_GAZE_Y
, which are not
required for basic preporcessing, but are needed to use the functions
align_msg
and custom_ia
.
Additionally, the Subject
parameter is used to specify
the column corresponding to the subject identifier. Typical Data Viewer
output contains a column called RECORDING_SESSION_LABEL
which is the name of the column containing the subject identifier. The
function will rename it Subject
and will ensure it is
encoded as a factor.
If your data contain a column corresponding to an item identifier
please specify it in the Item
parameter. In doing so, the
function will standardize the name of the column to Item
and will ensure it is encoded as a factor. If you don’t have an item
identifier column, by default the value of this parameter is NA.
Lastly, a new column called Event
will be created which
indexes each unique recording sequence and corresponds to the
combination of Subject
and TRIAL_INDEX
. This
Event variable is required internally for subsequent operations. Should
you choose to define the Event variable differently, you can override
the default; however, do so cautiously as this may impact the
performance of subsequent operations because it must index each time
sequence in the data uniquely. Upon completion, the function prints a
summary indicating the results.
## Checking required columns...
## All required columns are present in the data.
## Checking optional columns...
## The following optional is not present in the data: EYE_TRACKED
## Working on required columns...
## RECORDING_SESSION_LABEL renamed to Subject.
## itemid renamed to Item.
## Subject converted to factor.
## LEFT_INTEREST_AREA_ID converted to numeric.
## LEFT_INTEREST_AREA_LABEL converted to factor.
## RIGHT_INTEREST_AREA_ID converted to numeric.
## RIGHT_INTEREST_AREA_LABEL converted to factor.
## TIMESTAMP converted to numeric.
## TRIAL_INDEX converted to numeric.
## Event variable created from Subject and TRIAL_INDEX
## Working on optional columns...
## SAMPLE_MESSAGE converted to factor.
## LEFT_GAZE_X converted to numeric.
## LEFT_GAZE_Y converted to numeric.
## RIGHT_GAZE_X converted to numeric.
## RIGHT_GAZE_Y converted to numeric.
## LEFT_IN_BLINK converted to numeric.
## RIGHT_IN_BLINK converted to numeric.
## LEFT_IN_SACCADE converted to numeric.
## RIGHT_IN_SACCADE converted to numeric.
At this point, it is safe to remove the columns which were output by
Data Viewer, but that are not needed for preprocessing in using this
package. Removing these will reduce the amount of system memory consumed
and result in a final dataset that consume less disk space. This is done
straightforwardly using the function rm_extra_DVcols
. By
default it will remove all the Data Viewer columns that are not needed
for preprocessing (if they are present in the data). However, if
desired, it is possible to keep specific columns from this set using the
Keep
parameter, which accommodates a string or character
vector. If using the sample data set included in this package, it is not
necessary to do this step, as these columns have already been
removed.
When the data were loaded, samples that were outside of any interest
area were labeled as NA. The relabel_na
function examines
the interest area columns (LEFT_INTEREST_AREA_ID
,
RIGHT_INTEREST_AREA_ID
,
LEFT_INTEREST_AREA_LABEL
, and
RIGHT_INTEREST_AREA_LABEL
) for cells containing NAs. It
then assigns 0 to the ID columns and “Outside” to the LABEL columns) to
indicate those eye gaze samples which fell outside of the interest areas
defined in the study. The number of interest areas you defined in your
experiment should be supplied to the parameter NoIA
.
## LEFT_INTEREST_AREA_LABEL: Number of levels DO NOT match NoIA.
## RIGHT_INTEREST_AREA_LABEL: Number of levels match NoIA.
Notice that the output informs us that the number of levels
LEFT_INTEREST_AREA_LABEL
does not match the number of
interest areas listed in NoIA
. This is because we only have
data from the right eye (hence, all samples in
LEFT_INTEREST_AREA_LABEL
are listed as “Outside”).
The subsequent preprocessing requires that the interest area IDs are
numerically coded, with values ranging from 0 (i.e., outside all
interest areas) up to a maximum of 8. So, it’s important to check that
the IDs present in the data set, conform to this. The
check_ia
functions does just this and indicates how those
IDs are mapped to the interest area labels.
## RIGHT_IA_ID RIGHT_IA_LABEL
## 0 Outside
## 1 Target_IA
## 2 RhymeComp_IA
## 3 OnsetComp_IA
## 4 Distract_IA
## LEFT_IA_ID LEFT_IA_LABEL
## 0 Outside
## Interest Area IDs for the right eye are coded appropriately between 0 and 8.
## Interest Area IDs for the left eye are coded appropriately between 0 and 8.
## Interest Area ID and label mapping combinations for the right eye are consistent.
## Interest Area ID and label mapping combinations for the left eye are consistent.
If your interest area IDs do not conform to the required coding, or you would like to create new labels for your existing interest areas, please consult the Interest Areas vignette. That vignette illustrates how to relabel existing interest area codings (as well as remap the gaze data to entirely new interest areas, should you so desire).
The function create_time_series
creates a time series (a
new column called Time
) which is required for subsequent
processing, plotting, and modeling of the data. It is common to export a
period of time prior to the onset of the stimulus as a baseline. In this
case, an adjustment (equal to the duration of the baseline period) must
be applied to the time series, specified in the Adjust
parameter. In effect, the adjustment simply subtracts the given value to
each time point. So, a positive value will shift the zero point forward
(making the initial zero a negative time value), while a negative value
will shift the zero point backward (making the initial zero a positive
time value). An example illustrating this can be found in the Message Alignment vignette. In
the example below, the data were exported with a 100ms pre-stimulus
interval.
## 100 ms adjustment applied.
Note that if you have used the align_msg
function
(illustrated in the Message
Alignment vignette), you may need to specify a column name in
Adjust
. That column can be used to apply the recording
event specific adjustment to each trial. Consult that vignette for
further details.
The function check_time_series
can be used to verify the
time series. It outputs the unique start times present in the data.
These will be the same standardized time point relative to the stimulus
if you have exported your data from Data Viewer with pre-defined
interest period relative to a message. By specifying the parameter
ReturnData = T
, the function can return a summary data
frame that can be used to inspect the start time of each event. As you
can see below, by providing Adjust
with a postive value, we
have effectively shifted the zero point forward along the number line,
causing the first sample to have a negative time value.
## # A tibble: 1 × 1
## Start_Time
## <dbl>
## 1 -100
## Set ReturnData to TRUE to output full, event-specific information.
Another way to check that your time series has been created correctly
is to use the check_msg_time
function. By providing the
appropriate message text, we can see that the onset of our target now
occurs at Time = 0. Note that the Msg
parameter can handle
exact matches or matches based on regular expressions. As with
check_time_series
, the parameter
ReturnData = T
will return a summary data frame that can be
used to inspect the message time of each event.
## # A tibble: 1 × 2
## SAMPLE_MESSAGE Time
## <fct> <dbl>
## 1 TargetOnset 0
## Set ReturnData to TRUE to output full, event-specific information.
If you do not remember the messages in your data, you can output all
existing messages and their corresponding timestamps using
check_all_msgs
. Additionally and optionally, the output of
the function can be saved using the parameter
ReturnData = T
.
## # A tibble: 4 × 1
## SAMPLE_MESSAGE
## <fct>
## 1 Preview
## 2 TargetOnset
## 3 VowelOnset
## 4 TIMER_search
## Set ReturnData to TRUE to output full, event-specific information.
Depending on the design of the study, right, left, or both eyes may
have been recorded during the experiment. Data Viewer outputs gaze data
by placing it in separate columns for each eye
(LEFT_INTEREST_AREA_ID
,
LEFT_INTEREST_AREA_LABEL
,
RIGHT_INTEREST_AREA_ID
,
RIGHT_INTEREST_AREA_LABEL
). However, it is preferable to
have gaze data in a single set of columns, regardless of which eye was
recorded during the experiment. The function
select_recorded_eye
provides the functionality for this
purpose, returning three new columns (IA_ID
,
IA_LABEL
, IA_Data
).
The function select_recorded_eye
requires that the
parameter Recording
be specified. This parameter instructs
the function about which eye(s) was used to record the gaze data. It
takes one of four possible strings: “LandR”, “LorR”, “L”, or “R”.
“LandR” should be used when any participant had both eyes recorded.
“LorR” should be used when some participants had their left eye recorded
and others had their right eye recorded “L” should be used when all
participant had their left eye recorded. “R” should be used when all
participant had their right eye recorded.
If in doubt, use the function check_eye_recording
which
will do a quick check to see if LEFT_INTEREST_AREA_ID
and
RIGHT_INTEREST_AREA_ID
contain data. It will then suggest
the appropriate Recording parameter setting. When in complete doubt, use
“LandR”. The “LandR” setting requires an additional parameter
(WhenLandR
) to be specified. This instructs the function to
select either the right eye or the left eye when data exist for
both.
## Checking gaze data using Data Viewer columns LEFT_INTEREST_AREA_ID and RIGHT_INTEREST_AREA_ID.
## The dataset contains recordings for ONLY the right eye.
## Set the Recording parameter in select_recorded_eye() to 'R'.
After executing, the function prints a summary of the output. While
the function check_eye_recording
indicated that the
parameter Recording
should be set to “R”, the example below
sets the parameter to “LandR”, which can act as a “catch-all”.
Consequently, in the summary, it can be seen that there were only
recordings in the right eye.
## Selecting gaze data using Data Viewer columns LEFT_INTEREST_AREA_ID and RIGHT_INTEREST_AREA_ID and the Recording argument: R
## Gaze data summary for 160 events:
## 0 event(s) contained gaze data for both eyes, for which the Right eye has been selected.
## The final data frame contains 158 event(s) using gaze data from the right eye.
## The final data frame contains 2 event(s) with no samples falling within any interest area during the given time series.
Prior to binning the data, some researchers might prefer to remove
trials with excessive trackloss. Because Data Viewer does not provide a
specific column for trackloss, it is possible to determine this using a
combination of information, namely, the column In_Blink
and/or the X and Y coordinates (Gaze_X
and
Gaze_Y
).
The function mark_trackloss
uses this information to
determine the status of a given sample. The argument Type
can be set to “Blink”, “OffScreen”, or “Both”. When set to “OffScreen”
or “Both”, ScreenSize
must be supplied as a numeric vector
of the X and Y dimensions of the computer sceen used during the
experiment.
Once the samples corresponding to trackloss have been identified,
events with less than the required amount of quality data can be removed
from the data set, using the function rm_trackloss_events
.
The argument RequiredData
represents the percentage of data
(non-trackloss) required in order to retain the event. In the example
below, each event must contain 75% quality data, in other words, no more
than 25% trackloss.
In order to obtain proportion looks, it is necessary to bin the data.
That is, group samples in chunks of time, count the number of samples in
each of the interest areas, and calculate the proportions based on the
counts. The sampling rate at which the eye gaze data were recorded must
be provided. For Eyelink trackers, this is typically 250Hz, 500Hz, or
1000Hz. If in doubt, use the function check_samplingrate
to
determine it. The sampling rate can then be supplied to the function
bin_prop
.
## Sampling rate(s) present in the data are: 1000 Hz.
## Set ReturnData to TRUE to output full, event-specific information.
Note that the check_samplingrate
function returns a
printed message indicating the sampling rate(s) present in the data.
Optionally, it can return a new column called SamplingRate
by specifying the parameter ReturnData
as TRUE. In the
event that data were collected at different sampling rates, this column
can be used to subset the dataset by the sampling rate before proceeding
to the next processing step.
The function bin_prop
calculates the proportion of looks
(samples) to each interest area in a particular span of time (bin size).
In order to do this, it is necessary to supply the parameters
BinSize
and SamplingRate
. BinSize
should be specified in milliseconds, representing the chunk of time
within which to calculate the proportions.
Not all bin sizes work for all sampling rates, due to downsampling
constraints. If unsure which are appropriate for your current sampling
rate, use the ds_options
function. When provided with the
current sampling rate in SamplingRate
(see above), the
function will return a printed summary of the bin size options and their
corresponding downsampled rate. By default, this returns the whole
number downsampling rates users are likely to want; however, it can also
return all possible (valid) downsampling rates, even if they are not
round numbers.
## Suggested binning/downsampling options:
## Bin size: 1 ms; Samples per bin: 1 samples; Downsampled rate: 1000 Hz
## Bin size: 2 ms; Samples per bin: 2 samples; Downsampled rate: 500 Hz
## Bin size: 4 ms; Samples per bin: 4 samples; Downsampled rate: 250 Hz
## Bin size: 5 ms; Samples per bin: 5 samples; Downsampled rate: 200 Hz
## Bin size: 8 ms; Samples per bin: 8 samples; Downsampled rate: 125 Hz
## Bin size: 10 ms; Samples per bin: 10 samples; Downsampled rate: 100 Hz
## Bin size: 20 ms; Samples per bin: 20 samples; Downsampled rate: 50 Hz
## Bin size: 25 ms; Samples per bin: 25 samples; Downsampled rate: 40 Hz
## Bin size: 40 ms; Samples per bin: 40 samples; Downsampled rate: 25 Hz
## Bin size: 50 ms; Samples per bin: 50 samples; Downsampled rate: 20 Hz
## Bin size: 100 ms; Samples per bin: 100 samples; Downsampled rate: 10 Hz
The SamplingRate
parameter in bin_prop
should be specified in Hertz (see check_samplingrate
),
representing the original sampling rate of the data and the
BinSize
should be specified in milliseconds (see
ds_options
), representing the span of time over which to
calculate the proportion. The bin_prop
function returns new
columns corresponding to each interest area ID (e.g.,
IA_1_C
, IA_1_P
). The extension ‘_C’ indicates
the count of samples in the bin and the extension ‘_P’ indicates the
proportion.
## Binning information:
## Original rate of 1000 Hz with one sample every 1 ms.
## Downsampled rate of 50 Hz using 20 ms bins.
## New bins contain 20 samples.
## Binning...
## Calculating proportions...
## There are 103 data points with less than 20 samples per bin.
## These can be examined and/or removed using the column 'NSamples'.
## Subsequent Empirical Logit calculations may be influenced by the number of samples (depending on the number of observations requested).
## These all occur in the last bin of the time series (typical of Data Viewer output).
In performing the calculation, the function effectively downsamples
the data. To check this and to know the new sampling rate, simply call
the function check_samplingrate
again.
## Sampling rate(s) present in the data are: 50 Hz.
## Set ReturnData to TRUE to output full, event-specific information.
Proportions are inherently bound between 0 and 1 and are therefore not suitable for many types of analysis. Logits provide a transformation resulting in an unbounded measure (as well as weights which estimate the variance). The calculations contained in this package are based on: Barr, D. J., (2008) Analyzing ‘visual world’ eyetracking data using multilevel logistic regression, Journal of Memory and Language, 59(4), 457–474. However, they have been modified to allow greater flexibility.
When using an empirical logit transformation it is important to keep two things in mind. The first is the number of observations (or samples) on which to base the calculation. Typically, this is the number of samples per bin, which varies depending on your original sampling rate and bin size.
To determine the number of samples per bin present in the data, use
the function check_samples_per_bin
.
## There are 20 samples per bin.
## One data point every 20 millisecond(s)
##
## There are data points with less than 20 samples per bin.
## Subsequent Empirical Logit calculations may be influenced by the number of samples (depending on the number of observations requested).
## These all occur in the last bin of the time series (typical of Data Viewer output).
However, a user may choose to define a different number of
observations (because the number of samples is inherently linked to the
sampling rate). Though, it is important to note that changing this value
can drastically impact the results of the transformation and weight
calculations. There are some safeguards within the transformation
function to prevent users from choosing inadvisable values (though these
safeguards can be overridden with the parameter
ObsOverride
). So, if in doubt, it is safest to use the
number of samples present in your data (as indicated by
check_samples_per_bin
). The second things to keep in mind
is the constant to be added in the transformation. Note that by default
the calculation uses a constant of 0.5; however, the user can specify a
different value to be used.
If you are interested in visualizing the effect of both number of
observations and constant on the result of the empirical logit
transformation and weight calculations, please refer to the Plotting vignette, which illustrates and
discusses the function plot_transformation_app
.
The function transform_to_elogit
transforms the
proportions to empirical logits and also calculates a weight for each
value. The weight estimates the variance in each bin (because the
variance of the logit depends on the mean). This is particularly
important for regression analyses and should be specified in the model
call (e.g., weight = 1 / IA_1_wts
). As mentioned above, the
function takes the number of observations in the parameter
ObsPerBin
. Here we use the number of samples per bin
present in the data.
## Number of Observations equal to Number of Samples.
## Calculation will be based on Number of Samples.
Some researchers may prefer to perform a binomial analysis. Therefore
the function create_binomial
uses (previously calculated)
proportions and number of observations to create a success/failure
column for each IA. This column is then a suitable response variable for
logistic regression of the time series. As with the empirical logit
transformation, a user may choose to define a number of observations
that is different from the number of samples per bin. Because this can
create artifacts in the scaling or more samples than are present in the
data, safeguards are in place to prevent users from choosing inadvisable
values (though these safeguards can be overridden with the parameter
ObsOverride
).
## Number of Observations equal to Number of Samples.
## Counts will remain as present in the data.
By default the function will create a success/failure column for each
IA in the data; however, it is also possible to create a custom column
comparing looks between two specific interest areas. This is done by
specifying the parameter CustomBinom
with a vector of two
integers (e.g., CustomBinom = c(1,2)) in which the two integers
correspond to the IDs of the desired interest areas.
For advanced users who have worked with the package functions before
and who are familiar with the required steps and output, there is a
meta-function, called fasttrack
, which runs through the
previous functions and outputs a dataframe with either empirical logits
or binomial data. Note that using this function will still require the
user to manually remove unneeded columns (see above). This meta-function
takes as parameters all the required arguments to the component
functions. Also, this function assumes that dynamic interest areas were
used and do not need to be relabelled/reassigned. It also assumes an
interest period was defined in Data Viewer relative to the critical
stimulus, thus not requiring separate message alignment. Again, this is
only recommended for users who have previously worked with visual world
data, the functions contained in this package, and are confident that
their data meet the requirements/assumptions of the
fasttrack
function.
Some may wish to rename the interest area columns created by the
functions to something more meaningful than the numeric coding scheme.
To do so, use the function rename_columns
. This will
convert column names like IA_1_C
and IA_2_P
to
IA_Target_C
and IA_Rhyme_P
, respectively. This
will perform the operation on all the IA_
columns for upto
8 interest areas.
dat6 <- rename_columns(dat5, Labels = c(IA1="Target", IA2="Rhyme",
IA3="OnsetComp", IA4="Distractor"))
## Renaming 4 interest areas.
You can now check the column names in the data.
[1] “Subject” “LEFT_GAZE_X”
[3] “LEFT_GAZE_Y” “LEFT_IN_BLINK”
[5] “LEFT_IN_SACCADE” “LEFT_INTEREST_AREA_ID”
[7] “LEFT_INTEREST_AREA_LABEL” “RIGHT_GAZE_X”
[9] “RIGHT_GAZE_Y” “RIGHT_IN_BLINK”
[11] “RIGHT_IN_SACCADE” “RIGHT_INTEREST_AREA_ID”
[13] “RIGHT_INTEREST_AREA_LABEL” “SAMPLE_MESSAGE”
[15] “TIMESTAMP” “TRIAL_INDEX”
[17] “talker” “Rating”
[19] “Exp” “Item”
[21] “Event” “Time”
[23] “EyeRecorded” “EyeSelected”
[25] “IA_ID” “IA_LABEL”
[27] “Gaze_X” “Gaze_Y”
[29] “In_Blink” “In_Saccade”
[31] “IA_Data” “DS”
[33] “NSamples” “IA_outside_C”
[35] “IA_Target_C” “IA_Rhyme_C”
[37] “IA_OnsetComp_C” “IA_Distractor_C”
[39] “IA_outside_P” “IA_Target_P”
[41] “IA_Rhyme_P” “IA_OnsetComp_P”
[43] “IA_Distractor_P” “Obs”
[45] “IA_outside_ELogit” “IA_outside_wts”
[47] “IA_Target_ELogit” “IA_Target_wts”
[49] “IA_Rhyme_ELogit” “IA_Rhyme_wts”
[51] “IA_OnsetComp_ELogit” “IA_OnsetComp_wts”
[53] “IA_Distractor_ELogit” “IA_Distractor_wts”
Before embarking on a statistical analysis, it is probably necessary
to take a couple steps, such as paring down the data to only include the
columns which will be needed later and ensuring the data are ordered
appropriately. This is straightforward using dplyr
.
You are now ready to plot your data. Please refer to the Plotting vignette for details on the various plotting functions contained in the package.