RD - Raw Data

The process for setting Raw data to production status is the most involved of all of the data types in AQS and requires the most user interaction. At each stage of the process the Status Indicator is set to a different value:

Status

Description / Meaning

F

Data is loaded and the load process is not complete. Data at F status is only visible to members of the submitting screening group while the data is loading.

R

Data has been successfully loaded, automated relational checks have passed and data is ready for review. Data is only visible to members of the screening group responsible for the monitor and will not be included in any reports except for those specifically designed to view pre-production data.

S

Statistical Analysis and Critical Review tests (StatCR) have been done and reports are available (see below). Some manual editing and further review may be required. Data is only visible to members of the screening group responsible for the monitor and will not be included in any reports except for those specifically designed to view pre-production data.

P

Data is at Production status and is readable by all AQS users and public web pages.

Rules

Only one sample value is allowed for the same monitor, date, and begin time.

Sample values may not have overlapping durations.

Statistical Tests

The pattern and gap tests performed on hourly data (duration of 1) are described briefly below. See EPA document Screening Procedures for Ambient Air Quality for more detailed information (publication # EPA-450/2-78-037, July 1978). The values are validated via the following statistical tests, and results are included in the Statistical Critical Review Report for user review prior to posting as production data.

Pattern Tests

Pattern tests are performed on hourly data for pollutants 44201 ozone (O3), 42101 carbon monoxide (CO), 42401 sulfur dioxide (SO2), and 42602 nitrogen dioxide (NO2). Exceptional event data are excluded from the tests. The tests are run on a month of hourly data. Essentially, each test scans the month’s values and compares them against empirically derived thresholds to determine if they are questionable. If so, the value is flagged as failing that particular test. The raw data values are converted to the appropriate units before the tests are applied. The factors used to convert from reporting units to the units of the tests are given in Table 6-1. The threshold values for each pollutant and each test are listed in Table 6-2. As Table 6-2 shows, different threshold values pertain depending on the season of the year and the time of day.

The Dixon test (not applied to CO) scans each day’s values and determines the highest, second highest, and lowest values in that day. It then computes the Dixon ratio, defined as (max - secmax)/(max -low). If this value is greater than 0.55, the day fails the Dixon test and all hours are marked as failing.

The max hour test compares each value in the month to a constant to determine if the value is too high. If so, it is marked as failing.

The high difference test compares the value at each hour in the month to the previous hour and the subsequent hour. If the difference between any two hours is greater than allowable, then the hour under inspection is marked as failing.

The spike test works much like the high difference test, except that both differences must be greater than allowable for the test to fail. Also, the percentage difference between the hour in question and both its adjacent hours must be greater than allowable for the test to fail. If either the difference or the percentage comparison fails, the value is rejected.

The high consecutive values test looks at each hour and the subsequent three hours. If all four values are greater than allowable, then all four hours are marked as failing the test.

Threshold Values for the Pattern Tests

Pollutant

Data Minimum Stratification

Maximum Hour Test

High Difference Test

Spike

Consecutive Values Test

Value[1]

Ozone (pphm[2])

Summer Day (Months 05-10, Hours 10-17)

50

15

10 (300%)

26

5

Summer Night (Months 05-10, Hours 18-09)

38

10

5 (300%)

26

5

Winter Day (Months 11-04, Hours 10-17)

26

13

10 (300%)

26

5

Winter Night (Months 11-04, Hours 18-09)

15

10

5 (300%)

26

5

Carbon Monoxide (ppm)

Rush Traffic (Months 06-10, Hours 16-20)

65

22

17 (500%)

35

17

Non Rush Traffic (Months 11-15, Hours 21-05)

44

22

17 (500%)

35

17

Sulfur Dioxide (pphm[2])

Zone 1 (Regions 1, 5, 6, 7)

99

19

8 (500%)

38

25

Zone 2 (Regions 2,3,4)

50

11

8 (500%)

38

25

Zone 3 (Regions 8,9,10)

30

8

8 (500%)

38

25

Nitrogen Dioxide (pphm[2])

None

64

27

11 (300%)

53

12

[1]Values below the minimum are excluded from the high difference and spike tests.

[2]pphm = parts per hundred million.

Gap Test

The gap test is performed on data for pollutants 44201 (O3), 42101 (CO), 42401 (SO2), and 42602 (NO2). Exceptional events data are excluded. The gap test is so named because it looks for gaps in the Frequency Distribution Table for a month’s values. The test is run on a month of hourly data (duration of 1). For each pollutant, AQS builds a Frequency Distribution Table and computes constants associated with the frequency distribution. If there is not enough data to compute the constants, a warning is issued. Having determined the constants, a largest reasonable gap is estimated. Then the largest actual gap in the data is determined and compared to the largest estimated gap to determine whether the month passes or fails the gap test.

Patterns and Gap Failure Report

The patterns and gap failure report is produced by the Critical Review (CR) statistics program. The report identifies the day in which a pattern test failed or the month in which the gap test failed, and it shows the first and last keys of the transactions involved in the failed test. Under the heading hourly values failing test(s)/test(s) failed, the report gives additional information to help identify the values that failed the test.

For pattern test failures, the report shows all the values for each day in which a value failed a test. If database values are being changed with update transactions, some of the values listed may be from the database and some from the transactions in the batch transaction file. One-letter codes under the values indicate which of the pattern tests a value failed. If a particular value failed more than one test, multiple codes are listed. The codes are:

Code

Test

C

High consecutive values test

D

Dixon test

H

High difference test

M

Max hour test

S

Spike test

The codes are also listed on each page of the reports as part of the page heading.

Determining which values are involved in a failure of the gap test is a bit more difficult. The gap test identifies a gap in the frequency distribution of a month’s values. Often, but not always, the gap is due to an outlier, a value unusually higher or lower than the bulk of the data for the month.

The patterns and gap failure report does not identify the date and hour of the value(s) failing the gap test, but it does give information that may allow manual identification of the values(s). Gap size is the difference in magnitude between the two values on either side of the gap in the frequency distribution, expressed in the units used for the gap test (ppm or pphm). Num above gap is the number of values above the gap. If the number of values is large, the gap is in the smallest values for the month. Finally, slot below gap is the value on the low end of the gap, expressed in the units of the test (ppm or pphm).

Shewhart Test

The Shewhart test which is performed on daily data is described briefly herein. See EPA Document Screening Procedures for Ambient Air Quality for more detailed information (publication # EPA-450/2-78-037, July 1987).

The Shewhart test is performed on daily data (duration is 7, 24-hour) for pollutants 12128 (Pb), 42401 (SO2), 42602 (NO2), 88101 (PM-2.5), and 81102 (PM-10). Exceptional events data are excluded. The test is run on a month of daily data. The program counts the number of valid samples for the current month and each of the three previous months. If there is insufficient data to perform the test, a warning message is issued. Given sufficient data for at least two of the three previous months, the program computes the mean and range for the current month. It then computes the historical mean and range, from the mean and range of the data for the three historical months. The mean and range for the current month are compared against the historical values to determine whether the current month passes or fails the Shewhart test.

Critical Review Tests

For monitors defined as SLAMS, NAMS, and PAMS, measurement data is flagged in the StatCR report if it meets certain “Critical Review” conditions. The specific tests are as follows:

Criteria Pollutants

  • Any value that exceeds the 3-year historical maximum. The 3-year historical maximum is defined as the maximum non-exceptional event value obtained between the current year and the current year minus 3 years.

  • The first maximum of the dataset is 125% greater than second maximum on a per monitor-year basis.

  • Any value exceeds any associated NAAQS multiplied by the critical review factor located in the AQS “state_threasholds” table.

  • NAAQS exceedances that are deleted from the database.

Non-Criteria Pollutants

  • Any monitor-quarter whose first maximum is 175% greater than the 3-year historical maximum.

  • The first maximum of the dataset is 150% greater than second maximum on a per monitor-year basis.

Certified Data

  • Any monitor whose certified data has changed due to a change in the raw data for a monitor-year. This change may be in the form of an insertion, modification, or deletion of the supporting raw data.

Examples:

RD transaction in default mode
RD|I|17|031|4201|42101|1|1|007|554|20200501|00:00|0.192|||||||||||||||
RD transaction in Tribal mode
RD|I|TT|207|0762|85101|1|7|105|808|20200101|00:00|2.47190||||||||||||||.42619|.29072

Format

RD - Raw Data transaction format

Seq.

Name

Description

Formatting Rules

Required

Key

1

Transaction Type

Raw Data Sample transaction identifier.

Must = RD

Always

2

Action Indicator

Indicator for Insert, Update, or Delete action.

Must = I, U, or D

Always

3

State Code / Tribal Indicator

The FIPS state code of the monitor, or “TT” to indicate that the next field on the transaction is a Tribal code.

Must exist in STATES Reference Table or be ‘TT’ for Tribal Site

Always

Y

4

County Code / Tribal Code

The FIPS County Code of the site. If the previous field on the transaction contains “TT”, then the Tribal Code of the site.

Must exist in COUNTIES (for state) or TRIBAL_AREAS Reference table

Always

Y

5

Site Number

Four digit number to uniquely identify site in county and/or tribe.

Must exist in SITES table with {State Code, county Code} or Tribal Code

Always

Y

6

Parameter Code

The AQS parameter code assigned to the monitor in AQS.

Must exist in PARAMETERS reference table.

Always

Y

7

POC

Parameter Occurrence Code: One or two digit number identifying a specific monitor for a parameter at the site.

Must exist in MONITOR table for non-Insert actions.

Always

Y

8

Sample Duration Code

Represents the amount of time the monitor samples before reporting a measurement.

Must exist in SAMPLE DURATIONS reference table.

Insert

9

Reported Unit Code

The unit of measure of the reported value.

Must exist in UNITS reference table.

Insert

10

Method Code

Identifies a particular method for collecting and analyzing samples.

Must exist in SAMPLING METHODOLOGIES reference table.

Insert

11

Sample Date

The calendar date for which the observation is being reported.

YYYYMMDD.

Always

Y

12

Sample Begin Time

The time at which the sampling for the reported observation began, in standard time at the location of the monitoring site.

HH:MM.

Always

Y

13

Reported Sample Value

The value of an observation being reported.

Decimal.

Sample value or null data code required

14

Null Data Code

A code to explain why no sample value was reported.

Must exist in QUALIFIERS reference table.

Sample value or null data code required

15

Collection Frequency Code

Indicates the elapsed time period between observations.

Must exist in COLLECTION FREQUENCIES reference table.

No

16

Monitor Protocol ID

The ID used to distinguish combinations of sample duration, unit, method, collection frequency, composite type, and alternate method detectable limit (MDL) for a monitor.

Must exist in MONITOR PROTOCOLS table for monitor.

No

17

Qualifier Code - 1

Qualifications used to describe the raw data.

Must exist in QUALIFIERS reference table.

No

18

Qualifier Code - 2

Qualifications used to describe the raw data.

Must exist in QUALIFIERS reference table.

No

19

Qualifier Code - 3

Qualifications used to describe the raw data.

Must exist in QUALIFIERS reference table.

No

20

Qualifier Code - 4

Qualifications used to describe the raw data.

Must exist in QUALIFIERS reference table.

No

21

Qualifier Code - 5

Qualifications used to describe the raw data.

Must exist in QUALIFIERS reference table.

No

22

Qualifier Code - 6

Qualifications used to describe the raw data.

Must exist in QUALIFIERS reference table.

No

23

Qualifier Code - 7

Qualifications used to describe the raw data.

Must exist in QUALIFIERS reference table.

No

24

Qualifier Code - 8

Qualifications used to describe the raw data.

Must exist in QUALIFIERS reference table.

No

25

Qualifier Code - 9

Qualifications used to describe the raw data.

Must exist in QUALIFIERS reference table.

No

26

Qualifier Code - 10

Qualifications used to describe the raw data.

Must exist in QUALIFIERS reference table.

No

27

Alternate Method Detection Limit

User supplied alternate to the federeal method detection limit (MDL).

Decimal.

No

28

Measurement Uncertainty

The value of method uncertainty associated with the sample.

Decimal.

No