Class ParquetAppendDataStateData
java.lang.Object
edu.stanford.slac.archiverappliance.plain.AppendDataStateData
edu.stanford.slac.archiverappliance.plain.parquet.ParquetAppendDataStateData
A stateful class that manages the process of writing and appending EPICS event data to Parquet files.
This class extends AppendDataStateData to provide Parquet-specific logic. It handles
file partitioning, writing new events, and merging data.
Key behaviors:
- Incremental Appends: When appending new events to an existing file, this class writes the new data to a temporary file. When the stream is closed, this temporary file is efficiently merged with the original data file.
- Bulk Appends (ETL): For ETL operations, it uses an optimized path that leverages
ParquetRewriterto merge multiple source Parquet files directly into the destination file without full deserialization.
-
Field Summary
Fields inherited from class edu.stanford.slac.archiverappliance.plain.AppendDataStateData
currentEventsYear, desc, lastKnownTimeStamp, partitionGranularity, previousFilePath, previousYear, pv2key, rootFolder, writer -
Constructor Summary
ConstructorsConstructorDescriptionParquetAppendDataStateData(PartitionGranularity partitionGranularity, String rootFolder, String desc, Instant lastKnownTimestamp, org.apache.parquet.hadoop.metadata.CompressionCodecName compressionCodecName, PVNameToKeyMapping pv2key, org.apache.parquet.ParquetReadOptions readOptions, PathResolver pathResolver) Constructs a new state manager for appending data to Parquet files. -
Method Summary
Modifier and TypeMethodDescriptionbooleanbulkAppend(String pvName, ETLContext context, ETLBulkStream bulkStream, String extension, String extensionToCopyFrom) Performs a bulk append optimized for ETL processes.voidCloses any open writers and merges temporary files.protected EventFileWritercreateNewWriter(String pvName, Path pvPath, EventStream stream) toString()voidupdateStateBasedOnExistingFile(String pvName, Path currentPVFilePath) Initializes the writer state by reading metadata from an existing Parquet file.Methods inherited from class edu.stanford.slac.archiverappliance.plain.AppendDataStateData
checkStream, getPathResolver, partitionBoundaryAwareAppendData, preparePartition, shouldISkipEventBasedOnTimeStamps, shouldISwitchPartitions
-
Constructor Details
-
ParquetAppendDataStateData
public ParquetAppendDataStateData(PartitionGranularity partitionGranularity, String rootFolder, String desc, Instant lastKnownTimestamp, org.apache.parquet.hadoop.metadata.CompressionCodecName compressionCodecName, PVNameToKeyMapping pv2key, org.apache.parquet.ParquetReadOptions readOptions, PathResolver pathResolver) Constructs a new state manager for appending data to Parquet files.- Parameters:
partitionGranularity- The granularity for partitioning data files (e.g., YEARLY).rootFolder- The root directory for storage.desc- A description for logging purposes.lastKnownTimestamp- The timestamp of the last known event in this storage partition. Used to prevent writing out-of-order data.compressionCodecName- The compression codec to use for writing new files.pv2key- The mapping from PV name to storage key.readOptions- Configuration for reading Parquet files.pathResolver- The resolver for determining file paths.
-
-
Method Details
-
updateStateBasedOnExistingFile
public void updateStateBasedOnExistingFile(String pvName, Path currentPVFilePath) throws IOException Initializes the writer state by reading metadata from an existing Parquet file. This is called when appending data to a file that already contains events.- Specified by:
updateStateBasedOnExistingFilein classAppendDataStateData- Parameters:
pvName- The name of the PV.currentPVFilePath- The path to the existing Parquet file.- Throws:
IOException- if the PV name in the file does not match or an I/O error occurs.
-
createNewWriter
protected EventFileWriter createNewWriter(String pvName, Path pvPath, EventStream stream) throws IOException - Specified by:
createNewWriterin classAppendDataStateData- Throws:
IOException
-
bulkAppend
public boolean bulkAppend(String pvName, ETLContext context, ETLBulkStream bulkStream, String extension, String extensionToCopyFrom) throws IOException Performs a bulk append optimized for ETL processes.This method expects an
ETLParquetFilesStreamand usescombineFiles(List, Path, boolean)to merge the source files directly, bypassing event-by-event processing.- Specified by:
bulkAppendin classAppendDataStateData- Parameters:
pvName- The PV namecontext- The ETL contextbulkStream- The ETL bulk streamextension-extensionToCopyFrom-- Returns:
trueif the append was successful.- Throws:
IOException- if the stream is not of the expected type or an I/O error occurs.
-
closeStreams
Closes any open writers and merges temporary files.This method is critical for ensuring data is committed to its final destination. It closes the
ParquetWriter(flushing its buffers) and then callscombineWithTempFiles()to merge the newly written data into the main partition file.- Overrides:
closeStreamsin classAppendDataStateData- Throws:
IOException- if an I/O error occurs.
-
toString
-