An optional, comma-separated list of regular expressions that match names of schemas specified in table.include.list for which you want to take the snapshot.
Simply overwrites storage with latest delta record
复制代码
为了向前兼容,数据开发同事 Karl 新增了 OverwriteNonDefaultsWithLatestAvroPayload 类,覆写了 combineAndGetUpdateValue 来处理上述问题,并已反馈给社区 [HUDI-1255] Add new Payload (OverwriteNonDefaultsWithLatestAvroPayload) for updating specified fields in storage[5] , 其实社区内类似需求还有很多,如 [HUDI-1160] Support update partial fields for CoW table[6], 我们也期待有更多的开发者可以将这个功能做的愈加完善。
当然这里也存在限制,如果真的希望将某个字段更新为空值,那么使用 OverwriteNonDefaultsWithLatestAvroPayload 是无法实现的。
同时我们也对社区的 Compaction 策略了补充,添加了基于时间的 Compaction 调度策略,即不仅仅可以基于增量提交数进行 Compaction,还可以基于时间做 Compaction,该工作也已经反馈给社区,参见[HUDI-1381] Schedule compaction based on time elapsed[7],这对于想要在指定时间内进行 Compaction 提供了更高的灵活性。
Hudi uses Avro as the internal canonical representation for records, primarily due to its nice schema compatibility & evolution[9] properties. This is a key aspect of having reliability in your ingestion or ETL pipelines. As long as the schema passed to Hudi (either explicitly in DeltaStreamer schema provider configs or implicitly by Spark Datasource's Dataset schemas) is backwards compatible (e.g no field deletes, only appending new fields to schema), Hudi will seamlessly handle read/write of old and new data and also keep the Hive schema up-to date.
The indexing component is a key part of the Hudi writing and it maps a given recordKey to a fileGroup inside Hudi consistently. This enables faster identification of the file groups that are affected/dirtied by a given write operation.
Hudi supports a few options for indexing as below
• HoodieBloomIndex (default)
: Uses a bloom filter and ranges information placed in the footer of parquet/base files (and soon log files as well)
•HoodieGlobalBloomIndex : The default indexing only enforces uniqueness of a key inside a single partition i.e the user is expected to know the partition under which a given record key is stored. This helps the indexing scale very well for even very large datasets[11]. However, in some cases, it might be necessary instead to do the de-duping/enforce uniqueness across all partitions and the global bloom index does exactly that. If this is used, incoming records are compared to files across the entire dataset and ensure a recordKey is only present in one partition.
•HBaseIndex : Apache HBase is a key value store, typically found in close proximity to HDFS. You can also store the index inside HBase, which could be handy if you are already operating HBase.
You can implement your own index if you'd like, by subclassing the HoodieIndex class and configuring the index class name in configs.
复制代码
在与社区的讨论后,我们更倾向于使用 HBaseIndex 或类似的 k-v store 来管理索引。
■ 更新
upsert 慢除了某些文件较大的问题,另一方面也与 CDC 的特点有关。可变数据的更新范围其实是不可预测的,极端情况下待更新的 1000 条数据属于 1000 个不同的文件时,更新的性能很难通过代码优化的方式提升,只能增加 cpu 资源提高处理并行度。我们会从几个方面着手:
参数调整,要是否有办法平衡文件的数量和大小。
尝试部分业务表使用 MOR 模式,MOR 在更新时会先将数据写入日志文件,之后再合并到 Parquet,理论上可以降低覆写 Parquet 文件的频率。