建议TSDB去重策略包含分区键

当前按照排序键去重的设计非常反直觉,容易造成非预期数据覆盖,建议去重策略同时包括分区+排序键或者排序键默认包含分区键

参考最佳实践设计窄表,文档中排序键未包含分区键,造成数据覆盖

复现代码:

d_db1 = database("", HASH,[SYMBOL,20])
d_db2 = database("", VALUE, 2010.01M..2030.01M)

d_db = database("dfs://test",COMPO, [d_db1, d_db2],engine="TSDB")

day_col_names = `time`instrument`factor`value
day_col_types = [DATE,SYMBOL,SYMBOL,FLOAT]
factor_day_table = table(1:0, day_col_names, day_col_types)
d_db.createPartitionedTable(factor_day_table, `factor_day, `factor`time, compressMethods={time:`delta}, sortColumns=`instrument`time,keepDuplicates=LAST)

t1 = table(2010.01.01 as time, `test as instrument, `open as factor, 1 as value)
t2 = table(2010.01.01 as time, `test as instrument, `low as factor, 1 as value)
t = loadTable("dfs://test", `factor_day)
t.append!(t1)
t.append!(t2)
select * from t

预期两行,结果只有一行
请先 登录 后评论

1 个回答

saki

首先排序键本身的排序功能不是针对全局数据的,其主要目的主要是为了建立数据索引,二是为了去重。设计上,分区键和排序键互相独立,你可以在排序键中指定分区键或者不指定分区键。您给的例子是符合预期的,因为去重键是 `instrument`time 所以只保留一行,您如果要实现两行可以 sortColumns 指定为 `factor`instrument`time

请先 登录 后评论