因子表按纵表存储在dolphindb数据库中,不管是提取因子进行模型训练,还是进行其他的一些因子计算处理,都要按宽表的形式展示出来。下面一个demo进行,先数据预处理再pivot by成宽表和先pivot by成宽表再进行数据预处理的性能进行比较。先pivot by成宽表再进行数据预处理的性能是先数据预处理再pivot by成宽表的4倍。
模拟生成2008.01.01..2022.05.30,4500只code,1000个因子的数据。dolphindb server2.00.6,worker数为8,
if(existsDatabase("dfs://factor")){ dropDatabase("dfs://factor") } datePar = database("", RANGE, date(datetimeAdd(2008.01M, 0..40*12, 'M'))) namePar = database("", VALUE, ["DEFAULT_NAME"]) factorDB = database("dfs://factor", COMPO, [datePar, namePar], engine="TSDB") factorDoubleSchema = table(1:0, `date`code`name`value, [DATE, SYMBOL, SYMBOL, DOUBLE]) factorDoubleTable = factorDB.createPartitionedTable(factorDoubleSchema, "factor_double", partitionColumns=`date`name, sortColumns=`name`date, keepDuplicates=ALL) path="dfs://factor" tbName= "factor_double" tb=loadTable(path,"factor_double") tradedates=2008.01.01..2022.05.30 tradedates=tradedates[1<=weekday(tradedates)<=5] n=size(tradedates) codes=string(600001..604500) X=1..1000 factornames=`factor+decimalFormat(X, "0000") tradedates=sort(take(tradedates,4500*n)) for (ifactor in factornames){ print(ifactor) idata=table(tradedates as date,take(codes,4500*n) as code,take(ifactor,4500*n)as name,rand(10.,4500*n) as value) upsert!(tb,idata,keyColNames=`name`date) }
定义数据标准化处理和指数平滑数据处理函数
def factorFiterAndNormize(x,fiter){ /* * 数据标准化处理 */ v=winsorize(x,fiter) return (v-mean(v))\std(v) } def factorewmMean(x){ /* * 指数平滑 */ return ewmMean(X=x,com=0.5) } addFunctionView(factorFiterAndNormize) addFunctionView(factorewmMean)
第一种方法,提取2年100个因子数据,先数据预处理再pivot by 成宽表耗时1分6秒
def pivotby_last(){ factors=`factor+decimalFormat(1..100, "0000") //查询数据,并数据标准化 data=select date, code, name,factorFiterAndNormize(value,0.03) as value from loadTable("dfs://factor","factor_double") where name in factors and date between 2020.01.01:2022.01.01 context by date,name //进行指数平滑 data=select date, code, name,factorewmMean(value) as value from data context by code,name csort date data=select value from data pivot by date, code, name; return data } timer data=pivotby_last() //Time elapsed: 186393.562 ms
第二种方法,提取2年100个因子数据,先pivot by 成宽表再数据预处理,耗时46s,提取性能是第一种方法的4倍
def pivotby_first(){ factors=`factor+decimalFormat(1..100, "0000") data=select value from loadTable("dfs://factor","factor_double") where name in factors and date between 2020.01.01:2022.01.01 pivot by date, code, name; factors=colNames(data)[2:] colCount = factors.size() colDefs = array(ANY, colCount+2) colDefs[0]=sqlCol(`date) colDefs[1]=sqlCol(`code) for(i in 0:colCount){ colDefs[i+2] = sqlCol(factors[i], factorFiterAndNormize{,0.03},factors[i]) } //标准化处理 temp=sql(colDefs, data, groupBy=sqlCol(`date),groupFlag=0).eval() for(i in 0:colCount){ colDefs[i+2] = sqlCol(factors[i], factorewmMean,factors[i]) } //进行指数平滑 return sql(colDefs, temp, groupBy=sqlCol(`code),groupFlag=0,csort=sqlCol(`date)).eval() } timer data=pivotby_first() //TTime elapsed: 46751.896
如果觉得我的文章对您有用,请随意打赏。你的支持将鼓励我继续创作!