Scala High Performance Programming by 2016

Scala High Performance Programming by 2016

Author:2016
Language: eng
Format: epub, mobi
Publisher: Packt Publishing


Data clean up

The return algorithm is now blazingly fast. That is, blazingly fast to return incorrect results! Remember that we still have to handle some edge cases and clean up the input data. Our algorithm only works if there is exactly one midpoint per minute, and Dave informed us that we are likely to see more than one midpoint computed for the same minute.

To handle this problem, we create a dedicated MidpointSeries module and make sure that an instance of MidpointSeries, wrapping a series of Midpoint instances, is properly created without duplicates:

class MidpointSeries private(val points: Vector[Midpoint]) extends AnyVal object MidpointSeries { private def removeDuplicates(v: Vector[Midpoint]): Vector[Midpoint] = { @tailrec def loop( current: Midpoint, rest: Vector[Midpoint], result: Vector[Midpoint]): Vector[Midpoint] = { val sameTime = current +: rest.takeWhile(_.time == current.time) val average = sameTime.map(_.value).sum / sameTime.size val newResult = result :+ Midpoint(current.time, average) rest.drop(sameTime.size - 1) match { case h +: r => loop(h, r, newResult) case _ => newResult } } v match { case h +: rest => loop(h, rest, Vector.empty) case _ => Vector.empty } } def fromExecution(executions: Vector[Execution]): MidpointSeries = { new MidpointSeries(removeDuplicates( executions.map(Midpoint.fromExecution))) }

Our removeDuplicates method uses a tail recursive method (Refer to Chapter 3, Unleashing Scala Performance). This groups all the midpoints with the same execution time, calculates the average value of these data points, and builds a new series with these average values. Our module provides a fromExecution factory method to build an instance of MidpointSeries from a Vector of Execution. This factory method calls removeDuplicates to clean up the data.

To improve our module, we add our previous computeReturns method to the MidpointSeries class. That way, once constructed, an instance of MidpointSeries can be used to compute any return series:

class MidpointSeries private(val points: Vector[Midpoint]) extends AnyVal { def returns(rollUp: MinuteRollUp): Vector[Return] = { for { i <- (rollUp.value until points.size).toVector } yield Return.fromMidpoint(points(i - rollUp.value), points(i)) } }

This is the same code that we previously wrote, but this time, we are confident that points does not contain duplicates. Note that the constructor is marked private, so the only way to instantiate an instance of MidpointSeries is via our factory method. This guarantees that it is impossible to create an instance of MidpointSeries with a "dirty" Vector. You release this new version of the program, wish good luck to Dave and his team, and leave for a well deserved lunch break.

As you return, you are surprised to find Vanessa, one of the data scientists, waiting at your desk. "The return series code still doesn't work", she says. The team was so excited to finally be given a working algorithm that they decided to skip lunch to play with it. Unfortunately, they discovered some inconsistencies with the results. You try to collect as much data as possible, and spend an hour looking at the invalid results that Vanessa is talking about. You noticed that they all involved trade executions for two specific symbols: FOO and BAR. A surprisingly small amount of trades is recorded for these symbols, and it is not unusual for several minutes to elapse between trade executions.



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.