CSC 5003 – Semantic Web And Big Data Architecture

CI1 : Introduction To Scala

Install Scala and practice.

Install Scala (∼10mn, – easy)

To begin with, we need to install the tools to run Scala. You can follow the instructions on Scala Website.

Next, you want to use your Java IDE to code in Scala. To do so, you need to install an additional plugin or a variant of the IDE you are used to. In this lecture, we recommend using IntelliJ.

For Intellij IDEA: Scala Plugin
For eclipse: Scala IDE
For Visual: Scala Plugin

Peter Pan (∼45mn – moyen)

In this exercice, we will write functions that manipulate text (Peter Pan, here). The goal is to get familiar with Scala and higher-order functions.

In this exercice, we do not want to use the traditional loops for and while, except if stated overwise

Using your IDE, create a new Scala project. Then, create a new Scala object called TextReader. In Scala, the object keyword is used to create a class that has exactly one instance. Add the main method in your code. You should have something like:

object TextReader {
    def main(args: Array[String]): Unit = {
    }
}
		

Check you can run Scala by printing the traditional Hello, World!

object TextReader {
  def main(args: Array[String]): Unit = {
    println("Hello, World!")
  }
}

Follow this link and download the book Peter Pan. Then, we want to read the lines of this file and put the lines into a list. We can use the following:

val lines = Source.fromFile(path)("UTF-8").getLines.toList

You will have to import Source with import scala.io.Source. Print the first element of this list.

println(lines.head)

Although the length function exists in Scala, recode it using a recursive function. You can use if/else and the isEmpty method. Compare that you get the same result as the native length method. (Optional) Try to write the same function with pattern matching.

def length(l: List[String]): Int = {
    if (l.isEmpty) {
      0
    } else {
      length(l.tail) + 1
    }
}

Now, we want to get statistics about this text. Write a function that returns the total number of caracters. Here, use only higher-order functions (no for/while and recursive functions).

def totalChars(l: List[String]): Int = {
    l.map(_.length).reduce(_ + _)
}

Write a function to get the number of lines containing the word Peter (String has a method contains).

def getNumberLinesPeter(l : List[String]): Int = {
    l.filter(_.contains("Peter")).length
}

Using curryfication, generalize the previous function so that it accepts any word (not just Peter) and return the matching function.

def getNumberLinesWord(word: String)(l: List[String]) : Int = {
    l.filter(_.contains(word)).length
}

Using only higher-order functions, apply the following transformation to each line:

If the line contains love, turn it into the Char +
Else if the line contains hate, turn it into the Char -
Else, turn it into the Char =

Then, remove the = and print what remains on a single line. You should get something like: +++--+-++---.

def printPlusMinus(l: List[String]) = {
    l.map(x => if (x.contains("love")) '+' else if (x.contains("hate")) '-' else '=').filter(x => x != '=').foreach(print)
}

(hard) Using only higher-order functions, get the most frequent letter in the text, excluding the space. If you want to transform something into a list, you can use the .toList method.

We propose the following algorithm:

Turn all the lines into a list of Char.
Merge all the lists into a single one.
Remove all spaces
Group all the similar letters together.
Turn the Map into a list and for each group of letter, keep only its length
Using a reduce, find the most used character

Be careful with the types at each step (an IDE helps a lot, here).

def mostCommonLetter(l: List[String]) : Char = {
    l.map(_.toList).flatten.filter(_ != ' ').groupBy(x => x).toList
      .map(x => (x._1, x._2.length))
      .reduce((x, y) => if (x._2 < y._2) y else x)._1
  }

Data Analysis (∼25mn – moyen)

In the exercice, we will make some data analysis on a simple dataset representing statistics about Pokemons.

In this exercice, we do not want to use the traditional loops for and while, except if stated overwise

Download the Pokemon dataset. Next, create a new Scala object called Pokemon with a main class as we did in the previous exercice. Load the file and print the first line.

import scala.io.Source

object Pokemon {
  def main(args: Array[String]): Unit = {
    val lines = Source.fromFile(YOUR_PATH).getLines.toList
    println(lines.head)
  }
}

Write a function that extracts the header and the content. This function must split the header and the lines of the content to make the columns appear. It must return a pair composed of an array of String (the header) and a list of array of String (a list of all the rows, where each row is an array of String representing each column). You can use the split method on a String.

def getHeaderContent(l: List[String]): (Array[String], List[Array[String]]) = {
    val header = l.head
    val content = l.tail
    (header.split(","), content.map(_.split(",")))
}

Write a function that takes as input the header and returns a Map mapping the column names to their index. We can use the function .toMap on a iterable of pairs. Print the index of the column attack.

  def getIndex(header: Array[String]): Map[String, Int] = {
    header.zipWithIndex.toMap
  }

Write a function to get a column, given the content and the column index.

def getColumn(content: List[Array[String]], index: Int) = {
    content.map(x => x(index))
}

Write a function to compute the mean value of a column given by it index. You can use the .toDouble conversion function.

def getMeanValue(content: List[Array[String]], index: Int) : Double = {
    val column = getColumn(content, index)
    column.map(_.toDouble).reduce(_ + _) / column.length
}

Given the content, a column number, and a number of bins, compute the size of the bins of the histogram of the data in the considered column. We recall that the N bins are created by splitting the interval of data in N equal intervals and then counting how many data points there are on each interval. You can use the .toDouble conversion function. Return a sorted list by the bin number (you can use sortBy).

def getHistogram(content: List[Array[String]], index: Int, nBins: Int) = {
    val column = content.map(x => x(index).toDouble)
    val start = column.min
    val end = column.max
    column.groupBy(x => ((x - start) * nBins / (end - start)).toInt * (end - start) / nBins + start).view.mapValues(_.length).toList.sortBy(_._1)
}

What is the Bird Pokémon with the highest attack? You can use the method maxBy.

println(content.filter(x => x(index.getOrElse("classification", -1)) == "Bird Pokémon")
      .maxBy(x => x(index.getOrElse("attack", -1))).array(index.getOrElse("name", -1)))