Notes taken by Luizo ( @LuigiBrosNin on Telegram)

  • Exam info collapsed:: true
    • Presentation like in the thesis of a paper in the drive β€œStudentCBD” folder
    • Free choice, no overlap in here (you need to compile the form to get in) (links for 2024/25)
    • CHANGE THE REFERENCES FROM THE TEMPLATE PLS
    • You’ll get the slides in google slides through mail after compiling the form
    • Presentation in 30m time
    • β€œexpand the scopes of the paper”

Theory

Appunti Francesco Corigliano i studied here anyway (Italian notes)

1 Intro Models

Relational β†’ structured (basically)

NON RELATIONAL MODELS

  • boolean model not usable, searches are not 100% relevant.

  • no scheme

    Queries can’t manipulate data, but can retrieve them ordered by relevance

    RELATIONAL MODELS

  • set-based

  • not nested, not ordered

  • distinction between schema and data

  • boolean model

  • XML can represent it

    SEMI-RELATIONAL MODELS

  • shows proprieties of both structural and non-structural data

  • retrospective-built schema, always evolving

  • list-based

  • XML is the primary format

  • many query can’t be written in SQL without compromises (recursion, slowness)

2 Intro XML

eXtensible Markup Language, comes from SGML (Standard Generalized Markup Language)

both can be used to define markup languages for many domains

XML Structure:

  • Prolog β†’ info for doc interpretation (eg. xml version, document type definition DTD)
  • Document Body β†’an element that can contain more nested elements and comments
    • each element is enclosed in a <tag></tag>

    • names are case sensitive

    • values of attributes must be contained in β€˜β€™ or β€œβ€

    • attributes can’t have the same attribute more than once

    • data must be within the enclosed tag

2.2 SQL/XML - Relational Data and XML

collapsed:: true

XML is used to exchange data between apps and represent structured data.

SQL/XML is a bridge for archiving locally these data in a relational manner

2 main functions to achieve conversions

  1. XML Extraction (SQL β†’ XML, easy) - Mapping each data type that can be represented in the XML in a table (name of table β†’ name of doc, every line is a row element and each value (column) is included in an element with the attribute name) - Extracting data using XQuery: SQL β†’ XML (SQL/XML) 2. XML Archiving (XML β†’ SQL, complex) - Columns are Object-Relational, to archive XML fragments in a single field - Fragmenting documents to be stored piece by piece

    SQL/XML Language β†’ SQL extension that contains constructors, routines, functions to support manipulation and archiving of XML in a SQL db.

  • Operators
    • XMLELEMENT β†’ creates an XML element with args

      • Name β†’ given explicitly trough a constant
      • Content β†’ you can specify any objects, elements or strings ( concat op β†’ || )
      • attributes β†’ optional list
    • XMLATTRIBUTES β†’ params corresponding to attributes, default name as the attr name

    • XMLFOREST β†’ produces a list of simple elements, same params as XMLATRIBUTES

    • XMLCONCAT β†’ concats a forest of elements

    • XMLAGG β†’ groups tuples based on one or more attributes, with GROUP BY op

    • XMLGEN β†’ specify the XML code to insert, you can use variables between {}


      The result gets calculated considering the SQL, making a SELECT * and then building the requested attributes 1 by 1.

2.3 XQuery - XQuery language

XQuery can be used to access XML expressed data and has XSLT functions (eXtensible Stylesheet Language Transformations).

XQuery operates on sequences, that can be atom values (eg. β€œhello” string, β€œ3” integer) or nodes.

an XQuery expression can receive in input ο»Ώ or ο»Ώ sequences, and output an ordered and not-nested sequence.

  • Ordered β†’

  • Not-nested β†’

  • sequence = element β†’ ο»Ώ collapsed:: true

    Operators: ,,to,union (same as | ),intersect,except

    1. Equivalents β†’ ο»Ώ
    2. ο»Ώ union ο»Ώ
    3. ο»Ώ intersect ο»Ώ
    4. ο»Ώ except ο»Ώ

    Nodes

    Attributes: when XQuery extracts attributes, they’re not ordered, but they get inserted in a sequence, preserving the order.

    Text: the value of the string of each node of type document / element is the ordered concat of all text childs.

    two text nodes can’t be adjacent (eg. Jane Doe β†’ β€œJane Doe” is the text node, β€œJane” + β€œDoe” are not 2 different text nodes)

    Nodes can be of text, integers, date type.

    Path Expressions

    Used to extract values and proprieties from nodes and trees.

    "doc_name.xml"/child:doc/child:chapter/child:selection

    Each path gets evaluated in a context (a sequence of nodes with extra info, such as position) and produces a sequence.

    It then gets evaluated, and the context used is the sequence in the previous step, and so on.

    Stair step structure β†’ expression to path, 3 step procedure:

    1. Axe β†’ select the nodes relative to the position of the context node
    • self::

    • child::

    • parent::

    • ancestor::

    • descendant::

    • following-sibling:

    • preceding-sibling::

    • attribute::

    • descendant-or-self::

    • ancestor-or-self:: 2. Test β†’ filter result nodes based on:

    • Name

      • child::section β†’ return only the section tag within the children
      • child::* β†’ return all children
      • attribute::xml:lang β†’ returns xml:lang attribute
    • Type

      • descendant::node() β†’ returns all descendant nodes.
      • descendant::text() β†’ returns all descendant text nodes.
      • descendant::element() β†’ returns all nodes of descendant elements.
    • Both β†’ descendant::element(person, xs:decimal) β†’ returns person elements of type decimal 3. Predictate β†’ filters nodes on generic criteria, each step can terminate with one or more predicate, inb [] to filter even more, like an array


      predictate is true if

    • expression returns a single integer value

    • the position of the subject node corresponds to a integer value (child::chapter[2] returns the 2nd child of the chapter tag)

    • 1st element is a node

      • child::chapter[child::title] β†’ returns all chapter tags who have a title tag child.

      • child::chapter[attribute::xml:lang = "it"] β†’ return all chapters with the lang attribute set to β€œit”.

        predictate is false if the expression returns a non-empty sequence.

        Complete path expressions β†’ starts with / or // to contain the root of the document or all the nodes respectively.

        Shortened syntax

  • Omission of Child β†’ child::section/child::paragraph = section/paragraph

  • Substitution of Axe with @ β†’ para[attribute::type=β€œwarning”] β†’ para[@type=β€œwarning”]

  • Substitution of descendant-or-self::node() with **//** β†’ div/descendant-or-self::node()/child::paragraph β†’ div//paragraph

  • Substitution of self::node() with . β†’ self::node()/descendant-or-self::node()/child::para β†’ .//para

  • Substitution of parent::node() with **..** β†’ parent:node()/child::section β†’ ../section

    FLWOR Expressions

    For, Let, Where, Order by, XQuery equivalent of SELECT-FROM-WHERE, but defined by binding variables

    1. Return, creates the result
    • For and Let create a list of tuples (all possible associations) 2. For β†’ associates one or more variable to the expressions

      iterates all elements in a variable (like ο»Ώ)

      1. Let β†’ declares a variable basically

      you can aggregate expressions to functions: grouping elements to evaluate them to find numbers (min, max if they’re numbers)

      1. Where β†’ filters a list based on a condition (basically an if in a for)
      2. Order by β†’ orders the list based on parameters
      3. Comments β†’ (: like this :)
  • Join example


    tables

    We want to produce a screening schedule listing songs, author and birth + death dates.


    query


    output

    Conditional expressions

    basically if-then-else, in xQuery constructed like this

    if ($product1/price < $product2/price)
    then $product2
    else $product1

    Comparison operators β†’ =, !=, <, ⇐, >, β‰₯

    they act on sequences, meaning $book1/author = β€œLuizo” is true if at least 1 selected node β€œauthor” is equal to Luizo.

    These operations are not transitive:

    (1, 2) = (2, 3) = TRUE

    (1, 2) != (2, 3) = TRUE

    XPath 2.0 adds β†’ eq, ne, It, le, gt, ge

    Expressions with quantifiers β†’ XQuery variables are often associated with sets of objects instead of single values (eg. for $lib in doc books.xml/books/book assigns <book> elements)

    specific operators can verify proprieties of objects eg:

    some $emp in //employee β†’ ο»Ώ equivalent

    every $imp in //employee β†’ ο»Ώ equivalent

    Standard functions

    input functions, necessary to obtain the XML code to make the query

  • fn:doc('bib.xml') β†’ access a document

  • fn:collection('compositors') β†’ access a set of documents

    Node sequence functions

  • fn:position()β†’ current node position

  • fn:last() β†’ ο»Ώ of nodes

  • fn:count($elements) β†’ node sequence cardinality

    Aggregated functions

2.4 XQuery in DBMS - XQuery on DB2, Oracle and SQL Server

DB2

DB2 can insert XML type columns into relational tables trough the INSERT + XMLPARSE function.

XQuery queries are inserted trough the SELECT clause trough XMLQUERY, which returns XML fragments

XMLEXISTS β†’ in the WHERE clause

XMLTABLE β†’ in the FROM clause

DB2 supports all axe path expressions, all nodes and all tests for the XPath standard.

to create a XML table, specify XML as data type in a column.

  • DB2 Examples of XMLPARSE | XMLQUERY (literally just an XPath tutorial lol)


    to insert XML data, we can use XMLPARSE, we can choose to maintain the white space.


    SELECT example that extracts departments and puts them in <deptName> trough XMLQUERY


    passing β†’ specifies the XML fragment we’re working on, in the example what gets passed is the content of the departments as a list, so it creates a $list variable that we can use inside the query.

    example that extracts names and salaries of all employers who gain > 50k and inserted in <empSalary> and ordered by salary.


    example that uses ==let and avg== to see the average salary of the department in which each employee works


    Union operator example (basically an AND)


    Join equivalent example. We’re passing both employees content and departments content for it to work.


    Join + let example to get for each department, name + sum of salaries from that department


    Mishmash of all previous examples:

    for each department, we want the number of employees (BLUE), the average salary (GREEN). results must be sorted by the sum of the salaries (RED) of each department, and filter out departments with only 1 or less employees (PURPLE)

    XQuery in oracle DB

    non-supported functions:

  • version encoding

  • xml:id

  • xs:duration (use xs:YearMonthDuration or xs:DayTimeDuration)

  • schema validation feature

  • module feature

  • no standard functions for regex

    • fn;id, fn:idref are not supported

    • fn:collection with no arguments not supported

      Table creation β†’ with XMLType keyword

      XMLType can be implemented as

      1. LOB (Large OBject) β†’ string
      2. Structured Storage β†’ automatic tables that follow the XML schema, preserving the DOM

      inserting a XML fragment in a table β†’ XMLType into an insert followed by the string in XML

  • Oracle Examples

    XQuery example in Oracle, same example as DB2


    Join equivalent example


    we can use doc() to query XML files directly without inserting data in tables

    a DUAL empty table is present in all default Oracle database, used as a foobar option since β€œFROM” is required in SQL

    XQuery on SQL Server 2012

    non-supported features:

  • Sequences

    • must be homogeneous (composed of nodes or atomic values)
    • to construct is no supported
    • union, intersect, except cannot be used to combine node sequences
  • Path Expressions, they are only supported:

    • child, descendant, parent, attribute, self, descendant-or-self Axes
    • comment(), node(), processing-instruction(), text() Types
  • FLWOR

    • LET will be inserted every mention of the variable, meaning it will be recomputed each time the variable gets referenced
  • idiv and not are not supported

    SQL Server

    Table creation β†’ with XMLType keyword

    to insert data in the table, INSERT op with a XML fragment under a string

  • Query example in Oracle Berkeley DB XML


    same query as before, but without β€œpassing”

    Oracle Berkeley β†’ set of libraries that can be used inside common programming languages to manage an XML DB

    Major Operations

    Create a container

    1. document-based container β†’ document archived as delivered to the system
    2. node-based container β†’ document gets turned into nodes and archived separately, they can be modified afterwards

    containers are comparable to table in relational DBMS

    we can use β€œputDocument <namePrefix> <string> [f|s|q]” to insert documents

    parameters

    f β†’ string is a path to an XML

    s β†’ string is a XML fragment

    it’s possible to bind a schema of a XML document to a table so that you can only insert documents that satisfy that schema into the database

    Indexes can be created which improve performance

    retrieving documents in a container can be done trough XQuery queries

2.5 NoSQL

means a non relational database, there are 225+ systems that are NoSQL

Proprieties of BASE (ACID equivalent)

  • no ACID proprieties

    • weak consistency
    • more powerful
    • easy to use and fast
    • optimist
  • no Schemas β†’ replaced by non relational model, key-value, doc, graph

  • Availability β†’ always data availability, but β€œfailure” or inconsistent data are also possible result

  • Soft state β†’ the state of the system can change overtime

  • Final consistency β†’ sooner or later data will be coherent, but no consistency verification on every transaction

    CAP compromise (triangle)

    1. Consistency β†’ works or not
    2. Availability β†’ every request gets an answer
    3. Partition Tolerance β†’ still works on data loss, a single node error doesn’t cause a system collapse

    in NoSQL, there is no schema, no unused cells and data types are implicit. most considerations are executed at application level.

  • Aggregation β†’ elements are gathered in an aggregate (document)

  • Aggregate β†’ cluster of domain objects that can be treated as an unity. aggregates are the base example of transaction and archiving of data

  • Polygot persistence β†’ Hybrid approach at data persistence, there are many db models made to solve different needs.

    Schema replacements

    Key - Value β†’ hash table

    Key β†’ string

    Value β†’ any data type, usually a block of data not interpreted (binary / json), memorized as blob.

    4 base operations

    1. Insert(key, value)
    2. Fetch(key)
    3. Update(key)
    4. Delete(key)

    Data model of documents β†’ XML, JSON, text or binary BLOB

    any tree structure can be represented as XML or JSON, similar to key-value, but the value is a document.

    pros:

  • Rich data structure β†’ incorporates data in secondary documents and arrays, supports advanced queries and indexing

  • aggregated data in a single structure β†’ good performance

  • dynamic scheme β†’ data can quickly adapt to necessities

    Columnar data model β†’ archiving tables as columns of data instead of rows

    Columnar β†’ extensions to traditional tables (adding info)

    Graph data model β†’ graph algorithms are used, graph theory is used to structure data

  • vertical scaling

  • transaction

  • ACID

3.1 IR - Information Retrieval

Means finding non structured material that satisfies an information need, usually about web search, social media, local laptop usage.

Document β†’ web page, email, book, article, messages, word, pwp, pdf, forum posts, etc

Common proprieties β†’ meaningful text content + non structured data

Collection β†’ series of documents (static collection)

Query β†’ set of key words to express an informatory need

Our objective β†’ recover relevant docs and help the user complete the activity

IR vs Data Retrieval

Data retrieval (DBMS) β†’ database record

  • structured into fields

  • consistent values and types

  • pre-defined formats for storage

    IR β†’ text document

  • unstructured

  • can contain any kind of data

  • expressed in natural language

    Query language

  • based on a formal grammar

  • answer is independent from implementation

  • answer β†’ predictable set of records that are true under a query

    Search language

  • based on keywords

  • can be interpreted differently

  • answer β†’ ranked list of results based on likeliness from query

    Text comparison

    compare the query text with the document text and determine coherence is the main problem in IR.

    We need a pertinence-based ranking instead of an exact match, to accomodate the natural language.

    Retrieval models

    1. classic models β†’ boolean retrieval, spatial-vectorial

    Boolean retrieval β†’ simple, sees documents as bag of words, precise

    Incidence vectors β†’ slightly complex queires would require a matrix dimensions of billions, to save space we save the 1 positions.

    Inverted index β†’ we use posting-lists to avoid wasting memory with fixed arrays

    1. probabilistic models β†’ BM25, linguistic models
    2. combining evidence β†’ inference networks, learning to rank

    Text processing

  • Tokenization β†’ cut character sequence into word tokens

  • Normalization β†’ map text terms to same β€œnormal” form (eg. U.S.A and USA)

  • Lemmatization β†’ reduce to simple correct forms (eg. the boy’s car are different colours β†’ the boy car be different colours)

  • Stemming β†’ different forms of a root to match (authorize, authorization)

  • Stop words β†’ omit very common words (the, a, to, of)

  • Dictionary β†’ lexical indexes, used for autocorrection too

3.2 Ranking

Ranking recovery β†’ the system returns a ranking β€œon top” while going for a query

Vector Space Model (VSM)

Ranking mode, documents are represented by vectors or matrix based on weight of words

Docs as vecs proprieties

  • |Vectorial space as |V|-dimensional|

  • terms are axes of space

  • documents are points

  • millions of dimensions on search engines, most empty

    Query as vecs proprieties

  • represents queries as vectors in space

  • classifies documents based on distance (near β†’ vectors are alike)

  • classifies documents on relevance, from least distant

    Term Frequencies (TF) β†’ assigning a weight for each term in a vector space, based on frequency.

    ο»Ώ β†’ term frequency of term ο»Ώ in document ο»Ώ

    Vector space distance

    D1,D2,D3 are phrases, Q is a query

    Euclidean distance β†’ straight line distance between the final points, doesn’t work with what we need to do

    Using angles

    by taking the angle most similar to the vector phrase, we achieve what we need

    classifying documents by descending angles and ascending cosine are the same, so we use that

    Cosine is a monotone descending function (0 to 180), meaning it is inversely proportional of the angle

    Cosine similarity

    ο»Ώ β†’ tf weight of query term

    ο»Ώ β†’ tf weight of document term

    ο»Ώ is the cosine similarity between the 2 angles, or the cosine of the angle between ο»Ώ and ο»Ώ.

    Search engines

    knowing what the user wants

  • information need β†’ localizing info that we’re looking for

  • user query β†’ the terms that express the information needs

  • query intent β†’ the user activity

    how to search trough millions of documents

  • Topological network measures (PageRank)

  • field boosting measures (Title, subtitle, body on different weights)

  • semantics and time space proprieties (freshness of page)

    Search Engine Results Page (SERP)

    users click on the better results, 1st result is clicked 10x as more as the 6th one

    to be in the 1st position, the cost per click is 10 times higher than the 6th one

    common words are less informative than less common terms

    Document Frequency ο»Ώ β†’ number of documents that contain ο»Ώ

    Inverse Document Frequency ο»Ώ β†’ how rare is a word

  • ο»Ώ β†’ number of docs

  • we apply the logarithm to mitigate the exponential growth of really common words (Zipf law)

    TF ο»Ώ IDF get used a lot together for better estimates

    Probabilistic Models

    using probability to estimate relevance is the most efficient way of retrieving docs

    1. Okapi BM25 (weighting scheme) β†’ uses doc length, term frequency and rarity to ponder documents
    2. similarity score is between 0 and 1
    3. params
      1. k1 β†’ saturation and term frequency, default is 1.2
      2. b β†’ controls how much weight the vector norm of the doc length has, default is 0.75/ (0 β†’ no normalization, 1 β†’ complete normalization)
    4. Language model β†’ assigns a probability on a word sequence trough probability distribution
    5. each language model is tied at each doc of the collection, classification is Q query under the model ο»Ώ
    6. Unigram LM β†’ text generation is to extact words from a bucket following probabilty
    7. N-grams LM β†’ same, but uses N previous words to calculate the next one

    Raking efficiency

    ranking is computationally costly, even more if the query is complex

    some efficiency checks:

  • how we access document is relevant

  • we’re mostly not interested in lesser relevant documents

    we have a latency limit to satisfy the user, and we can’t assign a complete score to each document for each query

    TAAT vs DAAT

    TAAT (Term At A Time) β†’ scores for all docs concurrently, one query term at a time

    DAAT (Document At A Time) β†’ total score for each doc

    Secure ranking β†’ guarantees K results as the highest scored overall

    Non-secure ranking β†’ docs near K could be enough

  • index elimination β†’ consider only rare terms (high IDF), consider only documents that contain a lot of query terms

  • champion lists β†’ pre-compute scores of top relevant documents

3.3 Web Information Retrieval

Web crawling β†’ process of gathering pages from the web

Crawler (spider)

takes known seed URLs, fetches and parses them and puts them on a queue called URL frontier, then fetches each URL on the frontier and repeats

a crawler requires

Robustness β†’ web scan on distributed machines

  • malicious pages (spam - spider traps (endless pages))

  • latency, webmaster conditions, depth of hierarchy scan, mirror sites and dupe pages

    Politeness β†’ sever web policies

    explicit β†’ specs from webmasters on which portions can be crawled

    implicit β†’ avoid hitting same site too often

    robots.txt β†’explicit protocol, gives limited access to a website, defines rules

    URL Frontier

    can search trough breadth first (only 1st page of site) or deep first (stops at 1st page without other links)

    rankings are constructed combining

  • content β†’ can use RI models like probability based ones

  • popularity β†’ calculated trough hypertext links analysis using one or more link analysis models

    Simple link analysis β†’ good nodes do not go into bad ones, so we mark them as good

    Citation analysis β†’ popularity estimate

  • co-citation of articles means they’re correlated

  • indexing of citation for author searches

  • Page-Rank preview

    Page-Rank β†’ 0 to 1 assignment to each node from the link structure. for each query, this gets integrated into the search engine results to define the classification list.

    the probability of a user being on a given site (during browsing) is that site’s page-rank.

    some factors Google uses for raking:

  • Page ranking

  • language (phrases, errors, synonyms)

  • time characteristics (some queries look for recent indexing, like news)

  • personalization (history, social ambient, etc)

    HITS β†’ Hyperlink-Induced Topic Search

    useful for wide argument queries, 2 series of correlated pages

  • Hub page β†’ points to many trusted pages

  • Authority page β†’ pointed by good hub pages

    Semantic search β†’ graph search over structured knowledge rather than text speech

3.4 IR Evaluation

measuring relevance

  1. benchmark doc collection
  2. benchmark suite of queries
  3. human assessment of relevancy

user needs are translated into a query, but relevancy is assessed by the user.

Relevance judgments

  • Binary (relevant \ not relevant)

  • considering all sets of docs is expensive

  • depth-k pooling solution β†’ take the top-ο»Ώ docs of ο»Ώ different information retrieval systems, having a ο»Ώ pool instead of the entire collection

    Collections Text REtrieval Conference (TREC) β†’ biggest available web collection, 25m pages.

    Mechanical Turk β†’ presenting query-doc pairs to low-cost labor on online crowd-sourcing platforms, a lot of variance in resulting judgements

    the effectiveness of an IR system is dictated by

  • Precision β†’ relevant results / total results

  • Recall β†’ relevant results / relevant docs in the collection

    @K β†’ parameter, threshold rank, made to ignore documents < K

    Average Precision β†’ the average of @K precision calculated on Ks were it was retrieved the relevant result, stops at Recall@K = 1

    Mean Average Precision (MAP) β†’ macro-average, aka measured from the average on all queries

    Precision 0 β†’ if a relevant doc never gets retrieved

    non-binary notion of relevance β†’ spectrum of relevancy, measure is DCG (Discounted Cumulative Gain) or NDCG (Normalized Discounted Cumulative Gain)

3.5 IR Advanced methods

Synonymy β†’ using different words to refer to the same thing, it’s a problem that impacts the majority of IR tools’ recall.

Query extension β†’ expanding the query with synonyms

eg. β€œcar” β†’ β€œcar automobile vehicle automobiles”

there are many query extension techniques

Correlated terms

  • trough wordnets

  • co-occurring terms in the same documents or in an adjacent relevant document

    Thesaurus query expansion β†’ automatic expansion based on vocabulary is not effective (a word means different things in different contexts)

    we can use co-occurencies based on:

    1. Dice’s coefficient β†’ ο»Ώ, higher the score, the more correlated the words. in practice, we have:

    where:

    ο»Ώ β†’ words

    ο»Ώ β†’ documents containing a and b respectively

    ο»Ώ β†’ documents containing both

    1. Mutual information β†’ based on probability:

    where:

    ο»Ώ β†’ words

    ο»Ώ β†’ probability of the word existing in a window of text

    ο»Ώ β†’ probability of the words together existing in a window of text

    This method inherently favors terms of low frequency

    1. Expected mutual information β†’ fixes the problem with low frequency with P(a,b):

    1. Pearson’s Chi-squared (ο»Ώ) β†’ confronts the number of co-occurencies with the expected number of co-occurencies if the words were independent and normalizes it (dividing it) with the expected number

    measures are based on documents or smaller parts.

    Query expansion with relevance feedback (RF) β†’ makes use of refinement techniques based on user feedback

    Pseudo RF (blind relevance feedback) β†’ automates the manual RF, calculating it based on the first results

    Machine learning (ML) and IR β†’ building a classifier using AI, based on term frequency in body and title, doc length and doc popularity (pagerank).

    our goal is to build a classifier for relevant and non-relevant docs.

    β€œluckily”, lots of training data is available.

4.1 Data analytics

DIKW Pyramid β†’ how information is defined along with knowledge and wisdom

Data analysis β†’ extract information from the data

we can extract information as

  • summaries, as the average of a group of numbers

  • association, creating relations

    Population β†’ set of relevant objects to get info from

    es. all Los Angeles houses, all university students, all recipes of grocery stores

    Record β†’ tuple of values, characterizes a population element (es. a house’s square meters, price, etc)

    Variable β†’ name of the value of a record

  • Numerical variable, we can apply arithmetics. the numbers can be discrete or continued.

  • Categorical variable

    • Ordinal, if they can be ordered (eg. school grades)

    • Nominal, eg. colors

      Dataset β†’ a set of records becomes a unique table

      Descriptive statistic β†’ indicators that can identify the statistical proprieties of a population based on a single variable

      Centrality

      Arithmetic mean β†’ used to fill missing data, sensible to anomalies

      we can maintain the same mean by replacing missing data with the mean result.

      Median β†’ effective value of the observations, immune to anomalies, but needs ordered data.

      Mode β†’ most frequent effective value, robust to anomalies, can be applied to categorical variables.

      Variation

      Squared Deviation β†’ measures each difference between any and the average observation β†’

      deviation is higher the more values are far from the average, and the more values, the higher the deviation.

      Variance β†’ (written as or ) normalizes the Squared deviation by the number of values: Single variable indicators

  • Min β†’ minimum value in observations

  • Max β†’ maximum value in observations

  • Range β†’ difference between max and min Multiple variable indicators in descriptive statistic, they describe the relation between variables, aka how and if they’re linked (eg. houses price to squared meters).

  • Covariance

  • Correlation Association measures

  • Covariance β†’ strength of relation between 2 variables if the covariance is high (many and diverge at the same time) there’s a relation Positive covariance β†’ directly proportional, X and Y diverge in the same direction Negative covariance β†’ inversely proportional, X and Y diverge in opposite directions Variance is subject to change value based on measurement units, big values give big results, which is not a problem in correlation

  • Correlation β†’ divides covariance by the product of the standard deviations, limiting values between -1 and 1.

    • tends to -1 β†’ opposite variation, inversely proportional
    • tends to 1 β†’ same variation, directly proportional
    • tends to 0 β†’ independent variation, linear relation Dispersion and regression graph Useful to visualize the values, see if there’s an association

4.2 Data Warehouse

Data Warehouses β†’ operative database, supports software apps and has big repos that consolidate data from different sources.

  • Is updated offline

  • Follows the multidimensional data model

  • Needs complex queries to analyse and export data (computationally costly)

  • Implemented using the star or snowflake scheme to efficiently support OLAP queries.

    Decision process based on data Companies may use data to take informed decisions Decision makers need to know the techniques of the data models

  • Data Warehouse proprieties:

    • Oriented at subject β†’ targets one or several subjects of analysis, according to requirements

    • Integrated β†’ the contents result from the integration of data from various operational and external systems

    • Nonvolatile β†’ accumulates data from operational systems for a long period of time, modification and removal are not allowed, except purging of obsolete data

    • Time-varying β†’ keeps track of how the data evolved over time

      Design of databases

  • Requirement specification

  • Conceptual design

  • Logical design

  • Physical design The final result is a relational database

    Design of data Warehouse relational paradigm is not appropriate for data warehouses, as they’re aimed at specific end users, need to be comprehensible and have good performances.

    multidimensional modeling paradigm β†’ data constructed by facts connected as dimensions.

  • Fact β†’ the object of interest (eg. sales if we’re sellers)

  • Measures β†’ numeral values that quantify the facts

  • dimensions

    • Temporal β†’ changes trough time

    • Positional β†’ facts trough geographical distribution

    • Hierarchy β†’ different levels of detail for exporting

    • Aggregation β†’ measure are aggregated after a hierarchy passage (eg. a month passes, aggregate months sales)

      Redundancy β†’ every dimension has a univocal table No redundancy β†’ there are more normalized tables

      populating a DW ETL β†’ Extraction, Transformation, Loading

  • Extraction β†’ relevant data gets selected and extracted

    • Static extraction β†’ 1st population, makes an β€œinstant” of operative data
    • Incremental extraction β†’ update to the DW, captures the changes from the last extraction (based on timestamp and DBMS register)
  • Transformation β†’ converts data from operative to the standardized of the DW.

    1. Cleaning β†’ removes errors and incongruities, converts to standardized format
    2. Integration β†’ reconciles data from different sources, schema and data
    3. Aggregation β†’ summarizes data based on granularity
  • Loading β†’ feeds the DW with the transformed data, including the updates (updates propagation)

    OLTP Systems β†’ Operative databases (elaboration of online transactions) OLTP are traditional databases optimized to support daily operations:

  • Rapid access to data

  • Elaborating transaction

  • Comparison given updated coherent online data Proprieties of OLTP

  • Detailed data

  • No historical data

  • Highly normalized

  • Bad performances on complex queries

    Data analysis requires a new paradigm, OLAP

  • Focuses on analytic queries, where data reconstructions

  • Support for loads of heavy queries

  • Indexing aimed to access few records

  • OLAP queries include aggregation

    at DW backend we have

  • ETL tools

  • Data staging β†’ intermediate database where all the modification processes get executed at DW level we have

  • Enterprise DW β†’ centralized warehouse, includes organization

  • Data mart β†’ DW of specialized departments

  • Metadata β†’

    • Enterprise metadata β†’ describe semantics and rules, constraints and politics of data. Security and monitor info. They describe data sources (schemes, proprieties, update frequencies, limits, access methods)
    • Technical metadata β†’ describe data storing and apps and processing acting on data ( ETL operations ). They describe the structure of DW and Data mart, at a conceptual/logical level, and physical level
    • Repository β†’ memorizes info on DW and its contents at OLAP level we have β†’ a server that provides a multidimensional view of data a Frontend has
  • visualization of data

  • client tools to use the contents of the DW

    • OLAP tools β†’ provide interactive exploration and manipulation of data, and the formulation of complex queries ad hoc

    • Reporting tools β†’ reports management and production, they use predefined queries used regularly

    • Static tools β†’ analysis and visualization of data cubes trough static methods

    • Data mining tools β†’ allow users to analyze data to acquire knowledge on models and tendencies, used to make predictions

      Multidimensional model (MultiDim) MultiDim visualizes data in a -dimensional space, a cube composed by dimensions (perspectives of data analysis) and facts

  • Example Dimensions β†’ Client, Time, Product. Dimensions are described by attributes. Facts β†’ Sales. every cell of the cube is the no. of units sold per category, time unit and a client’s city Measures β†’ Quantities

    • Additive β†’ summary trough addition (the most common one)

    • Semi-additive β†’ summary trough addition of some dimensions (not all)

    • Non-additive

    • Distributive β†’ defined by an aggregation function that can be calculated in a distributive way

    • Algebraic β†’ defined by an aggregation function that can be expressed as a scalar function way

    • Holistic β†’ cannot be calculated by under aggregated data. Members β†’ instances of a dimension each cube of data contains different measures, a cube can be sparse or dense (since not all data could be present, eg. when there’s no data at all)

      Data granularity β†’ level of detail of measurements for each cube dimension (eg. Time: All - Year - Semester)

      Hierarchies β†’ can visualize data at different granularity levels. we can define mappings for the granularity levels, as well as a schema of a dimension

  • OLAP Queries for cubes (kinda useless) Roll-up β†’ going to a higher level (specific to general) Drill-down β†’ going to more specific levels (general to specific) In SQL, you access levels like so:

    SELECT SUM(F.vendite)
    FROM Time T JOIN Fatto F
    GROUP BY T.year <----
    WHERE T.year = 2018

    Pivot β†’ rotation of the cube, basically changing the group by Slice β†’ just a WHERE statement Dice β†’ selecting more than 1 slice, literally an AND/OR in the WHERE

4.3 Data Mining

Data mining β†’ computing process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics and db systems Patterns β†’ regularities, patterns (eg. house prices are roughly 2500€ per , buying cereals and buying milk) Knowledge β†’ understanding the logic behind the patterns. to get knowledge we need big data and different observations on homogeneous data

  • Data extraction steps

    • Problem definition

    • Identify required data

    • Pre-elaboration β†’ clean the data so that can be input to a ML algorythm

    • Hypothesis modulation

    • Training and testing

    • Verify and distribute

      Machine Learning

  • Does the abstraction from data to patterns.

  • Non linear interactions are hard to spot

  • Manual codification of some algorithms can be hard

  • (almost) never 100% accurate Supervised techniques β†’ we use a β€œtraining set” of data with a variable that is the β€œsolution”, acting as output examples

    Supervised learning has to

  • Regression (if output is a numerical value) β†’ easy to calculate accuracy or error

  • Classification (if output is a class) β†’

    • Binary classification (cat / not cat)
      • Multi class classification β†’ classes are a finite set of outputs
      Non-supervised techniques β†’ no output value or no output, but there are fixed objectives to reach
  • Clustering β†’ divides examples in clusters, similar examples in same cluster

  • Association rules β†’ variable relations (eg. {onion, potato} β†’ {hamburger})

  • Spotting anomalies β†’ sus data gets sussed out

  • Generative models β†’ from examples they generate an instance with similar characteristics

  • Characteristics extraction β†’ reduces variables by generating new characterstics

    Data pre-treatment A pre-treatment is common since real world data is incomplete, noisy and incoherent Pre-elaboration phases

  • Cleaning

    1. Deduction of missing data
    2. Identify wrong data, level noise data
    3. Correlate incoherent data
    4. Solve redundancy
  • Integration

  • Transformation

    Denoising data Binning β†’ populating bins of similar data and leveling each bin locally (averaging out values for each bin so they’re the same) Linear regression β†’ searching the best line that can adapt attributes to favor previsions. Clustering β†’ similar values are grouped together, values outside of clusters are anomalies and get scrapped.

    Data Warehousing β†’ uploading phase Data mining β†’ machine learning phase

    Data integration β†’ combines entities and attributes from different sources

    Entity identification table A and B A.customer_id ←> B.customer_number how do be disambiguate for sure?

    attribute metadata:

  • Attribute name β†’ string distance between names

  • Attribute type β†’ both string type

  • Value interval β†’ both have the same interval of values impossible to be 100% accurate

    Data transformation β†’ data into vectors

    Recall β†’ hypothesis function modeled by AI

    Codification problem β†’ not all data is numerical

    Scale problem β†’ big numbers are not good in ML, since they work well on normalized values

    Text codification to solve the codification problem, we just code the strings into vectors (Vectorization)

    Bag-Of-Words (BOW) β†’ every variable of a vector is a specific word with boolean values, true if the word appears in the text.

    Category codification Hot codification β†’ same as BOW, but every variable is a category, and only 1 can be true

    Resizing (Normalization) β†’ better performance and precision in ML, usually obtained trough linear resizing.

Exam

Domande esame con risposta