Big Data Quality the next semantic challenge Maria Teresa PAZIENZA a.a. 2018-19 (BIG) DATA IS ONLY AS USEFUL AS ITS QUALITY Introduction (Since Big Data is big and messy), challenges can
be classified into engineering tasks (managing data at an unumaginable scale) and semantics (finding and meaningfully combining information that is relevant to your needs) Challenges for Big Data Identify relevant pieces of information in messy data. Named entity resolution (event extraction in tweets short texts) Coreference resolution (if 2 mentions refer to each other-indexing
billions of RDF triples -data formats easy to use RDF/RDFS, OWL) Information extraction (difficult to scale) Paraphrase resolution (it aims at identifying an entry in a given knowledge base to which an entity-mention-in-a-document refers) Ontology population entity consolidation (organizing extracted tuples in a quering form such as instances of ontologies, tuples of a database for schema or set of quads subject, predicate, object, context-) Basic assumptions Datasets published on the web of data cover a diverse
set of domains Data on the web reveals a large variation in data quality . Data extracted from semi-structured sources Dbpedia etc.- often contain inconsistencies as well as misrepresented and incomplete information Even datasets with quality problems might be useful for certain applications as long as the quality is in the required range (in different application contexts) Variety as a Big Data issue Variety as a Big Data issue is distinct in that
established small scale methods are insufficient. The Big Data notion of variety is a generalization of semantic heterogeneity as studied in the field of databases, artificial intelligence, semantic web and cognitive science since many years. Quality on the Web specific aspects Coherence via links to external datasets Data representation quality Consistency with regard to implicit information (inference mechanisms for knowledge representation formalisms on the web -owl- usually follow an open world
assumption, whereas databases usually adopt closed world semantics) Ontology quality No consensus on how data quality dimensions and metrics should be defined Quality on the Web specific aspects The challenges are related to openness of the web of data, diversity of the information and
unbound, dynamic set of autonomous data sources and publishers. Dimensions of data quality Organized into two categorie: contextual, referring to attributes that are dependent on the context in which the data are observed or used, and intrinsic, referring to attributes that are objective and native to the data. Contextual dimensions of data quality
Include at least relevancy, value added , quantity, believability, accessibility, understandibility, availability, verifiability and reputation of the data. Contextual dimensions of data quality lend themselves more towards information as opposed to data because these dimensions are formed by placing data within a situation or problem specific context. Intrinsic dimensions of data quality
Intrinsic data quality has 4 dimensions: Accuracy (degree to which data are equivalent to their corresponding real values) Timeliness (degree to which data are up-to-date: currency or lenght of time since the records last update, volatility which describes the frequency of updates) Consistency (degree to which related- data- records match in terms of format and structure) Completeness (degree to which data are full and complete in content, with no missing data) Es: indirizzo Intrinsic dimensions of data quality
Table 1. Dimensions of data quality. Data quality dimension Description Supply chain example Accuracy Are the data free of errors?
Customer shipping address in a customer relationship management system matches the address on the most recent customer order Timeliness Are the data up-to-date? Inventory management system reflects realtime inventory levels at each retail location Consistency
Are the data presented in the same format? All requested delivery dates are entered in a DD/MM/YY format Completeness Are necessary data missing? Customer shipping address includes all data points necessary to complete a shipment (i.e. name, street address, city, state, and zip code)
The question from knowledge management experts Big Data can leverage on semantics? Yes Commonly used data in BD context: Data generated by humans (mainly disseminated through web tools as social networks, cookies, emails, ) Data generated from connected objects The Internet of human being and the Internet of things become a mix of big data that must be targeted to understand, plan and act in
a predictive way Bidirectionality The relation between Big Data and Semantics is bidirectional As it is true for BD leverages on semantics, some semantics tasks are optimized by using tools designed for large data sets processing Challenges for Big Data a) Meaningful data integration challenges: 1. Define the problem to solve
2. Identify relevant pieces of data in Big Data 3. ETL it into appropriate formats and store it for processing 4. Disambiguate it 5. Solve the problem Challenges for Big Data b) Billion Triple Challenge which aims to process large scale target vocabulary and to link that entity to the corresponding sources c) The Linked Open Data ripper for providing good use cases for LOD and to be able to link
them with non LOD efficiently d) The value of the use of semantics in data integration and in the design of future DBMS Challenges for Big Data Semantics could be considered as a magic world to bridge the gap of the heterogeneity of data. Semantics can be used in a decidable system which makes possible to: detect inconsistency of data, generate new knowledge using inference engine or simply
link more accurately specific data not relevant for machine learning based techniques. Challenges for Big Data To determine the quality of datasets published on the web and make this quality information explicit. Assuring data quality is particularly a challenge in LOD as it involves a set of autonomously evolving data sources. Information quality criteria for: Web documents page trustworthiness versus page rank
Structured information correctness of facts, adequacy of semantic representation, degree of coverage Trustworthisess of web sources Trustworthiness or accuracy of a web source as the probability that it contains the correct value for a fact, assuming that it mentions any value for that fact. Trustworthiness is orthogonal to PageRank
Data quality assessment methodology A data quality assessment methodology is defined as the process of evaluating if a piece of data meets the information consumers need in a specific case. The process involves measuring the quality dimensions that are relevant to the user and comparing the assessment results with the users quality requirements.
18 September 2014. FRAPCON/FRAPTRAN User Group Meeting. ... (SNF) for a 300 year period of dry storage. This work is part of the ongoing research effort for Extended Storage and Transportation (EST) GOAL: to assess the potential for low temperature...
The main question is to what extent the advanced tokamak modes can be achieved in a burning plasma: What is the achievable bN (macroscopic stability) Can the necessary pressure profiles realized in the presence of strong a heating (microturbulence &...
Most recent IU IRB Protocol Summary w/ Eskenazi Health listed as a study site (upload required) NCT# Basic study information. Plan for Eskenazi Health specific information including: Anticipated number of Eskenazi Health patients (or medical records) involved in research. Eskenazi...
* Urban Legends, like Hip Hop and Rap Exhibit many features of Literature: They are dramatic, and play on such emotions as fear or embarrassment. They are filled with such rhetorical devices as Hyperbole (overstatement), Antithesis, Symbolism, Irony, and especially...
an apostrophe to show where letters or numbers have been omitted in a . contraction. A contraction is a shortened form of a word or figure (can't. for . cannot, '81. for. 1981) or of a group of words (she'll....
5 Steps to a 5. A second practice test. 1. Which of the following belief systems owned ... the Renaissance (A). Hinduism retained its traditional. patriarchal society in India (C). African. ... Proto-Bantu is the language family from which. the...
2d active scalar regime Bolgiano '59-Obukhov '59 (Rayleigh-Bernard turb scenario) Self-advection Buoyancy balance each other at all the ``inertial" range scales Even smaller scales + nonlinear cascade of scalar (temperature) to small scales Menu * consistent with RB numerics Celani,Matsumoto,...
Challenging the Cuts: S75 Equality Duty. Debbie Kohner, CAJ . ... - no arguable case re breachod scheme - PA agreed to submit the matter to EQIA ... Usual heads - Wednesbury unreasonableness or ultra . vires - Melissa explain...
Ready to download the document? Go ahead and hit continue!