CEDAR: Rich Data Analytics at Cloud Scale https://team.inria.fr/cedar Mon, 18 Mar 2024 15:04:15 +0000 en-US hourly 1 https://wordpress.org/?v=5.9.7 Nelly Barret’s PhD defense https://team.inria.fr/cedar/phd-defense-of-nelly-barret/ Sat, 16 Mar 2024 09:56:56 +0000 https://team.inria.fr/cedar/?p=9509 Even the best things have an end! Nelly Barret defended her PhD thesis on March 15th.

Her PhD is titled “User-oriented exploration of semi-structured datasets”. Her work has contributed to ConnectionLens and ConnectionStudio, at the heart of SourcesSay, and lead to the standalone projects Abstra and Pathways

As …

Continue reading

]]>
Even the best things have an end! Nelly Barret defended her PhD thesis on March 15th.

Her PhD is titled “User-oriented exploration of semi-structured datasets”. Her work has contributed to ConnectionLens and ConnectionStudio, at the heart of SourcesSay, and lead to the standalone projects Abstra and Pathways

As Nelly embarks on the next chapter of her career, we extend our heartfelt congratulations and best wishes for continued success and fulfillment.

Defense jury

  • Stefano Ceri – Full professor, Politecnico di Milano, Dipartimento di Elettronica, Informazione e Bioingegneria (DEIB)
  • Katja Hose – Full professor, TU Wien, Databases and Artificial Intelligence research unit
    Fatemeh Nargesian – Assistant professor, University of Rochester, Department of Computer Science
  • Jean-Marc Petit – Full professor, Insa Lyon, Laboratoire d’InfoRmatique en Image et Systèmes d’information (LIRIS)
  • Fatiha Sais – Full professor, Université Paris Saclay, Laboratoire Interdisciplinaire des Sciences du Numérique (LISN)

Supervisors

  • Ioana Manolescu – Senior researcher, Inria and Ecole Polytechnique – PhD advisor
  • Karen Bastien – WeDoData CEO – PhD co-supervisor

 

 

]]>
Fatemeh Nargesian “Tabular Data Discovery in Data Lakes” https://team.inria.fr/cedar/fatemeh-nargesian-tabular-data-discovery-in-data-lakes/ Tue, 12 Mar 2024 11:26:58 +0000 https://team.inria.fr/cedar/?p=9504 Fatemeh Nargesian will present her work on March 12th, 2024, at 2 pm. The seminar will be held in Grace Hopper room, in the Alan Turing Building (Palaiseau), and online here.  

Title : 

Tabular Data Discovery in Data Lakes

Abstract: 

Tabular data discovery streamlines the construction and integration …

Continue reading

]]>
Fatemeh Nargesian will present her work on March 12th, 2024, at 2 pm. The seminar will be held in Grace Hopper room, in the Alan Turing Building (Palaiseau), and online here.  

Title : 

Tabular Data Discovery in Data Lakes

Abstract: 

Tabular data discovery streamlines the construction and integration of tables, utilized in downstream data science tasks, from massive collections of data sources such as data lakes. This process involves
efficiently identifying relevant tables and discovering queries that facilitate the construction of datasets beneficial for downstream data science task. In this talk, I will first describe how to develop efficient
index structures for table discovery, based on equi-join, semantic-join, and union operations. Next, I will show how leveraging table version histories can reveal fine-grained schematic links among columns in data lakes. We will also see how to construct a navigational structure over data lakes, presenting an alternative discovery method to conventional keyword searches. Finally, I will conclude by a discussion on the challenges of using discovered queries for approximate query answering.

Bio:

Fatemeh Nargesian is an assistant professor of computer science at the University of Rochester. She obtained her PhD at the University of Toronto. Her research interests are in data acquisition for AI and
scientific time-series management. Her work has appeared at top-tier venues including VLDB, SIGMOD, and ICDE and has won the best demo award of VLDB 2017.

 

]]>
Six CEDAR short papers in Infox sur Seine 2024 https://team.inria.fr/cedar/six-cedar-short-papers-in-infox-sur-seine-2024/ Tue, 12 Mar 2024 11:02:38 +0000 https://team.inria.fr/cedar/?p=9502 Infox sur Seine:
  • Antoine Gauquier, Ioana Manolescu and Pierre Senellart: “Efficient and Focused Web Crawling for Statistical Data Sources Retrieval”
  • Oana Balalau, Théo Galizzi, Isotta Magistrali, Ioana Manolescu and Gabriele Mura: “Improved Detection of Statistical Entities”
  • Continue reading

    ]]> The following short papers have been accepted for publication in the workshop Infox sur Seine:
    • Antoine Gauquier, Ioana Manolescu and Pierre Senellart: “Efficient and Focused Web Crawling for Statistical Data Sources Retrieval”
    • Oana Balalau, Théo Galizzi, Isotta Magistrali, Ioana Manolescu and Gabriele Mura: “Improved Detection of Statistical Entities”
    • Samuel Da Silva Guimarães and Oana Goga:  “The web of challenges of disinformation on Meta platforms in Brazil and France”
    • Sarra Bendaho, Nada Hanad, Nardjes Amieur, Asmaa El fraihi and Oana Goga: “Beyond misinformation: using generative AI in advertising”   
    • Nardjes Amieur, Salim Chouaki and Oana Goga: “Misinformation on Facebook: Who Gets Exposed and who Engages?”
    • Hiba Louzzani, Ines Abdelaziz, Asmaa El fraihi, Nardjes Amieur and Oana Goga:  “Targeting Vulnerable Groups with Misinformation through Placement-Based Ads on YouTube”
     
    ]]>
    Two short papers have been accepted at SEAGraph24 https://team.inria.fr/cedar/two-short-papers-have-been-accepted-at-seagraph24/ Tue, 27 Feb 2024 09:20:27 +0000 https://team.inria.fr/cedar/?p=9493 The following short papers have been accepted for publication in SEAGraph24: 

    • “Finding the PG schema of any (semi)structured dataset: a tale of graphs and abstraction” by Nelly Barret, Tudor Enache, Ioana Manolescu and Madhulika Mohanty
    • “Graph lenses over any data: the ConnectionLens experience” by Oana Balalau, Nelly Barret, Simon Ebel, …

      Continue reading

      ]]> The following short papers have been accepted for publication in SEAGraph24: 

      • “Finding the PG schema of any (semi)structured dataset: a tale of graphs and abstraction” by Nelly Barret, Tudor Enache, Ioana Manolescu and Madhulika Mohanty
      • “Graph lenses over any data: the ConnectionLens experience” by Oana Balalau, Nelly Barret, Simon Ebel, Théo Galizzi, Ioana Manolescu and Madhulika Mohanty
      ]]>
      MediumAI: Associated team between CEDAR and CWI team Human-Centered Data Analytics https://team.inria.fr/cedar/mediumai-associated-team-between-cedar-and-cwi-team-human-centered-data-analytics/ Mon, 19 Feb 2024 12:56:04 +0000 https://team.inria.fr/cedar/?p=9488 The Inria Associated Team MediumAI was accepted! We will start  a 3 year collaboration project (called Inria Associated Team) with the CWI team, Human-Centered Data Analytics: https://www.cwi.nl/en/groups/human-centered-data-analytics/

      We will be working together on evaluating types of bias and limitations prevalent in AI-based journalistic tools.

      Continue reading

      ]]>
      The Inria Associated Team MediumAI was accepted! We will start  a 3 year collaboration project (called Inria Associated Team) with the CWI team, Human-Centered Data Analytics: https://www.cwi.nl/en/groups/human-centered-data-analytics/

      We will be working together on evaluating types of bias and limitations prevalent in AI-based journalistic tools.
      ]]>
      PETS 2024: Client-side and Server-side Tracking on Meta: Effectiveness and Accuracy https://team.inria.fr/cedar/pets-2024-how-effective-are-tracking-restrictions-a-case-study-on-metas-tracking-technologies/ Sat, 03 Feb 2024 18:18:53 +0000 https://team.inria.fr/cedar/?p=9470 The paper “Client-side and Server-side Tracking on Meta: Effectiveness and Accuracy” by Asmaa El fraihi, Nardjes Amieur, Oana Goga and Walter Rudametkin has been accepted for publication at the Privacy Enhancing Technologies Symposium (PETS 2024).

      Continue reading

      ]]>
      The paper “Client-side and Server-side Tracking on Meta: Effectiveness and Accuracy” by Asmaa El fraihi, Nardjes Amieur, Oana Goga and Walter Rudametkin has been accepted for publication at the Privacy Enhancing Technologies Symposium (PETS 2024).

      ]]>
      Character encodings and how to live with them https://team.inria.fr/cedar/character-encodings-and-how-to-live-with-them/ Sat, 20 Jan 2024 21:20:28 +0000 https://team.inria.fr/cedar/?p=9447 Problems may occur when software encounters a character not supported by its designated encoding, resulting in errors such as character replacements or omissions. This tutorial provides guidance on preventing and resolving issues related to character encoding.

      What are character encodings?

      Human languages (think French, English, Chinese etc.) but also …

      Continue reading

      ]]>
      Problems may occur when software encounters a character not supported by its designated encoding, resulting in errors such as character replacements or omissions. This tutorial provides guidance on preventing and resolving issues related to character encoding.

      What are character encodings?


      Human languages (think French, English, Chinese etc.) but also other sets of symbols (think math, emojis, etc.) have led to the need to represent numerous symbols each as a character of a character set. A character is encoded on a small number of bytes (one, two). A character set is: a finite set of characters, together with an encoding (that is, sequence of bytes) of each character present in the set.

      Where are character encodings present?


      A character encoding is present whenever a piece of software does something with a character (or more characters, such as a string, or a file). For instance:

      • When typing characters on a keyboard, the terminal expects a certain character encoding
      • When storing a file in a file system, the file system views the file as using a certain character encoding
      • When looking at a file in a file editor, the editor uses a certain character encoding
      • When creating a database in Postgres, one can assign it a given character encoding through a dedicated parameter
      • When reading a database in Postgres, the client program (psql) also uses its own character encoding.

      What character encoding is best?


      We usually need one large enough to include accented French, Spanish (remember the reverse question mark in Spanish!) or German letters. For our daily needs, ==UTF8== does this and is the best. It makes sense to use it whenever there is no strong specific requests to use another encoding; and it makes sense not to use any character encoding that has fewer characters than UTF8.

      Below, we consider UTF8 to be the reference encoding. If you need any other encoding E, just replace UTF8 by E everywhere below. It is sometimes possible to handle characters encoded using an encoding E1, with a software that is configured to use another encoding E2 (if E2 somehow subsumes or extends E1). However, this makes for complicated settings, and will not be considered below.

      What can go wrong with character encodings?


      If a software encounters a character that is not in the character encoding that the software expects, this leads to an error. The software may then handle the error in one of the following ways:

      • Silently replace the character with the closest equivalent, e.g., if the software does not expect é, it will show e.
      • Not show the character at all, e.g., Hernàndez becomes Hernndez.
      • Show something very ugly instead of the character not being handled, e.g., an unprintable character.
      • Complain, e.g., refuse to save a file containing a character that is not in the set associated to the file.
      • Break, i.e., throw an error of some sort.

      Note that most of the above may lead to other hard-to-catch errors down the road. For instance:

      • a replacement of é with e may lead to similar but different names somewhere;
      • a replacement of Hernàndez with Hernndez may make an entity extraction fail because the extractor has never seen such a name;
      • a string may fail to be saved in a database, throwing an error or not (depending on how the code is written), etc.

      Therefore, it’s important to avoid or solve such problems.

      How to avoid or solve character encoding problems?


      Try to enforce UTF8 all over:

      1. Make sure all new files are in UTF8 encoding. You can find out a file’s encoding using the file command, like this:
        ioanamanolescu@im22 submission % file the.bib 
        the.bib: BibTeX text file, UTF-8 Unicode text, with very long lines
      1. If needed, you can change a file’s encoding, see e.g., https://stackoverflow.com/questions/132318/how-do-i-correct-the-character-encoding-of-a-file. You will have succeeded when the file command returns UTF8.
      2. Make sure your shell uses UTF8. This will help you view correctly the contents of UTF8 files, and that your terminal is not changing file encodings even without your being aware of it, e.g., when you create a new file by pasting in the terminal and saving it, or when copying or downloading a file, etc. https://stackoverflow.com/questions/5306153/how-to-get-terminals-character-encoding gives some hints on how to find and change your terminal’s encoding. The command locale shows the values known by your current shell, of the environment variables related to the character encoding and language:
      ioanamanolescu@im22 submission % locale
      LANG=""
      LC_COLLATE="C"
      LC_CTYPE="UTF-8"
      LC_MESSAGES="C"
      LC_MONETARY="C"
      LC_NUMERIC="C"
      LC_TIME="C"
      LC_ALL=
      

      The above page suggests changing the value of the LC_ALL environment varriable to control the default language and default character encoding, by including in one’s .bashrc:

      $ export LC_ALL=pt_PT.utf8 
      $ export LANG="$LC_ALL"
      

      Note that remotely connecting from a computer to another may attempt to set the locale on the remote machine, even if you did not intend it.

      • When connecting from a Mac to another machine, to avoid such imposition, uncheck the box “Set locale environment variables at start-up” at the bottom in this menu that shows up in Terminal > Preferences > Advanced
        1. Make sure every newly created database is in UTF8, e.g., for Postgres (https://www.postgresql.org/docs/current/multibyte.html), but likely also for other systems since it is probably part of the SQL standard:
        initdb -E UTF8
        
        1. Make sure every client connecting to the database uses UTF8
        SET CLIENT_ENCODING TO 'UTF8';
        
        jdbc:ctree:port@host_name:db_name?characterEncoding=UTF-8
        

        but this does not seem to be part of the JDBC standard. MORE INFO/RESOLUTION NEEDED HERE

        1. tmux also brings its own encoding issues:

        tmux attempts to guess if the terminal is likely to support UTF-8 by checking the first of the LC_ALL, LC_CTYPE and LANG environment variables to be set
        for the string “UTF-8”. This is not always correct:
        the -u flag explicitly informs tmux that UTF-8 is supported.

      ]]>
      ACM IMC 2023: Understanding the Privacy Risks of Popular Search Engine Advertising Systems https://team.inria.fr/cedar/acm-imc-2023-understanding-the-privacy-risks-of-popular-search-engine-advertising-systems/ Thu, 12 Oct 2023 11:12:18 +0000 https://team.inria.fr/cedar/?p=9363 The paper “Understanding the Privacy Risks of Popular Search Engine Advertising Systems” by Salim Chouaki, Oana Goga, Hamed Haddadi, and Peter Snyder has been accepted for publication at the ACM Internet Measurement Conference 2023.

      Continue reading

      ]]>
      The paper “Understanding the Privacy Risks of Popular Search Engine Advertising Systems” by Salim Chouaki, Oana Goga, Hamed Haddadi, and Peter Snyder has been accepted for publication at the ACM Internet Measurement Conference 2023.

      ]]>
      EMNLP 2023: Evaluating the Factual Faithfulness of Graph-to-Text Generation https://team.inria.fr/cedar/emnlp-2023-evaluating-the-factual-faithfulness-of-graph-to-text-generation/ Mon, 09 Oct 2023 06:59:00 +0000 https://team.inria.fr/cedar/?p=9359 Continue reading

      ]]>
      The paper “FactSpotter: Evaluating the Factual Faithfulness of Graph-to-Text Generation” by Kun Zhang, Oana Balalau, Ioana Manolescu has been accepted for publication in Findings of EMNLP 2023. ]]>
      PhD Defense of Vera Sosnovik https://team.inria.fr/cedar/phd-defense-of-vera-sosnovik/ Wed, 06 Sep 2023 09:08:41 +0000 https://team.inria.fr/cedar/?p=9349 Vera Sosnovik defended her PhD thesis entitled "Detection and analysis of online issue and political ads".

       

      Supervisors:

      Mme. Oana Goga-  CNRS

      M. Patrick Loiseau - Inria

      The thesis committee consists of:

      Rapporteurs:

      M. Kévin Huguenin - Université de Lausanne

      M. Walter Rudametkin …

      Continue reading

      ]]>
      Vera Sosnovik defended her PhD thesis entitled “Detection and analysis of online issue and political ads”.

       

      Supervisors:

      Mme. Oana Goga-  CNRS

      M. Patrick Loiseau – Inria

      The thesis committee consists of:

      Rapporteurs:

      M. Kévin Huguenin – Université de Lausanne

      M. Walter Rudametkin – University of Lille

      Examiner:

      M. Gilles Bastin – Sciences Po Grenoble

      M. Paolo Frasca – CNRS

      Mme. Juhi Kulshrestha – Aalto University



      The defense took place at Auditorium IMAG (Grenoble) on the 4th of September at 14:00. 

       

      Abstract

       

      Online political advertising has become the cornerstone of political campaigns. The budget spent solely on political advertising in the U.S. has increased by more than 100% from $700 million during the 2017-2018 U.S. election cycle to $1.6 billion during the 2020 U.S. presidential elections. Naturally, the capacity offered by online platforms to micro-target ads with political content has been worrying lawmakers, journalists, and online platforms, especially after the 2016 U.S. presidential election, where Cambridge Analytica has targeted voters with political ads congruent with their personality. 

       

      To curb such risks, both online platforms and regulators (through the DSA act proposed by the European Commission) have agreed that researchers, journalists, and civil society need to be able to scrutinize the political ads running on large online platforms. Consequently, online platforms such as Meta and Google have implemented Ad Libraries that contain information about all political ads running on their platforms. 

       

      The thesis consists of three contributions related to the online political advertising problems. The first project investigates whether we can reliably distinguish political ads from non-political ads. We take an empirical approach to analyze what kind of ads are deemed political by ordinary people and what kind of ads lead to disagreement. Our results show a significant disagreement between what ad platforms, ordinary people, and advertisers consider political and suggest that this disagreement mainly comes from diverging opinions on which ads address social issues. Overall our results imply that it is important to consider social issue ads as political, but they also complicate political advertising regulations. 

       

      In the second project, we focus on political ads that are related to policy. Understanding which policies politicians or organizations promote and to whom is essential in determining dishonest representations. We propose automated methods based on pre-trained models to classify ads in 14 main policy groups identified by the Comparative Agenda Project (CAP). We discuss several inherent challenges that arise. Finally, we analyze policy-related ads featured on Meta platforms during the 2022 French presidential elections period.  

       

      In the final contribution we propose a set of practical benchmarks to evaluate the “goodness” of political ad definitions. The benchmarks aim to assess whether the definitions can capture a set of truly problematic ads (the true positives), such as ads with divisive messages across demographic groups, and the ability to not capture a set of ads that only have humanitarian scopes (the false positives). We evaluate two definitions from online platforms and two definitions from policymakers based on our benchmarks. Our results show that definitions that only cover ads from/about political actors, and elections miss the highest percentage of advertisements that are divisive across different demographic groups.

      ]]>