Back to the project page

Design brainstorm

The wizard consists generally of four parts:

1. Selecting data sources

Prior to anything else, the ODBC sources containing the data must be known. There is very little we can do to make this step easier. It should just consist of selecting the source. It should also be possible to configure this step in an external configuration file by administrators, so the people who use the wizard afterwards needn’t make do this.

2. Selecting source data tables and fields

This step is the main step of the wizard. It determines which fields of which tables will be used in the integration step.

3. Constructing destination tables

In this step the user creates new tables, and selects which of the source fields it should contain. Join conditions can be specified, but the join should be a natural join by default. Constructing conditions must be done with UI and in an intelligent fashion (could I possibly be more vague about this?)

4. Data integration

This step requires little or no user interaction, but it will require some feedback to the user as the data integration is carried out.

Step 2 in depth

It is assumed that the user knows at the very least the basics of the kind of data that is stored, but not anything about the structure of the database.

The user will be presented with the opportunity to do a search. This search is flexible, which means that the user can provide both a sample of the data it’s looking for or say that it’s looking for a date (by entering “date” in the search field).

The wizard will compare the entered data with a set of rules to determine it’s semantic meaning based on syntax and contents. For instance date patterns are recognized, and names (something like multiple words, first and last capitalized?), currency (exactly two decimals, prefixed with currency symbol) etc. These rules are defined in an xml format (possibly derived from xsd). It will recognize data type names such as “date” and “name” and “price”

It then comes back to the user presenting the most likely interpretation of the data entered, allowing the user to confirm, or change the settings. If no likely candidate was found, it will allow the user to forcibly select a data type (if that’s possible), or select plain text search. For plain text search it can be selected whether field contents or names must be preferred. It can select fuzzy and case-insensitive search options, etc. Data type formats may be ambiguous. The date 5/1/2004 might be January 5th or May 1st. Because of the / separator, the former is more likely (American date format), but we must still ask the user “did you mean January 5th or May 1st”.

Defining this in XML (in a way dat doesn’t treat dates differently than any other types) will be difficult. Somehow it must be possible to specify that ##/## is unambiguous if either is above 12, but ambiguous if neither is. Then it must be possible to specify unambiguous ways to display the ambiguous alternatives (in this case January 5th). A syntax based on regular expression syntax with grouping would be most useful. Something like this perhaps:

<DataType type="date">
   <!-- other formats-->
   <format pattern="{:d+}/{:d+}/{:d^4}" 
       condition="\1 > 0 && \1 < 13 && \2 > 0 && \2 < 32">
      <alternative="day:\2 month:\1" />
   </format>
   <format pattern="{:d+}/{:d+}/{:d^4}" 
       condition=”\2 > 0 && \2 < 13 && \1 > 0 && \1 < 32”>
      <alternative="day:\1 month:\2" />
   </format>
</DataType>

The pattern 5/1/2004 will match both formats, and thus be considered ambiguous. The alternatives will be used to ask the user for which he meant (a better alternative is needed for dates of course, but I didn’t feel like figuring out a way to convert numbers to months right now). This imposes the requirement that the alternatives may never be the same if a format can be an ambiguous match.

Some way of scoring the formats for likelihood is also needed (perhaps just by their ordering), and a way to update that scoring based on the user selections (adaptive learning of the likelihood) would be nifty.

For dates specifically, you’d need specific formats like the ones above that give a certain choice for what is month or day a better likelihood based on the separator (so that 5/1/2004 is most likely May 1st because the / indicates American format, but 5-1-2004 is most likely January 5th because the – indicates a European format). Lastly you’d have a specification that cares not about the separators, and gives no preference either way.

A method must be found to map the parts of the string to their semantic meanings (named groups perhaps, ie. ($<dd>:d) matches the day part) and determine what equivalence means meaning left.dd=right.dd, indicating that it’s a numerical comparison, not string, etc. Then we still need to handle actual DATE data types in the database.

The wizard will then search the database(s) in an intelligent fashion. This means that if a date is searched for, it will not assume that all dates in the database properly use a DATE data type. It will also check the format of text fields, determine their semantic meaning in much the same manner as above (only without user intervention). The end result will be that if the user is searching for “25-11-1981” but the database contains the string (not date) “11/25/81” or “25-NOV-81” or “1981-11-25” or “November 25th '81” it will still match. If the user is searching for “November 1981” all dates in November 1981 will match. A search for “25-11” will match all occurrences regardless of the year. Another example, if a user searches for “Roderick de Jong”, it should also match occurrences of “Jong, Roderick de” or tables with separate fields contains “Roderick” and “de Jong”.

To accomplish this, prior to searching for the first time, a semantic mapping of the database is made (without user intervention). Here we determine which fields have which kind of data. In addition to using (of course) the ODBC data types and the kind of pattern rules above, this will also use the field names as clues. A ‘number’ field with name “price” is recognized to be a currency. A string field whose name contains the words “first” and “name” is assumed to be a first name. Note that these kinds of rules must be constructed with care. We cannot assume that all fields named “name” identify personal names. It might be a product name instead, or something else entirely (note: although it is possible to localize this by adjusting the xml files containing these rules, a possibility to provide an external xml file with localized field name strings would be better, since it doesn’t require a localizer to fully understand the mapping rules. A skeleton localization table can be generated from the rules, and filled in by the localizer).

The semantic mapping that is made is stored for future reference, so this step need only be done once.

Using this mapping, it becomes possible to construct SQL statements that compare fields determined to be compatible with the keyword data.

The results are displayed to the user, giving not only the tables and fields found, but also their context (all fields + other rows from the same table). ID keys must be hidden as much as possible, and foreign keys must be expanded so the user can see the kinds of relations this data participates in without needing to know the relational structure of the database. Database administrators can provide a friendly name mapping for the field names for this display step.

The user sees the result, and determines it is what he’s looking for. If it’s not, he can search again with different keywords, or he can expand the search. An expanded search is pretty much a (possibly fuzzy) plain text search. It pays not so much mind to the semantic mapping it found on the database, but performs a much broader search. Again results are displayed. If the user indicates he found what he’s looking for this time round, the data can be used to update the semantic mapping. If the user’s keyword was determined to be a name, but the result was found in a field that the semantic mapping had not determined to be a (part of a) name, the mapping can be updated to say that this field does represent a name, making future searches more effective.

Once the table is found, the user indicates which fields of this table and related tables (expanded foreign keys) he wants to keep. Fields from foreign tables are automatically assigned a join condition to preserve only those relating to the main data selected. Specifying constraints on the data to be selected is technically step 3, but it might be more intuitive to do that here (and in fact step 3 and step 2 may be heavily intermixed altogether).

This process then repeats until the user has all the data it wants.