In the revised literature we have not identified any existing approach, which could convert data from semi-structured (websites) or unstructured web sources to the RDF form and consequently integrate into a Linked Data cloud. Therefore, our motivation and objective was to develop intelligent assistant for extracting semi-structured web data. This intelligent assistant should automatically identify and select part of web data, some of those web data should be selected by business user without any technical skills and we have automatically prepared wrapper for extracting these web data.
We implemented the prototype, which automatically identifies main search form, repeated results with specific algorithms on the website, identifies data inside these results and their details data. It also allows additional selecting data and automatically propose name of those data. With intelligent assistant we can also export data to the RDF form. Intelligent assistant allows us extracting data from very dynamic websites (websites with many lines of JavaScript and AJAX code), where similar approaches have many issues.
We have evaluated the functioning of intelligent assistant in such a way that we tried to extract web data from many different websites. As different websites we consider very dynamic, static and secured against extracting websites, etc. We have found out that our approach has advantages over others in extracting web data from very dynamic websites and it allows explicit conversion of web data in the forth or fifth level on five star Linked Data ranking, where others in most cases convert web data in third level only. Besides that it allows automatic identification of repeated results on website with specific algorithm, which is one of the features of our approach and most of others do not offer this option.
|