Automatic RDF-ization of big data semi-structured datasets

Autores/as

  • Ronald Gualán Departamento de Ciencias de la Computación, Universidad de Cuenca, Avenida 12 de abril, Cuenca, Ecuador.
  • Renán Freire Departamento de Ciencias de la Computación, Universidad de Cuenca, Avenida 12 de abril, Cuenca, Ecuador.
  • Andrés Tello Departamento de Ciencias de la Computación, Universidad de Cuenca, Avenida 12 de abril, Cuenca, Ecuador.
  • Mauricio Espinoza Departamento de Ciencias de la Computación, Universidad de Cuenca, Avenida 12 de abril, Cuenca, Ecuador.
  • Víctor Saquicela Departamento de Ciencias de la Computación, Universidad de Cuenca, Avenida 12 de abril, Cuenca, Ecuador.

Resumen

ABSTRACT
Linked data adoption continues to grow in many fields at a considerable pace. However, some of the most important datasets usually remain underexploited because of two main reasons: the huge volume of the datasets and the lack of methods for automatic conversion to RDF. This paper presents an automatic approach to tackle these problems by leveraging recent Big Data tools and a program for automatic conversion from a relational model to RDF. Overall, the process can be summarized in three steps: 1) bulk transfer of data from different sources to Hive/HDFS; 2) transformation of data on Hive to RDF using D2RQ; and 3) storing the resulting RDF in CumulusRDF. By using these Big Data tools, the platform will cope with the handling of big amounts of data available in different sources, which can include structured or semi-structured data. Moreover, since the RDF data are stored in CumulusRDF in the final step, users or applications can consume the resulting data by means of web services or SPARQL queries. Finally, an evaluation in the hydro-meteorological domain demonstrates the soundness of our approach.
Keywords: Automatic transformation to RDF, data integration, Semantic Web, NoSQL, RDF, semi- structured sources, big data, D2RQ, Apache Hive, CumulusRDF, Apache ServiceMix.

RESUMEN
La adopción de Linked Data sigue creciendo en muchos campos a un ritmo considerable. Sin embargo, algunos de los conjuntos de datos más importantes por lo general permanecen des-semantificados debido a dos razones principales: el enorme volumen de los conjuntos de datos y la falta de métodos para la conversión automática a RDF. Este artículo presenta un enfoque automático para hacer frente a estos problemas mediante el aprovechamiento de nuevas herramientas de Big Data y un programa para la conversión automática de un modelo relacional a RDF. En general, el proceso implementado se puede resumir en tres pasos: 1) transferencia masiva de datos desde las diferentes fuentes hacia Hive/HDFS, 2) transformación de los datos en Hive a RDF utilizando D2RQ, y 3) almacenamiento del RDF resultante en CumulusRDF. De este modo, mediante el uso de estas herramientas de Big Data garantizamos que la plataforma sea capaz de hacer frente a las grandes cantidades de datos disponibles en diferentes fuentes, ya sea que contengan datos estructuradas o semi-estructurados. Además, puesto que los datos RDF se almacenan en CumulusRDF en la etapa final, los usuarios o aplicaciones pueden consumir los datos resultantes a través de servicios web o consultas SPARQL. Finalmente, una evaluación demuestra la solidez de nuestro enfoque.
Palabras clave: Transformación automática a RDF, integración de datos, Web Semántica, NoSQL, RDF, fuentes semi-estructuradas, Big Data, D2RQ, Apache Hive, CumulosRDF, Apache ServiceMix.

Descargas

Los datos de descargas todavía no están disponibles.

Citas

Atemezing, A., O. Corcho, D. Garijo, J. Mora, M. Poveda Villalon, P. Rozas, D. Vila-Suero, B. Villazón-Terrazas, 2012. Transforming meteorological data into linked data. Semantic Web. Undefined, 1, 1-5, IOS Press. Available at http://www.semantic-web-journal.net/sites/default/files/swj281_0.pdf.

Bifet, A., 2013. Mining big data in real time. Informatica, 37, 4 pp. Available at http://ailab.ijs.si/dunja/TuringSLAIS-2012/Papers/Bifet.pdf.

Bizer, C., P. Boncz, M.L. Brodie, O. Erling, 2012. The meaningful use of big data: four perspectives four challenges. ACM SIGMOD Rec., 40, 56-60.

Bizer, C., A. Seaborne, 2004. D2RQ-treating non-RDF databases as virtual RDF graphs. In: Proceedings of the 3rd International Semantic Web Conference (ISWC2004). Citeseer Hiroshima.

Bouhali, R., A. Laurent, 2015. Exploiting RDF Open Data Using NoSQL Graph Databases. In: Chbeir, R., Y. Manolopoulos, I. Maglogiannis, R. Alhajj, (Eds.). Artificial Intelligence Applications and Innovations. In: Iliadis, L., I. Maglogiannis, G. Tsoumakas, I. Vlahavas, M. Bramer (Eds.). Proceedings of the 5th IFIP Conference on Artificial Intelligence Applications and Innovations (AIAI'2009), April 23-25, 2009, Thessaloniki, Greece.

Bizer, C., R. Cyganiak, J. Garbers, O. Maresch, C. Becker, 2009. The D2RQ Platform v0.7 - Treating Non-RDF Relational Databases as Virtual RDF Graphs - User Manual and Language Specification. Available at http://wifo5-03.informatik.uni-mannheim.de/bizer/ d2rq/spec/20090810/.

Chappell, D., 2004. Enterprise service bus. O’Reilly Media, Inc.

Cudré-Mauroux, P., I. Enchev, S. Fundatureanu, P. Groth, A. Haque, A. Harth, F.L. Keppmann, D. Miranker, J.F. Sequeda, M. Wylot, 2013. NoSQL databases for RDF: an empirical evaluation. In: The Semantic Web-ISWC 2013. Springer, pp. 310-325. Available at http://ribs.csres.utexas.edu/nosqlrdf/nosqlrdf_iswc2013.pdf.

Cuesta, C.E., M.A. Martínez-Prieto, J.D. Fernández, 2013. Towards an architecture for managing big semantic data in real-time. In: Software Architecture. Springer, pp. 45-53. Available at http://dataweb.infor.uva.es/wp-content/uploads/2013/04/ecsa2013.pdf.

ESRI, 1998. ESRI Shapefile Technical Description.

GeoTools, n.d. GeoTools The Open Source Java GIS Toolkit - GeoTools [WWW Document]. URL http://geotools.org/ (accessed 1.19.16).

Hitzler, P., K. Janowicz, 2013. Linked Data, Big Data, and the 4th Paradigm. Semantic Web, 4, 233-235.

Hohpe, G., B. Woolf, 2004. Enterprise integration patterns: Designing, building, and deploying messaging solutions. Addison-Wesley Professional.

Hyland, B., G. Atemezing, B. Villazón-Terrazas, 2014. Best practices for publishing linked data. Available at http://www.w3.org/TR/ld-bp/.

Ibsen, C., J. Anstey, 2010. Camel in action. Manning Publications Co.

Jos Dirksen (n.d.). ServiceMix 4.2 - DZone - Refcardz [WWW Document]. dzone.com. URL https://dzone.com/refcardz/servicemix (accessed 6.10.16).

Khadilkar, V., M. Kantarcioglu, B. Thuraisingham, P. Castagna, 2012. Jena-HBase: A distributed, scalable and efficient RDF triple store. In: Proceedings of the 11th International Semantic Web Conference Posters & Demonstrations Track, ISWC-PD. Citeseer, pp. 85-88.

Knoblock, C.A., P. Szekely, J.L. Ambite, A. Goel, S. Gupta, K. Lerman, M. Muslea, M. Taheriyan, P. Mallick, 2012. Semi-automatically mapping structured sources into the semantic web. In: The Semantic Web: Research and Applications. Springer, pp. 375-390.

Ladwig, G., A. Harth, 2011. CumulusRDF: linked data management on nested key-value stores. In: The 7th International Workshop on Scalable Semantic Web Knowledge Base Systems (SSWS 2011). p. 30.

Marshall, M.S., R. Boyce, H.F. Deus, J. Zhao, E.L. Willighagen, M. Samwald, E. Pichler, J. Hajagos, E. Prud’hommeaux, S. Stephens, 2012. Emerging practices for mapping and linking life sciences data using RDF - A case series. Web Semant. Sci. Serv. Agents World Wide Web, 14, 2-13.

McAfee, A., E. Brynjolfsson, T.H. Davenport, D.J. Patil, D. Barton, 2012. Big data. The Management Revolution. Harv. Bus Rev., 90, 61-67.

Ortiz Vivar, J.E., J.L. Segarra, 2015. Plataforma para la anotación semántica de servicio web RESTful sobre un bus de servicios. Bachelor Thesis, Escuela de Ingeniería de Sistemas, Facultad de Ingeniería, Universidad de Cuenca, 189 pp. Available at http://dspace.ucuenca.edu.ec/handle/123456789/23105.

Papailiou, N., I. Konstantinou, D. Tsoumakos, N. Koziris, 2012. H2RDF: adaptive query processing on RDF data in the cloud. In: Proceedings of the 21st International Conference Companion on World Wide Web. ACM, pp. 397-400.

Patni, H., C.A. Henson, M. Cooney, A.P. Sheth, K. Thirunarayan, 2011. Demonstration: real-time semantic analysis of sensor streams. Proceedings of the 4th International Workshop on Semantic Sensor Networks, 119-122. Availabe at https://works.bepress.com/tk_prasad/102/.

Patni, H., C.A. Henson, A.P. Sheth, 2010. Linked sensor data. In: Collaborative Technologies and Systems (CTS), 2010 International Symposium on. IEEE, pp. 362-370.

Prakashbhai, P.A., H.M. Pandey, 2014. Inference patterns from Big Data using aggregation, filtering and tagging - A survey. In: Confluence The Next Generation Information Technology Summit (Confluence), 2014 5th International Conference-. IEEE, pp. 66-71.

Sagiroglu, S., D. Sinanc, 2013. Big data: A review. In: Collaboration Technologies and Systems (CTS), 2013 International Conference on. IEEE, pp. 42-47.

The Apache Software Foundation, 2016. Apache Sqoop [WWW Document]. URL https://sqoop.apache.org/ (accessed 6.7.16).

The Apache Software Foundation (n.d.). Sqoop 2 Connector Development - Apache Sqoop documentation [WWW Document]. URL http://sqoop.apache.org/docs/1.99.6/ ConnectorDevelopment.html (accessed 1.19.16).

Thusoo, A., J.S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, R. Murthy, 2009. Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2, 1626-1629.

Villazon-Terrazas, B., D. Vila-Suero, D. Garijo, L.M. Vilches-Blazquez, M. Poveda-Villalon, J. Mora, O. Corcho, A. Gomez-Perez, 2012. Publishing Linked Data-There is no One-Size-Fits-All Formula. Available at http://oa.upm.es/14465/1/2.formulaLD.pdf.

Villazón-Terrazas, B., L.M. Vilches-Blázquez, O. Corcho, A. Gómez-Pérez, 2011. Methodological guidelines for publishing government linked data. In: Linking Government Data. Springer, pp. 27-49.

Zeng, K., J. Yang, H. Wang, B. Shao, Z. Wang, 2013. A distributed graph engine for web scale RDF data. In: Proc. VLDB Endow, 6, 265-276.

Zengenene, D., V. Casarosa, C. Meghini, 2013. Towards a Methodology for Publishing Library Linked Data. In: Bridging Between Cultural Heritage Institutions. Springer, pp. 81-92.
Zikopoulos, P., C. Eaton, C., others, 2011. Understanding big data: Analytics for enterprise class hadoop and streaming data. McGraw-Hill Osborne Media.

Descargas

Publicado

2017-01-18

Cómo citar

Gualán, R., Freire, R., Tello, A., Espinoza, M., & Saquicela, V. (2017). Automatic RDF-ization of big data semi-structured datasets. Maskana, 7(Supl.), 117–127. Recuperado a partir de https://publicaciones.ucuenca.edu.ec/ojs/index.php/maskana/article/view/1082

Artículos más leídos del mismo autor/a

> >>