StreamSets Transformer – Apache Spark, machine learning and ETL pipelines

StreamSets Transformer has the advantage that users do not have to program anything at first, but can configure the connections and configurations in a web interface.

Modern ETL from different data sources
StreamSets Transformer – Apache Spark, machine learning and ETL pipelines

providers on the subject

With StreamSets Transformer, modern ETL can also be used with Apache Spark. A useful area of ​​application is, for example, training ML models or collecting data from different cloud platforms.

StreamSets Transformer has the advantage that users do not have to program anything at first, but can configure the connections and configurations in a web interface.

(Image: T.Joos)

Tools like Hadoop and Spark can process massive amounts of data in a short amount of time. The challenge is to provide these big data tools with data at the speed and volume that allows for effective processing. StreamSets Transformer makes it possible to integrate and automate different data sources, for example around machine learning. This allows data pipelines to be created to process massive amounts of data from different sources.

Together with Apache Spark, this makes it possible to combine, connect and enrich training datasets. The data preparation can thus be completely automated. In parallel, Scala and PySpark code can also be used as part of the data pipeline. StreamSets Transformer is available under an Apache 2.0 license.

The advantage of the environment is that users do not have to program anything at first, but can configure the connections and configurations required for StreamSets Transformer in the web interface. This also offers drag & drop. In addition, several pipelines and connections can be used in parallel on the surface. The provider also provides a free version.

With StreamSets Transformer, multiple pipelines can be used in parallel.  The configuration is done in the web-based dashboard.
With StreamSets Transformer, multiple pipelines can be used in parallel. The configuration is done in the web-based dashboard.

(Image: T.Joos)

Connectors to AWS, Azure, GCP, Snowflakes, Databricks and also SAP HANA

When data is distributed across different sources and cloud platforms, consolidation and alignment becomes an important issue for companies. This data must then in turn be available efficiently on platforms on which it is to be processed. As a data pipeline engine, StreamSets Transformer has the task of creating ETL, ELT and data transformation pipelines. These can in turn be used natively with Snowflake or Apache Spark.

StreamSets supports numerous sources for building a data pipeline.
StreamSets supports numerous sources for building a data pipeline.

(Image: T.Joos)

In order to stream data, connectors to the source platform are required. StreamSets Transformer offers connections to the most important cloud services such as AWS, Azure, GCP, Databricks and Snowflakes. Streamsets also offers numerous other connectors. All options can be found on the project website.

Possible connections also include Hive, Hadoop, MapR, Kafka, MongoDB, Oracle, MySQL, PostgreSQL, Salesforce, SAP HANA, and Microsoft SQL Server. Even Windows event displays can be read and streamed. StreamSets supports more than 40 storage and database sources as well as Kafka streams and MapR streams. Microsoft Azure SQL Data Lake and more than 30 other databases, storage and streaming platforms can be connected via the dashboard.

Pipelines can be assembled to the various data sources in the graphical interface with drag & drop.
Pipelines can be assembled to the various data sources in the graphical interface with drag & drop.

(Image: T.Joos)

Understand StreamSets Transformer components

StreamSets Transformer are based on “Environments” and “Deployments”. An “environment” defines where the StreamSets engines are to be deployed. It represents the resources required for the operation of the engines. Several “environments” can also be created in the streamset dashboard. All tasks can be scripted or configured in the graphical user interface.

A “deployment” is a group of identical engine instances that are used in an “environment”. A deployment defines the type, version, and configuration of the StreamSets engine to be used. Here, too, it is possible to operate several deployments in one environment and thus build a flexible structure from a combination of different environments and deployments.

A data collector engine runs data ingestion pipelines that perform dataset-based data transformations in streaming, CDC, or batch mode. Deployment in an environment is necessary to set up a Data Collector Engine.

A Transformer engine runs data processing pipelines on Apache Spark that perform set-based transformations such as joins, aggregates, and sorts on the entire dataset. Deployment in an environment is necessary to set up a Transformer engine.

Getting started with StreamSets Transformer on the cloud platform

On the website it is also possible to work with the platform free of charge via the cloud. There are various tutorials that can be used to test the possibilities of the platform with sample data quickly and free of charge. Setup only takes a few minutes and no local installations are required.

StreamSets Transformers can also be used free of charge in the platform's web interface.
StreamSets Transformers can also be used free of charge in the platform’s web interface.

(Image: T.Joos)

The advantage of using the cloud platform is that a free version of StreamSets Transformer is available very quickly. Various instructions, help and documentation can be found in the interface to set up an initial environment. If you need extended support and more functions, you can switch to the extended editions Professional or Enterprise.

The platform mainly consists of five components that can be adjusted in the web interface:

  • Control Hub is the dashboard for creating, delivering and operating data streams.
  • Data Collector is an open source tool for developing streaming data pipelines with a graphical user interface and a command line interface.
  • Data Collector Edge is a data collection and analysis tool for IoT and cybersecurity edge systems that runs on an agent.
  • Data Protector detects and secures data as it moves through a pipeline to support GDPR/DSGVO, HIPAA, and other regulatory compliance.
  • DataFlow Performance Manager adds historical comparisons and data SLAs for availability, accuracy, and security.

StreamSets comes with more than 50 pre-installed transformation processors that the user can drag and drop onto a graphical workspace. The processors can connect, remove, convert, parse and aggregate data from various sources. Developers can write their own custom processors in Java, Java Expression Language (EL), JavaScript, Jython, Groovy, and Scala.

Conclusion

If you want to set up a data stream from different sources for Apache Spark or Snowflake, StreamSets Transformer is an ideal platform that is also easy to use. The free version is of particular interest for test purposes and already offers all the options that StreamSets Transformer knows. It’s worth taking a look at the platform as it can be up and running in minutes with virtually no effort.

(ID:48417313)

#StreamSets #Transformer #Apache #Spark #machine #learning #ETL #pipelines

Leave a Comment

Your email address will not be published.