Set Up Your First Apache Beam Application (Java)

Masato Naka
3 min readFeb 28, 2023

Overview

Data processing has become more and more important and there are so many tools for data processing jobs. Apache Beam supports both batch and streaming processing, both of which are important and necessary for many data-driven services.

For a complete beginner, it’s hard to start learning a new framework without an experience of setting up a simple “Hello World” application.

This post will cover how to create and run your first Apache Beam application in Java.

Tools

  1. Gradle: build tool
  2. VS Code for editor + Gradle for Java extension

Steps

1. Create a Gradle project

gradle init

You’ll be asked several questions. Here are the values that I used for my project.

  1. Select type of project to generate: application
  2. Select implementation language: java
  3. Split functionality across multiple subprojects?: no
  4. Select build script DSL: Kotlin
  5. Generate build using new APIs and behavior: yes
  6. Select test framework: JUnit Jupyter
  7. Project name: apachebeamtraining
  8. Source package: apachebeamtraining

You can choose your preferred option.

Now you can run the Hello World application with the following command:

./gradlew run
Starting a Gradle Daemon, 3 incompatible Daemons could not be reused, use --status for details

> Task :app:run
Hello World!

BUILD SUCCESSFUL in 7s
2 actionable tasks: 2 executed

2. Update dependencies (app/build.gradle.kts)

Add the apache beam related dependencies (you can replace the version with the latest one) to dependencies

    // https://mvnrepository.com/artifact/org.apache.beam/beam-sdks-java-core
implementation("org.apache.beam:beam-sdks-java-core:2.45.0")

// https://mvnrepository.com/artifact/org.apache.beam/beam-runners-direct-java
runtimeOnly("org.apache.beam:beam-runners-direct-java:2.45.0")

3. Write the first Apache Beam logic

Write the wordcount Apache Beam logic in src/main/java/apachebeamtraining/App.java:

public class App {
public static void main(String[] args) {
PipelineOptions options = PipelineOptionsFactory.create();
// Create pipeline
Pipeline p = Pipeline.create(options);
// Read text data from Sample.txt
PCollection<String> textData = p.apply(TextIO.read().from("Sample.txt"));
// Write to the output file with wordcounts as a prefix
textData.apply(TextIO.write().to("wordcounts"));
// Run the pipeline
p.run().waitUntilFinish();
}
}

The steps are pretty simple:

  1. Create a PipelineOption.
  2. Create a Pipeline with the option.
  3. Add the logic to read data from Sample.txt to the pipeline and get the return value as PCollection, which is an abstraction of dataset in Apache Beam.
  4. Add another step to write the return value in the previous step to output file with name starting with wordcounts.
  5. Lastly, run and finish the pipeline.

Here we don’t dive into the details of DSL. You can check the details in programming guide.

You also need to import the dependencies to use PipelineOptions, Pipeline, etc.

If you’re using VSCode, you can reload all the dependencies with Gradle extension in VSCode by clicking on the icon:

4. Prepare Sample.txt

Prepare Sample.txt in app directory. You can write any text in the file like the following:

This post will cover how to start creating your first Apache Beam application in Java.

5. Run the first Apache Beam app

You can run the app by the command:

./gradlew run
Starting a Gradle Daemon, 4 incompatible Daemons could not be reused, use --status for details

> Task :app:run
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

BUILD SUCCESSFUL in 7s
2 actionable tasks: 2 executed

or you can also run by Gradle extension of VSCode:

Now you’ll find a new file app/wordcounts-00000-of-00001 , which contains exactly the same contents as app/Sample.txt.

cat app/wordcounts-00000-of-00001
This post will cover how to start creating your first Apache Beam application in Java.

Summary

In this post, we set up a simple Apache Beam application in Java with Gradle and confirmed the app successfully read Sample.txt and write the contents to an output file with the specified name prefix wordcounts-.

This is the beginning of the Apache Beam journey. Let’s write some more logic in the next step.

References

  1. https://beam.apache.org/get-started/quickstart-java/
  2. https://beam.apache.org/documentation/programming-guide/

--

--

Masato Naka

An SRE engineer, mainly working on Kubernetes. CKA (Feb 2021). His Interests include Cloud-Native application development, and machine learning.