Skip to content

Read File in pyspark #160

@imsanjoykb

Description

@imsanjoykb
# spark is from the previous example
sc = spark.sparkContext

# A text dataset is pointed to by path.
# The path can be either a single text file or a directory of text files
path = "examples/src/main/resources/people.txt"

df1 = spark.read.text(path)
df1.show()
# +-----------+
# |      value|
# +-----------+
# |Michael, 29|
# |   Andy, 30|
# | Justin, 19|
# +-----------+

# You can use 'lineSep' option to define the line separator.
# The line separator handles all `\r`, `\r\n` and `\n` by default.
df2 = spark.read.text(path, lineSep=",")
df2.show()
# +-----------+
# |      value|
# +-----------+
# |    Michael|
# |   29\nAndy|
# | 30\nJustin|
# |       19\n|
# +-----------+

# You can also use 'wholetext' option to read each input file as a single row.
df3 = spark.read.text(path, wholetext=True)
df3.show()
# +--------------------+
# |               value|
# +--------------------+
# |Michael, 29\nAndy...|
# +--------------------+

# "output" is a folder which contains multiple text files and a _SUCCESS file.
df1.write.csv("output")

# You can specify the compression format using the 'compression' option.
df1.write.text("output_compressed", compression="gzip")

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions