pyspark - reading google bucket data in spark -
i have followed blog read data stored in google bucket. https://cloud.google.com/dataproc/docs/connectors/install-storage-connector has worked fine. following command
hadoop fs -ls gs://the-bucket-you-want-to-list
gave me expected results.but when tried reading data using pyspark using
rdd = sc.textfile("gs://crawl_tld_bucket/")
,
it throws following error:
`
py4j.protocol.py4jjavaerror: error occurred while calling o20.partitions. : java.io.ioexception: no filesystem scheme: gs @ org.apache.hadoop.fs.filesystem.getfilesystemclass(filesystem.java:2660) @ org.apache.hadoop.fs.filesystem.createfilesystem(filesystem.java:2667) @ org.apache.hadoop.fs.filesystem.access$200(filesystem.java:94) `
how done?
to access google cloud storage have include cloud storage connector:
spark-submit --jars /path/to/gcs/gcs-connector-latest-hadoop2.jar your-pyspark-script.py
or
pyspark --jars /path/to/gcs/gcs-connector-latest-hadoop2.jar
Comments
Post a Comment