pyspark - reading google bucket data in spark -


i have followed blog read data stored in google bucket. https://cloud.google.com/dataproc/docs/connectors/install-storage-connector has worked fine. following command

hadoop fs -ls gs://the-bucket-you-want-to-list

gave me expected results.but when tried reading data using pyspark using

rdd = sc.textfile("gs://crawl_tld_bucket/"),

it throws following error:

`

py4j.protocol.py4jjavaerror: error occurred while calling o20.partitions. : java.io.ioexception: no filesystem scheme: gs     @ org.apache.hadoop.fs.filesystem.getfilesystemclass(filesystem.java:2660)     @ org.apache.hadoop.fs.filesystem.createfilesystem(filesystem.java:2667)     @ org.apache.hadoop.fs.filesystem.access$200(filesystem.java:94) ` 

how done?

to access google cloud storage have include cloud storage connector:

spark-submit --jars /path/to/gcs/gcs-connector-latest-hadoop2.jar your-pyspark-script.py 

or

pyspark --jars /path/to/gcs/gcs-connector-latest-hadoop2.jar 

Comments

Popular posts from this blog

angular - Ionic slides - dynamically add slides before and after -

minify - Minimizing css files -

Add a dynamic header in angular 2 http provider -